Skip to content

docs: add Getting Started section with prerequisites, quick start, and reference installation#1227

Merged
Coco-Ben merged 4 commits into
NVIDIA:mainfrom
andreliuNV:docs-getting-started
May 6, 2026
Merged

docs: add Getting Started section with prerequisites, quick start, and reference installation#1227
Coco-Ben merged 4 commits into
NVIDIA:mainfrom
andreliuNV:docs-getting-started

Conversation

@andreliuNV
Copy link
Copy Markdown
Contributor

Summary

Adds the Getting Started section to the NICo documentation, covering the full deployment path from prerequisites through first host discovery.

  • Prerequisites (4 pages): Hardware, Network, Software, BMC/OOB — consolidated from site-reference-arch.md, site-setup.md, and networking_requirements.md
  • Building NICo Containers: repositioned in nav before Quick Start
  • Quick Start Guide: 7-step deployment walkthrough (build → K8s → configure → deploy → verify → OOB → first host) with site configuration pulled from helm-prereqs/README.md
  • Reference Installation: manual phase-by-phase installation from SETUP_PHASES.md, plus PKI architecture, PostgreSQL architecture, and troubleshooting
  • Provisioning (Day 0 Operations): new section with Ingesting Hosts, Host Validation, SKU Validation
  • helm-prereqs/README.md: slimmed to lightweight pointer to docs site
  • Old pages removed where fully superseded (site-reference-arch, site-setup, networking_requirements, expected_machine_update, bootstrap)
  • TLS/SPIFFE page moved from kubernetes/ to development/
  • CLI naming standardized: admin-cli (gRPC) vs carbide-cli (REST) with distinction table

Context

Part of NICo Docs Code Yellow (FORGE-8168). Follows the IA recommendation from CDEVS-2173 for the Getting Started section structure. Builds on the Overview section work in PR #1190.

Test plan

  • Preview with fern docs dev — verify all nav links resolve
  • Walk through Quick Start Guide steps 1-7 for completeness
  • Verify prerequisite pages cover all content from removed old pages
  • Verify Reference Installation troubleshooting section is complete
  • Check landing page persona table links resolve to new page paths
  • Verify helm-prereqs/README.md links point to published docs URLs

🤖 Generated with Claude Code

@andreliuNV andreliuNV requested a review from a team as a code owner April 29, 2026 05:44
@andreliuNV andreliuNV force-pushed the docs-getting-started branch 2 times, most recently from bfeb251 to cff6bad Compare April 29, 2026 15:27
@andreliuNV andreliuNV requested review from a team and shayan1995 as code owners April 29, 2026 15:27
@andreliuNV andreliuNV force-pushed the docs-getting-started branch 2 times, most recently from 8db8e6f to 3664db9 Compare April 29, 2026 20:19
@andreliuNV andreliuNV requested a review from Coco-Ben as a code owner April 29, 2026 20:19
@andreliuNV andreliuNV force-pushed the docs-getting-started branch from 3664db9 to 64bc25b Compare April 29, 2026 20:23
@andreliuNV andreliuNV requested a review from a team as a code owner April 29, 2026 20:23
@lachen-nv
Copy link
Copy Markdown
Contributor

PR conflict, can you resolve it?

@andreliuNV andreliuNV force-pushed the docs-getting-started branch from 64bc25b to 7b32e7c Compare April 30, 2026 03:30
@github-actions
Copy link
Copy Markdown

@andreliuNV andreliuNV force-pushed the docs-getting-started branch from 7b32e7c to 4d41ce9 Compare April 30, 2026 03:33
@andreliuNV
Copy link
Copy Markdown
Contributor Author

PR conflict, can you resolve it?

merge conflicts resolved, @shayan1995 since this touches your helm-prereqs README and SETUP-PHASES.md, please review.

@shayan1995
Copy link
Copy Markdown
Contributor

/ok to test 9ec63e8

@benhuntley
Copy link
Copy Markdown
Contributor

benhuntley commented Apr 30, 2026

Three issues which I noticed:

  • API Service (NICo Core) — ... Exposes a debug web UI on /admin for operators via HTTP with OIDC authentication.
    Pedantically I think this should be via HTTPS.
  • PostgreSQL — stores all NICo system state in the forgedb database.
    I believe the DB is called "forge_system_carbide" in the code (ex https://github.com/NVIDIA/infra-controller-core/blob/main/helm-prereqs/templates/postgresql.yaml#L49)
  • Route Server — manages BGP routing between DPUs and the Ethernet fabric via FRR (free range routing). Advertises service VIPs and tenant routes.
    Is this accurate? I don't think there is a stand alone Route Server service that runs on the site controller?

@andreliuNV
Copy link
Copy Markdown
Contributor Author

@benhuntley issues addressed across both #1190 and #1227

@andreliuNV andreliuNV force-pushed the docs-getting-started branch from 3ae2584 to 19507de Compare May 4, 2026 05:55
@andreliuNV
Copy link
Copy Markdown
Contributor Author

/ok to test 19507de

@andreliuNV
Copy link
Copy Markdown
Contributor Author

@Coco-Ben please review

Comment thread docs/getting-started/quick-start.md
Rewrites the entire Overview section of the NICo documentation and adds
a new landing page, following the IA recommendation from CDEVS-2173.

Overview section (5 pages):
- What is NICo? — intro, "Why NICo exists" (sourced from VDR/Code Yellow),
  architecture overview with NICo Components and Prerequisite Components
  matching the architecture diagram, "Where NICo fits" stack diagram
- Key Capabilities — hardware readiness, DPU lifecycle, multi-tenancy,
  trust/attestation, firmware control, deployment flexibility, GB200 rack-scale
- Operational Principles — 5 foundational design principles
- Day 0/1/2 Lifecycle — three operational phases
- Scope and Boundaries — two-column tables showing NICo vs platform
  responsibilities (renamed from "What NICo Does Not Cover")

Landing page (index.md):
- Persona-based entry points: Deploy & Operate, Integrate, Evaluate
- Quick links to HCL, release notes, FAQs, GitHub repos

Other changes:
- Replace "NCX Infra Controller" with "NVIDIA Infra Controller" in prose
- Replace "GB200/GB300-class AI infrastructure" with "AI factory-scale
  infrastructure"
- Fix /admin UI protocol: HTTP → HTTPS
- Fix database name: forgedb → forge_system_carbide
- Remove Route Server from NICo Components (not a NICo-deployed service)
- Remove hand-wavey/marketing language across all overview pages
- Remove duplicative content between overview pages
- Add explicit URL slugs to prevent Fern slug mangling (what-is-ni-co)
- Replace stale Introduction page (README.md) with new landing page

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Andrew Liu <andreliu@nvidia.com>

-e
Signed-off-by: Peter Gambrill <pgambrill@nvidia.com>
andreliuNV and others added 3 commits May 5, 2026 16:15
…d reference installation

Adds the Getting Started section covering the full deployment path from
prerequisites through first host discovery, following the IA recommendation
from CDEVS-2173.

Prerequisites (4 pages):
- Hardware — site controller specs, compute system requirements, DPU details,
  BIOS/UEFI settings, link to HCL for supported GPUs/systems
- Network — IP sizing formulas, address pools, ASN/VNI allocations, underlay
  BGP, EVPN overlay (with route server explanation), route-targets, switch
  config, site controller topology, physical cabling
- Software — validated versions for all dependencies with "if you already
  have / if deploying reference" decision paths, installation order
- BMC/OOB Setup — DHCP relay, BMC credentials, expected machines manifest,
  Redfish requirements

Quick Start Guide (7 steps):
1. Build NICo Containers (links to existing build guide)
2. Prepare K8s Cluster (requirements, tools)
3. Configure Site (full checklist from helm-prereqs README)
4. Run setup.sh (phase table, what gets deployed)
5. Verify Site Controller (pod checks, LoadBalancer, Keycloak, carbide-cli)
6. Connect OOB Network (DHCP relay verification)
7. Discover First Host (credentials, manifest, TPM approval)

Reference Installation:
- Manual phase-by-phase installation (from SETUP_PHASES.md)
- PKI architecture (3-layer cert chain)
- PostgreSQL architecture (Zalando operator, credential flow)
- Full troubleshooting guide

Provisioning (Day 0 Operations):
- Ingesting Hosts (consolidated with expected machines management)
- Host Validation, SKU Validation (moved from Operations)

Other changes:
- helm-prereqs/README.md slimmed to config reference + pointers to docs
- helm-prereqs/SETUP_PHASES.md removed (content in reference installation)
- Old pages removed: site-reference-arch, site-setup, networking_requirements,
  expected_machine_update, kubernetes/bootstrap
- TLS/SPIFFE page moved from kubernetes/ to development/
- CLI naming standardized: admin-cli (gRPC) vs carbide-cli (REST)
- NTP clarified: not a NICo service, provided via DHCP option 42
- carbide-ntp removed from VIP table (chart doesn't exist)
- Database name corrected: forgedb → forge_system_carbide
- Landing page persona table updated to reference new pages

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Andrew Liu <andreliu@nvidia.com>
-e 
Signed-off-by: Peter Gambrill <pgambrill@nvidia.com>
-e 
Signed-off-by: Peter Gambrill <pgambrill@nvidia.com>
-e
Signed-off-by: Peter Gambrill <pgambrill@nvidia.com>
@Coco-Ben Coco-Ben force-pushed the docs-getting-started branch from 80a32c6 to 02354fd Compare May 5, 2026 23:17
@Coco-Ben Coco-Ben enabled auto-merge (squash) May 5, 2026 23:18
@Coco-Ben Coco-Ben merged commit 2ce1e99 into NVIDIA:main May 6, 2026
43 checks passed
inf0rmatiker pushed a commit to inf0rmatiker/infra-controller that referenced this pull request May 7, 2026
…d reference installation (NVIDIA#1227)

## Summary

Adds the Getting Started section to the NICo documentation, covering the
full deployment path from prerequisites through first host discovery.

- **Prerequisites** (4 pages): Hardware, Network, Software, BMC/OOB —
consolidated from site-reference-arch.md, site-setup.md, and
networking_requirements.md
- **Building NICo Containers**: repositioned in nav before Quick Start
- **Quick Start Guide**: 7-step deployment walkthrough (build → K8s →
configure → deploy → verify → OOB → first host) with site configuration
pulled from helm-prereqs/README.md
- **Reference Installation**: manual phase-by-phase installation from
SETUP_PHASES.md, plus PKI architecture, PostgreSQL architecture, and
troubleshooting
- **Provisioning (Day 0 Operations)**: new section with Ingesting Hosts,
Host Validation, SKU Validation
- **helm-prereqs/README.md**: slimmed to lightweight pointer to docs
site
- Old pages removed where fully superseded (site-reference-arch,
site-setup, networking_requirements, expected_machine_update, bootstrap)
- TLS/SPIFFE page moved from kubernetes/ to development/
- CLI naming standardized: admin-cli (gRPC) vs carbide-cli (REST) with
distinction table

## Context

Part of NICo Docs Code Yellow (FORGE-8168). Follows the IA
recommendation from CDEVS-2173 for the Getting Started section
structure. Builds on the Overview section work in PR NVIDIA#1190.

## Test plan

- [ ] Preview with `fern docs dev` — verify all nav links resolve
- [ ] Walk through Quick Start Guide steps 1-7 for completeness
- [ ] Verify prerequisite pages cover all content from removed old pages
- [ ] Verify Reference Installation troubleshooting section is complete
- [ ] Check landing page persona table links resolve to new page paths
- [ ] Verify helm-prereqs/README.md links point to published docs URLs

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Signed-off-by: Peter Gambrill <pgambrill@nvidia.com>
Signed-off-by: Andrew Liu <andreliu@nvidia.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Peter Gambrill <pgambrill@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants