From 6f7dc8a3b457b6dbc14d70fe3d9a304a0b479069 Mon Sep 17 00:00:00 2001 From: vigneshv Date: Sun, 8 Mar 2026 22:36:39 -0500 Subject: [PATCH 01/13] docs: add end-to-end installation guide and fill customer-facing gaps - Add book/src/manuals/installation-guide.md: 10-step deployment guide stitching together existing docs and filling gaps (Vault commands, Temporal setup, admin-cli build, Elektra OTP bootstrap, verification) - Update building_bmm_containers.md: add image summary table, tagging/ pushing section (auth before tag/push), REST image build steps, fix typo "perfrom" and stray backtick in tar command - Update site-setup.md: replace nvcr.io/nvidian internal image refs with placeholders and build-from-source links (fixes #476) - Update helm/PREREQUISITES.md: add Vault PKI engine/role/auth/policy commands, explicit carbide DB/user requirements, pg extensions, and new Temporal section (optional for core, required for REST) - Update book/src/SUMMARY.md: add installation guide entry, fix broken faqs.md link (file is faq.md) - Update README.md: add installation guide link in Getting Started Signed-off-by: vigneshv --- README.md | 1 + book/src/SUMMARY.md | 1 + book/src/manuals/building_nico_containers.md | 78 ++- book/src/manuals/installation-guide.md | 478 +++++++++++++++++++ book/src/manuals/pushing_containers.md | 39 ++ book/src/manuals/site-setup.md | 18 +- helm/PREREQUISITES.md | 94 +++- 7 files changed, 688 insertions(+), 21 deletions(-) create mode 100644 book/src/manuals/installation-guide.md create mode 100644 book/src/manuals/pushing_containers.md diff --git a/README.md b/README.md index 4be89996d4..25f7e1db99 100644 --- a/README.md +++ b/README.md @@ -10,6 +10,7 @@ of the bare-metal lifecycle to fast-track building next generation AI Cloud offe ## Getting Started - Go to the [NCX Infra Controller overview](https://nvidia.github.io/ncx-infra-controller-core/) to get an overview of NICo architecture and capabilities. +- Follow the [End-to-End Installation Guide](https://nvidia.github.io/ncx-infra-controller-core/manuals/installation-guide.html) for a complete walkthrough from cluster setup to first provisioned host. - Or jump to the [Site Setup guide](https://nvidia.github.io/ncx-infra-controller-core/manuals/site-setup.html) to start setting up your site for NICo. - Or jump to the [Building Containers guide](https://nvidia.github.io/ncx-infra-controller-core/manuals/building_nico_containers.html) to see an overview for building the containers. - Check out [Local Development with DevSpace](dev/deployment/devspace/README.md) to run NICo locally with mock systems. diff --git a/book/src/SUMMARY.md b/book/src/SUMMARY.md index 6afafd1e5c..ec4852b535 100644 --- a/book/src/SUMMARY.md +++ b/book/src/SUMMARY.md @@ -25,6 +25,7 @@ # Manuals +- [End-to-End Installation Guide](manuals/installation-guide.md) - [Site Setup](manuals/site-setup.md) - [Site Reference Architecture](manuals/site-reference-arch.md) - [Networking Requirements](manuals/networking_requirements.md) diff --git a/book/src/manuals/building_nico_containers.md b/book/src/manuals/building_nico_containers.md index b3684966ea..ef2bb31d88 100644 --- a/book/src/manuals/building_nico_containers.md +++ b/book/src/manuals/building_nico_containers.md @@ -1,12 +1,34 @@ # Building NICo Containers This section provides instructions for building the containers for NCX Infra Controller (NICo). +For the complete deployment workflow, see the [End-to-End Installation Guide](installation-guide.md). + +## Container Image Summary + +The following table lists all container images produced by this build process: + +| Image Name | Dockerfile | Purpose | Architecture | +|------------|-----------|---------|-------------| +| `nico-buildcontainer-x86_64` | `dev/docker/Dockerfile.build-container-x86_64` | Intermediate build container (Rust toolchain, libraries) | x86_64 | +| `nico-runtime-container-x86_64` | `dev/docker/Dockerfile.runtime-container-x86_64` | Intermediate runtime base image | x86_64 | +| `nico` (nvmetal-carbide) | `dev/docker/Dockerfile.release-container-sa-x86_64` | Carbide API, DHCP, DNS, PXE, hardware health, SSH console | x86_64 | +| `boot-artifacts-x86_64` | `dev/docker/Dockerfile.release-artifacts-x86_64` | PXE boot artifacts for x86 hosts | x86_64 | +| `boot-artifacts-aarch64` | `dev/docker/Dockerfile.release-artifacts-aarch64` | PXE boot artifacts for DPU BFB provisioning | x86_64 (bundles aarch64 binaries) | +| `machine-validation-runner` | `dev/docker/Dockerfile.machine-validation-runner` | Machine validation / burn-in test runner | x86_64 | +| `machine-validation-config` | `dev/docker/Dockerfile.machine-validation-config` | Machine validation config (bundles runner tar) | x86_64 | +| `build-artifacts-container-cross-aarch64` | `dev/docker/Dockerfile.build-artifacts-container-cross-aarch64` | Intermediate cross-compile container for aarch64 | x86_64 | + +The intermediate images (`nico-buildcontainer-x86_64`, `nico-runtime-container-x86_64`, +`build-artifacts-container-cross-aarch64`) are used during the build process and do not +need to be pushed to your registry. The remaining images must be pushed to a registry +accessible by your Kubernetes cluster. ## Installing Prerequisite Software Before you begin, ensure you have the following prerequisites: * An Ubuntu 24.04 Host or VM with 150GB+ of disk space (MacOS is not supported) +* For REST containers: Go 1.25.4 or later, Docker 20.10+ with BuildKit enabled Use the following steps to install the prerequisite software on the Ubuntu Host or VM. These instructions assume an `apt`-based distribution such as Ubuntu 24.04. @@ -55,27 +77,34 @@ cargo make --cwd pxe --env SA_ENABLEMENT=1 build-boot-artifacts-x86-host-sa docker build --build-arg "CONTAINER_RUNTIME_X86_64=alpine:latest" -t boot-artifacts-x86_64 -f dev/docker/Dockerfile.release-artifacts-x86_64 . ``` -## Building the Machine Validation images +## Building the Machine Validation Images ```sh -docker build --build-arg CONTAINER_RUNTIME_X86_64=nico-runtime-container-x86_64 -t machine-validation-runner -f dev/docker/Dockerfile.machine-validation-runner . +docker build --build-arg CONTAINER_RUNTIME_X86_64=nico-runtime-container-x86_64 \ + -t machine-validation-runner -f dev/docker/Dockerfile.machine-validation-runner . -docker save --output crates/machine-validation/images/machine-validation-runner.tar machine-validation-runner:latest - -// This copies `machine-validation-runner.tar` into the `/images` directory on the `machine-validation-config` container. When using a kubernetes deployment model -// this is the only `machine-validation` container you need to configure on the `carbide-pxe` pod. - -docker build --build-arg CONTAINER_RUNTIME_X86_64=nico-runtime-container-x86_64 -t machine-validation-config -f dev/docker/Dockerfile.machine-validation-config . +docker save --output crates/machine-validation/images/machine-validation-runner.tar \ + machine-validation-runner:latest +docker build --build-arg CONTAINER_RUNTIME_X86_64=nico-runtime-container-x86_64 \ + -t machine-validation-config -f dev/docker/Dockerfile.machine-validation-config . ``` -## Building nico-core container +The `machine-validation-config` container bundles `machine-validation-runner.tar` into its +`/images` directory. In a Kubernetes deployment, this is the only machine-validation +container you need to configure on the `carbide-pxe` pod. + +## Building nico-core Container ```sh -docker build --build-arg "CONTAINER_RUNTIME_X86_64=nico-runtime-container-x86_64" --build-arg "CONTAINER_BUILD_X86_64=nico-buildcontainer-x86_64" -f dev/docker/Dockerfile.release-container-sa-x86_64 -t nico . +docker build \ + --build-arg "CONTAINER_RUNTIME_X86_64=nico-runtime-container-x86_64" \ + --build-arg "CONTAINER_BUILD_X86_64=nico-buildcontainer-x86_64" \ + -f dev/docker/Dockerfile.release-container-sa-x86_64 \ + -t nico . ``` -## Building the AARCH64 Containers and artifacts +## Building the AARCH64 Containers and Artifacts ### Building the Cross-compile container @@ -101,3 +130,30 @@ docker build --build-arg "CONTAINER_RUNTIME_AARCH64=alpine:latest" -t boot-artif ``` **NOTE**: The `CONTAINER_RUNTIME_AARCH64=alpine:latest` build argument must be included. The aarch64 binaries are bundled into an x86 container. + +## Building REST Containers + +The REST components (cloud-api, cloud-workflow, site-manager, site-agent, +db migrations, cert-manager) are built from the +[bare-metal-manager-rest](https://github.com/NVIDIA/bare-metal-manager-rest) repository. + +```sh +cd bare-metal-manager-rest +make docker-build IMAGE_REGISTRY= IMAGE_TAG= +``` + +### REST Image Summary + +| Image | Purpose | +|-------|---------| +| `carbide-rest-api` | REST API server (port 8388) | +| `carbide-rest-workflow` | Temporal workflow worker (cloud-worker, site-worker) | +| `carbide-rest-site-manager` | Site management / registry service | +| `carbide-rest-site-agent` | On-site agent (elektra) | +| `carbide-rest-db` | Database migration job (runs once per upgrade) | +| `carbide-rest-cert-manager` | Native PKI certificate manager (credsmgr) | + +## Next Steps + +After building all images, tag and push them to your private registry. +See [Tagging and Pushing Containers](pushing_containers.md). diff --git a/book/src/manuals/installation-guide.md b/book/src/manuals/installation-guide.md new file mode 100644 index 0000000000..85bb0e503d --- /dev/null +++ b/book/src/manuals/installation-guide.md @@ -0,0 +1,478 @@ +# End-to-End Installation Guide + +This guide ties together the build, deploy, and configuration steps needed to go from +a ready Kubernetes cluster to your first provisioned bare-metal host. It links to +existing documentation for each major step and fills the gaps between them. + +The order of operations below follows the sequence validated by NVIDIA engineering +and SA teams during production deployments. + +## Order of Operations + +| Step | What | Where to find details | +|------|------|----------------------| +| 1 | [Build and push all container images](#1-build-and-push-containers) | [Building NICo Containers](building_nico_containers.md), [REST README](https://github.com/NVIDIA/bare-metal-manager-rest#building-docker-images) | +| 2 | [Provision site controller OS and Kubernetes](#2-site-controller-and-kubernetes) | [Site Reference Architecture](site-reference-arch.md) | +| 3 | [Deploy foundation services](#3-foundation-services) | [Site Setup](site-setup.md), [helm/PREREQUISITES.md](../../helm/PREREQUISITES.md) | +| 4 | [Deploy site CA, credsmgr, and Temporal](#4-site-ca-credsmgr-and-temporal) | This guide | +| 5 | [Deploy Carbide REST / cloud components](#5-deploy-carbide-rest-components) | This guide, [REST repo](https://github.com/NVIDIA/bare-metal-manager-rest) | +| 6 | [Deploy Carbide core](#6-deploy-carbide-core) | [Helm README](../../helm/README.md), [deploy/README.md](../../deploy/README.md) | +| 7 | [Install admin-cli](#7-install-admin-cli) | This guide | +| 8 | [Deploy Elektra site agent](#8-deploy-elektra-site-agent) | This guide | +| 9 | [Ingest managed hosts](#9-ingest-hosts) | [Ingesting Hosts](ingesting_machines.md) | +| 10 | [Verify end-to-end](#10-verification) | This guide | + +--- + +## 1. Build and Push Containers + +All container images must be built from source and pushed to a registry your cluster +can access. There are no pre-built public images available. + +```{note} +If you encounter `nvcr.io/nvidian/...` image references in documentation or manifests, +those are NVIDIA-internal paths not accessible externally. Replace them with your own +registry paths after building from source. +``` + +### BMM Core + +Follow the [Building NICo Containers](building_nico_containers.md) guide for build steps, +then [Tagging and Pushing Containers](pushing_containers.md) to push images to your +private registry. It covers +prerequisites, build steps for x86_64 and aarch64, tagging, pushing to a private +registry, and a summary table of all images produced. + +### BMM REST + +Clone [bare-metal-manager-rest](https://github.com/NVIDIA/bare-metal-manager-rest) +and build with: + +```bash +REGISTRY= +TAG= + +make docker-build IMAGE_REGISTRY=$REGISTRY IMAGE_TAG=$TAG + +for image in carbide-rest-api carbide-rest-workflow carbide-rest-site-manager \ + carbide-rest-site-agent carbide-rest-db carbide-rest-cert-manager; do + docker push "$REGISTRY/$image:$TAG" +done +``` + +See the [bare-metal-manager-rest README](https://github.com/NVIDIA/bare-metal-manager-rest#building-docker-images) +for the full list of images and build options. + +--- + +## 2. Site Controller and Kubernetes + +Customers are expected to provision their own site controller OS and Kubernetes cluster. + +See the [Site Reference Architecture](site-reference-arch.md) for hardware requirements, +Kubernetes versions, networking best practices, and IP pool sizing. + +In summary, you need: + +* 3 or 5 site controller nodes running Ubuntu 24.04 LTS with Kubernetes v1.30.x +* CNI (Calico v3.28.1 validated), ingress controller (Contour), load balancer (MetalLB) +* OOB switch VLANs with DHCP relay pointing at the Carbide DHCP service VIP +* In-band ToR switches with BGP unnumbered on DPU-facing ports, EVPN enabled +* IP pools allocated per the reference architecture + +--- + +## 3. Foundation Services + +Deploy the following services before any Carbide components. The order within this +step matters. + +**For baselines and versions**, see [Site Setup](site-setup.md). + +**For the Secrets, ConfigMaps, and ClusterIssuer** that the Helm chart expects, see +[helm/PREREQUISITES.md](../../helm/PREREQUISITES.md) -- it provides `kubectl create` +commands for every required resource. + +Deploy in this order: + +1. **External Secrets Operator (ESO)** -- optional, but simplifies secret management. + If you skip ESO, create all Kubernetes Secrets manually. + +2. **cert-manager** (v1.11.1+) with approver-policy (v0.6.3). Create the + `vault-forge-issuer` ClusterIssuer as described in + [helm/PREREQUISITES.md](../../helm/PREREQUISITES.md#5-clusterissuer). + +3. **PostgreSQL** -- SSL-enabled, with required extensions: + +```bash +psql "postgres://:@:/?sslmode=require" \ + -c 'CREATE EXTENSION IF NOT EXISTS btree_gin;' \ + -c 'CREATE EXTENSION IF NOT EXISTS pg_trgm;' +``` + +4. **Vault** -- deployed and unsealed, with: + * PKI secrets engine at mount path **`forgeca`** + * PKI role named **`forge-cluster`** + * Kubernetes auth enabled with a role for the cert-manager service account + * Vault policy granting sign/issue capabilities + +These Vault configuration steps are documented in detail in +[helm/PREREQUISITES.md](../../helm/PREREQUISITES.md#hashicorp-vault). + +--- + +## 4. Site CA, credsmgr, and Temporal + +This step sets up the certificate infrastructure that both the REST / cloud components +and Temporal depend on. + +### 4.1 Create Site CA Secrets + +Create root CA secrets in the `cert-manager` namespace: + +```bash +kubectl -n cert-manager create secret generic vault-root-ca-certificate \ + --from-file=certificate=./cacert.pem +kubectl -n cert-manager create secret generic vault-root-ca-private-key \ + --from-file=privatekey=./ca.key +``` + +If you need to generate a self-signed root CA for testing: + +```bash +openssl req -x509 -newkey rsa:4096 -keyout ca.key -out cacert.pem \ + -sha256 -days 3650 -nodes -subj "/CN=Carbide Root CA" +``` + +### 4.2 Deploy cloud-cert-manager (credsmgr) + +credsmgr runs an embedded Vault process and creates the `vault-issuer` ClusterIssuer +used for Temporal TLS certificates and cloud component mTLS. + +From the `bare-metal-manager-rest` repository, update images in +`deploy/kustomize/base/cert-manager/kustomization.yaml` to point at your registry, +then: + +```bash +kubectl apply -k deploy/kustomize/base/cert-manager +kubectl get clusterissuer vault-issuer +``` + +Verify the `vault-issuer` shows `Ready=True` before proceeding. + +### 4.3 Provision Temporal TLS Certificates + +Apply Temporal certificate manifests (client certs for `cloud-api` and `cloud-workflow`, +server certs for the `temporal` namespace). These manifests are in the +`bare-metal-manager-rest` repository under `deploy/kustomize/base/temporal-certs`: + +```bash +kubectl apply -k deploy/kustomize/base/temporal-certs +``` + +Verify: + +```bash +kubectl -n cloud-api get certificate temporal-client-cloud-certs +kubectl -n cloud-workflow get certificate temporal-client-cloud-certs +kubectl -n temporal get secret server-cloud-certs server-interservice-certs server-site-certs +``` + +### 4.4 Deploy Temporal + +Deploy Temporal server v1.22.6 with Elasticsearch 7.17.3 for visibility. +Use the TLS certificates provisioned above for mTLS. + +After all Temporal pods are `Running`, register the required namespaces: + +```bash +tctl --ns cloud namespace register +tctl --ns site namespace register +``` + +```{note} +If Temporal pods are stuck in `Init:0/1`, the Elasticsearch index may not be ready. +Check `kubectl -n temporal logs elasticsearch-master-0` and wait for ES to become +healthy, or create the index manually. +``` + +--- + +## 5. Deploy Carbide REST Components + +The REST / cloud layer provides the customer-facing API, workflow orchestration, and +site management. Deploy from the +[bare-metal-manager-rest](https://github.com/NVIDIA/bare-metal-manager-rest) repository. + +For each component below, update the image reference in `kustomization.yaml` to +your registry and adjust ConfigMaps for your Postgres, Temporal, and Vault endpoints. + +### 5.1 Database Migrations (cloud-db) + +Initializes the cloud database schema. This is a one-time job: + +```bash +kubectl apply -k deploy/kustomize/base/db +kubectl -n cloud-db get jobs -w +``` + +Wait for the job to complete before proceeding. + +### 5.2 cloud-workflow + +Deploys `cloud-worker` and `site-worker` Temporal workers: + +```bash +kubectl apply -k deploy/kustomize/base/cloud-workflow +kubectl -n cloud-workflow get pods +``` + +Both deployments should reach `Running`. + +### 5.3 cloud-api + +The customer-facing REST API: + +```bash +kubectl apply -k deploy/kustomize/base/cloud-api +kubectl -n cloud-api get pods +``` + +### 5.4 cloud-site-manager + +The site registry service: + +```bash +kubectl apply -k deploy/kustomize/base/site-manager +``` + +```{note} +If `carbide-rest-site-manager` fails with `unable to start container process`, the +entrypoint in `deployment.yaml` does not match the production Dockerfile. Update +`deployment.yaml` to use the correct binary path. +``` + +--- + +## 6. Deploy Carbide Core + +This deploys the on-site gRPC API and all supporting services (DHCP, DNS, PXE, +hardware health, SSH console, and optionally Unbound) into the `forge-system` namespace. + +There are two deployment methods: **Helm** (recommended) and **Kustomize** (legacy). + +### Helm (Recommended) + +See the [Helm chart README](../../helm/README.md) for full documentation and +[helm/PREREQUISITES.md](../../helm/PREREQUISITES.md) for the Secrets and ConfigMaps +that must exist before install. + +1. Copy `helm/examples/values-minimal.yaml` (or `values-full.yaml`) and customize: + * `global.image.repository` and `global.image.tag` -- your built core image + * `global.imagePullSecrets` -- if using a private registry + * `carbide-api.hostname` -- your API FQDN + * `carbide-api.siteConfig.carbideApiSiteConfig` -- site-specific TOML overrides + * MetalLB `externalService` annotations for each service VIP + * Kea DHCP configuration under `carbide-dhcp.config` + +2. Install: + +```bash +helm upgrade --install carbide ./helm \ + --namespace forge-system --create-namespace \ + -f values-mysite.yaml +``` + +3. Verify: + +```bash +kubectl -n forge-system get pods +kubectl -n forge-system get certificates +``` + +The migration job runs automatically. Pods may briefly restart until the database is ready. + +### Kustomize (Alternative) + +See [deploy/README.md](../../deploy/README.md) for the full list of inputs. +Populate `deploy/kustomization.yaml` and `deploy/files/`, then: + +```bash +cd deploy +kustomize build . --enable-helm --enable-alpha-plugins --enable-exec | kubectl apply -f - +``` + +### Verify the API + +```bash +curl -k https://:1079/healthz +``` + +If the API VIP is not externally reachable: + +```bash +kubectl port-forward svc/carbide-api 1079:1079 -n forge-system +curl -k https://localhost:1079/healthz +``` + +--- + +## 7. Install admin-cli + +Build from source in the `bare-metal-manager-core` repository: + +```bash +cargo make build-cli +``` + +The binary is at `target/release/admin-cli`. Point it at your API: + +```bash +admin-cli -c https://api-. site info +``` + +If the API is not externally reachable: + +```bash +kubectl port-forward svc/carbide-api 1079:1079 -n forge-system & +admin-cli -c https://localhost:1079 site info +``` + +--- + +## 8. Deploy Elektra Site Agent + +Elektra bridges the on-site Carbide core to the cloud REST layer via Temporal. + +1. Register a site through cloud-api or cloud-site-manager to get a ``. + +2. Register the per-site Temporal namespace: + +```bash +tctl --ns namespace register +``` + +3. Generate an OTP for the site agent and create the bootstrap secret. The OTP is + issued by `cloud-site-manager` and stored as a Kubernetes secret in the + `elektra-site-agent` namespace: + +```bash +# Issue a one-time password for the site +OTP=$(curl -s -X POST https:///api/v1/sites//otp \ + -H "Authorization: Bearer " | jq -r '.otp') + +kubectl -n elektra-site-agent create secret generic site-agent-bootstrap \ + --from-literal=SITE_UUID= \ + --from-literal=OTP="$OTP" \ + --from-literal=CLOUD_API_ENDPOINT=https:// +``` + +4. Update the image and site config in the site-agent manifests, then apply: + +```bash +kubectl apply -k deploy/kustomize/base/site-agent +``` + +5. Verify: + +```bash +kubectl -n elektra-site-agent get pods +kubectl -n elektra-site-agent logs -l app=elektra --tail=50 +``` + +--- + +## 9. Ingest Hosts + +See [Ingesting Hosts](ingesting_machines.md) for the complete procedure. + +For each managed host, you need the **BMC MAC address**, **chassis serial number**, and +**factory BMC username/password** (from your asset management system or server vendor). + +```bash +# Set desired credentials BMM will apply to all hosts +admin-cli -c credential add-bmc --kind=site-wide-root --password='' +admin-cli -c credential add-uefi --kind=host --password='' + +# Upload expected machines manifest +admin-cli -c credential em replace-all --filename expected_machines.json + +# Approve for measured boot ingestion +admin-cli -c mb site trusted-machine approve \* persist --pcr-registers="0,3,5,6" +``` + +BMM then automatically: assigns IPs via DHCP, discovers BMCs via Redfish, rotates +credentials, provisions DPUs, PXE-boots hosts into Scout for hardware discovery, and +moves machines to the `Available` pool. + +Monitor progress: + +```bash +admin-cli -c machine list +``` + +--- + +## 10. Verification + +Once hosts are `Available`, verify the full deployment: + +```bash +# All core pods running +kubectl -n forge-system get pods + +# API healthy +curl -k https://:1079/healthz + +# Machines discovered and available +admin-cli -c machine list + +# Admin UI accessible +# https://api-./admin +# Or via port-forward: kubectl port-forward svc/carbide-api 1079:1079 -n forge-system +``` + +To complete the hello-world test, create an instance to provision Ubuntu on a managed +host, then SSH to verify: + +```bash +ssh -p 22 @ +``` + +--- + +## Troubleshooting + +### Temporal Pods Stuck in Init + +Pods stuck in `Init:0/1` -- usually Elasticsearch index not ready. +Check `kubectl -n temporal logs elasticsearch-master-0`. + +### kubectl Connection Refused + +When accessing through a jump host: `ssh -L 6443:localhost:6443 ` + +### External API Access Blocked + +Use port-forwarding: `kubectl port-forward svc/carbide-api 1079:1079 -n forge-system` + +### carbide-rest-site-manager Fails to Start + +`unable to start container process` -- entrypoint mismatch between `deployment.yaml` +and the Dockerfile. Update to the correct binary path. + +### Pods Stuck in ImagePullBackOff + +Missing `imagePullSecrets`. Verify: `kubectl -n get secret imagepullsecret` + +### nvcr.io/nvidian Image References + +Internal NVIDIA paths. Build from source (Step 1) and replace with your registry URL. + +### Machines Not Progressing + +Check state controller logs: +`kubectl -n forge-system logs -l app=carbide-api --tail=100 | grep state_controller` + +Common causes: DHCP relay not configured on OOB switch, BMC MACs not matching the +expected machines table, network boot not first in boot order. diff --git a/book/src/manuals/pushing_containers.md b/book/src/manuals/pushing_containers.md new file mode 100644 index 0000000000..7e610e3738 --- /dev/null +++ b/book/src/manuals/pushing_containers.md @@ -0,0 +1,39 @@ +# Tagging and Pushing Containers to a Private Registry + +After building all NICo container images (see [Building NICo Containers](building_nico_containers.md)), +tag them for your private registry and push. Set your registry URL and version tag as +environment variables: + +```sh +REGISTRY= +TAG= +``` + +## Authenticate with your registry + +```sh +docker login +``` + +## Tag and Push NICo Core Images + +```sh +docker tag nico $REGISTRY/nvmetal-carbide:$TAG +docker tag boot-artifacts-x86_64 $REGISTRY/boot-artifacts-x86_64:$TAG +docker tag boot-artifacts-aarch64 $REGISTRY/boot-artifacts-aarch64:$TAG +docker tag machine-validation-config $REGISTRY/machine-validation-config:$TAG + +docker push $REGISTRY/nvmetal-carbide:$TAG +docker push $REGISTRY/boot-artifacts-x86_64:$TAG +docker push $REGISTRY/boot-artifacts-aarch64:$TAG +docker push $REGISTRY/machine-validation-config:$TAG +``` + +## Tag and Push BMM REST Images + +```sh +for image in carbide-rest-api carbide-rest-workflow carbide-rest-site-manager \ + carbide-rest-site-agent carbide-rest-db carbide-rest-cert-manager; do + docker push "$REGISTRY/$image:$TAG" +done +``` diff --git a/book/src/manuals/site-setup.md b/book/src/manuals/site-setup.md index aa32e3c4ec..8932a6e207 100644 --- a/book/src/manuals/site-setup.md +++ b/book/src/manuals/site-setup.md @@ -74,17 +74,23 @@ These components are not required for NICo setup, but are recommended site metri The following services are installed during the NICo installation process. -- **NICo core (forge‑system)** +- **NICo core (forge-system)** - - nvmetal-carbide:v2025.07.04-rc2-0-8-g077781771 (primary carbide-api, plus supporting workloads) + - `/nvmetal-carbide:` (primary carbide-api, plus supporting workloads). + Build from [bare-metal-manager-core](https://github.com/NVIDIA/bare-metal-manager-core). + See [Building BMM Containers](building_bmm_containers.md). -- **cloud‑api**: cloud-api:v0.2.72 (two replicas) +- **cloud-api**: `/carbide-rest-api:` (two replicas). + Build from [bare-metal-manager-rest](https://github.com/NVIDIA/bare-metal-manager-rest). -- **cloud‑workflow**: cloud-workflow:v0.2.30 (cloud‑worker, site‑worker) +- **cloud-workflow**: `/carbide-rest-workflow:` (cloud-worker, site-worker). + Build from [bare-metal-manager-rest](https://github.com/NVIDIA/bare-metal-manager-rest). -- **cloud‑cert‑manager (credsmgr)**: cloud-cert-manager:v0.1.16 +- **cloud-cert-manager (credsmgr)**: `/carbide-rest-cert-manager:`. + Build from [bare-metal-manager-rest](https://github.com/NVIDIA/bare-metal-manager-rest). -- **elektra-site-agent**: forge-elektra:v2025.06.20-rc1-0 +- **elektra-site-agent**: `/carbide-rest-site-agent:`. + Build from [bare-metal-manager-rest](https://github.com/NVIDIA/bare-metal-manager-rest). ## Order of Operations diff --git a/helm/PREREQUISITES.md b/helm/PREREQUISITES.md index 9955a80638..ccec86f1ab 100644 --- a/helm/PREREQUISITES.md +++ b/helm/PREREQUISITES.md @@ -26,9 +26,53 @@ helm install cert-manager jetstack/cert-manager \ Required for PKI (certificate signing) and secret storage. Vault serves as the backend for the cert-manager issuer and provides secrets to various Carbide components. - Vault must be deployed and unsealed. -- A PKI secrets engine must be configured for certificate signing. +- A **PKI secrets engine** must be enabled at mount path **`forgeca`**: + +```bash +vault secrets enable -path=forgeca pki +vault secrets tune -max-lease-ttl=87600h forgeca +``` + +- A **PKI role** named **`forge-cluster`** must be created under the `forgeca` mount. This role name is referenced by `carbide-api` via the `VAULT_PKI_ROLE_NAME` environment variable: + +```bash +vault write forgeca/roles/forge-cluster \ + allowed_domains="forge.local,svc.cluster.local" \ + allow_subdomains=true \ + allow_bare_domains=true \ + max_ttl=8760h +``` + +- **Kubernetes auth** must be enabled with a role for the **cert-manager service account**, so the `vault-forge-issuer` ClusterIssuer (Section 5) can authenticate to Vault: + +```bash +vault auth enable kubernetes +vault write auth/kubernetes/config \ + kubernetes_host="https://kubernetes.default.svc:443" +vault write auth/kubernetes/role/cert-manager \ + bound_service_account_names=cert-manager \ + bound_service_account_namespaces=cert-manager \ + policies=forge-pki-policy \ + ttl=1h +``` + +- A **Vault policy** must grant the cert-manager role permission to sign certificates: + +```bash +vault policy write forge-pki-policy - <:@:/?sslmode=require" \ + -c 'CREATE EXTENSION IF NOT EXISTS btree_gin;' \ + -c 'CREATE EXTENSION IF NOT EXISTS pg_trgm;' +``` + +- **Schema creation:** The migration job included in the `carbide-api` subchart handles schema creation and migrations automatically after extensions are in place. You do not need to run migrations manually. - **Connection details:** Provided to the chart via a ConfigMap and a Secret (see Sections 3 and 4 below). +For additional PostgreSQL configuration details (TLS, ESO integration, per-namespace credentials), see the [Site Setup guide](book/src/manuals/site-setup.md#postgresql-db). + +--- + +## 2a. Temporal (Required for bare-metal-manager-rest only) + +Temporal is **not required** by the Carbide core Helm chart. You can operate Carbide core +standalone using `admin-cli` with direct gRPC commands. + +Temporal **is required** if you deploy the +[bare-metal-manager-rest](https://github.com/NVIDIA/bare-metal-manager-rest) layer +(cloud-api, cloud-workflow, site-manager, elektra-site-agent). The REST components use +Temporal for workflow orchestration between the cloud control plane and site agents. + +If you plan to deploy bare-metal-manager-rest: + +- **Reference version:** Temporal server v1.22.6, admin tools v1.22.4, UI v2.16.2 +- **Visibility store:** Elasticsearch 7.17.3 +- **Persistence:** PostgreSQL (can share the same cluster as Carbide, with separate + databases `temporal` and `temporal_visibility`) +- **Frontend endpoint:** `temporal-frontend.temporal.svc:7233` (cluster-internal) +- **Required namespaces:** Register `cloud` and `site` after Temporal is running: + +```bash +tctl --ns cloud namespace register +tctl --ns site namespace register +``` + +- **mTLS:** The REST components expect Temporal client TLS certificates. These are + issued by the `vault-issuer` ClusterIssuer created by cloud-cert-manager (credsmgr), + which is part of bare-metal-manager-rest. See the + [End-to-End Installation Guide](book/src/manuals/installation-guide.md) for the + full deployment order. + --- ## 3. Kubernetes Secrets From 24d7f406a9761987779bed2389056dfd5020652f Mon Sep 17 00:00:00 2001 From: vigneshv Date: Tue, 10 Mar 2026 12:08:09 -0500 Subject: [PATCH 02/13] docs: add Vault AppRole/token generation steps to PREREQUISITES.md Add step-by-step instructions for obtaining VAULT_ROLE_ID, VAULT_SECRET_ID, and VAULT_TOKEN from Vault. These values were previously listed as required but with no explanation of how to generate them -- customers were blocked at this step. Signed-off-by: vigneshv --- helm/PREREQUISITES.md | 54 ++++++++++++++++-- work.md | 130 ++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 180 insertions(+), 4 deletions(-) create mode 100644 work.md diff --git a/helm/PREREQUISITES.md b/helm/PREREQUISITES.md index ccec86f1ab..b8708a35ea 100644 --- a/helm/PREREQUISITES.md +++ b/helm/PREREQUISITES.md @@ -166,23 +166,69 @@ Vault AppRole credentials for automated secret access by Carbide services. **Required keys:** `VAULT_ROLE_ID`, `VAULT_SECRET_ID` +To obtain these values, enable AppRole auth in Vault and create a role for Carbide: + +```bash +vault auth enable approle + +vault write auth/approle/role/carbide \ + token_policies="forge-pki-policy,forge-kv-policy" \ + token_ttl=1h \ + token_max_ttl=4h \ + secret_id_ttl=0 +``` + +Then read the role ID and generate a secret ID: + +```bash +vault read -field=role_id auth/approle/role/carbide/role-id +vault write -field=secret_id -f auth/approle/role/carbide/secret-id +``` + +Create the Kubernetes secret with the values returned above: + ```bash kubectl create secret generic carbide-vault-approle-tokens \ --namespace forge-system \ - --from-literal=VAULT_ROLE_ID='' \ - --from-literal=VAULT_SECRET_ID='' + --from-literal=VAULT_ROLE_ID='' \ + --from-literal=VAULT_SECRET_ID='' ``` ### `carbide-vault-token` -Vault token for direct API access. +Vault token for direct API access. This token is used by Carbide services that +authenticate to Vault directly rather than via AppRole. **Required keys:** `VAULT_TOKEN` +Generate a token with the policies Carbide needs: + +```bash +vault token create \ + -policy=forge-pki-policy \ + -policy=forge-kv-policy \ + -ttl=768h \ + -display-name=carbide-api +``` + +The `token` field in the output is your `VAULT_TOKEN`. Create the Kubernetes secret: + ```bash kubectl create secret generic carbide-vault-token \ --namespace forge-system \ - --from-literal=VAULT_TOKEN='' + --from-literal=VAULT_TOKEN='' +``` + +**Note:** The policies referenced above (`forge-pki-policy`, `forge-kv-policy`) must +be created first. See the [Vault section](#hashicorp-vault) above for the PKI policy. +For the KV policy: + +```bash +vault policy write forge-kv-policy - < "perform", removed stray backtick in tar command + +### Updated: book/src/manuals/site-setup.md + +Replaced 5 lines referencing nvcr.io/nvidian/nvforge-devel/... with: +- `/image-name:` placeholder format +- "Build from [repo-name](github-link)" for each component +This directly fixes GitHub issue #476. + +### Updated: helm/PREREQUISITES.md + +Added under "HashiCorp Vault": +- PKI secrets engine must be at mount path `forgeca` (with vault commands) +- PKI role must be named `forge-cluster` (with vault write command) +- Kubernetes auth must have a role for cert-manager SA (with vault commands) +- Vault policy for PKI signing (with vault policy write command) +- Link to site-setup.md for additional details + +Added new section "2a. Temporal": +- Temporal is NOT required for carbide-core (can use admin-cli with gRPC directly) +- Temporal IS required for bare-metal-manager-rest +- Reference versions, frontend endpoint, required namespaces, mTLS note + +Updated under "PostgreSQL Database": +- Explicit: "Create a dedicated database named carbide with a dedicated user named + carbide. Do not use the default postgres superuser." +- Added required extensions (btree_gin, pg_trgm) with psql command +- Link to site-setup.md for additional details + +### Updated: book/src/SUMMARY.md + +Added "End-to-End Installation Guide" as the first entry under Manuals. + +### Updated: README.md + +Added installation guide link in Getting Started section. Tweaked existing link +descriptions to be more specific. + +## Overlap with PR #479 + +Larry Chen's PR #479 ("docs: remove private repos") also modifies site-setup.md to +strip nvcr.io/nvidian prefixes. His change is narrower (just removes the prefix, +leaving bare image names). Our change is broader (replaces with YOUR_REGISTRY +placeholders and adds build-from-source links). Whoever merges second resolves the +conflict on that file. + +## How to review + +1. Start with `book/src/manuals/installation-guide.md` -- does the 10-step flow make + sense? Does it match what you'd actually do deploying Carbide? + +2. Check `helm/PREREQUISITES.md` -- are the Vault commands correct? Is the Temporal + section accurate (optional for core, required for REST)? + +3. Check `book/src/manuals/building_bmm_containers.md` -- is the image summary table + complete? Are the tag/push commands right? + +4. Check `book/src/manuals/site-setup.md` -- are the replacement image names and repo + links correct? + +## Source material + +- SA deployment notes for TLV01 (Chelsea Isaac's 10-step guide) +- Carbide Installation Walkthrough BYO K8s (~4000 line internal doc, sections 7.0-7.14) +- Customer questions from SMC/Rafay partner deployments +- GitHub issue #476 (nvcr.io references blocking customers) +- Slack threads from #carbide-sa-enablement and ext-rafay channels From 97b00ec0c14a1e84c639705ebea872dbedcf3003 Mon Sep 17 00:00:00 2001 From: vigneshv Date: Tue, 10 Mar 2026 15:56:29 -0500 Subject: [PATCH 03/13] docs: fix stale BMM references after NICo rename - Update site-setup.md, SUMMARY.md, installation-guide.md, and pushing_containers.md to use NICo naming and correct repo links - Remove work.md from PR Signed-off-by: vigneshv --- book/src/SUMMARY.md | 14 ++- book/src/manuals/installation-guide.md | 10 +- book/src/manuals/pushing_containers.md | 2 +- book/src/manuals/site-setup.md | 16 +-- work.md | 130 ------------------------- 5 files changed, 23 insertions(+), 149 deletions(-) delete mode 100644 work.md diff --git a/book/src/SUMMARY.md b/book/src/SUMMARY.md index ec4852b535..d996d29c8d 100644 --- a/book/src/SUMMARY.md +++ b/book/src/SUMMARY.md @@ -1,4 +1,4 @@ -# NCX Infra Controller +# NVIDIA Bare Metal Manager - [Introduction](README.md) - [Hardware Compatbility List](hcl.md) @@ -11,7 +11,6 @@ - [Redfish Workflow](architecture/redfish_workflow.md) - [Redfish Endpoints Reference](architecture/redfish/endpoints_reference.md) - [Reliable state handling](architecture/state_handling.md) -- [Networking integrations](architecture/networking_integrations.md) - [DPU configuration](architecture/dpu_configuration.md) - [Health checks and health aggregation](architecture/health_aggregation.md) - [Health probe IDs](architecture/health/health_probe_ids.md) @@ -30,6 +29,7 @@ - [Site Reference Architecture](manuals/site-reference-arch.md) - [Networking Requirements](manuals/networking_requirements.md) - [Building NICo Containers](manuals/building_nico_containers.md) +- [Tagging and Pushing Containers](manuals/pushing_containers.md) - [Ingesting Hosts](manuals/ingesting_machines.md) - [Updating Expected Hosts Manifest](manuals/expected_machine_update.md) - [Host Validation](manuals/machine_validation.md) @@ -40,13 +40,17 @@ - [VPC Routing Profiles](manuals/vpc/vpc_routing_profiles.md) - [VPC Peering](manuals/vpc/vpc_peering_management.md) - [Metrics]() - - [Core metrics](manuals/metrics/core_metrics.md) + - [Core metrics](manuals/metrics/carbide_core_metrics.md) + +# Sites and site access + +- [carbide-admin-cli access](sites/forge_admin_cli.md) # Design -- [SPIFFE SVID Design](design/machine-identity/spiffe-svid-sdd.md) +- [SPIFFE SVID Design](design/spiffe-svid-sdd.md) # Development @@ -68,7 +72,7 @@ # Playbooks -- [Azure OIDC for NCX Infra Controller-Web UI](playbooks/carbide_web_oauth2.md) +- [Azure OIDC for NVIDIA Bare Metal Manager-Web UI](playbooks/carbide_web_oauth2.md) - [Force deleting and rebuilding Forge hosts](playbooks/force_delete.md) - [Rebooting a machine](playbooks/machine_reboot.md) - [Instance/Subnet/etc is stuck in a state]() diff --git a/book/src/manuals/installation-guide.md b/book/src/manuals/installation-guide.md index 85bb0e503d..bb87c1f062 100644 --- a/book/src/manuals/installation-guide.md +++ b/book/src/manuals/installation-guide.md @@ -35,7 +35,7 @@ those are NVIDIA-internal paths not accessible externally. Replace them with you registry paths after building from source. ``` -### BMM Core +### NICo Core Follow the [Building NICo Containers](building_nico_containers.md) guide for build steps, then [Tagging and Pushing Containers](pushing_containers.md) to push images to your @@ -43,7 +43,7 @@ private registry. It covers prerequisites, build steps for x86_64 and aarch64, tagging, pushing to a private registry, and a summary table of all images produced. -### BMM REST +### NICo REST Clone [bare-metal-manager-rest](https://github.com/NVIDIA/bare-metal-manager-rest) and build with: @@ -319,7 +319,7 @@ curl -k https://localhost:1079/healthz ## 7. Install admin-cli -Build from source in the `bare-metal-manager-core` repository: +Build from source in the `ncx-infra-controller-core` repository: ```bash cargo make build-cli @@ -390,7 +390,7 @@ For each managed host, you need the **BMC MAC address**, **chassis serial number **factory BMC username/password** (from your asset management system or server vendor). ```bash -# Set desired credentials BMM will apply to all hosts +# Set desired credentials NICo will apply to all hosts admin-cli -c credential add-bmc --kind=site-wide-root --password='' admin-cli -c credential add-uefi --kind=host --password='' @@ -401,7 +401,7 @@ admin-cli -c credential em replace-all --filename expected_machines.js admin-cli -c mb site trusted-machine approve \* persist --pcr-registers="0,3,5,6" ``` -BMM then automatically: assigns IPs via DHCP, discovers BMCs via Redfish, rotates +NICo then automatically: assigns IPs via DHCP, discovers BMCs via Redfish, rotates credentials, provisions DPUs, PXE-boots hosts into Scout for hardware discovery, and moves machines to the `Available` pool. diff --git a/book/src/manuals/pushing_containers.md b/book/src/manuals/pushing_containers.md index 7e610e3738..9a4893f4eb 100644 --- a/book/src/manuals/pushing_containers.md +++ b/book/src/manuals/pushing_containers.md @@ -29,7 +29,7 @@ docker push $REGISTRY/boot-artifacts-aarch64:$TAG docker push $REGISTRY/machine-validation-config:$TAG ``` -## Tag and Push BMM REST Images +## Tag and Push REST Images ```sh for image in carbide-rest-api carbide-rest-workflow carbide-rest-site-manager \ diff --git a/book/src/manuals/site-setup.md b/book/src/manuals/site-setup.md index 8932a6e207..3d8faaeed1 100644 --- a/book/src/manuals/site-setup.md +++ b/book/src/manuals/site-setup.md @@ -1,6 +1,6 @@ # Site Setup Guide -This page outlines the software dependencies for a Kubernetes-based install of NCX Infra Controller (NICo). It includes the *validated baseline* of software dependencies, +This page outlines the software dependencies for a Kubernetes-based install of NVIDIA Bare Metal Manager (BMM). It includes the *validated baseline* of software dependencies, as well as the *order of operations* for site bringup, including what you must configure if you already operate some of the common services yourself. **Important Notes** @@ -16,7 +16,7 @@ as well as the *order of operations* for site bringup, including what you must c ## Validated Baseline -This section lists all software dependencies, including the versions validated for this release of NICo. +This section lists all software dependencies, including the versions validated for this release of BMM. ### Kubernetes and Node Runtime @@ -58,7 +58,7 @@ This section lists all software dependencies, including the versions validated f ### Monitoring and Telemetry (OPTIONAL) -These components are not required for NICo setup, but are recommended site metrics. +These components are not required for BMM setup, but are recommended site metrics. - **Monitoring System**: Prometheus Operator v0.68.0; Prometheus v2.47.0; Alertmanager v0.26.0 @@ -70,15 +70,15 @@ These components are not required for NICo setup, but are recommended site metri - **Host Monitoring** Node exporter v1.6.1 -### NICo Components +### BMM Components -The following services are installed during the NICo installation process. +The following services are installed during the BMM installation process. -- **NICo core (forge-system)** +- **NICo core (forge-system)** - `/nvmetal-carbide:` (primary carbide-api, plus supporting workloads). - Build from [bare-metal-manager-core](https://github.com/NVIDIA/bare-metal-manager-core). - See [Building BMM Containers](building_bmm_containers.md). + Build from [ncx-infra-controller-core](https://github.com/NVIDIA/ncx-infra-controller-core). + See [Building NICo Containers](building_nico_containers.md). - **cloud-api**: `/carbide-rest-api:` (two replicas). Build from [bare-metal-manager-rest](https://github.com/NVIDIA/bare-metal-manager-rest). diff --git a/work.md b/work.md deleted file mode 100644 index c6ad6b3462..0000000000 --- a/work.md +++ /dev/null @@ -1,130 +0,0 @@ -# PR Review Context: docs/end-to-end-installation-guide - -## What this PR does - -This PR improves the deployment documentation for NVIDIA Bare Metal Manager (Carbide) -so external customers can actually deploy it end-to-end without needing internal NVIDIA -access, Slack channels, or video recordings. - -## The problem - -Customers following the public repos (bare-metal-manager-core and bare-metal-manager-rest) -hit three blockers: - -1. **nvcr.io/nvidian references**: `site-setup.md` listed container images from NVIDIA's - internal registry (nvcr.io/nvidian/nvforge-devel/...) with no instructions on how to - build them from source. This was filed as GitHub issue #476. - -2. **Vault/PostgreSQL gaps in helm/PREREQUISITES.md**: Customers asked (verbatim): - - "A PKI secrets engine is required for Vault. Is there any specific setting also?" - - "Do we have to create another one called 'carbide'?" - - "I don't see Temporal in prerequisites. Do carbide still need it?" - The answers were discoverable only by cross-referencing site-setup.md, Helm values - files, and the ClusterIssuer example. - -3. **No end-to-end guide**: The 10-step deployment flow (validated by SA teams at TLV01) - was only documented internally. Externally, customers had to piece together: - - building_bmm_containers.md (how to build core images) - - bare-metal-manager-rest README (how to build REST images) - - site-setup.md (foundation service baselines) - - helm/PREREQUISITES.md (secrets and configmaps) - - helm/README.md (Helm chart config) - - deploy/README.md (Kustomize alternative) - - ingesting_machines.md (host onboarding) - with no document explaining the ordering or the gaps between them. - -## Files changed (6 files, +685 -29) - -### New: book/src/manuals/installation-guide.md - -A lean "stitching" document that links to existing docs and fills gaps. Follows the -exact 10-step sequence SA teams used during production deployments: - -1. Build and push containers (links to building_bmm_containers.md + REST README) -2. Site controller and Kubernetes (links to site-reference-arch.md) -3. Foundation services (links to site-setup.md + PREREQUISITES.md) -4. Site CA, credsmgr, and Temporal (GAP FILLED: deployment order, vault commands) -5. Deploy Carbide REST components (GAP FILLED: cloud-db, workflow, api, site-manager) -6. Deploy Carbide core (links to helm/README.md, with Kustomize alternative) -7. Install admin-cli (GAP FILLED: build from source, port-forward workaround) -8. Deploy Elektra site agent (GAP FILLED: site registration, Temporal namespace, OTP) -9. Ingest hosts (links to ingesting_machines.md) -10. Verification (GAP FILLED: healthz, admin UI, hello-world test) - -Plus a troubleshooting section with real issues from SA deployments. - -### Updated: book/src/manuals/building_bmm_containers.md - -Added: -- Container image summary table (image name, Dockerfile, purpose, architecture) -- Which images are intermediate (don't push) vs deployable (must push) -- "Tagging and Pushing to a Private Registry" section with docker tag/push commands -- "Building BMM REST Containers" section with make docker-build + push loop -- REST image summary table -- Fixed typo: "perfrom" -> "perform", removed stray backtick in tar command - -### Updated: book/src/manuals/site-setup.md - -Replaced 5 lines referencing nvcr.io/nvidian/nvforge-devel/... with: -- `/image-name:` placeholder format -- "Build from [repo-name](github-link)" for each component -This directly fixes GitHub issue #476. - -### Updated: helm/PREREQUISITES.md - -Added under "HashiCorp Vault": -- PKI secrets engine must be at mount path `forgeca` (with vault commands) -- PKI role must be named `forge-cluster` (with vault write command) -- Kubernetes auth must have a role for cert-manager SA (with vault commands) -- Vault policy for PKI signing (with vault policy write command) -- Link to site-setup.md for additional details - -Added new section "2a. Temporal": -- Temporal is NOT required for carbide-core (can use admin-cli with gRPC directly) -- Temporal IS required for bare-metal-manager-rest -- Reference versions, frontend endpoint, required namespaces, mTLS note - -Updated under "PostgreSQL Database": -- Explicit: "Create a dedicated database named carbide with a dedicated user named - carbide. Do not use the default postgres superuser." -- Added required extensions (btree_gin, pg_trgm) with psql command -- Link to site-setup.md for additional details - -### Updated: book/src/SUMMARY.md - -Added "End-to-End Installation Guide" as the first entry under Manuals. - -### Updated: README.md - -Added installation guide link in Getting Started section. Tweaked existing link -descriptions to be more specific. - -## Overlap with PR #479 - -Larry Chen's PR #479 ("docs: remove private repos") also modifies site-setup.md to -strip nvcr.io/nvidian prefixes. His change is narrower (just removes the prefix, -leaving bare image names). Our change is broader (replaces with YOUR_REGISTRY -placeholders and adds build-from-source links). Whoever merges second resolves the -conflict on that file. - -## How to review - -1. Start with `book/src/manuals/installation-guide.md` -- does the 10-step flow make - sense? Does it match what you'd actually do deploying Carbide? - -2. Check `helm/PREREQUISITES.md` -- are the Vault commands correct? Is the Temporal - section accurate (optional for core, required for REST)? - -3. Check `book/src/manuals/building_bmm_containers.md` -- is the image summary table - complete? Are the tag/push commands right? - -4. Check `book/src/manuals/site-setup.md` -- are the replacement image names and repo - links correct? - -## Source material - -- SA deployment notes for TLV01 (Chelsea Isaac's 10-step guide) -- Carbide Installation Walkthrough BYO K8s (~4000 line internal doc, sections 7.0-7.14) -- Customer questions from SMC/Rafay partner deployments -- GitHub issue #476 (nvcr.io references blocking customers) -- Slack threads from #carbide-sa-enablement and ext-rafay channels From f5672e0f230b3248eb25eb4de7075a0750495546 Mon Sep 17 00:00:00 2001 From: vigneshv Date: Sun, 19 Apr 2026 22:31:10 -0500 Subject: [PATCH 04/13] docs: add NGC account prerequisite for container builds Signed-off-by: vigneshv --- book/src/manuals/building_nico_containers.md | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/book/src/manuals/building_nico_containers.md b/book/src/manuals/building_nico_containers.md index ef2bb31d88..ba0439851a 100644 --- a/book/src/manuals/building_nico_containers.md +++ b/book/src/manuals/building_nico_containers.md @@ -29,6 +29,7 @@ Before you begin, ensure you have the following prerequisites: * An Ubuntu 24.04 Host or VM with 150GB+ of disk space (MacOS is not supported) * For REST containers: Go 1.25.4 or later, Docker 20.10+ with BuildKit enabled +* An [NVIDIA NGC](https://www.nvidia.com/en-us/gpu-cloud/) account (free). Required for pulling base images such as the DOCA HBN container used in the aarch64 / DPU BFB build. Sign up at [ngc.nvidia.com](https://ngc.nvidia.com) and generate an API key under **API Keys** > **Generate Personal Key**. Use the following steps to install the prerequisite software on the Ubuntu Host or VM. These instructions assume an `apt`-based distribution such as Ubuntu 24.04. @@ -123,6 +124,13 @@ BUILD_CONTAINER_X86_URL="nico-buildcontainer-x86_64" cargo make build-cli ### Building the DPU BFB +The BFB build automatically pulls the HBN container from `nvcr.io`. You must +authenticate with NGC before building: + +```sh +docker login nvcr.io -u '$oauthtoken' -p +``` + ```sh cargo make --cwd pxe --env SA_ENABLEMENT=1 build-boot-artifacts-bfb-sa From 6088910e6c490a58aa6d9a7b9fd8a8c61ceb1d76 Mon Sep 17 00:00:00 2001 From: vigneshv Date: Mon, 20 Apr 2026 07:47:47 -0500 Subject: [PATCH 05/13] docs: add missing REST images and repo link Signed-off-by: vigneshv --- book/src/manuals/installation-guide.md | 3 ++- book/src/manuals/pushing_containers.md | 15 ++++++++++++++- 2 files changed, 16 insertions(+), 2 deletions(-) diff --git a/book/src/manuals/installation-guide.md b/book/src/manuals/installation-guide.md index bb87c1f062..5998284471 100644 --- a/book/src/manuals/installation-guide.md +++ b/book/src/manuals/installation-guide.md @@ -55,7 +55,8 @@ TAG= make docker-build IMAGE_REGISTRY=$REGISTRY IMAGE_TAG=$TAG for image in carbide-rest-api carbide-rest-workflow carbide-rest-site-manager \ - carbide-rest-site-agent carbide-rest-db carbide-rest-cert-manager; do + carbide-rest-site-agent carbide-rest-db carbide-rest-cert-manager \ + carbide-rla carbide-psm carbide-nsm; do docker push "$REGISTRY/$image:$TAG" done ``` diff --git a/book/src/manuals/pushing_containers.md b/book/src/manuals/pushing_containers.md index 9a4893f4eb..0e2bab80d8 100644 --- a/book/src/manuals/pushing_containers.md +++ b/book/src/manuals/pushing_containers.md @@ -31,9 +31,22 @@ docker push $REGISTRY/machine-validation-config:$TAG ## Tag and Push REST Images +REST images are built from the +[ncx-infra-controller-rest](https://github.com/NVIDIA/ncx-infra-controller-rest) +repository. The `make docker-build` command tags images at build time when you pass +`IMAGE_REGISTRY` and `IMAGE_TAG`: + +```sh +cd /path/to/ncx-infra-controller-rest +make docker-build IMAGE_REGISTRY=$REGISTRY IMAGE_TAG=$TAG +``` + +Then push all REST images: + ```sh for image in carbide-rest-api carbide-rest-workflow carbide-rest-site-manager \ - carbide-rest-site-agent carbide-rest-db carbide-rest-cert-manager; do + carbide-rest-site-agent carbide-rest-db carbide-rest-cert-manager \ + carbide-rla carbide-psm carbide-nsm; do docker push "$REGISTRY/$image:$TAG" done ``` From 89552b2a6779907685c8f0fe7cbd8dd68fabfa15 Mon Sep 17 00:00:00 2001 From: vigneshv Date: Mon, 20 Apr 2026 08:01:58 -0500 Subject: [PATCH 06/13] docs: fix PKI role config and add missing REST images Signed-off-by: vigneshv --- book/src/manuals/building_nico_containers.md | 3 +++ helm/PREREQUISITES.md | 12 ++++++++---- 2 files changed, 11 insertions(+), 4 deletions(-) diff --git a/book/src/manuals/building_nico_containers.md b/book/src/manuals/building_nico_containers.md index ba0439851a..371455c31d 100644 --- a/book/src/manuals/building_nico_containers.md +++ b/book/src/manuals/building_nico_containers.md @@ -160,6 +160,9 @@ make docker-build IMAGE_REGISTRY= IMAGE_TAG=< | `carbide-rest-site-agent` | On-site agent (elektra) | | `carbide-rest-db` | Database migration job (runs once per upgrade) | | `carbide-rest-cert-manager` | Native PKI certificate manager (credsmgr) | +| `carbide-rla` | Rack Level Abstraction service | +| `carbide-psm` | Power Shelf Manager service | +| `carbide-nsm` | NVSwitch Manager service | ## Next Steps diff --git a/helm/PREREQUISITES.md b/helm/PREREQUISITES.md index b8708a35ea..6a769676a5 100644 --- a/helm/PREREQUISITES.md +++ b/helm/PREREQUISITES.md @@ -37,10 +37,14 @@ vault secrets tune -max-lease-ttl=87600h forgeca ```bash vault write forgeca/roles/forge-cluster \ - allowed_domains="forge.local,svc.cluster.local" \ - allow_subdomains=true \ - allow_bare_domains=true \ - max_ttl=8760h + allow_any_name=true \ + allowed_uri_sans="spiffe://*" \ + max_ttl=720h \ + ttl=720h \ + key_type=ec \ + key_bits=256 \ + require_cn=false \ + use_csr_common_name=true ``` - **Kubernetes auth** must be enabled with a role for the **cert-manager service account**, so the `vault-forge-issuer` ClusterIssuer (Section 5) can authenticate to Vault: From 6d0d64755210145efb268ce72f33cb8c806b010f Mon Sep 17 00:00:00 2001 From: vigneshv Date: Mon, 20 Apr 2026 08:10:54 -0500 Subject: [PATCH 07/13] docs: fix REST repo references and installation guide accuracy Signed-off-by: vigneshv --- book/src/manuals/building_nico_containers.md | 4 +- book/src/manuals/installation-guide.md | 198 +++++++++---------- helm/PREREQUISITES.md | 8 +- 3 files changed, 105 insertions(+), 105 deletions(-) diff --git a/book/src/manuals/building_nico_containers.md b/book/src/manuals/building_nico_containers.md index 371455c31d..0c270db485 100644 --- a/book/src/manuals/building_nico_containers.md +++ b/book/src/manuals/building_nico_containers.md @@ -143,10 +143,10 @@ docker build --build-arg "CONTAINER_RUNTIME_AARCH64=alpine:latest" -t boot-artif The REST components (cloud-api, cloud-workflow, site-manager, site-agent, db migrations, cert-manager) are built from the -[bare-metal-manager-rest](https://github.com/NVIDIA/bare-metal-manager-rest) repository. +[ncx-infra-controller-rest](https://github.com/NVIDIA/ncx-infra-controller-rest) repository. ```sh -cd bare-metal-manager-rest +cd ncx-infra-controller-rest make docker-build IMAGE_REGISTRY= IMAGE_TAG= ``` diff --git a/book/src/manuals/installation-guide.md b/book/src/manuals/installation-guide.md index 5998284471..1d0f836db3 100644 --- a/book/src/manuals/installation-guide.md +++ b/book/src/manuals/installation-guide.md @@ -11,14 +11,14 @@ and SA teams during production deployments. | Step | What | Where to find details | |------|------|----------------------| -| 1 | [Build and push all container images](#1-build-and-push-containers) | [Building NICo Containers](building_nico_containers.md), [REST README](https://github.com/NVIDIA/bare-metal-manager-rest#building-docker-images) | +| 1 | [Build and push all container images](#1-build-and-push-containers) | [Building NICo Containers](building_nico_containers.md), [REST repo](https://github.com/NVIDIA/ncx-infra-controller-rest) | | 2 | [Provision site controller OS and Kubernetes](#2-site-controller-and-kubernetes) | [Site Reference Architecture](site-reference-arch.md) | | 3 | [Deploy foundation services](#3-foundation-services) | [Site Setup](site-setup.md), [helm/PREREQUISITES.md](../../helm/PREREQUISITES.md) | -| 4 | [Deploy site CA, credsmgr, and Temporal](#4-site-ca-credsmgr-and-temporal) | This guide | -| 5 | [Deploy Carbide REST / cloud components](#5-deploy-carbide-rest-components) | This guide, [REST repo](https://github.com/NVIDIA/bare-metal-manager-rest) | +| 4 | [Deploy site CA, credsmgr, and Temporal](#4-site-ca-credsmgr-and-temporal) | This guide, [REST repo](https://github.com/NVIDIA/ncx-infra-controller-rest) | +| 5 | [Deploy Carbide REST / cloud components](#5-deploy-carbide-rest-components) | This guide, [REST repo](https://github.com/NVIDIA/ncx-infra-controller-rest) | | 6 | [Deploy Carbide core](#6-deploy-carbide-core) | [Helm README](../../helm/README.md), [deploy/README.md](../../deploy/README.md) | | 7 | [Install admin-cli](#7-install-admin-cli) | This guide | -| 8 | [Deploy Elektra site agent](#8-deploy-elektra-site-agent) | This guide | +| 8 | [Deploy Elektra site agent](#8-deploy-elektra-site-agent) | This guide, [REST repo](https://github.com/NVIDIA/ncx-infra-controller-rest) | | 9 | [Ingest managed hosts](#9-ingest-hosts) | [Ingesting Hosts](ingesting_machines.md) | | 10 | [Verify end-to-end](#10-verification) | This guide | @@ -45,7 +45,7 @@ registry, and a summary table of all images produced. ### NICo REST -Clone [bare-metal-manager-rest](https://github.com/NVIDIA/bare-metal-manager-rest) +Clone [ncx-infra-controller-rest](https://github.com/NVIDIA/ncx-infra-controller-rest) and build with: ```bash @@ -61,7 +61,7 @@ for image in carbide-rest-api carbide-rest-workflow carbide-rest-site-manager \ done ``` -See the [bare-metal-manager-rest README](https://github.com/NVIDIA/bare-metal-manager-rest#building-docker-images) +See the [ncx-infra-controller-rest README](https://github.com/NVIDIA/ncx-infra-controller-rest#building-docker-images) for the full list of images and build options. --- @@ -127,56 +127,50 @@ These Vault configuration steps are documented in detail in This step sets up the certificate infrastructure that both the REST / cloud components and Temporal depend on. -### 4.1 Create Site CA Secrets +### 4.1 Create Site CA Secret -Create root CA secrets in the `cert-manager` namespace: +Generate a root CA and create the `ca-signing-secret` used by the +`carbide-rest-ca-issuer` ClusterIssuer and credsmgr. From the +`ncx-infra-controller-rest` repository: ```bash -kubectl -n cert-manager create secret generic vault-root-ca-certificate \ - --from-file=certificate=./cacert.pem -kubectl -n cert-manager create secret generic vault-root-ca-private-key \ - --from-file=privatekey=./ca.key +./scripts/gen-site-ca.sh ``` -If you need to generate a self-signed root CA for testing: +This creates a `kubernetes.io/tls` secret named `ca-signing-secret` in both the +`carbide-rest` and `cert-manager` namespaces. Run `./scripts/gen-site-ca.sh --help` +for options (custom CN, output to disk, dry-run). -```bash -openssl req -x509 -newkey rsa:4096 -keyout ca.key -out cacert.pem \ - -sha256 -days 3650 -nodes -subj "/CN=Carbide Root CA" -``` - -### 4.2 Deploy cloud-cert-manager (credsmgr) +### 4.2 Create carbide-rest-ca-issuer and deploy credsmgr -credsmgr runs an embedded Vault process and creates the `vault-issuer` ClusterIssuer -used for Temporal TLS certificates and cloud component mTLS. - -From the `bare-metal-manager-rest` repository, update images in -`deploy/kustomize/base/cert-manager/kustomization.yaml` to point at your registry, -then: +Create the `carbide-rest-ca-issuer` ClusterIssuer (backed by `ca-signing-secret` +from Step 4.1) and deploy credsmgr. From the `ncx-infra-controller-rest` repository: ```bash +kubectl apply -k deploy/kustomize/base/cert-manager-io kubectl apply -k deploy/kustomize/base/cert-manager -kubectl get clusterissuer vault-issuer +kubectl get clusterissuer carbide-rest-ca-issuer ``` -Verify the `vault-issuer` shows `Ready=True` before proceeding. +Verify `carbide-rest-ca-issuer` shows `Ready=True` before proceeding. ### 4.3 Provision Temporal TLS Certificates -Apply Temporal certificate manifests (client certs for `cloud-api` and `cloud-workflow`, -server certs for the `temporal` namespace). These manifests are in the -`bare-metal-manager-rest` repository under `deploy/kustomize/base/temporal-certs`: +Apply the Temporal namespace, database credentials, and mTLS certificate manifests. +From the `ncx-infra-controller-rest` repository: ```bash -kubectl apply -k deploy/kustomize/base/temporal-certs +kubectl apply -f deploy/kustomize/base/temporal-helm/namespace.yaml +kubectl apply -f deploy/kustomize/base/temporal-helm/db-creds.yaml +kubectl apply -k deploy/kustomize/base/common ``` -Verify: +Verify the mTLS certificates are issued: ```bash -kubectl -n cloud-api get certificate temporal-client-cloud-certs -kubectl -n cloud-workflow get certificate temporal-client-cloud-certs -kubectl -n temporal get secret server-cloud-certs server-interservice-certs server-site-certs +kubectl wait --for=condition=Ready certificate/server-interservice-cert -n temporal --timeout=120s +kubectl wait --for=condition=Ready certificate/server-cloud-cert -n temporal --timeout=120s +kubectl wait --for=condition=Ready certificate/server-site-cert -n temporal --timeout=120s ``` ### 4.4 Deploy Temporal @@ -184,11 +178,25 @@ kubectl -n temporal get secret server-cloud-certs server-interservice-cer Deploy Temporal server v1.22.6 with Elasticsearch 7.17.3 for visibility. Use the TLS certificates provisioned above for mTLS. -After all Temporal pods are `Running`, register the required namespaces: +After all Temporal pods are `Running`, register the required namespaces via +`temporal-admintools`: ```bash -tctl --ns cloud namespace register -tctl --ns site namespace register +kubectl exec -n temporal deploy/temporal-admintools -- \ + temporal operator namespace create --namespace cloud \ + --address temporal-frontend.temporal:7233 \ + --tls-cert-path /var/secrets/temporal/certs/server-interservice/tls.crt \ + --tls-key-path /var/secrets/temporal/certs/server-interservice/tls.key \ + --tls-ca-path /var/secrets/temporal/certs/server-interservice/ca.crt \ + --tls-server-name interservice.server.temporal.local + +kubectl exec -n temporal deploy/temporal-admintools -- \ + temporal operator namespace create --namespace site \ + --address temporal-frontend.temporal:7233 \ + --tls-cert-path /var/secrets/temporal/certs/server-interservice/tls.crt \ + --tls-key-path /var/secrets/temporal/certs/server-interservice/tls.key \ + --tls-ca-path /var/secrets/temporal/certs/server-interservice/ca.crt \ + --tls-server-name interservice.server.temporal.local ``` ```{note} @@ -203,55 +211,32 @@ healthy, or create the index manually. The REST / cloud layer provides the customer-facing API, workflow orchestration, and site management. Deploy from the -[bare-metal-manager-rest](https://github.com/NVIDIA/bare-metal-manager-rest) repository. - -For each component below, update the image reference in `kustomization.yaml` to -your registry and adjust ConfigMaps for your Postgres, Temporal, and Vault endpoints. +[ncx-infra-controller-rest](https://github.com/NVIDIA/ncx-infra-controller-rest) repository. -### 5.1 Database Migrations (cloud-db) - -Initializes the cloud database schema. This is a one-time job: +All REST components deploy into the `carbide-rest` namespace via a single Helm +umbrella chart: ```bash -kubectl apply -k deploy/kustomize/base/db -kubectl -n cloud-db get jobs -w +helm upgrade --install carbide-rest helm/charts/carbide-rest \ + --namespace carbide-rest --create-namespace \ + -f \ + --set global.image.repository= \ + --set global.image.tag= \ + --timeout 600s --wait ``` -Wait for the job to complete before proceeding. - -### 5.2 cloud-workflow - -Deploys `cloud-worker` and `site-worker` Temporal workers: - -```bash -kubectl apply -k deploy/kustomize/base/cloud-workflow -kubectl -n cloud-workflow get pods -``` +This deploys: `carbide-rest-api`, `carbide-rest-workflow` (cloud-worker and +site-worker), `carbide-rest-site-manager`, `carbide-rest-db` (migration job), +`carbide-rest-cert-manager` (credsmgr), and Keycloak (dev IdP). -Both deployments should reach `Running`. - -### 5.3 cloud-api - -The customer-facing REST API: - -```bash -kubectl apply -k deploy/kustomize/base/cloud-api -kubectl -n cloud-api get pods -``` - -### 5.4 cloud-site-manager - -The site registry service: +Verify: ```bash -kubectl apply -k deploy/kustomize/base/site-manager +kubectl get pods -n carbide-rest ``` -```{note} -If `carbide-rest-site-manager` fails with `unable to start container process`, the -entrypoint in `deployment.yaml` does not match the production Dockerfile. Update -`deployment.yaml` to use the correct binary path. -``` +All deployments should reach `Running` and the db-migration job should show +`Completed`. --- @@ -344,41 +329,55 @@ admin-cli -c https://localhost:1079 site info ## 8. Deploy Elektra Site Agent Elektra bridges the on-site Carbide core to the cloud REST layer via Temporal. +It deploys as a StatefulSet in the `carbide-rest` namespace. -1. Register a site through cloud-api or cloud-site-manager to get a ``. - -2. Register the per-site Temporal namespace: +1. Pre-apply the gRPC client certificate so it exists before the pod starts: ```bash -tctl --ns namespace register +helm template carbide-rest-site-agent helm/charts/carbide-rest-site-agent \ + --namespace carbide-rest \ + -f \ + --set global.image.repository= \ + --set global.image.tag= \ + --show-only templates/certificate.yaml | kubectl apply -f - + +kubectl wait --for=condition=Ready certificate/core-grpc-client-site-agent-certs \ + -n carbide-rest --timeout=120s ``` -3. Generate an OTP for the site agent and create the bootstrap secret. The OTP is - issued by `cloud-site-manager` and stored as a Kubernetes secret in the - `elektra-site-agent` namespace: +2. Create the per-site Temporal namespace (the site-agent panics without it): ```bash -# Issue a one-time password for the site -OTP=$(curl -s -X POST https:///api/v1/sites//otp \ - -H "Authorization: Bearer " | jq -r '.otp') - -kubectl -n elektra-site-agent create secret generic site-agent-bootstrap \ - --from-literal=SITE_UUID= \ - --from-literal=OTP="$OTP" \ - --from-literal=CLOUD_API_ENDPOINT=https:// +SITE_UUID= + +kubectl exec -n temporal deploy/temporal-admintools -- \ + temporal operator namespace create --namespace "$SITE_UUID" \ + --address temporal-frontend.temporal:7233 \ + --tls-cert-path /var/secrets/temporal/certs/server-interservice/tls.crt \ + --tls-key-path /var/secrets/temporal/certs/server-interservice/tls.key \ + --tls-ca-path /var/secrets/temporal/certs/server-interservice/ca.crt \ + --tls-server-name interservice.server.temporal.local ``` -4. Update the image and site config in the site-agent manifests, then apply: +3. Install the site-agent Helm chart (the pre-install hook registers the site + and creates the `site-registration` secret): ```bash -kubectl apply -k deploy/kustomize/base/site-agent +helm upgrade --install carbide-rest-site-agent helm/charts/carbide-rest-site-agent \ + --namespace carbide-rest \ + -f \ + --set global.image.repository= \ + --set global.image.tag= \ + --set "envConfig.CLUSTER_ID=$SITE_UUID" \ + --set "envConfig.TEMPORAL_SUBSCRIBE_NAMESPACE=$SITE_UUID" \ + --timeout 300s --wait ``` -5. Verify: +4. Verify: ```bash -kubectl -n elektra-site-agent get pods -kubectl -n elektra-site-agent logs -l app=elektra --tail=50 +kubectl get pods -n carbide-rest -l app.kubernetes.io/name=carbide-rest-site-agent +kubectl logs -n carbide-rest -l app.kubernetes.io/name=carbide-rest-site-agent --tail=20 ``` --- @@ -459,8 +458,9 @@ Use port-forwarding: `kubectl port-forward svc/carbide-api 1079:1079 -n forge-sy ### carbide-rest-site-manager Fails to Start -`unable to start container process` -- entrypoint mismatch between `deployment.yaml` -and the Dockerfile. Update to the correct binary path. +`unable to start container process` -- verify the image was built with the production +Dockerfile (`docker/production/Dockerfile.carbide-rest-site-manager`), not the local +dev Dockerfile. ### Pods Stuck in ImagePullBackOff diff --git a/helm/PREREQUISITES.md b/helm/PREREQUISITES.md index 6a769676a5..28b52303a9 100644 --- a/helm/PREREQUISITES.md +++ b/helm/PREREQUISITES.md @@ -111,17 +111,17 @@ For additional PostgreSQL configuration details (TLS, ESO integration, per-names --- -## 2a. Temporal (Required for bare-metal-manager-rest only) +## 2a. Temporal (Required for ncx-infra-controller-rest only) Temporal is **not required** by the Carbide core Helm chart. You can operate Carbide core standalone using `admin-cli` with direct gRPC commands. Temporal **is required** if you deploy the -[bare-metal-manager-rest](https://github.com/NVIDIA/bare-metal-manager-rest) layer +[ncx-infra-controller-rest](https://github.com/NVIDIA/ncx-infra-controller-rest) layer (cloud-api, cloud-workflow, site-manager, elektra-site-agent). The REST components use Temporal for workflow orchestration between the cloud control plane and site agents. -If you plan to deploy bare-metal-manager-rest: +If you plan to deploy ncx-infra-controller-rest: - **Reference version:** Temporal server v1.22.6, admin tools v1.22.4, UI v2.16.2 - **Visibility store:** Elasticsearch 7.17.3 @@ -137,7 +137,7 @@ tctl --ns site namespace register - **mTLS:** The REST components expect Temporal client TLS certificates. These are issued by the `vault-issuer` ClusterIssuer created by cloud-cert-manager (credsmgr), - which is part of bare-metal-manager-rest. See the + which is part of ncx-infra-controller-rest. See the [End-to-End Installation Guide](book/src/manuals/installation-guide.md) for the full deployment order. From 2b163afa47fb255581c0949cf63aa930fb155125 Mon Sep 17 00:00:00 2001 From: vigneshv Date: Mon, 20 Apr 2026 08:14:15 -0500 Subject: [PATCH 08/13] docs: fix Temporal cert source and Keycloak deployment in install guide Signed-off-by: vigneshv --- book/src/manuals/installation-guide.md | 27 ++++++++++++++++++++------ 1 file changed, 21 insertions(+), 6 deletions(-) diff --git a/book/src/manuals/installation-guide.md b/book/src/manuals/installation-guide.md index 1d0f836db3..0af5442521 100644 --- a/book/src/manuals/installation-guide.md +++ b/book/src/manuals/installation-guide.md @@ -156,16 +156,24 @@ Verify `carbide-rest-ca-issuer` shows `Ready=True` before proceeding. ### 4.3 Provision Temporal TLS Certificates -Apply the Temporal namespace, database credentials, and mTLS certificate manifests. -From the `ncx-infra-controller-rest` repository: +Apply the Temporal namespace, database credentials, and mTLS server certificate +manifests. From the `ncx-infra-controller-rest` repository: + +```bash +kubectl apply -k deploy/kustomize/base/temporal-helm +``` + +This creates the `temporal` namespace, database credentials, and three server +mTLS certificates (`server-interservice-cert`, `server-cloud-cert`, +`server-site-cert`) issued by `carbide-rest-ca-issuer`. + +Then apply the common resources (Temporal client certs for the REST workers): ```bash -kubectl apply -f deploy/kustomize/base/temporal-helm/namespace.yaml -kubectl apply -f deploy/kustomize/base/temporal-helm/db-creds.yaml kubectl apply -k deploy/kustomize/base/common ``` -Verify the mTLS certificates are issued: +Verify the server certificates are issued: ```bash kubectl wait --for=condition=Ready certificate/server-interservice-cert -n temporal --timeout=120s @@ -227,7 +235,14 @@ helm upgrade --install carbide-rest helm/charts/carbide-rest \ This deploys: `carbide-rest-api`, `carbide-rest-workflow` (cloud-worker and site-worker), `carbide-rest-site-manager`, `carbide-rest-db` (migration job), -`carbide-rest-cert-manager` (credsmgr), and Keycloak (dev IdP). +and `carbide-rest-cert-manager` (credsmgr). + +If you need a dev IdP, deploy Keycloak separately before the umbrella chart: + +```bash +kubectl apply -k deploy/kustomize/base/keycloak -n carbide-rest +kubectl rollout status deployment/keycloak -n carbide-rest --timeout=300s +``` Verify: From eb3c5e7d26eb00cbb486455cdc5af37b26d46725 Mon Sep 17 00:00:00 2001 From: vigneshv Date: Mon, 20 Apr 2026 08:26:00 -0500 Subject: [PATCH 09/13] docs: address review findings across PR 512 Signed-off-by: vigneshv --- book/src/manuals/building_nico_containers.md | 2 +- book/src/manuals/installation-guide.md | 38 +++++++++----------- helm/PREREQUISITES.md | 12 +++---- 3 files changed, 23 insertions(+), 29 deletions(-) diff --git a/book/src/manuals/building_nico_containers.md b/book/src/manuals/building_nico_containers.md index 0c270db485..5b6b5c6725 100644 --- a/book/src/manuals/building_nico_containers.md +++ b/book/src/manuals/building_nico_containers.md @@ -28,7 +28,7 @@ accessible by your Kubernetes cluster. Before you begin, ensure you have the following prerequisites: * An Ubuntu 24.04 Host or VM with 150GB+ of disk space (MacOS is not supported) -* For REST containers: Go 1.25.4 or later, Docker 20.10+ with BuildKit enabled +* For REST containers: Go (see `go.mod` in the REST repo for the required version), Docker 20.10+ with BuildKit enabled * An [NVIDIA NGC](https://www.nvidia.com/en-us/gpu-cloud/) account (free). Required for pulling base images such as the DOCA HBN container used in the aarch64 / DPU BFB build. Sign up at [ngc.nvidia.com](https://ngc.nvidia.com) and generate an API key under **API Keys** > **Generate Personal Key**. Use the following steps to install the prerequisite software on the Ubuntu Host or VM. These instructions diff --git a/book/src/manuals/installation-guide.md b/book/src/manuals/installation-guide.md index 0af5442521..aba251b1fa 100644 --- a/book/src/manuals/installation-guide.md +++ b/book/src/manuals/installation-guide.md @@ -191,22 +191,16 @@ After all Temporal pods are `Running`, register the required namespaces via ```bash kubectl exec -n temporal deploy/temporal-admintools -- \ - temporal operator namespace create --namespace cloud \ - --address temporal-frontend.temporal:7233 \ - --tls-cert-path /var/secrets/temporal/certs/server-interservice/tls.crt \ - --tls-key-path /var/secrets/temporal/certs/server-interservice/tls.key \ - --tls-ca-path /var/secrets/temporal/certs/server-interservice/ca.crt \ - --tls-server-name interservice.server.temporal.local + temporal operator namespace create cloud --address temporal-frontend.temporal:7233 kubectl exec -n temporal deploy/temporal-admintools -- \ - temporal operator namespace create --namespace site \ - --address temporal-frontend.temporal:7233 \ - --tls-cert-path /var/secrets/temporal/certs/server-interservice/tls.crt \ - --tls-key-path /var/secrets/temporal/certs/server-interservice/tls.key \ - --tls-ca-path /var/secrets/temporal/certs/server-interservice/ca.crt \ - --tls-server-name interservice.server.temporal.local + temporal operator namespace create site --address temporal-frontend.temporal:7233 ``` +If your Temporal deployment uses mTLS, add the TLS flags to each command: +`--tls-cert-path`, `--tls-key-path`, `--tls-ca-path`, `--tls-server-name`. +See `helm-prereqs/SETUP_PHASES.md` for the full mTLS example. + ```{note} If Temporal pods are stuck in `Init:0/1`, the Elasticsearch index may not be ready. Check `kubectl -n temporal logs elasticsearch-master-0` and wait for ES to become @@ -240,7 +234,7 @@ and `carbide-rest-cert-manager` (credsmgr). If you need a dev IdP, deploy Keycloak separately before the umbrella chart: ```bash -kubectl apply -k deploy/kustomize/base/keycloak -n carbide-rest +(cd && kubectl apply -k deploy/kustomize/base/keycloak) kubectl rollout status deployment/keycloak -n carbide-rest --timeout=300s ``` @@ -326,17 +320,17 @@ Build from source in the `ncx-infra-controller-core` repository: cargo make build-cli ``` -The binary is at `target/release/admin-cli`. Point it at your API: +The binary is at `target/release/carbide-admin-cli`. Point it at your API: ```bash -admin-cli -c https://api-. site info +carbide-admin-cli -c https://api-. site info ``` If the API is not externally reachable: ```bash kubectl port-forward svc/carbide-api 1079:1079 -n forge-system & -admin-cli -c https://localhost:1079 site info +carbide-admin-cli -c https://localhost:1079 site info ``` --- @@ -406,14 +400,14 @@ For each managed host, you need the **BMC MAC address**, **chassis serial number ```bash # Set desired credentials NICo will apply to all hosts -admin-cli -c credential add-bmc --kind=site-wide-root --password='' -admin-cli -c credential add-uefi --kind=host --password='' +carbide-admin-cli -c credential add-bmc --kind=site-wide-root --password='' +carbide-admin-cli -c credential add-uefi --kind=host --password='' # Upload expected machines manifest -admin-cli -c credential em replace-all --filename expected_machines.json +carbide-admin-cli -c expected-machine replace-all --filename expected_machines.json # Approve for measured boot ingestion -admin-cli -c mb site trusted-machine approve \* persist --pcr-registers="0,3,5,6" +carbide-admin-cli -c mb site trusted-machine approve \* persist --pcr-registers="0,3,5,6" ``` NICo then automatically: assigns IPs via DHCP, discovers BMCs via Redfish, rotates @@ -423,7 +417,7 @@ moves machines to the `Available` pool. Monitor progress: ```bash -admin-cli -c machine list +carbide-admin-cli -c machine list ``` --- @@ -440,7 +434,7 @@ kubectl -n forge-system get pods curl -k https://:1079/healthz # Machines discovered and available -admin-cli -c machine list +carbide-admin-cli -c machine list # Admin UI accessible # https://api-./admin diff --git a/helm/PREREQUISITES.md b/helm/PREREQUISITES.md index 28b52303a9..1730a59977 100644 --- a/helm/PREREQUISITES.md +++ b/helm/PREREQUISITES.md @@ -75,7 +75,7 @@ EOF - The `VAULT_SERVICE` URL must be provided to the cluster via a ConfigMap (see Section 4). -For additional Vault configuration details, see the [Site Setup guide](book/src/manuals/site-setup.md#vault-pki-and-secrets). +For additional Vault configuration details, see the [Site Setup guide](../book/src/manuals/site-setup.md#vault-pki-and-secrets). ### External Secrets Operator (Optional) @@ -107,7 +107,7 @@ psql "postgres://:@: Date: Mon, 20 Apr 2026 08:28:20 -0500 Subject: [PATCH 10/13] docs: align step 8 Temporal namespace with step 4.4 pattern Signed-off-by: vigneshv --- book/src/manuals/installation-guide.md | 9 +++------ 1 file changed, 3 insertions(+), 6 deletions(-) diff --git a/book/src/manuals/installation-guide.md b/book/src/manuals/installation-guide.md index aba251b1fa..31dd83c052 100644 --- a/book/src/manuals/installation-guide.md +++ b/book/src/manuals/installation-guide.md @@ -360,14 +360,11 @@ kubectl wait --for=condition=Ready certificate/core-grpc-client-site-agent-certs SITE_UUID= kubectl exec -n temporal deploy/temporal-admintools -- \ - temporal operator namespace create --namespace "$SITE_UUID" \ - --address temporal-frontend.temporal:7233 \ - --tls-cert-path /var/secrets/temporal/certs/server-interservice/tls.crt \ - --tls-key-path /var/secrets/temporal/certs/server-interservice/tls.key \ - --tls-ca-path /var/secrets/temporal/certs/server-interservice/ca.crt \ - --tls-server-name interservice.server.temporal.local + temporal operator namespace create "$SITE_UUID" --address temporal-frontend.temporal:7233 ``` +If your Temporal deployment uses mTLS, add the TLS flags as described in Step 4.4. + 3. Install the site-agent Helm chart (the pre-install hook registers the site and creates the `site-registration` secret): From 82347d1c97c013438770a4c36b738e3b543771f1 Mon Sep 17 00:00:00 2001 From: vigneshv Date: Mon, 20 Apr 2026 08:37:56 -0500 Subject: [PATCH 11/13] docs: fix vault token key, ssh public key, healthz route, KV engine step Signed-off-by: vigneshv --- book/src/manuals/installation-guide.md | 6 +++--- helm/PREREQUISITES.md | 14 ++++++++++++-- 2 files changed, 15 insertions(+), 5 deletions(-) diff --git a/book/src/manuals/installation-guide.md b/book/src/manuals/installation-guide.md index 31dd83c052..d3242830be 100644 --- a/book/src/manuals/installation-guide.md +++ b/book/src/manuals/installation-guide.md @@ -300,14 +300,14 @@ kustomize build . --enable-helm --enable-alpha-plugins --enable-exec | kubectl a ### Verify the API ```bash -curl -k https://:1079/healthz +curl -k https://:1079/ ``` If the API VIP is not externally reachable: ```bash kubectl port-forward svc/carbide-api 1079:1079 -n forge-system -curl -k https://localhost:1079/healthz +curl -k https://localhost:1079/ ``` --- @@ -428,7 +428,7 @@ Once hosts are `Available`, verify the full deployment: kubectl -n forge-system get pods # API healthy -curl -k https://:1079/healthz +curl -k https://:1079/ # Machines discovered and available carbide-admin-cli -c machine list diff --git a/helm/PREREQUISITES.md b/helm/PREREQUISITES.md index 1730a59977..d10e406cf7 100644 --- a/helm/PREREQUISITES.md +++ b/helm/PREREQUISITES.md @@ -220,13 +220,22 @@ The `token` field in the output is your `VAULT_TOKEN`. Create the Kubernetes sec ```bash kubectl create secret generic carbide-vault-token \ --namespace forge-system \ - --from-literal=VAULT_TOKEN='' + --from-literal=token='' ``` **Note:** The policies referenced above (`forge-pki-policy`, `forge-kv-policy`) must be created first. See the [Vault section](#hashicorp-vault) above for the PKI policy. For the KV policy: +Enable the KV v2 secrets engine at the `secrets` mount path (must match +`FORGE_VAULT_MOUNT` in the `vault-cluster-info` ConfigMap): + +```bash +vault secrets enable -version=2 -path=secrets kv +``` + +Then create the policy: + ```bash vault policy write forge-kv-policy - < Date: Mon, 20 Apr 2026 09:29:29 -0500 Subject: [PATCH 12/13] docs: clean up REST image descriptions and DB secret key docs Signed-off-by: vigneshv --- book/src/manuals/building_nico_containers.md | 8 ++++---- helm/PREREQUISITES.md | 4 +++- 2 files changed, 7 insertions(+), 5 deletions(-) diff --git a/book/src/manuals/building_nico_containers.md b/book/src/manuals/building_nico_containers.md index 5b6b5c6725..b1e679f346 100644 --- a/book/src/manuals/building_nico_containers.md +++ b/book/src/manuals/building_nico_containers.md @@ -155,11 +155,11 @@ make docker-build IMAGE_REGISTRY= IMAGE_TAG=< | Image | Purpose | |-------|---------| | `carbide-rest-api` | REST API server (port 8388) | -| `carbide-rest-workflow` | Temporal workflow worker (cloud-worker, site-worker) | -| `carbide-rest-site-manager` | Site management / registry service | -| `carbide-rest-site-agent` | On-site agent (elektra) | +| `carbide-rest-workflow` | Temporal workflow worker | +| `carbide-rest-site-manager` | Site management and registry service | +| `carbide-rest-site-agent` | On-site Temporal agent | | `carbide-rest-db` | Database migration job (runs once per upgrade) | -| `carbide-rest-cert-manager` | Native PKI certificate manager (credsmgr) | +| `carbide-rest-cert-manager` | PKI certificate manager | | `carbide-rla` | Rack Level Abstraction service | | `carbide-psm` | Power Shelf Manager service | | `carbide-nsm` | NVSwitch Manager service | diff --git a/helm/PREREQUISITES.md b/helm/PREREQUISITES.md index d10e406cf7..1db7ad1990 100644 --- a/helm/PREREQUISITES.md +++ b/helm/PREREQUISITES.md @@ -151,7 +151,9 @@ All secrets should be created in the `forge-system` namespace (or whichever name Database credentials for `carbide-api`. -**Required keys:** `username`, `password`, `host`, `port`, `dbname`, `uri` +**Required keys:** `username`, `password` + +The Helm chart reads only `username` and `password` from this secret; connection host, port, and database name come from the `forge-system-carbide-database-config` ConfigMap (Section 4). The additional keys below (`host`, `port`, `dbname`, `uri`) are optional conveniences for manual `psql` access or ESO integration. ```bash kubectl create secret generic forge-system.carbide.forge-pg-cluster.credentials \ From 48225a1988094f78ed87d034cbf6f9b01644660d Mon Sep 17 00:00:00 2001 From: Peter Gambrill Date: Mon, 20 Apr 2026 18:12:53 -0700 Subject: [PATCH 13/13] Copyedits and rebase fixes for end-to-end install changes -e Signed-off-by: Peter Gambrill --- book/src/SUMMARY.md | 15 +- book/src/manuals/building_nico_containers.md | 10 +- book/src/manuals/installation-guide.md | 244 +++++++++---------- book/src/manuals/pushing_containers.md | 15 +- book/src/manuals/site-setup.md | 37 +-- 5 files changed, 162 insertions(+), 159 deletions(-) diff --git a/book/src/SUMMARY.md b/book/src/SUMMARY.md index d996d29c8d..7c54009ddc 100644 --- a/book/src/SUMMARY.md +++ b/book/src/SUMMARY.md @@ -1,7 +1,7 @@ -# NVIDIA Bare Metal Manager +# NCX Infra Controller - [Introduction](README.md) -- [Hardware Compatbility List](hcl.md) +- [Hardware Compatibility List](hcl.md) - [Release Notes](release-notes.md) - [FAQs](faq.md) @@ -11,6 +11,7 @@ - [Redfish Workflow](architecture/redfish_workflow.md) - [Redfish Endpoints Reference](architecture/redfish/endpoints_reference.md) - [Reliable state handling](architecture/state_handling.md) +- [Networking integrations](architecture/networking_integrations.md) - [DPU configuration](architecture/dpu_configuration.md) - [Health checks and health aggregation](architecture/health_aggregation.md) - [Health probe IDs](architecture/health/health_probe_ids.md) @@ -40,17 +41,13 @@ - [VPC Routing Profiles](manuals/vpc/vpc_routing_profiles.md) - [VPC Peering](manuals/vpc/vpc_peering_management.md) - [Metrics]() - - [Core metrics](manuals/metrics/carbide_core_metrics.md) - -# Sites and site access - -- [carbide-admin-cli access](sites/forge_admin_cli.md) + - [Core metrics](manuals/metrics/core_metrics.md) # Design -- [SPIFFE SVID Design](design/spiffe-svid-sdd.md) +- [SPIFFE SVID Design](design/machine-identity/spiffe-svid-sdd.md) # Development @@ -72,7 +69,7 @@ # Playbooks -- [Azure OIDC for NVIDIA Bare Metal Manager-Web UI](playbooks/carbide_web_oauth2.md) +- [Azure OIDC for NCX Infra Controller-Web UI](playbooks/carbide_web_oauth2.md) - [Force deleting and rebuilding Forge hosts](playbooks/force_delete.md) - [Rebooting a machine](playbooks/machine_reboot.md) - [Instance/Subnet/etc is stuck in a state]() diff --git a/book/src/manuals/building_nico_containers.md b/book/src/manuals/building_nico_containers.md index b1e679f346..f5644b1206 100644 --- a/book/src/manuals/building_nico_containers.md +++ b/book/src/manuals/building_nico_containers.md @@ -1,7 +1,7 @@ # Building NICo Containers This section provides instructions for building the containers for NCX Infra Controller (NICo). -For the complete deployment workflow, see the [End-to-End Installation Guide](installation-guide.md). +For the complete deployment workflow, refer to the [End-to-End Installation Guide](installation-guide.md). ## Container Image Summary @@ -28,8 +28,8 @@ accessible by your Kubernetes cluster. Before you begin, ensure you have the following prerequisites: * An Ubuntu 24.04 Host or VM with 150GB+ of disk space (MacOS is not supported) -* For REST containers: Go (see `go.mod` in the REST repo for the required version), Docker 20.10+ with BuildKit enabled -* An [NVIDIA NGC](https://www.nvidia.com/en-us/gpu-cloud/) account (free). Required for pulling base images such as the DOCA HBN container used in the aarch64 / DPU BFB build. Sign up at [ngc.nvidia.com](https://ngc.nvidia.com) and generate an API key under **API Keys** > **Generate Personal Key**. +* For REST containers: Go (refer to the `go.mod` file in the [REST repo](https://github.com/NVIDIA/ncx-infra-controller-rest) for the current required version), Docker 20.10+ with BuildKit enabled +* An [NVIDIA NGC](https://www.nvidia.com/en-us/gpu-cloud/) account (free). Required for pulling base images such as the DOCA HBN container used in the aarch64/DPU BFB build. Sign up at [ngc.nvidia.com](https://ngc.nvidia.com) and generate an API key under **API Keys** > **Generate Personal Key**. Use the following steps to install the prerequisite software on the Ubuntu Host or VM. These instructions assume an `apt`-based distribution such as Ubuntu 24.04. @@ -166,5 +166,5 @@ make docker-build IMAGE_REGISTRY= IMAGE_TAG=< ## Next Steps -After building all images, tag and push them to your private registry. -See [Tagging and Pushing Containers](pushing_containers.md). +After building all images, you will need to tag them and push them to your private registry. +Refer to the [Tagging and Pushing Containers](pushing_containers.md) section for more details. diff --git a/book/src/manuals/installation-guide.md b/book/src/manuals/installation-guide.md index d3242830be..779b169fe9 100644 --- a/book/src/manuals/installation-guide.md +++ b/book/src/manuals/installation-guide.md @@ -4,8 +4,8 @@ This guide ties together the build, deploy, and configuration steps needed to go a ready Kubernetes cluster to your first provisioned bare-metal host. It links to existing documentation for each major step and fills the gaps between them. -The order of operations below follows the sequence validated by NVIDIA engineering -and SA teams during production deployments. +The order of operations below has been validated by NVIDIA engineering +and SA teams for production deployments. ## Order of Operations @@ -26,7 +26,7 @@ and SA teams during production deployments. ## 1. Build and Push Containers -All container images must be built from source and pushed to a registry your cluster +All container images must be built from source and pushed to a registry that your cluster can access. There are no pre-built public images available. ```{note} @@ -37,16 +37,15 @@ registry paths after building from source. ### NICo Core -Follow the [Building NICo Containers](building_nico_containers.md) guide for build steps, -then [Tagging and Pushing Containers](pushing_containers.md) to push images to your -private registry. It covers -prerequisites, build steps for x86_64 and aarch64, tagging, pushing to a private +Follow the [Building NICo Containers](building_nico_containers.md) guide to build the container images, +then follow the [Tagging and Pushing Containers](pushing_containers.md) guide to push the images to your +private registry. These sections cover prerequisites, build steps for x86_64 and aarch64, tagging, pushing to a private registry, and a summary table of all images produced. ### NICo REST -Clone [ncx-infra-controller-rest](https://github.com/NVIDIA/ncx-infra-controller-rest) -and build with: +Clone the [ncx-infra-controller-rest](https://github.com/NVIDIA/ncx-infra-controller-rest) repo and build the container images +as follows: ```bash REGISTRY= @@ -61,77 +60,73 @@ for image in carbide-rest-api carbide-rest-workflow carbide-rest-site-manager \ done ``` -See the [ncx-infra-controller-rest README](https://github.com/NVIDIA/ncx-infra-controller-rest#building-docker-images) +Refer to the [ncx-infra-controller-rest README](https://github.com/NVIDIA/ncx-infra-controller-rest#building-docker-images) for the full list of images and build options. --- ## 2. Site Controller and Kubernetes -Customers are expected to provision their own site controller OS and Kubernetes cluster. +You will need to provision your own site controller OS and Kubernetes cluster. -See the [Site Reference Architecture](site-reference-arch.md) for hardware requirements, -Kubernetes versions, networking best practices, and IP pool sizing. +Refer to the [Site Reference Architecture](site-reference-arch.md) section for hardware requirements, +Kubernetes versions, networking best practices, and IP pool sizing recommendations. -In summary, you need: +In summary, you will need the following: * 3 or 5 site controller nodes running Ubuntu 24.04 LTS with Kubernetes v1.30.x * CNI (Calico v3.28.1 validated), ingress controller (Contour), load balancer (MetalLB) * OOB switch VLANs with DHCP relay pointing at the Carbide DHCP service VIP -* In-band ToR switches with BGP unnumbered on DPU-facing ports, EVPN enabled -* IP pools allocated per the reference architecture +* In-band ToR switches with BGP unnumbered on DPU-facing ports, with EVPN enabled +* IP pools allocated per the Site Reference Architecture recommendations --- ## 3. Foundation Services -Deploy the following services before any Carbide components. The order within this -step matters. +Deploy the following services before any Carbide components. -**For baselines and versions**, see [Site Setup](site-setup.md). +* *For baselines and versions*, refer to the [Site Setup](site-setup.md) section. -**For the Secrets, ConfigMaps, and ClusterIssuer** that the Helm chart expects, see -[helm/PREREQUISITES.md](../../helm/PREREQUISITES.md) -- it provides `kubectl create` -commands for every required resource. +* *For the Secrets, ConfigMaps, and ClusterIssuer* that the Helm chart expects, refer to +the [helm/PREREQUISITES.md](https://github.com/NVIDIA/ncx-infra-controller-core/blob/main/helm/PREREQUISITES.md) +file, which provides the `kubectl create` commands for every required resource. -Deploy in this order: +Deploy the services in this order: -1. **External Secrets Operator (ESO)** -- optional, but simplifies secret management. - If you skip ESO, create all Kubernetes Secrets manually. +1. **External Secrets Operator (ESO)**: This service is optional, but simplifies secret management. + If you skip ESO, you will need to create all Kubernetes Secrets manually. -2. **cert-manager** (v1.11.1+) with approver-policy (v0.6.3). Create the - `vault-forge-issuer` ClusterIssuer as described in - [helm/PREREQUISITES.md](../../helm/PREREQUISITES.md#5-clusterissuer). +2. **cert-manager** (v1.11.1+) with approver-policy (v0.6.3): Create the + `vault-forge-issuer` ClusterIssuer as described in the + [/helm/PREREQUISITES.md](https://github.com/NVIDIA/ncx-infra-controller-core/blob/main/helm/PREREQUISITES.md#5-clusterissuer). -3. **PostgreSQL** -- SSL-enabled, with required extensions: +3. **PostgreSQL**: SSL-enabled, with extensions. Create the required extensions using the following command: -```bash -psql "postgres://:@:/?sslmode=require" \ - -c 'CREATE EXTENSION IF NOT EXISTS btree_gin;' \ - -c 'CREATE EXTENSION IF NOT EXISTS pg_trgm;' -``` + ```bash + psql "postgres://:@:/?sslmode=require" \ + -c 'CREATE EXTENSION IF NOT EXISTS btree_gin;' \ + -c 'CREATE EXTENSION IF NOT EXISTS pg_trgm;' + ``` -4. **Vault** -- deployed and unsealed, with: - * PKI secrets engine at mount path **`forgeca`** - * PKI role named **`forge-cluster`** +4. **Vault**: Deployed and unsealed, with the following configuration: + * PKI secrets engine at mount path `forgeca` + * PKI role named `forge-cluster` * Kubernetes auth enabled with a role for the cert-manager service account - * Vault policy granting sign/issue capabilities - -These Vault configuration steps are documented in detail in -[helm/PREREQUISITES.md](../../helm/PREREQUISITES.md#hashicorp-vault). + * Vault policy granting sign/issue capabilities (Refer to the [Site Setup](site-setup.md#vault-pki-and-secrets) section for more details). --- ## 4. Site CA, credsmgr, and Temporal -This step sets up the certificate infrastructure that both the REST / cloud components +Next, set up the certificate infrastructure that both the REST cloud components and Temporal depend on. ### 4.1 Create Site CA Secret Generate a root CA and create the `ca-signing-secret` used by the -`carbide-rest-ca-issuer` ClusterIssuer and credsmgr. From the -`ncx-infra-controller-rest` repository: +`carbide-rest-ca-issuer` ClusterIssuer and credsmgr. Run the following command +from the `ncx-infra-controller-rest` repository: ```bash ./scripts/gen-site-ca.sh @@ -141,10 +136,11 @@ This creates a `kubernetes.io/tls` secret named `ca-signing-secret` in both the `carbide-rest` and `cert-manager` namespaces. Run `./scripts/gen-site-ca.sh --help` for options (custom CN, output to disk, dry-run). -### 4.2 Create carbide-rest-ca-issuer and deploy credsmgr +### 4.2 Create carbide-rest-ca-issuer and Deploy credsmgr Create the `carbide-rest-ca-issuer` ClusterIssuer (backed by `ca-signing-secret` -from Step 4.1) and deploy credsmgr. From the `ncx-infra-controller-rest` repository: +from Step 4.1) and deploy credsmgr. Run the following commands from the `ncx-infra-controller-rest` +repository: ```bash kubectl apply -k deploy/kustomize/base/cert-manager-io @@ -152,12 +148,14 @@ kubectl apply -k deploy/kustomize/base/cert-manager kubectl get clusterissuer carbide-rest-ca-issuer ``` -Verify `carbide-rest-ca-issuer` shows `Ready=True` before proceeding. +Verify that `carbide-rest-ca-issuer` shows `Ready=True` before proceeding. ### 4.3 Provision Temporal TLS Certificates Apply the Temporal namespace, database credentials, and mTLS server certificate -manifests. From the `ncx-infra-controller-rest` repository: +manifests. + +First, run the following command from the `ncx-infra-controller-rest` repository: ```bash kubectl apply -k deploy/kustomize/base/temporal-helm @@ -167,13 +165,13 @@ This creates the `temporal` namespace, database credentials, and three server mTLS certificates (`server-interservice-cert`, `server-cloud-cert`, `server-site-cert`) issued by `carbide-rest-ca-issuer`. -Then apply the common resources (Temporal client certs for the REST workers): +Next, apply the common resources (Temporal client certs for the REST workers): ```bash kubectl apply -k deploy/kustomize/base/common ``` -Verify the server certificates are issued: +Verify that the server certificates have been issued: ```bash kubectl wait --for=condition=Ready certificate/server-interservice-cert -n temporal --timeout=120s @@ -199,20 +197,20 @@ kubectl exec -n temporal deploy/temporal-admintools -- \ If your Temporal deployment uses mTLS, add the TLS flags to each command: `--tls-cert-path`, `--tls-key-path`, `--tls-ca-path`, `--tls-server-name`. -See `helm-prereqs/SETUP_PHASES.md` for the full mTLS example. +Refer to `helm-prereqs/SETUP_PHASES.md` for the full mTLS example. ```{note} If Temporal pods are stuck in `Init:0/1`, the Elasticsearch index may not be ready. -Check `kubectl -n temporal logs elasticsearch-master-0` and wait for ES to become -healthy, or create the index manually. +Check the logs using `kubectl -n temporal logs elasticsearch-master-0` and wait for +Elasticsearch to become healthy, or create the index manually. ``` --- ## 5. Deploy Carbide REST Components -The REST / cloud layer provides the customer-facing API, workflow orchestration, and -site management. Deploy from the +The REST cloud layer provides the customer-facing API, along with workflow orchestration and +site management. The components are built from the [ncx-infra-controller-rest](https://github.com/NVIDIA/ncx-infra-controller-rest) repository. All REST components deploy into the `carbide-rest` namespace via a single Helm @@ -227,7 +225,7 @@ helm upgrade --install carbide-rest helm/charts/carbide-rest \ --timeout 600s --wait ``` -This deploys: `carbide-rest-api`, `carbide-rest-workflow` (cloud-worker and +This deploys the following: `carbide-rest-api`, `carbide-rest-workflow` (cloud-worker and site-worker), `carbide-rest-site-manager`, `carbide-rest-db` (migration job), and `carbide-rest-cert-manager` (credsmgr). @@ -238,7 +236,7 @@ If you need a dev IdP, deploy Keycloak separately before the umbrella chart: kubectl rollout status deployment/keycloak -n carbide-rest --timeout=300s ``` -Verify: +Verify the deployment as follows: ```bash kubectl get pods -n carbide-rest @@ -258,19 +256,19 @@ There are two deployment methods: **Helm** (recommended) and **Kustomize** (lega ### Helm (Recommended) -See the [Helm chart README](../../helm/README.md) for full documentation and -[helm/PREREQUISITES.md](../../helm/PREREQUISITES.md) for the Secrets and ConfigMaps +Refer to the [Helm chart README](https://github.com/NVIDIA/ncx-infra-controller-core/blob/main/helm/README.md) for full documentation and +[helm/PREREQUISITES.md](https://github.com/NVIDIA/ncx-infra-controller-core/blob/main/helm/PREREQUISITES.md) for the Secrets and ConfigMaps that must exist before install. -1. Copy `helm/examples/values-minimal.yaml` (or `values-full.yaml`) and customize: - * `global.image.repository` and `global.image.tag` -- your built core image - * `global.imagePullSecrets` -- if using a private registry - * `carbide-api.hostname` -- your API FQDN - * `carbide-api.siteConfig.carbideApiSiteConfig` -- site-specific TOML overrides - * MetalLB `externalService` annotations for each service VIP - * Kea DHCP configuration under `carbide-dhcp.config` +1. Copy `helm/examples/values-minimal.yaml` (or `values-full.yaml`) and customize the following values: + * `global.image.repository` and `global.image.tag`: Your built core image + * `global.imagePullSecrets`: If using a private registry, add the secret name here + * `carbide-api.hostname`: Your API FQDN + * `carbide-api.siteConfig.carbideApiSiteConfig`: Site-specific TOML overrides + * `externalService`: MetalLB annotations for each service VIP + * `carbide-dhcp.config`: Add your Kea DHCP configuration in this section -2. Install: +2. Install the Helm chart: ```bash helm upgrade --install carbide ./helm \ @@ -278,7 +276,7 @@ helm upgrade --install carbide ./helm \ -f values-mysite.yaml ``` -3. Verify: +3. Verify the deployment as follows: ```bash kubectl -n forge-system get pods @@ -289,8 +287,8 @@ The migration job runs automatically. Pods may briefly restart until the databas ### Kustomize (Alternative) -See [deploy/README.md](../../deploy/README.md) for the full list of inputs. -Populate `deploy/kustomization.yaml` and `deploy/files/`, then: +Refer to [deploy/README.md](https://github.com/NVIDIA/ncx-infra-controller-core/blob/main/deploy/README.md) for the full list of inputs. +Populate `deploy/kustomization.yaml` and `deploy/files/`, then run the following command: ```bash cd deploy @@ -303,7 +301,7 @@ kustomize build . --enable-helm --enable-alpha-plugins --enable-exec | kubectl a curl -k https://:1079/ ``` -If the API VIP is not externally reachable: +If the API VIP is not externally reachable, you can use port-forwarding to access it locally: ```bash kubectl port-forward svc/carbide-api 1079:1079 -n forge-system @@ -314,19 +312,19 @@ curl -k https://localhost:1079/ ## 7. Install admin-cli -Build from source in the `ncx-infra-controller-core` repository: +Build the admin-cli from source in the `ncx-infra-controller-core` repository: ```bash cargo make build-cli ``` -The binary is at `target/release/carbide-admin-cli`. Point it at your API: +The binary is located at `target/release/carbide-admin-cli`. Point it to your API as follows: ```bash carbide-admin-cli -c https://api-. site info ``` -If the API is not externally reachable: +If the API is not externally reachable, you can use port-forwarding to access it locally: ```bash kubectl port-forward svc/carbide-api 1079:1079 -n forge-system & @@ -342,58 +340,58 @@ It deploys as a StatefulSet in the `carbide-rest` namespace. 1. Pre-apply the gRPC client certificate so it exists before the pod starts: -```bash -helm template carbide-rest-site-agent helm/charts/carbide-rest-site-agent \ - --namespace carbide-rest \ - -f \ - --set global.image.repository= \ - --set global.image.tag= \ - --show-only templates/certificate.yaml | kubectl apply -f - + ```bash + helm template carbide-rest-site-agent helm/charts/carbide-rest-site-agent \ + --namespace carbide-rest \ + -f \ + --set global.image.repository= \ + --set global.image.tag= \ + --show-only templates/certificate.yaml | kubectl apply -f - -kubectl wait --for=condition=Ready certificate/core-grpc-client-site-agent-certs \ - -n carbide-rest --timeout=120s -``` + kubectl wait --for=condition=Ready certificate/core-grpc-client-site-agent-certs \ + -n carbide-rest --timeout=120s + ``` 2. Create the per-site Temporal namespace (the site-agent panics without it): -```bash -SITE_UUID= + ```bash + SITE_UUID= -kubectl exec -n temporal deploy/temporal-admintools -- \ - temporal operator namespace create "$SITE_UUID" --address temporal-frontend.temporal:7233 -``` + kubectl exec -n temporal deploy/temporal-admintools -- \ + temporal operator namespace create "$SITE_UUID" --address temporal-frontend.temporal:7233 + ``` -If your Temporal deployment uses mTLS, add the TLS flags as described in Step 4.4. + If your Temporal deployment uses mTLS, add the TLS flags as described in Step 4.4. 3. Install the site-agent Helm chart (the pre-install hook registers the site and creates the `site-registration` secret): -```bash -helm upgrade --install carbide-rest-site-agent helm/charts/carbide-rest-site-agent \ - --namespace carbide-rest \ - -f \ - --set global.image.repository= \ - --set global.image.tag= \ - --set "envConfig.CLUSTER_ID=$SITE_UUID" \ - --set "envConfig.TEMPORAL_SUBSCRIBE_NAMESPACE=$SITE_UUID" \ - --timeout 300s --wait -``` + ```bash + helm upgrade --install carbide-rest-site-agent helm/charts/carbide-rest-site-agent \ + --namespace carbide-rest \ + -f \ + --set global.image.repository= \ + --set global.image.tag= \ + --set "envConfig.CLUSTER_ID=$SITE_UUID" \ + --set "envConfig.TEMPORAL_SUBSCRIBE_NAMESPACE=$SITE_UUID" \ + --timeout 300s --wait + ``` -4. Verify: +4. Verify the deployment as follows: -```bash -kubectl get pods -n carbide-rest -l app.kubernetes.io/name=carbide-rest-site-agent -kubectl logs -n carbide-rest -l app.kubernetes.io/name=carbide-rest-site-agent --tail=20 -``` + ```bash + kubectl get pods -n carbide-rest -l app.kubernetes.io/name=carbide-rest-site-agent + kubectl logs -n carbide-rest -l app.kubernetes.io/name=carbide-rest-site-agent --tail=20 + ``` --- ## 9. Ingest Hosts -See [Ingesting Hosts](ingesting_machines.md) for the complete procedure. +Refer to the [Ingesting Hosts](ingesting_machines.md) section for the complete ingestion procedure. -For each managed host, you need the **BMC MAC address**, **chassis serial number**, and -**factory BMC username/password** (from your asset management system or server vendor). +For each managed host, you need the BMC MAC address, chassis serial number, and +factory BMC username/password (from your asset management system or server vendor). ```bash # Set desired credentials NICo will apply to all hosts @@ -407,11 +405,11 @@ carbide-admin-cli -c expected-machine replace-all --filename expected_ carbide-admin-cli -c mb site trusted-machine approve \* persist --pcr-registers="0,3,5,6" ``` -NICo then automatically: assigns IPs via DHCP, discovers BMCs via Redfish, rotates -credentials, provisions DPUs, PXE-boots hosts into Scout for hardware discovery, and +NICo then automatically assigns IPs via DHCP, discovers BMCs via Redfish, rotates +credentials, provisions DPUs, PXE-boots hosts into Scout for hardware discovery, and then moves machines to the `Available` pool. -Monitor progress: +Monitor progress as follows: ```bash carbide-admin-cli -c machine list @@ -439,7 +437,7 @@ carbide-admin-cli -c machine list ``` To complete the hello-world test, create an instance to provision Ubuntu on a managed -host, then SSH to verify: +host, then use SSH to verify: ```bash ssh -p 22 @ @@ -451,34 +449,36 @@ ssh -p 22 @ ### Temporal Pods Stuck in Init -Pods stuck in `Init:0/1` -- usually Elasticsearch index not ready. -Check `kubectl -n temporal logs elasticsearch-master-0`. +If Temporal pods are stuck in `Init:0/1`, the Elasticsearch index may not be ready. +Check the logs using `kubectl -n temporal logs elasticsearch-master-0` and wait for +Elasticsearch to become healthy, or create the index manually. ### kubectl Connection Refused -When accessing through a jump host: `ssh -L 6443:localhost:6443 ` +When accessing through a jump host, use port-forwarding as follows: `ssh -L 6443:localhost:6443 ` ### External API Access Blocked -Use port-forwarding: `kubectl port-forward svc/carbide-api 1079:1079 -n forge-system` +Use port-forwarding as follows: `kubectl port-forward svc/carbide-api 1079:1079 -n forge-system` ### carbide-rest-site-manager Fails to Start -`unable to start container process` -- verify the image was built with the production -Dockerfile (`docker/production/Dockerfile.carbide-rest-site-manager`), not the local -dev Dockerfile. +If the carbide-rest-manager returns `unable to start container process`, verify the image was built with the production +Dockerfile (`docker/production/Dockerfile.carbide-rest-site-manager`), not with the local dev Dockerfile. ### Pods Stuck in ImagePullBackOff -Missing `imagePullSecrets`. Verify: `kubectl -n get secret imagepullsecret` +If pods are stuck in `ImagePullBackOff`, verify that the `imagePullSecrets` are present. Run the following command to check: `kubectl -n get secret imagepullsecret` ### nvcr.io/nvidian Image References -Internal NVIDIA paths. Build from source (Step 1) and replace with your registry URL. +If you encounter `nvcr.io/nvidian/...` image references in documentation or manifests, +those are NVIDIA-internal paths not accessible externally. Replace them with your own +registry paths after building from source. ### Machines Not Progressing -Check state controller logs: +Check the state controller logs as follows: `kubectl -n forge-system logs -l app=carbide-api --tail=100 | grep state_controller` Common causes: DHCP relay not configured on OOB switch, BMC MACs not matching the diff --git a/book/src/manuals/pushing_containers.md b/book/src/manuals/pushing_containers.md index 0e2bab80d8..2926d76fb5 100644 --- a/book/src/manuals/pushing_containers.md +++ b/book/src/manuals/pushing_containers.md @@ -1,8 +1,11 @@ # Tagging and Pushing Containers to a Private Registry -After building all NICo container images (see [Building NICo Containers](building_nico_containers.md)), -tag them for your private registry and push. Set your registry URL and version tag as -environment variables: +After building all NICo container images (refer to the [Building NICo Containers](building_nico_containers.md) section), +you will need to tag them and push them to your private registry. + +## Setting Environment Variables + +Set your registry URL and version tag as environment variables: ```sh REGISTRY= @@ -33,15 +36,15 @@ docker push $REGISTRY/machine-validation-config:$TAG REST images are built from the [ncx-infra-controller-rest](https://github.com/NVIDIA/ncx-infra-controller-rest) -repository. The `make docker-build` command tags images at build time when you pass -`IMAGE_REGISTRY` and `IMAGE_TAG`: +repository. The `make docker-build` command tags images at build time when you pass the +`IMAGE_REGISTRY` and `IMAGE_TAG` environment variables: ```sh cd /path/to/ncx-infra-controller-rest make docker-build IMAGE_REGISTRY=$REGISTRY IMAGE_TAG=$TAG ``` -Then push all REST images: +Then, push all REST images to your private registry: ```sh for image in carbide-rest-api carbide-rest-workflow carbide-rest-site-manager \ diff --git a/book/src/manuals/site-setup.md b/book/src/manuals/site-setup.md index 3d8faaeed1..edef23061f 100644 --- a/book/src/manuals/site-setup.md +++ b/book/src/manuals/site-setup.md @@ -1,6 +1,6 @@ # Site Setup Guide -This page outlines the software dependencies for a Kubernetes-based install of NVIDIA Bare Metal Manager (BMM). It includes the *validated baseline* of software dependencies, +This page outlines the software dependencies for a Kubernetes-based install of NVIDIA NCX Infra Controller (NICo). It includes the *validated baseline* of software dependencies, as well as the *order of operations* for site bringup, including what you must configure if you already operate some of the common services yourself. **Important Notes** @@ -16,7 +16,7 @@ as well as the *order of operations* for site bringup, including what you must c ## Validated Baseline -This section lists all software dependencies, including the versions validated for this release of BMM. +This section lists all software dependencies, including the versions validated for this release of NICo. ### Kubernetes and Node Runtime @@ -58,7 +58,7 @@ This section lists all software dependencies, including the versions validated f ### Monitoring and Telemetry (OPTIONAL) -These components are not required for BMM setup, but are recommended site metrics. +These components are not required for NICo setup, but are recommended site metrics. - **Monitoring System**: Prometheus Operator v0.68.0; Prometheus v2.47.0; Alertmanager v0.26.0 @@ -70,27 +70,30 @@ These components are not required for BMM setup, but are recommended site metric - **Host Monitoring** Node exporter v1.6.1 -### BMM Components +### NICo Components -The following services are installed during the BMM installation process. +The following services are installed during the NICo installation process. -- **NICo core (forge-system)** +- **NICo core (forge-system)**: `/nvmetal-carbide:` (primary carbide-api, plus supporting workloads) + + - Build from the [ncx-infra-controller-core](https://github.com/NVIDIA/ncx-infra-controller-core) repo. + Refer to the [Building NICo Containers](building_nico_containers.md) section for more details. - - `/nvmetal-carbide:` (primary carbide-api, plus supporting workloads). - Build from [ncx-infra-controller-core](https://github.com/NVIDIA/ncx-infra-controller-core). - See [Building NICo Containers](building_nico_containers.md). +- **cloud-api**: `/carbide-rest-api:` (two replicas) + + - Build from the [ncx-infra-controller-rest](https://github.com/NVIDIA/ncx-infra-controller-rest) repo. -- **cloud-api**: `/carbide-rest-api:` (two replicas). - Build from [bare-metal-manager-rest](https://github.com/NVIDIA/bare-metal-manager-rest). +- **cloud-workflow**: `/carbide-rest-workflow:` (cloud-worker, site-worker) + + - Build from the [ncx-infra-controller-rest](https://github.com/NVIDIA/ncx-infra-controller-rest) repo. -- **cloud-workflow**: `/carbide-rest-workflow:` (cloud-worker, site-worker). - Build from [bare-metal-manager-rest](https://github.com/NVIDIA/bare-metal-manager-rest). - -- **cloud-cert-manager (credsmgr)**: `/carbide-rest-cert-manager:`. - Build from [bare-metal-manager-rest](https://github.com/NVIDIA/bare-metal-manager-rest). +- **cloud-cert-manager (credsmgr)**: `/carbide-rest-cert-manager:` + + - Build from the [ncx-infra-controller-rest](https://github.com/NVIDIA/ncx-infra-controller-rest) repo. - **elektra-site-agent**: `/carbide-rest-site-agent:`. - Build from [bare-metal-manager-rest](https://github.com/NVIDIA/bare-metal-manager-rest). + + - Build from the [ncx-infra-controller-rest](https://github.com/NVIDIA/ncx-infra-controller-rest) repo. ## Order of Operations