Skip to content

Commit d9df611

Browse files
committed
chore: merge main into fix/894-customisable-credential-placeholder
Resolve conflict from upstream consolidation of subsystem docs (#1184): sandbox-providers.md was removed; migrate the Selective Passthrough content to the Credentials section of architecture/sandbox.md. Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
2 parents 8446a90 + 52097f2 commit d9df611

368 files changed

Lines changed: 27667 additions & 37118 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.agents/skills/build-from-issue/SKILL.md

Lines changed: 18 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -185,7 +185,7 @@ gh issue comment <id> --body "$(cat <<'EOF'
185185
- <risk or unknown that may need human input>
186186
187187
### Documentation Impact
188-
- <which architecture/ docs will need updating, or "None expected">
188+
- <docs expected per AGENTS.md, or "None expected">
189189
190190
---
191191
*Revision 1 — initial plan*
@@ -402,29 +402,24 @@ git diff --name-only main -- e2e/
402402

403403
If there are no changes under `e2e/`, skip this phase entirely.
404404

405-
If E2E files were modified, deploy to the local cluster and run the E2E test suite:
405+
If E2E files were modified, run the relevant E2E lane for the driver touched by the change:
406406

407407
```bash
408-
# Deploy all changes to the local k3s cluster
409-
mise run cluster:deploy
410-
411-
# Run the E2E sandbox tests
412-
mise run test:e2e:sandbox
408+
# Docker-backed gateway smoke E2E
409+
mise run e2e:docker
413410
```
414411

415-
`mise run test:e2e:sandbox` depends on `cluster:deploy` and `python:proto`, then runs `uv run pytest -o python_files='test_*.py' e2e/python`. However, since the cluster may need explicit deploy for code changes beyond just E2E test files, always run `mise run cluster:deploy` first as a separate step to ensure all sandbox/proxy/policy changes are live on the cluster before running E2E tests.
412+
Use `mise run e2e:podman`, `mise run e2e:vm`, or a Helm-backed Kubernetes E2E lane when the change targets those drivers.
416413

417414
**E2E retry loop** (up to 3 attempts):
418415

419-
1. Run `mise run cluster:deploy` (only on the first attempt, or if code was changed between attempts).
420-
2. Run `mise run test:e2e:sandbox`.
421-
3. If tests fail:
416+
1. Run the selected E2E lane.
417+
2. If tests fail:
422418
- Read the pytest output carefully — identify which tests failed and why.
423419
- Distinguish between **test bugs** (the test itself is wrong) and **implementation bugs** (the code under test is wrong).
424420
- Fix the failing code or tests.
425-
- If code changes were made (not just test fixes), re-run `mise run cluster:deploy` before retrying.
426421
- Decrement the retry counter and try again.
427-
4. If tests pass, Phase 2 is green.
422+
3. If tests pass, Phase 2 is green.
428423

429424
**If all 3 E2E attempts fail**, stop and report to the user:
430425
- Which E2E tests are failing
@@ -436,18 +431,9 @@ Do not proceed to PR creation if E2E verification is not green.
436431

437432
### Step 11: Update Documentation
438433

439-
Use the `arch-doc-writer` sub-agent to update architecture documentation. Use the Task tool:
440-
441-
```
442-
Task tool with subagent_type="arch-doc-writer"
443-
```
444-
445-
In the prompt, provide:
446-
- Which files were changed and why (from the plan + any deviations)
447-
- The issue context (what was built/fixed)
448-
- Which architecture docs in `architecture/` are likely affected
449-
450-
Launch one `arch-doc-writer` instance per documentation file that needs updating. If no documentation changes are needed, the `arch-doc-writer` will make that determination.
434+
Review the documentation requirements in `AGENTS.md` and update any affected
435+
docs as part of the implementation. Keep documentation changes scoped to the
436+
behavior or subsystem that changed.
451437

452438
### Step 12: Commit and Push
453439

@@ -507,10 +493,9 @@ Closes #<issue-id>
507493
## Checklist
508494
- [x] Follows Conventional Commits
509495
- [x] Commits are signed off (DCO)
510-
- [x] Architecture docs updated (if applicable)
511496
512497
**Documentation updated:**
513-
- `<architecture/doc.md>`: <what was updated>
498+
- `<doc path>`: <what was updated, or "None needed">
514499
EOF
515500
)"
516501
```
@@ -542,7 +527,7 @@ PR: [#<pr-number>](https://github.com/OWNER/REPO/pull/<pr-number>)
542527
- E2E: <count or N/A>
543528
544529
### Docs updated
545-
- <list of updated architecture docs, or "None needed">
530+
- <list of updated docs, or "None needed">
546531
547532
The issue will auto-close when the PR is merged.
548533
EOF
@@ -576,8 +561,8 @@ Local E2E tests passed. CI does not currently run E2E tests, so this comment ser
576561
| Field | Value |
577562
|-------|-------|
578563
| **Commit** | `<commit-sha>` |
579-
| **Command** | `mise run test:e2e:sandbox` |
580-
| **Cluster deploy** | `mise run cluster:deploy` (completed before test run) |
564+
| **Command** | `<selected e2e command>` |
565+
| **Gateway mode** | `<docker / podman / vm / helm>` |
581566
| **Result** | ✅ All passed |
582567
583568
### Test Summary
@@ -645,8 +630,9 @@ If the `state:in-progress` label is present, the skill was previously started bu
645630
| `gh pr create --title "..." --body "..."` | Create a pull request |
646631
| `gh api user --jq '.login'` | Get current GitHub username |
647632
| `mise run pre-commit` | Run pre-commit checks (includes unit tests, lint, format) |
648-
| `mise run cluster:deploy` | Deploy all changes to local k3s cluster |
649-
| `mise run test:e2e:sandbox` | Run E2E sandbox tests (depends on cluster:deploy) |
633+
| `mise run e2e:docker` | Run smoke E2E against a standalone Docker-backed gateway |
634+
| `mise run e2e:podman` | Run smoke E2E against a Podman-backed gateway |
635+
| `mise run e2e:vm` | Run smoke E2E against the VM compute driver |
650636

651637
## Example Usage
652638

@@ -695,7 +681,6 @@ User says: "Build issue #42"
695681
7. Add unit tests for pagination logic, integration tests for both endpoints
696682
8. `mise run pre-commit` passes on first attempt
697683
9. E2E tests skipped (no changes under `e2e/`)
698-
10. `arch-doc-writer` updates `architecture/gateway.md` with pagination details
699684
10. Commit, push, create PR with `Closes #42`
700685
11. Post summary comment on issue with PR link
701686
12. Update labels: remove `state:in-progress` + `state:review-ready`, add `state:pr-opened`

.agents/skills/create-github-pr/SKILL.md

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -158,7 +158,6 @@ PR descriptions must follow the project's [PR template](.github/PULL_REQUEST_TEM
158158
## Checklist
159159
- [ ] Follows Conventional Commits
160160
- [ ] Commits are signed off (DCO)
161-
- [ ] Architecture docs updated (if applicable)
162161
```
163162

164163
Populate the testing checklist based on what was actually run. Check boxes for steps that were completed.
@@ -193,7 +192,6 @@ Closes #456
193192
194193
- [x] Follows Conventional Commits
195194
- [x] Commits are signed off (DCO)
196-
- [ ] Architecture docs updated (if applicable)
197195
EOF
198196
)"
199197
```

.agents/skills/debug-inference/SKILL.md

Lines changed: 12 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -174,7 +174,7 @@ openshell sandbox create -- curl https://inference.local/v1/chat/completions --j
174174

175175
Interpretation:
176176

177-
- **`cluster inference is not configured`**: set the managed route with `openshell inference set`
177+
- **`cluster inference is not configured`**: set the managed gateway route with `openshell inference set`
178178
- **`connection not allowed by policy`** on `inference.local`: unsupported method or path
179179
- **`no compatible route`**: provider type and client API shape do not match
180180
- **Connection refused / upstream unavailable / verification failures**: base URL, bind address, topology, or credentials are wrong
@@ -232,7 +232,7 @@ In this case, OpenShell routing is usually working correctly. The failing hop is
232232

233233
This is not the same issue as the Colima CoreDNS fix.
234234

235-
OpenShell injects `host.docker.internal` and `host.openshell.internal` into sandbox pods with `hostAliases`. That path bypasses cluster DNS lookup. If the request still times out, the usual cause is host firewall or network policy, not CoreDNS.
235+
OpenShell injects `host.docker.internal` and `host.openshell.internal` into sandbox workloads when the selected compute platform supports it. That path bypasses runtime DNS lookup. If the request still times out, the usual cause is host firewall or network policy, not DNS.
236236

237237
### Verify the Problem
238238

@@ -248,40 +248,41 @@ OpenShell injects `host.docker.internal` and `host.openshell.internal` into sand
248248
curl -sS http://172.17.0.1:11434/v1/models
249249
```
250250

251-
3. Test the same endpoint from the OpenShell cluster container:
251+
3. Test the same endpoint from a gateway or sandbox container on the Docker network:
252252

253253
```bash
254-
docker exec openshell-cluster-<gateway> wget -qO- -T 5 http://host.docker.internal:11434/v1/models
254+
docker ps --filter name=openshell --format '{{.Names}}'
255+
docker exec <container-name> wget -qO- -T 5 http://host.docker.internal:11434/v1/models
255256
```
256257

257258
If steps 1 and 2 succeed but step 3 times out, the host firewall or network configuration is blocking the container-to-host path.
258259

259260
### Fix
260261

261-
Allow the Docker bridge network used by the OpenShell cluster to reach the host-local inference port. The exact command depends on your firewall tooling (iptables, nftables, firewalld, UFW, etc.), but the rule should allow:
262+
Allow the Docker bridge network used by the OpenShell gateway and sandbox containers to reach the host-local inference port. The exact command depends on your firewall tooling (iptables, nftables, firewalld, UFW, etc.), but the rule should allow:
262263

263-
- **Source**: the Docker bridge subnet used by the OpenShell cluster container (commonly `172.18.0.0/16`)
264-
- **Destination**: the host gateway IP injected into sandbox pods for `host.docker.internal` (commonly `172.17.0.1`)
264+
- **Source**: the Docker bridge subnet used by OpenShell containers (commonly `172.18.0.0/16`)
265+
- **Destination**: the host gateway IP injected into sandbox workloads for `host.docker.internal` (commonly `172.17.0.1`)
265266
- **Port**: the inference server port (e.g. `11434/tcp` for Ollama)
266267

267268
To find the actual values on your system:
268269

269270
```bash
270-
# Docker bridge subnet for the OpenShell cluster network
271+
# Docker bridge subnet for the OpenShell network
271272
docker network inspect $(docker network ls --filter name=openshell -q) --format '{{range .IPAM.Config}}{{.Subnet}}{{end}}'
272273

273274
# Host gateway IP visible from inside the container
274-
docker exec openshell-cluster-<gateway> cat /etc/hosts | grep host.docker.internal
275+
docker exec <container-name> cat /etc/hosts | grep host.docker.internal
275276
```
276277

277278
Adjust the source subnet, destination IP, or port to match your local Docker network layout.
278279

279280
### Verify the Fix
280281

281-
1. Re-run the cluster container check:
282+
1. Re-run the container network check:
282283

283284
```bash
284-
docker exec openshell-cluster-<gateway> wget -qO- -T 5 http://host.docker.internal:11434/v1/models
285+
docker exec <container-name> wget -qO- -T 5 http://host.docker.internal:11434/v1/models
285286
```
286287

287288
2. Re-test from a sandbox:

0 commit comments

Comments
 (0)