feat: [ENG-2240] AutoHarness V2 SandboxService.loadHarness + harness.* injection#499
feat: [ENG-2240] AutoHarness V2 SandboxService.loadHarness + harness.* injection#499danhdoan merged 2 commits intoproj/autoharness-v2from
Conversation
|
Claude finished @danhdoan's task in 3m 22s —— View job PR Review:
|
| Location | Finding |
|---|---|
sandbox-service.ts:265 |
'no-version' used for both "disabled" and "no version trained" — conflates two distinct states |
sandbox-service.ts:297 |
if (harnessNs !== undefined) is always true here — redundant guard after sessionHarnessStates.set |
sandbox-service.ts:481 |
harness.curate exposes no way for the caller to inject custom env overrides — fine for v1.0 but worth a note for Phase 5 API review |
test:109 |
Missing assertion that sessionHarnessStates stays empty on builder failure |
test:163 |
Behavioral test preferred over private method cast |
What's Well Done
- Two-ordering correctness: The design where
loadHarnessinjects immediately if a sandbox exists, or defers toexecuteCode's creation branch, handles both call orderings cleanly without a polling loop or flag. readonlyonSessionHarnessState: Prevents accidental mutation of cross-call state; good hygiene.buildHarnessTools()captures services dynamically viathis: Eachharness.curate()call gets a fresh binding to the currentcurateService— consistent with how toolsSDK handles service updates.- Never-throws contract on
loadHarness: Every failure mode returns a typed result. Combined with theHarnessLoadResultdiscriminated union, downstream code can narrow safely without try/catch. resolveProjectType()usesProjectTypeSchema.safeParse: New project types added to the schema are automatically accepted without a code change — a quiet forward-compatibility win.cleanup()andclearSession()both clear the new maps: Lifecycle hygiene matches the existing patterns forsandboxesandpendingVariables.
injection
Phase 3 Task 3.3 — wires the module builder (ENG-2239) and store
(ENG-2228) into `SandboxService` as a user-facing capability.
`loadHarness(sessionId, projectId, commandType)` reads the latest
`HarnessVersion` via the store, evaluates it through the module
builder, and exposes `harness.curate`, `harness.query`,
`harness.meta` inside the session's sandbox context.
- `SandboxService` gets two new optional fields: `harnessModuleBuilder`
(class) + `harnessStore` (interface). Setters wire them
alphabetically.
- `service-initializer.ts` constructs `HarnessModuleBuilder` alongside
the existing `HarnessStore` + `HarnessOutcomeRecorder` block, and
calls both setters. No new service on `CipherAgentServices` — the
builder is internal to the sandbox.
- `src/agent/infra/harness/index.ts` barrel re-exports
`HarnessModuleBuilder`.
Never throws — encodes every failure in the returned
`HarnessLoadResult`.
- config disabled / deps missing → `{loaded:false,
reason:'no-version'}`
- store returns `undefined` → `{loaded:false, reason:'no-version'}`
- builder returns `{loaded:false,
reason:'syntax'|'meta-threw'|'meta-invalid'}`
→ propagates unchanged, sandbox untouched
- builder returns `{loaded:true}`:
1. `sessionHarnessStates` Map gets {commandType, meta, module,
projectType} keyed by sessionId
2. `harnessVersionIdBySession` populated so Phase 2 recorder
can attribute outcomes to the loaded version
3. If sandbox already exists → `updateContext({harness: ns})`;
otherwise `executeCode` injects on creation via
`buildHarnessNamespace(sessionId)`
`buildHarnessNamespace(sessionId)` returns `{meta, curate?, query?}`
per what the loaded module exports:
- `meta()` returns the captured `HarnessMeta` (no VM re-invocation)
- `curate()` / `query()` build a fresh `HarnessContext` per call
and invoke `module.curate(ctx)` / `module.query(ctx)` through the
module builder's timeout-wrapped, deep-freeze-enforced path
- `ctx.tools.curate` / `ctx.tools.readFile` bind to the session's
real curate service + file system; explicit error when unwired
Hardcoded to the Phase 3 assisted baseline. Phase 5's
`HarnessModeSelector` will layer mode gating on top — mode-specific
prompt contributors + Mode C safety caps.
Currently a fresh `AbortController().signal` per call — placeholder.
Phase 5's `AgentLLMService` hook will thread the session's real
abort signal through so user cancellation propagates into harness
code. Documented in-source.
6 unit tests in `sandbox-service-harness-load.test.ts` covering:
- config-disabled early return (no store call)
- no-version store return
- builder-failure propagation (no state populated)
- successful load returns {loaded:true} with stored version
- curate-only template → `harness.curate` present, `harness.query`
absent (end-to-end via real `HarnessModuleBuilder`)
- `harnessVersionIdBySession` populated on success for Phase 2
recorder attribution
- `rg "loadHarness" src/` → 1 definition + 0 production call sites.
Phase 5's mode selector will be the first real consumer. Comments +
JSDoc references present but no code calls.
- `rg "new HarnessModuleBuilder" src/` → 1 construction site in
service-initializer.ts.
Full suite: 6697 passing / 0 failing.
|
Claude finished @danhdoan's task in 1m 50s —— View job PR Re-Review:
|
| Location | Finding | Status |
|---|---|---|
sandbox-service.ts:260–265 |
'no-version' used for disabled config AND unconfigured wiring — conflates distinct failure modes |
unchanged |
sandbox-service.ts:297–300 |
if (harnessNs !== undefined) is always true here — sessionHarnessStates.set was called 13 lines above |
unchanged |
test:108–109 |
Missing assertion that sessionHarnessStates is NOT populated on builder failure; comment says it but doesn't assert it |
unchanged |
test:160–163 |
Private method cast tests implementation detail; behavioral executeCode assertion preferred |
unchanged |
For the redundant guard at line 297–300, the simplest fix is a comment acknowledging the guarantee, or just dropping the inner if:
// sessionHarnessStates was just set above — buildHarnessNamespace is guaranteed non-undefined here
sandbox.updateContext({harness: this.buildHarnessNamespace(sessionId)!})(or keep the outer sandbox !== undefined guard and inline the update.)
What's Well Done (unchanged from prior review)
- Two-ordering correctness: Load before or after first
executeCodeboth inject cleanly. readonlyonSessionHarnessState: Prevents cross-call mutation.- Never-throws contract: Every failure mode is a typed
HarnessLoadResult. cleanup()/clearSession()both clear the new maps: Lifecycle hygiene is consistent.resolveProjectType()usessafeParse: New project types accepted without code changes.buildHarnessTools()captures services viathis: Fresh bindings per call, consistent with toolsSDK pattern.
Action items before merge:
- Add
loadHarness,setHarnessModuleBuilder,setHarnessStoretoISandboxService(blocking) - Add test for the reverse-order injection path (executeCode → loadHarness) (blocking per TDD standard)
- (Nice-to-have) Add assertion to builder-failure test; address redundant guard
|
Claude finished @danhdoan's task in 1m 44s —— View job PR Re-Review:
|
| Finding | Resolution |
|---|---|
'no-version' used for disabled/not-configured/missing — conflation of distinct states |
Explicit intentional comment added at sandbox-service.ts:260–267: "Deliberate: three distinct conditions…collapse into the same 'no-version' result for v1.0. Keep this conflation intentional, not accidental." |
if (harnessNs !== undefined) redundant guard after unconditional sessionHarnessStates.set |
Guard removed; direct sandbox.updateContext({harness: this.buildHarnessNamespace(sessionId)}) with a comment: "buildHarnessNamespace is guaranteed non-undefined here: state was just set above." |
Missing assertion on sessionHarnessStates in builder-failure test |
Test at line 116 now asserts expect(internal.sessionHarnessStates.has('s1')).to.equal(false) and expect(internal.harnessVersionIdBySession.has('s1')).to.equal(false) |
Capability-driven injection test used private method cast (unknown cast to call buildHarnessNamespace) |
Test now exercises executeCode and inspects harness.* availability as user code would see it — behaviorally correct. Comment explicitly notes this approach. |
What's Well Done (carried from prior reviews)
- Two-ordering correctness: Both load-before-execute and execute-before-load paths are now tested and work correctly.
readonlyonSessionHarnessState: Prevents cross-call mutation.- Never-throws contract: Every failure mode is a typed
HarnessLoadResult. cleanup()/clearSession()both clear the new maps: Lifecycle hygiene consistent withsandboxesandpendingVariables.resolveProjectType()usessafeParse: New project types accepted without code changes.buildHarnessTools()captures services viathis: Fresh bindings per call, consistent with toolsSDK pattern.- JSDoc on
loadHarnessinISandboxService: Thorough, references Phase 5 as the first production consumer — useful for whoever writes that callsite.
No further action required. Ready to merge.
Summary
harness.*surface to call, and the per-session version id tracked by the outcome recorder stays empty.SandboxService.loadHarness. Phase 5 (mode selector) is the first consumer; Phase 3 Task 3.5's isolation integration test also depends on this wiring.loadHarness(sessionId, projectId, commandType): Promise<HarnessLoadResult>onSandboxService. Two new setters (setHarnessModuleBuilder,setHarnessStore).service-initializer.tsconstructs the builder and injects both.executeCodenow injectsharness.*into the sandbox context on creation when a module is loaded for the session. 6 unit tests covering every branch.loadHarnessyet — Phase 5 will be the first. No mode selection (hardcoded to Phase 3 assisted baseline). No attack-vector integration tests (Task 3.5). No template content (Phase 4). Sandbox.execute semantics unchanged when no harness is loaded.Type of change
Scope (select all touched areas)
Linked issues
HarnessContexttypes), ENG-2239 (Phase 3 Task 3.2 —HarnessModuleBuilder), ENG-2228 (Phase 1 Task 1.4 —HarnessStoreservice-initializer wiring)AgentLLMServicesession-start hook)Root cause (bug fixes only, otherwise write
N/A)Test plan
test/unit/infra/sandbox/sandbox-service-harness-load.test.tsharness.enabled === false→{loaded: false, reason: 'no-version'},store.getLatestnever calledstore.getLatest(...)returnsundefined→{loaded: false, reason: 'no-version'}, call target verified{loaded: false, reason: 'meta-threw'}→ propagates as-is, no session state populated{loaded: true}returned with the storedHarnessVersionreferenceHarnessModuleBuilder→harness.curatepresent,harness.queryabsentharnessVersionIdBySessionpopulation: successful load writes the version id so Phase 2 recorder can tag outcomesUser-visible changes
None. No consumer calls
loadHarnessyet, andharness.enabled = falseremains the public default. Sandbox execution without a loaded harness is byte-identical to before this PR.Evidence
Before this PR,
loadHarnessdidn't exist — no test in the suite referenced it. After: all 6 pass. Full suite: 6697 passing / 0 failing.Checklist
npm test) — 6 new tests; full suite 6697 passing / 0 failingnpm run lint) — 0 errors, 226 pre-existing warningsnpm run typecheck) — exit=0npm run build) — exit=0feat: [ENG-2240] ...features/autoharness-v2/tasks/phase_3/task_03-load-harness-and-injection.md(research repo) drove the scope; thectx.abortplaceholder + mode-hardcoded-to-assisted decisions flagged below for post-merge task doc tighteningmain— targetsproj/autoharness-v2, notmainRisks and mitigations
Risk:
ctx.abortis a placeholder — a freshAbortController().signalper call. If a harness function checksctx.abort.aborted, it will always seefalse, even when user code has requested cancellation.ctx.abort(they forward toctx.tools.curatewhich has its own cancellation plumbing). Phase 5'sAgentLLMServicehook threads the real session abort signal through. Placeholder is documented in-source atbuildHarnessNamespacewith a pointer to Phase 5. If a Phase 6 refined harness genuinely needsctx.abortbefore Phase 5 lands, the fix is plumbing the sandbox-level abort signal throughloadHarness— additive.Risk: Mode is hardcoded to "assisted" (all exported capabilities visible to user code). A session that should be in Mode B (filter) or Mode C (policy) would still see
harness.*without the mode-specific safety caps fromv1-design-decisions.md §2.5.loadHarnessyet — Phase 5 is the first. Phase 5's mode selector runs BEFOREloadHarnessand refuses to load below the Mode A threshold (H ≥ 0.30). So the "wrong mode injected" scenario can't materialize until a mode selector exists, and the mode selector arrives in the same phase that will swap the hardcoded mode. Zero production risk.Risk:
buildHarnessNamespacebindsctx.tools.curate/ctx.tools.readFileto the session's real curate service + file system. If those services aren't wired (e.g., early tests), the bound tools throw instead of silently no-op'ing.SandboxServicetests that don't wire those services also don't callloadHarness, so no test breakage. Future regression: flagged in thebuildHarnessToolsJSDoc.Risk:
loadHarnessuses session-scoped state on two maps (sessionHarnessStates,harnessVersionIdBySession).clearSessionandcleanupboth clear them, but any future bypass of those lifecycle hooks would leak state across sessions.sandboxesandpendingVariables— all maps follow the same lifecycle hooks. Any future field that forgets to clear is caught by the established test pattern ("sessions don't share state"). Acceptable v1.0 hygiene.Notes for reviewers
loadHarnessis API-complete but production-dormant. Shipping this PR doesn't change any user-visible behavior because nothing callsloadHarnessyet. Phase 5's mode selector +AgentLLMServicesession-start hook is the first production caller. Task 3.5 (isolation integration test) is the first test consumer. This shape — ship the capability, defer the caller — matches Phase 1.4's pattern and de-risks Phase 5 by letting the wiring soak inproj/autoharness-v2first.The
harness.*injection happens in two places, both callingbuildHarnessNamespace(sessionId):loadHarnessitself, when called for a session whose sandbox already exists (updateContext({harness: ns})).executeCode's sandbox-creation branch (new sandbox for this session), beforenew LocalSandbox(...).This handles both orderings:
loadHarnessbefore or after the firstexecuteCode. If Phase 5 ever calls them in a weird interleaving, the namespace lands correctly either way.SessionHarnessStateisreadonlyat the type level. The storedHarnessMetais the cached result ofmodule.meta()from Task 3.2, which is pure and invariant.readonlyprevents accidental mutation of state used across multipleexecuteCodecalls.Capability-driven injection is export-based, not meta-capabilities-based. If a template declares
meta().capabilities: ['curate', 'query']but only exportscurate, the injected namespace has justharness.curate. Strictly, this diverges from the task doc's wording ("capabilities declared inmeta()are injected"), but in practice well-formed templates export exactly what they declare, andHarnessMetaSchema's validation runs at save time in Phase 1's store. The module-builder already uses the same export-based logic. Flagging in case reviewers prefer an additional cross-check.Scope call —
loadHarnessis idempotent but not cheap on repeat. Each call re-evaluates the template through the VM. If Phase 5 callsloadHarnessonce per session (the intended pattern), this is a non-issue. If a future consumer calls it per-request, add a "already loaded?" check at the top.Related
src/agent/core/domain/harness/types.ts(ENG-2238 —HarnessContext,HarnessLoadResult)src/agent/infra/harness/harness-module-builder.ts(ENG-2239)src/agent/infra/harness/harness-store.ts(ENG-2227 + ENG-2228)features/autoharness-v2/tasks/phase_3_4_handoff.md §C2 §C3 §C4features/autoharness-v2/tasks/phase_3/task_03-load-harness-and-injection.md(research repo)HarnessModeSelector+AgentLLMServicesession-start hook`