feat: [ENG-2239] AutoHarness V2 HarnessModuleBuilder#497
feat: [ENG-2239] AutoHarness V2 HarnessModuleBuilder#497danhdoan merged 3 commits intoproj/autoharness-v2from
Conversation
Phase 3 Task 3.2 — the core security surface for harness code
execution. Evaluates a `HarnessVersion.code` string inside
`vm.createContext` and returns a callable `HarnessModule` wrapped in
a `HarnessLoadResult`. Every failure mode normalized into a
`{loaded: false, reason}` variant; this class never throws.
- Fresh `vm.createContext` per `build()`. No state shared across
loads.
- Harness code prepended with `"use strict";\n` — makes
`Object.freeze(ctx)` a real invariant (sloppy-mode writes to frozen
properties silently no-op; strict mode throws TypeError).
- EVERY function invocation (`meta()`, `curate(ctx)`, `query(ctx)`)
runs inside `vm.Script.runInContext` with `{timeout: 5000}`. V8's
wall-clock timeout catches synchronous infinite loops.
- Returned Promises are additionally raced against a JS-level 5s
timer. Covers async never-resolving Promises that slip past the
vm timeout (which only gates sync execution).
- Errors are re-thrown via fresh `Error` construction — any
captured outer-scope properties (e.g. `capturedThis = globalThis`)
are stripped before propagation.
- `meta()` is invoked once at build time; result is cached.
Subsequent `module.meta()` calls return the captured reference
without re-invoking the VM.
Templates are CommonJS strings. The context exposes both
`module.exports` and `exports` pointing at the same object so
templates can write either idiom. After the script runs,
`module.exports` is canonical (handles the case where a template
reassigns it wholesale).
4 happy paths (valid template, meta() result, curate() pass-through,
curate-only → no query) + meta caching invariant + 5 error
categorizations + sync vm timeout on meta() + sync vm timeout on
curate() + async Promise.race timeout on curate() +
frozen-context mutation rejection (2 shapes).
Sync infinite-loop tests each wait ~5s for the vm timeout to fire;
test file runtime ≈ 15s. Async Promise.race test runtime ~5s.
Acceptable given Phase 3's security focus.
Dropped task doc scenario #10 ("meta() version !==
HarnessVersion.version").
That cross-invariant is enforced at the store boundary in ENG-2226's
`saveVersion`, not in the module builder. Documented in the PR body
for post-merge task doc tightening.
|
Claude finished @danhdoan's task in 3m 27s —— View job PR Review: feat: [ENG-2239] AutoHarness V2 HarnessModuleBuilder
SummaryOverall score: 4 / 5 Solid, well-scoped implementation. The class docstring is exemplary — isolation guarantees, CommonJS rationale, and concurrency caveats are all documented. Error normalization is thorough: every failure mode returns a discriminated One real security gap to address before Task 3.3 wires this class as a consumer: Issues🔴 Security — Shallow freeze leaves nested context properties mutable (line 283)
Suggestions🟡 Perf — The invocation script ( 🟡 Test gap — All happy-path, timeout, and freeze tests use 🟡 Test gap — nested property freeze not tested (line 226) Companion tests for the shallow-freeze fix (once landed). Nitpicks
The shallow freeze is the only blocker — all other items are clean-up. The architecture, isolation rationale, and test structure are all in good shape. Once the deep freeze lands with companion tests, this is ready to merge. |
|
Claude finished @danhdoan's task in 2m 33s —— View job PR Review: feat: [ENG-2239] AutoHarness V2 HarnessModuleBuilder
SummaryOverall score: 4.5 / 5 The revision round addressed all five items from the previous review cleanly. The security blocker (shallow freeze) is fixed, performance is improved (pre-compiled What was fixed ✅
Remaining issuesNitpick —
Nitpick — async timeout still double-wraps ( When The test (line 222–224) validates both Minor observation
The shallow-freeze fix is correct and the new tests cover the attack vector thoroughly. The two remaining nitpicks are safe to clean up in a follow-on commit or as part of Task 3.3. Ready to merge once the team is happy with the outstanding cosmetics. Fix |
|
Claude finished @danhdoan's task in 3m 4s —— View job PR Review: feat: [ENG-2239] AutoHarness V2 HarnessModuleBuilder
SummaryOverall score: 4.5 / 5 This is the third review pass. All items from the previous 4/5 review have been addressed: deep freeze lands correctly (spreading What was fixed ✅
Remaining issuesissue (CLAUDE.md violation) — CLAUDE.md requires object parameters for functions with more than 3 parameters. nitpick —
nitpick — async timeout error double-wraps ( When nitpick —
The |

Summary
HarnessModuleBuilderclass withbuild(version)returningHarnessLoadResult. Full per-invocationvm.Script.runInContextwith timeout (Option B per design discussion). Strict-mode injection makesObject.freeze(ctx)enforceable. 15 unit tests covering every code path.SandboxService.loadHarnesswiring (Task 3.3). No attack-vector integration tests (Task 3.5). No template content (Phase 4). No consumer imports this class yet.Type of change
Scope (select all touched areas)
Linked issues
HarnessContext+ module contract types this PR consumes)SandboxService.loadHarness+harness.*injection — next in the stream)Root cause (bug fixes only, otherwise write
N/A)Test plan
test/unit/agent/harness/harness-module-builder.test.tsmodule.meta()returns expectedHarnessMeta,module.curate(ctx)passes through correctly, curate-only template hasmodule.query === undefined.module.meta()returns the same object reference across calls (proof the VM function is invoked exactly once at build).meta()throws,meta()returns null,meta()returns object with missing required field — each maps to the rightreasonstring.meta()infinite loop caught at build (~5s via vm timeout) →reason: meta-threw;curate()infinite loop caught per call (~5s via vm timeout) → normalized throw with "failed" in message;curate()never-resolving Promise caught per call (~5s viaPromise.race) → normalized throw with "exceeded" in message.ctx.env = {...}throws TypeError due toObject.freeze+ strict mode;ctx.injected = 'leaked'same.User-visible changes
None. No consumer imports
HarnessModuleBuilderyet — Task 3.3 (SandboxService.loadHarness) is the first.harness.enabled = falseremains the public default.Evidence
Before this PR, the class didn't exist. After: all 15 tests pass. Full suite: 6683 passing / 0 failing.
Checklist
npm test) — 15 new tests; full suite 6683 passing / 0 failingnpm run lint) — 0 errors, 226 pre-existing warningsnpm run typecheck) — exit=0npm run build) — exit=0feat: [ENG-2239] ...features/autoharness-v2/tasks/phase_3/task_02-harness-module-builder.md(research repo) drove the scope; the Option B choice + scenario-10 drop flagged below for post-merge task-doc tighteningmain— targetsproj/autoharness-v2, notmainRisks and mitigations
Risk: Test-file runtime is ~15s because three tests each wait ~5s for vm timeouts to fire. Under CI load or on slow runners, these could flake if the vm timeout drifts past 5.1s.
.timeout(8000)budget — 3x the vm timeout + headroom. V8'svm.Script.runInContexttimeout is deterministic under normal load (it's a hard wall-clock limit, not a soft signal). Real flakes would point at CI CPU exhaustion, not test brittleness. If it becomes a problem, the timeout budgets are parametrizable via a constant.Risk: Strict-mode injection (
"use strict";\nprepended to harness code) changes the semantic context for all harness authors. A Phase 4 template author writing sloppy-mode code (e.g., implicit globals,arguments.callee) would get a syntax error they didn't expect.Object.freeze(e.g., Proxy with a throwingsettrap). Strict-mode is the pragmatic v1.0 choice.Risk:
__harnessArg/__harnessResultglobals on the sharedscriptContextare mutated on every invocation. Concurrent invocations on the same module would trample each other's state.SandboxService.executeCodeserializes user code execution per session. A concurrent-invocation race requires changes to the outer sandbox architecture that aren't on the v1.0 roadmap. If it becomes necessary, the fix is a fresh context per invocation (small perf cost) or a per-invocation mutex. Both additive.Risk:
vm.createContextin Node is NOT a true sandbox — it creates a newglobalThisbut shares the V8 realm. CPU isolation requires worker threads. A pathological sync infinite loop inside harness code will eat a full CPU for 5s before the vm timeout fires (which does, in fact, fire — tested).Notes for reviewers
The strict-mode injection is the load-bearing decision. Without
"use strict";\nprepended,Object.freeze(ctx)is a runtime suggestion — sloppy-mode writes silently fail. Tests 14 + 15 rely on strict-mode TypeError propagation. If the prepend is ever removed, both tests turn into false negatives (the harness seems to mutate successfully, but the stored ctx is actually unchanged, so the assertion still passes — very misleading). Keep the strict-mode prepend unless someone proposes an explicit alternative.invokeInVmuses context globals (__harnessArg,__harnessResult) to thread arguments through the boundary. An alternative is a fresh context per invocation — cleaner from a state-sharing perspective, but costs ~microseconds per call forvm.createContext. Shared-state variant is the v1.0 trade-off; Phase 6 refinement may revisit when refiners generate parallel-invocation harnesses.Scope tightening — dropped task doc scenario #10: "meta() returns object with version mismatching HarnessVersion.version". That cross-invariant is enforced in ENG-2226's
HarnessStore.saveVersionat save time. By the time the module builder runs, the invariant has already been validated. Re-checking it here would be duplicate defense — acceptable in other contexts, but it's the store's contract, not the builder's. Suggest tightening Task 3.2's task doc post-merge to drop scenario 10 (same pattern as every prior task doc tightening in this phase).Option B full isolation was chosen over Option A minimal after discussion. Option A would have shipped without sync-hang protection in
curate/querywrappers, relying onPromise.raceonly. Option B adds the vm timeout wrap to every invocation, catching sync infinite loops too. Cost: ~40 more LOC + shared-context globals. The security upside (sync CPU-DoS protection) is worth it for v1.0.Task 3.3's work starts immediately. Once this merges,
SandboxService.loadHarness(sessionId, projectId, commandType)can callharnessStore.getLatest(...)+harnessModuleBuilder.build(version)+ inject the result into the session sandbox. Task 3.3 is the next PR in this stream.Related
src/agent/core/domain/harness/types.ts(ENG-2238)features/autoharness-v2/tasks/phase_3_4_handoff.md §C4 §C5 §C7HarnessMetaSchemain the types filemakeVersion,makeCtx): matches the in-memory-store + recorder test patterns from Phase 1/2features/autoharness-v2/tasks/phase_3/task_02-harness-module-builder.md(research repo)SandboxService.loadHarness+ injection)