Skip to content

feat: [ENG-2239] AutoHarness V2 HarnessModuleBuilder#497

Merged
danhdoan merged 3 commits intoproj/autoharness-v2from
feat/ENG-2239
Apr 21, 2026
Merged

feat: [ENG-2239] AutoHarness V2 HarnessModuleBuilder#497
danhdoan merged 3 commits intoproj/autoharness-v2from
feat/ENG-2239

Conversation

@danhdoan
Copy link
Copy Markdown
Collaborator

@danhdoan danhdoan commented Apr 21, 2026

Summary

  • Problem: Phase 3 needs a trusted evaluator for harness code — every harness ever loaded (template at first, refiner-generated later) flows through this class. The evaluator must isolate harness execution from the outer sandbox and normalize every failure mode so the sandbox can degrade gracefully.
  • Why it matters: This is the security-critical component of Phase 3. A missed error path here would crash the agent; a missed isolation path would leak state between harness code and user code. Task 3.5 (isolation integration test) depends on this class existing before it can exercise the attack vectors.
  • What changed: New HarnessModuleBuilder class with build(version) returning HarnessLoadResult. Full per-invocation vm.Script.runInContext with timeout (Option B per design discussion). Strict-mode injection makes Object.freeze(ctx) enforceable. 15 unit tests covering every code path.
  • What did NOT change (scope boundary): No SandboxService.loadHarness wiring (Task 3.3). No attack-vector integration tests (Task 3.5). No template content (Phase 4). No consumer imports this class yet.

Type of change

  • New feature
  • Bug fix
  • Refactor (no behavior change)
  • Documentation
  • Test
  • Chore (build, dependencies, CI)

Scope (select all touched areas)

  • TUI / REPL
  • Agent / Tools
  • LLM Providers
  • Server / Daemon
  • Shared (constants, types, transport events)
  • CLI Commands (oclif)
  • Hub / Connectors
  • Cloud Sync
  • CI/CD / Infra

Linked issues

  • Closes ENG-2239
  • Related ENG-2238 (Phase 3 Task 3.1 — HarnessContext + module contract types this PR consumes)
  • Unblocks Phase 3 Task 3.3 (SandboxService.loadHarness + harness.* injection — next in the stream)
  • Unblocks Phase 3 Task 3.5 (isolation integration test — runs against this class)

Root cause (bug fixes only, otherwise write N/A)

  • Root cause: N/A
  • Why this was not caught earlier: N/A

Test plan

  • Coverage added:
    • Unit test
    • Integration test
    • Manual verification only
  • Test file(s): test/unit/agent/harness/harness-module-builder.test.ts
  • Key scenario(s) covered (15 tests total):
    • Happy paths (4): valid template loads, module.meta() returns expected HarnessMeta, module.curate(ctx) passes through correctly, curate-only template has module.query === undefined.
    • Caching invariant (1): module.meta() returns the same object reference across calls (proof the VM function is invoked exactly once at build).
    • Error categorization (5): syntax error, missing meta export, meta() throws, meta() returns null, meta() returns object with missing required field — each maps to the right reason string.
    • Timeout paths (3): meta() infinite loop caught at build (~5s via vm timeout) → reason: meta-threw; curate() infinite loop caught per call (~5s via vm timeout) → normalized throw with "failed" in message; curate() never-resolving Promise caught per call (~5s via Promise.race) → normalized throw with "exceeded" in message.
    • Frozen-context enforcement (2): ctx.env = {...} throws TypeError due to Object.freeze + strict mode; ctx.injected = 'leaked' same.

User-visible changes

None. No consumer imports HarnessModuleBuilder yet — Task 3.3 (SandboxService.loadHarness) is the first. harness.enabled = false remains the public default.

Evidence

  • Failing test/log before + passing after
  • Trace/log snippets
  • Screenshot/recording

Before this PR, the class didn't exist. After: all 15 tests pass. Full suite: 6683 passing / 0 failing.

$ perl -e 'alarm 60; exec @ARGV' npx mocha test/unit/agent/harness/harness-module-builder.test.ts --timeout 10000
  HarnessModuleBuilder
    ✔ valid template returns {loaded: true} with callable meta()
    ✔ module.meta() returns the validated HarnessMeta
    ✔ module.curate(ctx) returns the template result
    ✔ curate-only template has module.query === undefined
    ✔ module.meta() caches — subsequent calls return the same reference
    ✔ syntax error → {loaded: false, reason: syntax}
    ✔ missing meta export → {loaded: false, reason: syntax}
    ✔ meta() throws → {loaded: false, reason: meta-threw}
    ✔ meta() returns null → {loaded: false, reason: meta-invalid}
    ✔ meta() returns object missing commandType → {loaded: false, reason: meta-invalid}
    ✔ meta() infinite loop → {loaded: false, reason: meta-threw} (5001ms)
    ✔ curate() infinite loop → wrapper throws; module stays loaded (5003ms)
    ✔ curate() never-resolving Promise → Promise.race timer throws at 5s (5003ms)
    ✔ curate(ctx) cannot mutate ctx at runtime (Object.freeze boundary)
    ✔ curate(ctx) cannot add properties to ctx at runtime
  15 passing (15s)

Checklist

  • Tests added or updated and passing (npm test) — 15 new tests; full suite 6683 passing / 0 failing
  • Lint passes (npm run lint) — 0 errors, 226 pre-existing warnings
  • Type check passes (npm run typecheck) — exit=0
  • Build succeeds (npm run build) — exit=0
  • Commits follow Conventional Commits format — feat: [ENG-2239] ...
  • Documentation updated (if applicable) — task doc at features/autoharness-v2/tasks/phase_3/task_02-harness-module-builder.md (research repo) drove the scope; the Option B choice + scenario-10 drop flagged below for post-merge task-doc tightening
  • No breaking changes (or clearly documented above) — purely additive
  • Branch is up to date with main — targets proj/autoharness-v2, not main

Risks and mitigations

  • Risk: Test-file runtime is ~15s because three tests each wait ~5s for vm timeouts to fire. Under CI load or on slow runners, these could flake if the vm timeout drifts past 5.1s.

    • Mitigation: Each test has a .timeout(8000) budget — 3x the vm timeout + headroom. V8's vm.Script.runInContext timeout is deterministic under normal load (it's a hard wall-clock limit, not a soft signal). Real flakes would point at CI CPU exhaustion, not test brittleness. If it becomes a problem, the timeout budgets are parametrizable via a constant.
  • Risk: Strict-mode injection ("use strict";\n prepended to harness code) changes the semantic context for all harness authors. A Phase 4 template author writing sloppy-mode code (e.g., implicit globals, arguments.callee) would get a syntax error they didn't expect.

    • Mitigation: Phase 4's templates are tiny pass-through strings authored carefully; they won't hit sloppy-only features. For Phase 6 refiner-generated code, the LLM's training data is overwhelmingly strict-mode JS. If a real case emerges where a harness must be sloppy, the fix is dropping the strict prepend and finding another way to enforce Object.freeze (e.g., Proxy with a throwing set trap). Strict-mode is the pragmatic v1.0 choice.
  • Risk: __harnessArg / __harnessResult globals on the shared scriptContext are mutated on every invocation. Concurrent invocations on the same module would trample each other's state.

    • Mitigation: Phase 2 + Phase 3 architecture is single-invocation-per-session — SandboxService.executeCode serializes user code execution per session. A concurrent-invocation race requires changes to the outer sandbox architecture that aren't on the v1.0 roadmap. If it becomes necessary, the fix is a fresh context per invocation (small perf cost) or a per-invocation mutex. Both additive.
  • Risk: vm.createContext in Node is NOT a true sandbox — it creates a new globalThis but shares the V8 realm. CPU isolation requires worker threads. A pathological sync infinite loop inside harness code will eat a full CPU for 5s before the vm timeout fires (which does, in fact, fire — tested).

    • Mitigation: Documented in the class docstring. Worker-thread isolation is a post-v1.0 concern. The v1.0 threat model assumes refiner-generated harness code from trusted LLM sources, not adversarial user input. If the threat model expands (e.g., user-authored harnesses shipped in plugins), worker threads become a hard requirement.

Notes for reviewers

The strict-mode injection is the load-bearing decision. Without "use strict";\n prepended, Object.freeze(ctx) is a runtime suggestion — sloppy-mode writes silently fail. Tests 14 + 15 rely on strict-mode TypeError propagation. If the prepend is ever removed, both tests turn into false negatives (the harness seems to mutate successfully, but the stored ctx is actually unchanged, so the assertion still passes — very misleading). Keep the strict-mode prepend unless someone proposes an explicit alternative.

invokeInVm uses context globals (__harnessArg, __harnessResult) to thread arguments through the boundary. An alternative is a fresh context per invocation — cleaner from a state-sharing perspective, but costs ~microseconds per call for vm.createContext. Shared-state variant is the v1.0 trade-off; Phase 6 refinement may revisit when refiners generate parallel-invocation harnesses.

Scope tightening — dropped task doc scenario #10: "meta() returns object with version mismatching HarnessVersion.version". That cross-invariant is enforced in ENG-2226's HarnessStore.saveVersion at save time. By the time the module builder runs, the invariant has already been validated. Re-checking it here would be duplicate defense — acceptable in other contexts, but it's the store's contract, not the builder's. Suggest tightening Task 3.2's task doc post-merge to drop scenario 10 (same pattern as every prior task doc tightening in this phase).

Option B full isolation was chosen over Option A minimal after discussion. Option A would have shipped without sync-hang protection in curate/query wrappers, relying on Promise.race only. Option B adds the vm timeout wrap to every invocation, catching sync infinite loops too. Cost: ~40 more LOC + shared-context globals. The security upside (sync CPU-DoS protection) is worth it for v1.0.

Task 3.3's work starts immediately. Once this merges, SandboxService.loadHarness(sessionId, projectId, commandType) can call harnessStore.getLatest(...) + harnessModuleBuilder.build(version) + inject the result into the session sandbox. Task 3.3 is the next PR in this stream.

Related

  • Types this class consumes: src/agent/core/domain/harness/types.ts (ENG-2238)
  • Handoff contract (evaluation semantics): features/autoharness-v2/tasks/phase_3_4_handoff.md §C4 §C5 §C7
  • Entity schemas validated against: HarnessMetaSchema in the types file
  • Test helper pattern (makeVersion, makeCtx): matches the in-memory-store + recorder test patterns from Phase 1/2
  • Task doc: features/autoharness-v2/tasks/phase_3/task_02-harness-module-builder.md (research repo)
  • Next task: Phase 3 Task 3.3 (SandboxService.loadHarness + injection)
  • Depends on (merged): Phase 3 Task 3.1 (types — ENG-2238)
  • Consumers (not yet built): Phase 3 Task 3.3, Phase 5 (mode selector), Phase 6 (refinement evaluator)

Phase 3 Task 3.2 — the core security surface for harness code
execution. Evaluates a `HarnessVersion.code` string inside
`vm.createContext` and returns a callable `HarnessModule` wrapped in
a `HarnessLoadResult`. Every failure mode normalized into a
`{loaded: false, reason}` variant; this class never throws.

- Fresh `vm.createContext` per `build()`. No state shared across
  loads.
- Harness code prepended with `"use strict";\n` — makes
  `Object.freeze(ctx)` a real invariant (sloppy-mode writes to frozen
  properties silently no-op; strict mode throws TypeError).
- EVERY function invocation (`meta()`, `curate(ctx)`, `query(ctx)`)
  runs inside `vm.Script.runInContext` with `{timeout: 5000}`. V8's
  wall-clock timeout catches synchronous infinite loops.
- Returned Promises are additionally raced against a JS-level 5s
  timer. Covers async never-resolving Promises that slip past the
  vm timeout (which only gates sync execution).
- Errors are re-thrown via fresh `Error` construction — any
  captured outer-scope properties (e.g. `capturedThis = globalThis`)
  are stripped before propagation.
- `meta()` is invoked once at build time; result is cached.
  Subsequent `module.meta()` calls return the captured reference
  without re-invoking the VM.

Templates are CommonJS strings. The context exposes both
`module.exports` and `exports` pointing at the same object so
templates can write either idiom. After the script runs,
`module.exports` is canonical (handles the case where a template
reassigns it wholesale).

4 happy paths (valid template, meta() result, curate() pass-through,
curate-only → no query) + meta caching invariant + 5 error
categorizations + sync vm timeout on meta() + sync vm timeout on
curate() + async Promise.race timeout on curate() +
frozen-context mutation rejection (2 shapes).

Sync infinite-loop tests each wait ~5s for the vm timeout to fire;
test file runtime ≈ 15s. Async Promise.race test runtime ~5s.
Acceptable given Phase 3's security focus.

Dropped task doc scenario #10 ("meta() version !==
HarnessVersion.version").
That cross-invariant is enforced at the store boundary in ENG-2226's
`saveVersion`, not in the module builder. Documented in the PR body
for post-merge task doc tightening.
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 21, 2026

Claude finished @danhdoan's task in 3m 27s —— View job


PR Review: feat: [ENG-2239] AutoHarness V2 HarnessModuleBuilder

  • Read and analyze changed files
  • Review implementation correctness and security
  • Review test coverage and quality
  • Post inline comments
  • Write summary with overall score

Summary

Overall score: 4 / 5

Solid, well-scoped implementation. The class docstring is exemplary — isolation guarantees, CommonJS rationale, and concurrency caveats are all documented. Error normalization is thorough: every failure mode returns a discriminated {loaded: false, reason} without throwing. The vm-timeout + Promise.race two-layer hang detection is the right architecture for Option B. Test coverage is strong at 15 tests and every documented code path is exercised.

One real security gap to address before Task 3.3 wires this class as a consumer:


Issues

🔴 Security — Shallow freeze leaves nested context properties mutable (line 283)

Object.freeze(ctx) only prevents top-level reassignment (ctx.env = ...). A harness can still mutate ctx.env.commandType, ctx.env.workingDirectory, or any other nested property because ctx.env and ctx.tools are not frozen. Tests 14+15 confirm the top-level boundary works, but do not cover nested mutation. The fix is to also freeze ctx.env and ctx.tools at the call boundary, plus two additional tests confirming nested mutation throws.


Suggestions

🟡 Perf — vm.Script recompiled on every invokeInVm call (line 253)

The invocation script (__harnessResult = module.exports.curate(__harnessArg)) is a constant string for a given (name, versionId) pair but is re-compiled into a vm.Script on every call. Pre-compile once in buildModule and reuse.

🟡 Test gap — query export path untested (line 93)

All happy-path, timeout, and freeze tests use curate exclusively. A single smoke test confirming module.query is present and callable when exported would cover the query branch before Phase 5 consumes it.

🟡 Test gap — nested property freeze not tested (line 226)

Companion tests for the shallow-freeze fix (once landed).


Nitpicks

  • __harnessFn in ScriptContext is dead code (line 71) — declared but never used; remove it.
  • curateExport/queryExport parameters are string literals only used as undefined checks (lines 184–192) — replace with boolean.
  • Async timeout error double-wraps (line 313) — harness curate() failed: harness curate() exceeded 5000ms is redundant; minor cleanup.

The shallow freeze is the only blocker — all other items are clean-up. The architecture, isolation rationale, and test structure are all in good shape. Once the deep freeze lands with companion tests, this is ready to merge.

Comment thread src/agent/infra/harness/harness-module-builder.ts
Comment thread src/agent/infra/harness/harness-module-builder.ts Outdated
Comment thread src/agent/infra/harness/harness-module-builder.ts Outdated
Comment thread src/agent/infra/harness/harness-module-builder.ts Outdated
Comment thread src/agent/infra/harness/harness-module-builder.ts
Comment thread test/unit/agent/harness/harness-module-builder.test.ts
Comment thread test/unit/agent/harness/harness-module-builder.test.ts
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 21, 2026

PR Review: feat: [ENG-2239] AutoHarness V2 HarnessModuleBuilder

  • Read and analyze changed files
  • Review implementation correctness and security
  • Review test coverage and quality
  • Post inline comments
  • Write summary with overall score

View job run

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 21, 2026

Claude finished @danhdoan's task in 2m 33s —— View job


PR Review: feat: [ENG-2239] AutoHarness V2 HarnessModuleBuilder

  • Read and analyze changed files
  • Review implementation correctness and security
  • Review test coverage and quality
  • Post inline comments
  • Write summary with overall score

Summary

Overall score: 4.5 / 5

The revision round addressed all five items from the previous review cleanly. The security blocker (shallow freeze) is fixed, performance is improved (pre-compiled vm.Script), the query export path is now tested, and the buildModule signature correctly uses an object parameter per the CLAUDE.md convention. Test count grew from 15 to 18. The two remaining items below are cosmetic and non-blocking.


What was fixed ✅

Item Resolution
🔴 Shallow Object.freeze ctx.env and ctx.tools are now spread + frozen individually at the invocation boundary (lines 307–311)
🟡 vm.Script recompiled every call New compileInvokeScript() method; scripts compiled once in build() and reused across invocations
🟡 query export path untested New test: 'template exporting both curate+query produces both wrappers…' (line 99–122)
🟡 Nested freeze not tested Two new tests: ctx.env.commandType mutation (line 253) and ctx.tools.curate replacement (line 278)
🟡 String-literal export params buildModule now takes {curateInvokeScript: vm.Script | undefined, queryInvokeScript: vm.Script | undefined, ...} — intent is unambiguous

Remaining issues

Nitpick — __harnessFn is still dead code (src/agent/infra/harness/harness-module-builder.ts:71)

ScriptContext.#__harnessFn (interface line 71, context init line 114) is still declared and initialized but never read or written after the refactor. invokeInVm now threads everything through __harnessArg/__harnessResult only. Remove the field and its initializer to keep the ScriptContext interface honest.

interface ScriptContext {
  __harnessArg?: unknown
  __harnessResult?: unknown
  exports: Record<string, unknown>
  module: {exports: Record<string, unknown>}
}

Nitpick — async timeout still double-wraps (src/agent/infra/harness/harness-module-builder.ts:344)

When Promise.race fires the timer, the message chain is:
harness curate() failed: harness curate() exceeded 5000ms

The test (line 222–224) validates both failed and exceeded are present — which confirms the double-wrap is baked in and tested. Not a bug, but the redundancy adds noise to callers reading error logs. Either re-throw the timer error directly (its message is already self-describing), or check if the caught error message already starts with harness and skip the outer wrap. Low priority; fine to defer to a later polish pass.


Minor observation

as HarnessModule['curate'] cast (src/agent/infra/harness/harness-module-builder.ts:230)

wrapInvocation returns (ctx: HarnessContext) => Promise<unknown>, which is cast to HarnessModule['curate'] (typed as Promise<CurateResult>). CLAUDE.md flags as assertions as something to avoid. The VM boundary makes unknown unavoidable here — the cast is justified — but if CurateResult grows a Zod schema in a later phase, the return value could be narrowed with safeParse instead of the cast, eliminating the type assertion entirely. Not blocking for v1.0; worth noting for Task 3.3's handoff.


The shallow-freeze fix is correct and the new tests cover the attack vector thoroughly. The two remaining nitpicks are safe to clean up in a follow-on commit or as part of Task 3.3. Ready to merge once the team is happy with the outstanding cosmetics. Fix __harnessFn dead code →

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 21, 2026

Claude finished @danhdoan's task in 3m 4s —— View job


PR Review: feat: [ENG-2239] AutoHarness V2 HarnessModuleBuilder

  • Read and analyze changed files
  • Review implementation correctness and security
  • Review test coverage and quality
  • Post inline comments
  • Write summary with overall score

Summary

Overall score: 4.5 / 5

This is the third review pass. All items from the previous 4/5 review have been addressed: deep freeze lands correctly (spreading ctx.env and ctx.tools before freezing so caller originals are untouched), vm.Script is pre-compiled via compileInvokeScript, the query export path is tested, and buildModule correctly uses an object parameter. Test count is now 18 and covers every documented code path.


What was fixed ✅

Item Resolution
🔴 Shallow Object.freeze ctx.env and ctx.tools spread + frozen independently (lines 307–311)
🟡 vm.Script recompiled per invocation compileInvokeScript() pre-compiles once in build()
🟡 query export path untested New test: both curate and query wrappers callable (line 99)
🟡 Nested freeze not tested Two new tests: ctx.env.commandType and ctx.tools.curate mutations throw
🟡 String-literal export params buildModule takes {curateInvokeScript, queryInvokeScript, ...}

Remaining issues

issue (CLAUDE.md violation) — wrapInvocation has 4 positional parameters (src/agent/infra/harness/harness-module-builder.ts:292)

CLAUDE.md requires object parameters for functions with more than 3 parameters. buildModule was updated to object-param style in the previous pass, but wrapInvocation (4 params: scriptContext, invokeScript, name, versionId) was missed. Should use the same object-param pattern. Fix this →

nitpick — __harnessFn still dead code (src/agent/infra/harness/harness-module-builder.ts:71)

ScriptContext.__harnessFn is declared (line 71), initialized in vm.createContext (line 113), and referenced in the JSDoc comment (line 67) but never written or read anywhere in the implementation. Three places to clean up: the interface field, the context initializer, and the JSDoc mention. Fix this →

nitpick — async timeout error double-wraps (src/agent/infra/harness/harness-module-builder.ts:344)

When Promise.race fires, the message is already harness curate() exceeded 5000ms; the catch block then wraps it as harness curate() failed: harness curate() exceeded 5000ms. The test at line 222–224 validates both strings, so fixing it requires updating the test too. Fine to defer to a polish pass.

nitpick — as HarnessModule[…] type assertions (src/agent/infra/harness/harness-module-builder.ts:230, 238)

wrapInvocation returns Promise<unknown> and is cast to HarnessModule['curate'] / HarnessModule['query']. CLAUDE.md discourages as assertions. The VM boundary justifies the cast for v1.0 — if CurateResult gets a Zod schema in a later phase, safeParse could eliminate the assertion entirely. Non-blocking.


The wrapInvocation 4-parameter issue is the only item that wasn't flagged before and violates a repo standard — worth fixing in the same commit as the __harnessFn cleanup rather than a separate PR. Both are mechanical refactors with no behavior change. The async double-wrap and type assertions are low-priority defers.

Comment thread src/agent/infra/harness/harness-module-builder.ts
Comment thread src/agent/infra/harness/harness-module-builder.ts
Comment thread src/agent/infra/harness/harness-module-builder.ts
Comment thread src/agent/infra/harness/harness-module-builder.ts
@danhdoan danhdoan merged commit ad6cc57 into proj/autoharness-v2 Apr 21, 2026
6 checks passed
@danhdoan danhdoan deleted the feat/ENG-2239 branch April 22, 2026 00:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants