feat: [ENG-2239] AutoHarness V2 HarnessModuleBuilder by danhdoan · Pull Request #497 · campfirein/byterover-cli

danhdoan · 2026-04-21T10:35:01Z

Summary

Problem: Phase 3 needs a trusted evaluator for harness code — every harness ever loaded (template at first, refiner-generated later) flows through this class. The evaluator must isolate harness execution from the outer sandbox and normalize every failure mode so the sandbox can degrade gracefully.
Why it matters: This is the security-critical component of Phase 3. A missed error path here would crash the agent; a missed isolation path would leak state between harness code and user code. Task 3.5 (isolation integration test) depends on this class existing before it can exercise the attack vectors.
What changed: New HarnessModuleBuilder class with build(version) returning HarnessLoadResult. Full per-invocation vm.Script.runInContext with timeout (Option B per design discussion). Strict-mode injection makes Object.freeze(ctx) enforceable. 15 unit tests covering every code path.
What did NOT change (scope boundary): No SandboxService.loadHarness wiring (Task 3.3). No attack-vector integration tests (Task 3.5). No template content (Phase 4). No consumer imports this class yet.

Type of change

Scope (select all touched areas)

Linked issues

Closes ENG-2239
Related ENG-2238 (Phase 3 Task 3.1 — HarnessContext + module contract types this PR consumes)
Unblocks Phase 3 Task 3.3 (SandboxService.loadHarness + harness.* injection — next in the stream)
Unblocks Phase 3 Task 3.5 (isolation integration test — runs against this class)

Root cause (bug fixes only, otherwise write `N/A`)

Root cause: N/A
Why this was not caught earlier: N/A

Test plan

Coverage added:
- Unit test
- Integration test
- Manual verification only
Test file(s): test/unit/agent/harness/harness-module-builder.test.ts
Key scenario(s) covered (15 tests total):
- Happy paths (4): valid template loads, module.meta() returns expected HarnessMeta, module.curate(ctx) passes through correctly, curate-only template has module.query === undefined.
- Caching invariant (1): module.meta() returns the same object reference across calls (proof the VM function is invoked exactly once at build).
- Error categorization (5): syntax error, missing meta export, meta() throws, meta() returns null, meta() returns object with missing required field — each maps to the right reason string.
- Timeout paths (3): meta() infinite loop caught at build (~5s via vm timeout) → reason: meta-threw; curate() infinite loop caught per call (~5s via vm timeout) → normalized throw with "failed" in message; curate() never-resolving Promise caught per call (~5s via Promise.race) → normalized throw with "exceeded" in message.
- Frozen-context enforcement (2): ctx.env = {...} throws TypeError due to Object.freeze + strict mode; ctx.injected = 'leaked' same.

User-visible changes

None. No consumer imports HarnessModuleBuilder yet — Task 3.3 (SandboxService.loadHarness) is the first. harness.enabled = false remains the public default.

Evidence

Failing test/log before + passing after
Trace/log snippets
Screenshot/recording

Before this PR, the class didn't exist. After: all 15 tests pass. Full suite: 6683 passing / 0 failing.

$ perl -e 'alarm 60; exec @ARGV' npx mocha test/unit/agent/harness/harness-module-builder.test.ts --timeout 10000
  HarnessModuleBuilder
    ✔ valid template returns {loaded: true} with callable meta()
    ✔ module.meta() returns the validated HarnessMeta
    ✔ module.curate(ctx) returns the template result
    ✔ curate-only template has module.query === undefined
    ✔ module.meta() caches — subsequent calls return the same reference
    ✔ syntax error → {loaded: false, reason: syntax}
    ✔ missing meta export → {loaded: false, reason: syntax}
    ✔ meta() throws → {loaded: false, reason: meta-threw}
    ✔ meta() returns null → {loaded: false, reason: meta-invalid}
    ✔ meta() returns object missing commandType → {loaded: false, reason: meta-invalid}
    ✔ meta() infinite loop → {loaded: false, reason: meta-threw} (5001ms)
    ✔ curate() infinite loop → wrapper throws; module stays loaded (5003ms)
    ✔ curate() never-resolving Promise → Promise.race timer throws at 5s (5003ms)
    ✔ curate(ctx) cannot mutate ctx at runtime (Object.freeze boundary)
    ✔ curate(ctx) cannot add properties to ctx at runtime
  15 passing (15s)

Checklist

Tests added or updated and passing (npm test) — 15 new tests; full suite 6683 passing / 0 failing
Lint passes (npm run lint) — 0 errors, 226 pre-existing warnings
Type check passes (npm run typecheck) — exit=0
Build succeeds (npm run build) — exit=0
Commits follow Conventional Commits format — feat: [ENG-2239] ...
Documentation updated (if applicable) — task doc at features/autoharness-v2/tasks/phase_3/task_02-harness-module-builder.md (research repo) drove the scope; the Option B choice + scenario-10 drop flagged below for post-merge task-doc tightening
No breaking changes (or clearly documented above) — purely additive
Branch is up to date with main — targets proj/autoharness-v2, not main

Risks and mitigations

Risk: Test-file runtime is ~15s because three tests each wait ~5s for vm timeouts to fire. Under CI load or on slow runners, these could flake if the vm timeout drifts past 5.1s.
- Mitigation: Each test has a .timeout(8000) budget — 3x the vm timeout + headroom. V8's vm.Script.runInContext timeout is deterministic under normal load (it's a hard wall-clock limit, not a soft signal). Real flakes would point at CI CPU exhaustion, not test brittleness. If it becomes a problem, the timeout budgets are parametrizable via a constant.
Risk: Strict-mode injection ("use strict";\n prepended to harness code) changes the semantic context for all harness authors. A Phase 4 template author writing sloppy-mode code (e.g., implicit globals, arguments.callee) would get a syntax error they didn't expect.
- Mitigation: Phase 4's templates are tiny pass-through strings authored carefully; they won't hit sloppy-only features. For Phase 6 refiner-generated code, the LLM's training data is overwhelmingly strict-mode JS. If a real case emerges where a harness must be sloppy, the fix is dropping the strict prepend and finding another way to enforce Object.freeze (e.g., Proxy with a throwing set trap). Strict-mode is the pragmatic v1.0 choice.
Risk: __harnessArg / __harnessResult globals on the shared scriptContext are mutated on every invocation. Concurrent invocations on the same module would trample each other's state.
- Mitigation: Phase 2 + Phase 3 architecture is single-invocation-per-session — SandboxService.executeCode serializes user code execution per session. A concurrent-invocation race requires changes to the outer sandbox architecture that aren't on the v1.0 roadmap. If it becomes necessary, the fix is a fresh context per invocation (small perf cost) or a per-invocation mutex. Both additive.
Risk: vm.createContext in Node is NOT a true sandbox — it creates a new globalThis but shares the V8 realm. CPU isolation requires worker threads. A pathological sync infinite loop inside harness code will eat a full CPU for 5s before the vm timeout fires (which does, in fact, fire — tested).
- Mitigation: Documented in the class docstring. Worker-thread isolation is a post-v1.0 concern. The v1.0 threat model assumes refiner-generated harness code from trusted LLM sources, not adversarial user input. If the threat model expands (e.g., user-authored harnesses shipped in plugins), worker threads become a hard requirement.

Notes for reviewers

The strict-mode injection is the load-bearing decision. Without "use strict";\n prepended, Object.freeze(ctx) is a runtime suggestion — sloppy-mode writes silently fail. Tests 14 + 15 rely on strict-mode TypeError propagation. If the prepend is ever removed, both tests turn into false negatives (the harness seems to mutate successfully, but the stored ctx is actually unchanged, so the assertion still passes — very misleading). Keep the strict-mode prepend unless someone proposes an explicit alternative.

invokeInVm uses context globals (__harnessArg, __harnessResult) to thread arguments through the boundary. An alternative is a fresh context per invocation — cleaner from a state-sharing perspective, but costs ~microseconds per call for vm.createContext. Shared-state variant is the v1.0 trade-off; Phase 6 refinement may revisit when refiners generate parallel-invocation harnesses.

Scope tightening — dropped task doc scenario #10: "meta() returns object with version mismatching HarnessVersion.version". That cross-invariant is enforced in ENG-2226's HarnessStore.saveVersion at save time. By the time the module builder runs, the invariant has already been validated. Re-checking it here would be duplicate defense — acceptable in other contexts, but it's the store's contract, not the builder's. Suggest tightening Task 3.2's task doc post-merge to drop scenario 10 (same pattern as every prior task doc tightening in this phase).

Option B full isolation was chosen over Option A minimal after discussion. Option A would have shipped without sync-hang protection in curate/query wrappers, relying on Promise.race only. Option B adds the vm timeout wrap to every invocation, catching sync infinite loops too. Cost: ~40 more LOC + shared-context globals. The security upside (sync CPU-DoS protection) is worth it for v1.0.

Task 3.3's work starts immediately. Once this merges, SandboxService.loadHarness(sessionId, projectId, commandType) can call harnessStore.getLatest(...) + harnessModuleBuilder.build(version) + inject the result into the session sandbox. Task 3.3 is the next PR in this stream.

Types this class consumes: src/agent/core/domain/harness/types.ts (ENG-2238)
Handoff contract (evaluation semantics): features/autoharness-v2/tasks/phase_3_4_handoff.md §C4 §C5 §C7
Entity schemas validated against: HarnessMetaSchema in the types file
Test helper pattern (makeVersion, makeCtx): matches the in-memory-store + recorder test patterns from Phase 1/2
Task doc: features/autoharness-v2/tasks/phase_3/task_02-harness-module-builder.md (research repo)
Next task: Phase 3 Task 3.3 (SandboxService.loadHarness + injection)
Depends on (merged): Phase 3 Task 3.1 (types — ENG-2238)
Consumers (not yet built): Phase 3 Task 3.3, Phase 5 (mode selector), Phase 6 (refinement evaluator)

Phase 3 Task 3.2 — the core security surface for harness code execution. Evaluates a `HarnessVersion.code` string inside `vm.createContext` and returns a callable `HarnessModule` wrapped in a `HarnessLoadResult`. Every failure mode normalized into a `{loaded: false, reason}` variant; this class never throws. - Fresh `vm.createContext` per `build()`. No state shared across loads. - Harness code prepended with `"use strict";\n` — makes `Object.freeze(ctx)` a real invariant (sloppy-mode writes to frozen properties silently no-op; strict mode throws TypeError). - EVERY function invocation (`meta()`, `curate(ctx)`, `query(ctx)`) runs inside `vm.Script.runInContext` with `{timeout: 5000}`. V8's wall-clock timeout catches synchronous infinite loops. - Returned Promises are additionally raced against a JS-level 5s timer. Covers async never-resolving Promises that slip past the vm timeout (which only gates sync execution). - Errors are re-thrown via fresh `Error` construction — any captured outer-scope properties (e.g. `capturedThis = globalThis`) are stripped before propagation. - `meta()` is invoked once at build time; result is cached. Subsequent `module.meta()` calls return the captured reference without re-invoking the VM. Templates are CommonJS strings. The context exposes both `module.exports` and `exports` pointing at the same object so templates can write either idiom. After the script runs, `module.exports` is canonical (handles the case where a template reassigns it wholesale). 4 happy paths (valid template, meta() result, curate() pass-through, curate-only → no query) + meta caching invariant + 5 error categorizations + sync vm timeout on meta() + sync vm timeout on curate() + async Promise.race timeout on curate() + frozen-context mutation rejection (2 shapes). Sync infinite-loop tests each wait ~5s for the vm timeout to fire; test file runtime ≈ 15s. Async Promise.race test runtime ~5s. Acceptable given Phase 3's security focus. Dropped task doc scenario #10 ("meta() version !== HarnessVersion.version"). That cross-invariant is enforced at the store boundary in ENG-2226's `saveVersion`, not in the module builder. Documented in the PR body for post-merge task doc tightening.

github-actions · 2026-04-21T10:35:20Z

Claude finished @danhdoan's task in 3m 27s —— View job

PR Review: feat: [ENG-2239] AutoHarness V2 HarnessModuleBuilder

Read and analyze changed files
Review implementation correctness and security
Review test coverage and quality
Post inline comments
Write summary with overall score

Summary

Overall score: 4 / 5

Solid, well-scoped implementation. The class docstring is exemplary — isolation guarantees, CommonJS rationale, and concurrency caveats are all documented. Error normalization is thorough: every failure mode returns a discriminated {loaded: false, reason} without throwing. The vm-timeout + Promise.race two-layer hang detection is the right architecture for Option B. Test coverage is strong at 15 tests and every documented code path is exercised.

One real security gap to address before Task 3.3 wires this class as a consumer:

Issues

🔴 Security — Shallow freeze leaves nested context properties mutable (line 283)

Object.freeze(ctx) only prevents top-level reassignment (ctx.env = ...). A harness can still mutate ctx.env.commandType, ctx.env.workingDirectory, or any other nested property because ctx.env and ctx.tools are not frozen. Tests 14+15 confirm the top-level boundary works, but do not cover nested mutation. The fix is to also freeze ctx.env and ctx.tools at the call boundary, plus two additional tests confirming nested mutation throws.

Suggestions

🟡 Perf — vm.Script recompiled on every invokeInVm call (line 253)

The invocation script (__harnessResult = module.exports.curate(__harnessArg)) is a constant string for a given (name, versionId) pair but is re-compiled into a vm.Script on every call. Pre-compile once in buildModule and reuse.

🟡 Test gap — query export path untested (line 93)

All happy-path, timeout, and freeze tests use curate exclusively. A single smoke test confirming module.query is present and callable when exported would cover the query branch before Phase 5 consumes it.

🟡 Test gap — nested property freeze not tested (line 226)

Companion tests for the shallow-freeze fix (once landed).

Nitpicks

__harnessFn in ScriptContext is dead code (line 71) — declared but never used; remove it.
curateExport/queryExport parameters are string literals only used as undefined checks (lines 184–192) — replace with boolean.
Async timeout error double-wraps (line 313) — harness curate() failed: harness curate() exceeded 5000ms is redundant; minor cleanup.

The shallow freeze is the only blocker — all other items are clean-up. The architecture, isolation rationale, and test structure are all in good shape. Once the deep freeze lands with companion tests, this is ready to merge.

github-actions · 2026-04-21T10:51:43Z

PR Review: feat: [ENG-2239] AutoHarness V2 HarnessModuleBuilder

Read and analyze changed files
Review implementation correctness and security
Review test coverage and quality
Post inline comments
Write summary with overall score

View job run

github-actions · 2026-04-21T10:53:17Z

Claude finished @danhdoan's task in 2m 33s —— View job

PR Review: feat: [ENG-2239] AutoHarness V2 HarnessModuleBuilder

Read and analyze changed files
Review implementation correctness and security
Review test coverage and quality
Post inline comments
Write summary with overall score

Summary

Overall score: 4.5 / 5

The revision round addressed all five items from the previous review cleanly. The security blocker (shallow freeze) is fixed, performance is improved (pre-compiled vm.Script), the query export path is now tested, and the buildModule signature correctly uses an object parameter per the CLAUDE.md convention. Test count grew from 15 to 18. The two remaining items below are cosmetic and non-blocking.

What was fixed ✅

Item	Resolution
🔴 Shallow `Object.freeze`	`ctx.env` and `ctx.tools` are now spread + frozen individually at the invocation boundary (lines 307–311)
🟡 `vm.Script` recompiled every call	New `compileInvokeScript()` method; scripts compiled once in `build()` and reused across invocations
🟡 `query` export path untested	New test: `'template exporting both curate+query produces both wrappers…'` (line 99–122)
🟡 Nested freeze not tested	Two new tests: `ctx.env.commandType` mutation (line 253) and `ctx.tools.curate` replacement (line 278)
🟡 String-literal export params	`buildModule` now takes `{curateInvokeScript: vm.Script \| undefined, queryInvokeScript: vm.Script \| undefined, ...}` — intent is unambiguous

Remaining issues

Nitpick — __harnessFn is still dead code (src/agent/infra/harness/harness-module-builder.ts:71)

ScriptContext.#__harnessFn (interface line 71, context init line 114) is still declared and initialized but never read or written after the refactor. invokeInVm now threads everything through __harnessArg/__harnessResult only. Remove the field and its initializer to keep the ScriptContext interface honest.

interface ScriptContext {
  __harnessArg?: unknown
  __harnessResult?: unknown
  exports: Record<string, unknown>
  module: {exports: Record<string, unknown>}
}

Nitpick — async timeout still double-wraps (src/agent/infra/harness/harness-module-builder.ts:344)

When Promise.race fires the timer, the message chain is:
harness curate() failed: harness curate() exceeded 5000ms

The test (line 222–224) validates both failed and exceeded are present — which confirms the double-wrap is baked in and tested. Not a bug, but the redundancy adds noise to callers reading error logs. Either re-throw the timer error directly (its message is already self-describing), or check if the caught error message already starts with harness and skip the outer wrap. Low priority; fine to defer to a later polish pass.

Minor observation

as HarnessModule['curate'] cast (src/agent/infra/harness/harness-module-builder.ts:230)

wrapInvocation returns (ctx: HarnessContext) => Promise<unknown>, which is cast to HarnessModule['curate'] (typed as Promise<CurateResult>). CLAUDE.md flags as assertions as something to avoid. The VM boundary makes unknown unavoidable here — the cast is justified — but if CurateResult grows a Zod schema in a later phase, the return value could be narrowed with safeParse instead of the cast, eliminating the type assertion entirely. Not blocking for v1.0; worth noting for Task 3.3's handoff.

The shallow-freeze fix is correct and the new tests cover the attack vector thoroughly. The two remaining nitpicks are safe to clean up in a follow-on commit or as part of Task 3.3. Ready to merge once the team is happy with the outstanding cosmetics. Fix __harnessFn dead code →

github-actions · 2026-04-21T11:31:53Z

Claude finished @danhdoan's task in 3m 4s —— View job

PR Review: feat: [ENG-2239] AutoHarness V2 HarnessModuleBuilder

Read and analyze changed files
Review implementation correctness and security
Review test coverage and quality
Post inline comments
Write summary with overall score

Summary

Overall score: 4.5 / 5

This is the third review pass. All items from the previous 4/5 review have been addressed: deep freeze lands correctly (spreading ctx.env and ctx.tools before freezing so caller originals are untouched), vm.Script is pre-compiled via compileInvokeScript, the query export path is tested, and buildModule correctly uses an object parameter. Test count is now 18 and covers every documented code path.

What was fixed ✅

Item	Resolution
🔴 Shallow `Object.freeze`	`ctx.env` and `ctx.tools` spread + frozen independently (lines 307–311)
🟡 `vm.Script` recompiled per invocation	`compileInvokeScript()` pre-compiles once in `build()`
🟡 `query` export path untested	New test: both `curate` and `query` wrappers callable (line 99)
🟡 Nested freeze not tested	Two new tests: `ctx.env.commandType` and `ctx.tools.curate` mutations throw
🟡 String-literal export params	`buildModule` takes `{curateInvokeScript, queryInvokeScript, ...}`

Remaining issues

issue (CLAUDE.md violation) — wrapInvocation has 4 positional parameters (src/agent/infra/harness/harness-module-builder.ts:292)

CLAUDE.md requires object parameters for functions with more than 3 parameters. buildModule was updated to object-param style in the previous pass, but wrapInvocation (4 params: scriptContext, invokeScript, name, versionId) was missed. Should use the same object-param pattern. Fix this →

nitpick — __harnessFn still dead code (src/agent/infra/harness/harness-module-builder.ts:71)

ScriptContext.__harnessFn is declared (line 71), initialized in vm.createContext (line 113), and referenced in the JSDoc comment (line 67) but never written or read anywhere in the implementation. Three places to clean up: the interface field, the context initializer, and the JSDoc mention. Fix this →

nitpick — async timeout error double-wraps (src/agent/infra/harness/harness-module-builder.ts:344)

When Promise.race fires, the message is already harness curate() exceeded 5000ms; the catch block then wraps it as harness curate() failed: harness curate() exceeded 5000ms. The test at line 222–224 validates both strings, so fixing it requires updating the test too. Fine to defer to a polish pass.

nitpick — as HarnessModule[…] type assertions (src/agent/infra/harness/harness-module-builder.ts:230, 238)

wrapInvocation returns Promise<unknown> and is cast to HarnessModule['curate'] / HarnessModule['query']. CLAUDE.md discourages as assertions. The VM boundary justifies the cast for v1.0 — if CurateResult gets a Zod schema in a later phase, safeParse could eliminate the assertion entirely. Non-blocking.

The wrapInvocation 4-parameter issue is the only item that wasn't flagged before and violates a repo standard — worth fixing in the same commit as the __harnessFn cleanup rather than a separate PR. Both are mechanical refactors with no behavior change. The async double-wrap and type assertions are low-priority defers.

danhdoan requested review from RyanNg1403, bao-byterover, cuongdo-byterover, leehpham and ngduyanhece as code owners April 21, 2026 10:35