feat: [ENG-2241] AutoHarness V2 graceful-degradation tests by danhdoan · Pull Request #500 · campfirein/byterover-cli

danhdoan · 2026-04-21T13:11:02Z

Summary

Problem: brutal-review A4 enumerated 7 harness-failure scenarios that Phase 3 must degrade on — none of them can crash the sandbox, leak a partial harness state, or surface an unhelpful error. Up to now these have only been individually tested at the module-builder level (Task 3.2) or the injection level (Task 3.3). No single test suite proves the whole pipeline degrades end-to-end.
Why it matters: Phase 3's security story leans on this. Task 3.5 (isolation integration test) assumes the degradation contract is sound; Phase 5's mode selector assumes loadHarness never throws regardless of what the harness code does. If any scenario here regressed, a bug in the evaluator would silently crash user sessions.
What changed: new test/unit/agent/harness/graceful-degradation.test.ts — 7 scenarios, one per A4 case, each exercising the full SandboxService.loadHarness → builder → sandbox pipeline. Three-invariant assertion per test: return shape, namespace presence/per-call throw, sandbox-stays-healthy.
What did NOT change (scope boundary): No new source code. No test helpers added. No mutations to HarnessModuleBuilder, SandboxService, or HarnessStore — only exercising them. This is a pure coverage PR.

Type of change

Scope (select all touched areas)

Linked issues

Closes ENG-2241
Closes brutal-review item A4 (graceful degradation of all 7 enumerated harness failure cases)
Depends on (merged): ENG-2239 (HarnessModuleBuilder), ENG-2240 (SandboxService.loadHarness + injection), ENG-2227/2228 (HarnessStore)
Unblocks Phase 3 Task 3.5 (isolation integration test — runs in parallel but shares the "degradation never crashes sandbox" contract)

Root cause (bug fixes only, otherwise write `N/A`)

Root cause: N/A
Why this was not caught earlier: N/A

Test plan

Coverage added:
- Unit test
- Integration test
- Manual verification only
Test file(s): test/unit/agent/harness/graceful-degradation.test.ts
Key scenario(s) covered (7 tests, one per brutal-review A4 case):
- Build-time failures (harness NOT loaded):
  - 1. Syntax error at module load → {loaded: false, reason: 'syntax'} + no harness in sandbox + 2 + 2 returns 4
  - 2. Throw in meta() → {loaded: false, reason: 'meta-threw'} + no harness in sandbox + 2 + 2 returns 4
- Per-invocation failures (harness loads; per-call wrapper throws):
  - 3. Throw in curate() → loaded: true; module.curate(ctx) throws with /curate\(\) failed/; sandbox healthy
  - 4. Infinite loop in curate() → loaded: true; V8 vm timeout (~5003ms); sandbox healthy
  - 5. Infinite recursion in curate() → loaded: true; stack overflow caught; sandbox healthy
  - 7. Never-resolving Promise from curate() → loaded: true; Promise.race timer (~5003ms) with /exceeded/ message; sandbox healthy
- Non-failure (legal undefined):
  - 6. Returns undefined from curate() → loaded: true; module.curate(ctx) resolves to undefined; sandbox healthy. Pinned explicitly as NOT a warning case so a future "warn on undefined returns" drift breaks this test.

User-visible changes

None. Pure test coverage addition. No consumer of the tested paths changes behavior.

Evidence

Failing test/log before + passing after
Trace/log snippets
Screenshot/recording

Before this PR, the test file didn't exist. After: all 7 pass. Full suite: 6705 passing / 0 failing.

$ perl -e 'alarm 60; exec @ARGV' npx mocha test/unit/agent/harness/graceful-degradation.test.ts --timeout 10000
  graceful degradation — brutal-review A4
    ✔ 1. Syntax error at module load → {loaded:false, reason:syntax} (44ms)
    ✔ 2. Throw in meta() → {loaded:false, reason:meta-threw}
    ✔ 3. Throw in curate() → wrapper throws; module stays loaded
    ✔ 4. Infinite loop in curate() → vm timeout; wrapper throws (5003ms)
    ✔ 5. Infinite recursion in curate() → stack overflow; wrapper throws
    ✔ 6. Returns undefined from curate() → resolves to undefined (not a failure)
    ✔ 7. Never-resolving Promise from curate() → Promise.race timer throws (5003ms)
  7 passing (10s)

Checklist

Tests added or updated and passing (npm test) — 7 new tests; full suite 6705 passing / 0 failing
Lint passes (npm run lint) — 0 errors, 226 pre-existing warnings
Type check passes (npm run typecheck) — exit=0
Build succeeds (npm run build) — exit=0
Commits follow Conventional Commits format — feat: [ENG-2241] ...
Documentation updated (if applicable) — task doc at features/autoharness-v2/tasks/phase_3/task_04-graceful-degradation.md (research repo) drove the scope; the stub-vs-real-store deviation + per-invocation-direct-call design choice flagged below for post-merge task-doc tightening
No breaking changes (or clearly documented above) — test-only addition
Branch is up to date with main — targets proj/autoharness-v2, not main

Risks and mitigations

Risk: Test-file runtime is ~10s because two tests wait ~5s each for vm / Promise.race timeouts to fire. Under CI load the vm timeout could drift past 5.1s.
- Mitigation: Each timeout test has a .timeout(8000) budget — vm timeout + 3s headroom. V8's vm.Script.runInContext timeout is a hard wall-clock limit, not a soft signal. Real flakes would point at CI CPU exhaustion rather than test brittleness. If the runtime becomes painful, extract the two timeout tests to a dedicated "slow" file that runs on a nightly cadence; correctness signals stay in the PR gate.
Risk: The per-invocation failure tests (3-7) call result.module.curate!(ctx) directly rather than through service.executeCode('harness.curate(...)', 's1'). If the sandbox's harness.curate wrapper in buildHarnessNamespace ever diverges from a naked module.curate(ctx) call, this test file wouldn't catch the divergence.
- Mitigation: Task 3.3's behavioral test (harness.* visible inside sandbox code) verifies the injection-wrapper contract directly — if a regression made the wrapper add behavior (logging, retry, state-tracking) beyond a pass-through call, that test would catch it. This file's job is degradation semantics, not injection plumbing.
Risk: Task doc prescribes "stub IHarnessStore"; shipped with real HarnessStore backed by FileKeyStorage({inMemory: true}).
- Mitigation: Documented in the commit message + "Notes for reviewers" below. The real store is a strict pass-through in these tests (save + getLatest); stubbing adds setup without coverage benefit. If stub-strict is preferred, swap to InMemoryHarnessStore (from test/helpers/, shipped by Phat in Phase 1 Task 1.1) — one-line change.

Notes for reviewers

All 7 titles lead with the verbatim scenario name from the execution plan, then append the expected outcome. Reading the titles in isolation tells you exactly what's under test: "1. Syntax error at module load → {loaded:false, reason:syntax}". If the outcome clause is ever wrong, a failing test report points directly at the contract drift.

The sandbox-healthy invariant is load-bearing. Every test ends with expectSandboxHealthy() — a one-liner that runs 2 + 2 through the sandbox and asserts 4. Cheap proof that the harness failure didn't corrupt sandbox state (e.g., by leaking a broken VM context or mangling session maps). If any test's sandbox-healthy assertion ever fails, the evaluator or the sandbox injection has a real bug; don't retry, debug.

Case 6 (returns undefined) is explicitly pinned as NOT a failure — the test comment documents this so a future "defensive warning on undefined return" drift would break the test and force an explicit decision. Templates can legally return undefined; the LLM-side caller handles that. The module builder must not emit a warning or wrap the result.

Cases 4 and 5 cover distinct timeout mechanisms: case 4 (while(true){}) is caught by V8's vm wall-clock timeout on vm.Script.runInContext; case 5 (function go(){go()}; go()) is a stack overflow caught by V8 as a native error. Same normalized outcome (curate() failed: ...) but the underlying mechanisms are orthogonal — both need to be wrapped by the same error-normalization path in HarnessModuleBuilder.wrapInvocation.

Task 3.5 (isolation integration test) will replace this file's unit-level setup with a real file-backed sandbox + attack fixtures. That PR exercises orthogonal concerns (cross-context isolation); this one proves per-scenario degradation. Both are in scope for Phase 3's ship gate.

Test file: test/unit/agent/harness/graceful-degradation.test.ts
Builder under test: src/agent/infra/harness/harness-module-builder.ts (ENG-2239)
Store under test: src/agent/infra/harness/harness-store.ts (ENG-2227 + ENG-2228)
Load pipeline under test: src/agent/infra/sandbox/sandbox-service.ts .loadHarness() (ENG-2240)
Brutal-review item closed: A4 (graceful degradation of 7 enumerated harness failure cases)
Task doc: features/autoharness-v2/tasks/phase_3/task_04-graceful-degradation.md (research repo)
Next task in stream: Phase 3 Task 3.5 (isolation integration test — brutal-review A3, the 5 cross-VM attack vectors)

github-actions · 2026-04-21T13:11:22Z

Claude finished @danhdoan's task in 3m 27s —— View job

PR Review — ENG-2241 Graceful Degradation Tests

Tasks

Gather context and understand request
Read test file under review
Read source files under test
Analyze correctness, security, and code quality
Post inline comments for specific issues
Post summary review with score

Summary

Score: 4 / 5

This is a well-executed pure-coverage PR. The three-invariant-per-test structure (return shape, namespace presence, sandbox health) is the right model for end-to-end degradation testing, and every scenario from the A4 brutal-review item is represented with appropriate assertions. The test-file-level doc block accurately maps to the implementation. A few minor issues to address:

Correctness

editFile stub is missing .resolves() (line 97, inline)
All other stubs use .resolves(...) but editFile: sb.stub() returns undefined synchronously. This is a latent defect — currently harmless because none of the tested scenarios call editFile, but it will produce a confusing failure if any future code path reaches it during these tests.

Non-null assertions on result.module.curate! (line 179 and same pattern in tests 3–7, inline)
HarnessModule.curate is an optional property. The ! silences TypeScript but if the fixture were misconfigured and curate were absent, the test would throw TypeError: ... is not a function whose message doesn't match /curate\(\) failed/, producing a misleading failure. A one-line guard (if (result.module.curate === undefined) throw new Error('fixture must export curate')) makes broken fixtures self-describing without adding ceremony.

hasHarness.returnValue assertion could pass on undefined (line 152, inline)
returnValue is typed unknown, and .to.equal(false) passes for any falsy value. Adding an assertion message makes the intent explicit and distinguishes a correct false from an accidental undefined.

Style

as unknown as IFileSystem double-cast (line 104, inline)
Bypasses compile-time interface completeness checking. If IFileSystem gains a new method, this stub won't complain. Low priority but worth noting for interface-evolution hygiene.

ctx.tools stubs lack an explanatory comment (lines 85–88, inline)
The as unknown as ... casts are safe only because none of the tested scenarios actually invoke ctx.tools.*. A brief comment documents the assumption so a future test that adds a successful-invocation scenario doesn't inherit a silently wrong stub.

What's correct and well done

Seven scenarios map 1:1 to the A4 enumeration, with verbatim names in test titles — regression identification is instant.
expectSandboxHealthy() is the right design: a trivial arithmetic expression that proves VM context integrity without coupling to harness internals. Every test ends with it.
beforeEach isolation is solid: fresh FileKeyStorage, HarnessStore, HarnessModuleBuilder, and SandboxService per test — no shared state.
Cases 4 and 5 correctly target orthogonal timeout mechanisms (V8 wall-clock vs. stack overflow), and the PR correctly documents why both are needed.
Case 6 is explicitly pinned as non-failure: the test comment documenting the "not a warning case" intent means a future defensive-warning drift breaks with a clear signal rather than a silent regression.
The known direct-invocation gap (tests call result.module.curate directly rather than through the harness.* sandbox wrapper) is correctly scoped and documented; Task 3.3's injection test owns that contract.
.timeout(8000) gives 3 s headroom over the 5000 ms vm/Promise.race walls — appropriate CI budget.

Phase 3 Task 3.4 — closes brutal-review item A4 with 7 unit tests covering every harness-failure scenario from the execution plan. Each test exercises the full pipeline end-to-end: real `HarnessStore` (in-memory-backed) + real `HarnessModuleBuilder` + real `SandboxService.loadHarness` + `executeCode` for the "sandbox stays healthy" proof. Build-time failures (harness NOT loaded): 1. Syntax error at module load → reason: 'syntax' 2. Throw in meta() → reason: 'meta-threw' Per-invocation failures (harness loads; wrapper normalizes the throw): 3. Throw in curate() → "curate() failed: user error" 4. Infinite loop in curate() → vm timeout, ~5s 5. Infinite recursion in curate() → stack overflow, wrapper throws 7. Never-resolving Promise → Promise.race timer, ~5s Non-failure (legitimate legal behavior): 6. Returns undefined → resolves to undefined; no warn 1. `loadHarness` returns the right `HarnessLoadResult` variant (`{loaded: false, reason}` for 1-2; `{loaded: true}` for 3-7). 2. For build-time failures: sandbox context has no `harness.*` namespace (verified via `typeof harness !== 'undefined' === false`). For per-invocation failures: harness IS loaded, direct `module.curate(ctx)` throws (or resolves to undefined for case 6). 3. Sandbox continues executing unrelated code correctly after the harness failure — verified via `executeCode('2 + 2', 's1')` returning 4. If the harness failure corrupted sandbox state, this assertion would surface it. Cases 3-7 call `result.module.curate!(ctx)` directly rather than routing through the sandbox's `harness.curate()` injection. Reason: invoking an async harness function from REPL-executed sandbox code requires an IIFE that doesn't cleanly surface the thrown error (the REPL doesn't support top-level await in a way that propagates rejections for test assertions). The sandbox's `harness.curate` is a thin closure over `module.curate` (`() => module.curate(buildCtx())`); calling the module method directly tests the same code path. Task 3.3's behavioral test already verifies the sandbox injection wiring; this file's contribution is the failure-mode semantics. Task doc specified "stub `IHarnessStore`"; shipped with real `HarnessStore` backed by `FileKeyStorage({inMemory: true})`. The store is a pass-through in these tests (save version → getLatest returns it); stubbing adds ceremony without coverage benefit. If stub-strict is preferred, swap to `InMemoryHarnessStore` from `test/helpers/` in a one-line change.

github-actions · 2026-04-21T13:22:02Z

Claude finished @danhdoan's task in 4m 4s —— View job

PR Review — ENG-2241 Graceful Degradation Tests

Tasks

Gather context and understand request
Read test file under review
Read source files under test
Analyze correctness, security, and code quality
Post inline comments for specific issues
Post summary review with score

Score: 5 / 5

All four substantive issues from the prior review have been resolved. This is a clean, well-scoped coverage PR.

What was fixed since the last review

Prior concern	Status
`editFile: sb.stub()` missing `.resolves()`	✅ Fixed — now `.resolves({bytesWritten: 0, path: '/'})`
`result.module.curate!` non-null assertions silencing TypeScript	✅ Fixed — explicit guards in all 5 per-invocation tests: `if (result.module.curate === undefined) throw new Error('fixture must export curate')`
`hasHarness.returnValue` assertion had no message (couldn't distinguish `false` from `undefined`)	✅ Fixed — both build-time tests now pass `'harness namespace must not be injected on build-time failure'` as the assertion message
`ctx.tools` stubs lacked an explanatory comment	✅ Fixed — comment documents the assumption that no failure scenario invokes `ctx.tools.*`

Remaining / new observations

One nit remains (low priority, inline): The as unknown as IFileSystem double-cast at line 115 is still present — acceptable given the comment that explains it, but worth noting for interface-evolution hygiene.

One doc inaccuracy (inline): The file-level JSDoc says "≈ 15-20s" but the actual runtime is ~10s (only tests 4 and 7 wait ~5s; all others are fast).