Skip to content

feat: [ENG-2241] AutoHarness V2 graceful-degradation tests#500

Merged
danhdoan merged 1 commit intoproj/autoharness-v2from
feat/ENG-2241
Apr 21, 2026
Merged

feat: [ENG-2241] AutoHarness V2 graceful-degradation tests#500
danhdoan merged 1 commit intoproj/autoharness-v2from
feat/ENG-2241

Conversation

@danhdoan
Copy link
Copy Markdown
Collaborator

Summary

  • Problem: brutal-review A4 enumerated 7 harness-failure scenarios that Phase 3 must degrade on — none of them can crash the sandbox, leak a partial harness state, or surface an unhelpful error. Up to now these have only been individually tested at the module-builder level (Task 3.2) or the injection level (Task 3.3). No single test suite proves the whole pipeline degrades end-to-end.
  • Why it matters: Phase 3's security story leans on this. Task 3.5 (isolation integration test) assumes the degradation contract is sound; Phase 5's mode selector assumes loadHarness never throws regardless of what the harness code does. If any scenario here regressed, a bug in the evaluator would silently crash user sessions.
  • What changed: new test/unit/agent/harness/graceful-degradation.test.ts — 7 scenarios, one per A4 case, each exercising the full SandboxService.loadHarness → builder → sandbox pipeline. Three-invariant assertion per test: return shape, namespace presence/per-call throw, sandbox-stays-healthy.
  • What did NOT change (scope boundary): No new source code. No test helpers added. No mutations to HarnessModuleBuilder, SandboxService, or HarnessStore — only exercising them. This is a pure coverage PR.

Type of change

  • New feature
  • Bug fix
  • Refactor (no behavior change)
  • Documentation
  • Test
  • Chore (build, dependencies, CI)

Scope (select all touched areas)

  • TUI / REPL
  • Agent / Tools
  • LLM Providers
  • Server / Daemon
  • Shared (constants, types, transport events)
  • CLI Commands (oclif)
  • Hub / Connectors
  • Cloud Sync
  • CI/CD / Infra

Linked issues

  • Closes ENG-2241
  • Closes brutal-review item A4 (graceful degradation of all 7 enumerated harness failure cases)
  • Depends on (merged): ENG-2239 (HarnessModuleBuilder), ENG-2240 (SandboxService.loadHarness + injection), ENG-2227/2228 (HarnessStore)
  • Unblocks Phase 3 Task 3.5 (isolation integration test — runs in parallel but shares the "degradation never crashes sandbox" contract)

Root cause (bug fixes only, otherwise write N/A)

  • Root cause: N/A
  • Why this was not caught earlier: N/A

Test plan

  • Coverage added:
    • Unit test
    • Integration test
    • Manual verification only
  • Test file(s): test/unit/agent/harness/graceful-degradation.test.ts
  • Key scenario(s) covered (7 tests, one per brutal-review A4 case):
    • Build-time failures (harness NOT loaded):
      • 1. Syntax error at module load{loaded: false, reason: 'syntax'} + no harness in sandbox + 2 + 2 returns 4
      • 2. Throw in meta(){loaded: false, reason: 'meta-threw'} + no harness in sandbox + 2 + 2 returns 4
    • Per-invocation failures (harness loads; per-call wrapper throws):
      • 3. Throw in curate()loaded: true; module.curate(ctx) throws with /curate\(\) failed/; sandbox healthy
      • 4. Infinite loop in curate()loaded: true; V8 vm timeout (~5003ms); sandbox healthy
      • 5. Infinite recursion in curate()loaded: true; stack overflow caught; sandbox healthy
      • 7. Never-resolving Promise from curate()loaded: true; Promise.race timer (~5003ms) with /exceeded/ message; sandbox healthy
    • Non-failure (legal undefined):
      • 6. Returns undefined from curate()loaded: true; module.curate(ctx) resolves to undefined; sandbox healthy. Pinned explicitly as NOT a warning case so a future "warn on undefined returns" drift breaks this test.

User-visible changes

None. Pure test coverage addition. No consumer of the tested paths changes behavior.

Evidence

  • Failing test/log before + passing after
  • Trace/log snippets
  • Screenshot/recording

Before this PR, the test file didn't exist. After: all 7 pass. Full suite: 6705 passing / 0 failing.

$ perl -e 'alarm 60; exec @ARGV' npx mocha test/unit/agent/harness/graceful-degradation.test.ts --timeout 10000
  graceful degradation — brutal-review A4
    ✔ 1. Syntax error at module load → {loaded:false, reason:syntax} (44ms)
    ✔ 2. Throw in meta() → {loaded:false, reason:meta-threw}
    ✔ 3. Throw in curate() → wrapper throws; module stays loaded
    ✔ 4. Infinite loop in curate() → vm timeout; wrapper throws (5003ms)
    ✔ 5. Infinite recursion in curate() → stack overflow; wrapper throws
    ✔ 6. Returns undefined from curate() → resolves to undefined (not a failure)
    ✔ 7. Never-resolving Promise from curate() → Promise.race timer throws (5003ms)
  7 passing (10s)

Checklist

  • Tests added or updated and passing (npm test) — 7 new tests; full suite 6705 passing / 0 failing
  • Lint passes (npm run lint) — 0 errors, 226 pre-existing warnings
  • Type check passes (npm run typecheck) — exit=0
  • Build succeeds (npm run build) — exit=0
  • Commits follow Conventional Commits format — feat: [ENG-2241] ...
  • Documentation updated (if applicable) — task doc at features/autoharness-v2/tasks/phase_3/task_04-graceful-degradation.md (research repo) drove the scope; the stub-vs-real-store deviation + per-invocation-direct-call design choice flagged below for post-merge task-doc tightening
  • No breaking changes (or clearly documented above) — test-only addition
  • Branch is up to date with main — targets proj/autoharness-v2, not main

Risks and mitigations

  • Risk: Test-file runtime is ~10s because two tests wait ~5s each for vm / Promise.race timeouts to fire. Under CI load the vm timeout could drift past 5.1s.

    • Mitigation: Each timeout test has a .timeout(8000) budget — vm timeout + 3s headroom. V8's vm.Script.runInContext timeout is a hard wall-clock limit, not a soft signal. Real flakes would point at CI CPU exhaustion rather than test brittleness. If the runtime becomes painful, extract the two timeout tests to a dedicated "slow" file that runs on a nightly cadence; correctness signals stay in the PR gate.
  • Risk: The per-invocation failure tests (3-7) call result.module.curate!(ctx) directly rather than through service.executeCode('harness.curate(...)', 's1'). If the sandbox's harness.curate wrapper in buildHarnessNamespace ever diverges from a naked module.curate(ctx) call, this test file wouldn't catch the divergence.

    • Mitigation: Task 3.3's behavioral test (harness.* visible inside sandbox code) verifies the injection-wrapper contract directly — if a regression made the wrapper add behavior (logging, retry, state-tracking) beyond a pass-through call, that test would catch it. This file's job is degradation semantics, not injection plumbing.
  • Risk: Task doc prescribes "stub IHarnessStore"; shipped with real HarnessStore backed by FileKeyStorage({inMemory: true}).

    • Mitigation: Documented in the commit message + "Notes for reviewers" below. The real store is a strict pass-through in these tests (save + getLatest); stubbing adds setup without coverage benefit. If stub-strict is preferred, swap to InMemoryHarnessStore (from test/helpers/, shipped by Phat in Phase 1 Task 1.1) — one-line change.

Notes for reviewers

All 7 titles lead with the verbatim scenario name from the execution plan, then append the expected outcome. Reading the titles in isolation tells you exactly what's under test: "1. Syntax error at module load → {loaded:false, reason:syntax}". If the outcome clause is ever wrong, a failing test report points directly at the contract drift.

The sandbox-healthy invariant is load-bearing. Every test ends with expectSandboxHealthy() — a one-liner that runs 2 + 2 through the sandbox and asserts 4. Cheap proof that the harness failure didn't corrupt sandbox state (e.g., by leaking a broken VM context or mangling session maps). If any test's sandbox-healthy assertion ever fails, the evaluator or the sandbox injection has a real bug; don't retry, debug.

Case 6 (returns undefined) is explicitly pinned as NOT a failure — the test comment documents this so a future "defensive warning on undefined return" drift would break the test and force an explicit decision. Templates can legally return undefined; the LLM-side caller handles that. The module builder must not emit a warning or wrap the result.

Cases 4 and 5 cover distinct timeout mechanisms: case 4 (while(true){}) is caught by V8's vm wall-clock timeout on vm.Script.runInContext; case 5 (function go(){go()}; go()) is a stack overflow caught by V8 as a native error. Same normalized outcome (curate() failed: ...) but the underlying mechanisms are orthogonal — both need to be wrapped by the same error-normalization path in HarnessModuleBuilder.wrapInvocation.

Task 3.5 (isolation integration test) will replace this file's unit-level setup with a real file-backed sandbox + attack fixtures. That PR exercises orthogonal concerns (cross-context isolation); this one proves per-scenario degradation. Both are in scope for Phase 3's ship gate.

Related

  • Test file: test/unit/agent/harness/graceful-degradation.test.ts
  • Builder under test: src/agent/infra/harness/harness-module-builder.ts (ENG-2239)
  • Store under test: src/agent/infra/harness/harness-store.ts (ENG-2227 + ENG-2228)
  • Load pipeline under test: src/agent/infra/sandbox/sandbox-service.ts .loadHarness() (ENG-2240)
  • Brutal-review item closed: A4 (graceful degradation of 7 enumerated harness failure cases)
  • Task doc: features/autoharness-v2/tasks/phase_3/task_04-graceful-degradation.md (research repo)
  • Next task in stream: Phase 3 Task 3.5 (isolation integration test — brutal-review A3, the 5 cross-VM attack vectors)

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 21, 2026

Claude finished @danhdoan's task in 3m 27s —— View job


PR Review — ENG-2241 Graceful Degradation Tests

Tasks

  • Gather context and understand request
  • Read test file under review
  • Read source files under test
  • Analyze correctness, security, and code quality
  • Post inline comments for specific issues
  • Post summary review with score

Summary

Score: 4 / 5

This is a well-executed pure-coverage PR. The three-invariant-per-test structure (return shape, namespace presence, sandbox health) is the right model for end-to-end degradation testing, and every scenario from the A4 brutal-review item is represented with appropriate assertions. The test-file-level doc block accurately maps to the implementation. A few minor issues to address:


Correctness

editFile stub is missing .resolves() (line 97, inline)
All other stubs use .resolves(...) but editFile: sb.stub() returns undefined synchronously. This is a latent defect — currently harmless because none of the tested scenarios call editFile, but it will produce a confusing failure if any future code path reaches it during these tests.

Non-null assertions on result.module.curate! (line 179 and same pattern in tests 3–7, inline)
HarnessModule.curate is an optional property. The ! silences TypeScript but if the fixture were misconfigured and curate were absent, the test would throw TypeError: ... is not a function whose message doesn't match /curate\(\) failed/, producing a misleading failure. A one-line guard (if (result.module.curate === undefined) throw new Error('fixture must export curate')) makes broken fixtures self-describing without adding ceremony.

hasHarness.returnValue assertion could pass on undefined (line 152, inline)
returnValue is typed unknown, and .to.equal(false) passes for any falsy value. Adding an assertion message makes the intent explicit and distinguishes a correct false from an accidental undefined.


Style

as unknown as IFileSystem double-cast (line 104, inline)
Bypasses compile-time interface completeness checking. If IFileSystem gains a new method, this stub won't complain. Low priority but worth noting for interface-evolution hygiene.

ctx.tools stubs lack an explanatory comment (lines 85–88, inline)
The as unknown as ... casts are safe only because none of the tested scenarios actually invoke ctx.tools.*. A brief comment documents the assumption so a future test that adds a successful-invocation scenario doesn't inherit a silently wrong stub.


What's correct and well done

  • Seven scenarios map 1:1 to the A4 enumeration, with verbatim names in test titles — regression identification is instant.
  • expectSandboxHealthy() is the right design: a trivial arithmetic expression that proves VM context integrity without coupling to harness internals. Every test ends with it.
  • beforeEach isolation is solid: fresh FileKeyStorage, HarnessStore, HarnessModuleBuilder, and SandboxService per test — no shared state.
  • Cases 4 and 5 correctly target orthogonal timeout mechanisms (V8 wall-clock vs. stack overflow), and the PR correctly documents why both are needed.
  • Case 6 is explicitly pinned as non-failure: the test comment documenting the "not a warning case" intent means a future defensive-warning drift breaks with a clear signal rather than a silent regression.
  • The known direct-invocation gap (tests call result.module.curate directly rather than through the harness.* sandbox wrapper) is correctly scoped and documented; Task 3.3's injection test owns that contract.
  • .timeout(8000) gives 3 s headroom over the 5000 ms vm/Promise.race walls — appropriate CI budget.

Comment thread test/unit/agent/harness/graceful-degradation.test.ts Outdated
Comment thread test/unit/agent/harness/graceful-degradation.test.ts
Comment thread test/unit/agent/harness/graceful-degradation.test.ts
Comment thread test/unit/agent/harness/graceful-degradation.test.ts
Comment thread test/unit/agent/harness/graceful-degradation.test.ts
Comment thread test/unit/agent/harness/graceful-degradation.test.ts
Phase 3 Task 3.4 — closes brutal-review item A4 with 7 unit tests
covering every harness-failure scenario from the execution plan.
Each test exercises the full pipeline end-to-end: real
`HarnessStore` (in-memory-backed) + real `HarnessModuleBuilder` +
real `SandboxService.loadHarness` + `executeCode` for the "sandbox
stays healthy" proof.

Build-time failures (harness NOT loaded):
  1. Syntax error at module load      → reason: 'syntax'
  2. Throw in meta()                  → reason: 'meta-threw'

Per-invocation failures (harness loads; wrapper normalizes the
throw):
  3. Throw in curate()                → "curate() failed: user error"
  4. Infinite loop in curate()        → vm timeout, ~5s
  5. Infinite recursion in curate()   → stack overflow, wrapper throws
  7. Never-resolving Promise          → Promise.race timer, ~5s

Non-failure (legitimate legal behavior):
  6. Returns undefined                → resolves to undefined; no warn

  1. `loadHarness` returns the right `HarnessLoadResult` variant
     (`{loaded: false, reason}` for 1-2; `{loaded: true}` for 3-7).
  2. For build-time failures: sandbox context has no `harness.*`
     namespace (verified via `typeof harness !== 'undefined' ===
     false`).
     For per-invocation failures: harness IS loaded, direct
     `module.curate(ctx)` throws (or resolves to undefined for case 6).
  3. Sandbox continues executing unrelated code correctly after the
     harness failure — verified via `executeCode('2 + 2', 's1')`
     returning 4. If the harness failure corrupted sandbox state,
     this assertion would surface it.

Cases 3-7 call `result.module.curate!(ctx)` directly rather than
routing through the sandbox's `harness.curate()` injection.
Reason: invoking an async harness function from REPL-executed
sandbox code requires an IIFE that doesn't cleanly surface the
thrown error (the REPL doesn't support top-level await in a way
that propagates rejections for test assertions). The sandbox's
`harness.curate` is a thin closure over `module.curate`
(`() => module.curate(buildCtx())`); calling the module method
directly tests the same code path. Task 3.3's behavioral test
already verifies the sandbox injection wiring; this file's
contribution is the failure-mode semantics.

Task doc specified "stub `IHarnessStore`"; shipped with real
`HarnessStore` backed by `FileKeyStorage({inMemory: true})`. The
store is a pass-through in these tests (save version → getLatest
returns it); stubbing adds ceremony without coverage benefit. If
stub-strict is preferred, swap to `InMemoryHarnessStore` from
`test/helpers/` in a one-line change.
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 21, 2026

Claude finished @danhdoan's task in 4m 4s —— View job


PR Review — ENG-2241 Graceful Degradation Tests

Tasks

  • Gather context and understand request
  • Read test file under review
  • Read source files under test
  • Analyze correctness, security, and code quality
  • Post inline comments for specific issues
  • Post summary review with score

Score: 5 / 5

All four substantive issues from the prior review have been resolved. This is a clean, well-scoped coverage PR.


What was fixed since the last review

Prior concern Status
editFile: sb.stub() missing .resolves() ✅ Fixed — now .resolves({bytesWritten: 0, path: '/'})
result.module.curate! non-null assertions silencing TypeScript ✅ Fixed — explicit guards in all 5 per-invocation tests: if (result.module.curate === undefined) throw new Error('fixture must export curate')
hasHarness.returnValue assertion had no message (couldn't distinguish false from undefined) ✅ Fixed — both build-time tests now pass 'harness namespace must not be injected on build-time failure' as the assertion message
ctx.tools stubs lacked an explanatory comment ✅ Fixed — comment documents the assumption that no failure scenario invokes ctx.tools.*

Remaining / new observations

One nit remains (low priority, inline): The as unknown as IFileSystem double-cast at line 115 is still present — acceptable given the comment that explains it, but worth noting for interface-evolution hygiene.

One doc inaccuracy (inline): The file-level JSDoc says "≈ 15-20s" but the actual runtime is ~10s (only tests 4 and 7 wait ~5s; all others are fast).


What's correct and well done

  • Seven scenarios map 1:1 to the A4 enumeration, with verbatim names in test titles — regression identification is instant.
  • Three-invariant structure applied consistently: return shape, namespace presence/per-call throw, sandbox health — every test closes with expectSandboxHealthy().
  • expectSandboxHealthy() is correct by design: it evaluates 2 + 2 through service.executeCode, which uses the LocalSandbox VM context — orthogonal to the harness builder's scriptContext. A failure there means sandbox injection is broken, not just the harness function.
  • beforeEach isolation is solid: fresh FileKeyStorage({inMemory: true}), HarnessStore, HarnessModuleBuilder, and SandboxService per test — no shared state across scenarios.
  • Cases 4 and 5 correctly target orthogonal timeout mechanisms: V8 wall-clock timeout (vm.Script.runInContext) vs. stack overflow — same normalized outcome, different underlying V8 paths. Both must be covered.
  • Case 6 is explicitly pinned as non-failure: the inline comment documenting the "not a warning case" intent makes a future "defensive undefined warning" drift immediately visible rather than silent.
  • .timeout(8000) gives 3 s headroom over the 5 s walls — appropriate for CI variability.
  • The known direct-invocation gap (tests 3–7 call result.module.curate rather than harness.curate through the sandbox) is correctly scoped and documented; Task 3.3's injection test owns that contract.

Comment thread test/unit/agent/harness/graceful-degradation.test.ts
Comment thread test/unit/agent/harness/graceful-degradation.test.ts
Comment thread test/unit/agent/harness/graceful-degradation.test.ts
@danhdoan danhdoan merged commit f922647 into proj/autoharness-v2 Apr 21, 2026
6 checks passed
@danhdoan danhdoan deleted the feat/ENG-2241 branch April 22, 2026 00:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants