feat: Add startTime/endTime to trace types and persist traces to disk

# Trace Timestamps & Persistence Implementation Plan

> **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.

**Goal:** Add `startTime`/`endTime` to trace types, add aggregate threshold fields to tool_trajectory evaluator, and persist full traces to disk via `--trace` flag.

**Architecture:** Enrich existing `ToolCall`, `OutputMessage`, `ProviderResponse`, and `TraceSummary` interfaces with `startTime`/`endTime` ISO 8601 fields. Update `computeTraceSummary` to derive timing from span boundaries. Add aggregate assertion fields (`max_total_duration_ms`, `max_llm_calls`, `max_tool_calls`) to tool_trajectory evaluator. Add `--trace` CLI flag that writes full `outputMessages` to `.agentv/traces/` as JSONL. Since we have few users, replace `timestamp` directly with `startTime` (no soft deprecation).

**Tech Stack:** TypeScript 5.x, Bun, Vitest, cmd-ts (CLI), Zod (provider schemas)

---

## Task 1: Add `startTime`/`endTime` to core type interfaces

**Files:**
- Modify: `packages/core/src/evaluation/providers/types.ts` (ToolCall, OutputMessage, ProviderResponse)
- Modify: `packages/core/src/evaluation/trace.ts` (TraceSummary, ExecutionMetrics)

**Step 1: Update ToolCall interface**

In `packages/core/src/evaluation/providers/types.ts`, replace `timestamp` with `startTime`/`endTime` on `ToolCall`:

```typescript
export interface ToolCall {
  readonly tool: string;
  readonly input?: unknown;
  readonly output?: unknown;
  readonly id?: string;
  /** ISO 8601 start time */
  readonly startTime?: string;
  /** ISO 8601 end time */
  readonly endTime?: string;
  readonly durationMs?: number;
}
```

**Step 2: Update OutputMessage interface**

Replace `timestamp` with `startTime`/`endTime` on `OutputMessage`:

```typescript
export interface OutputMessage {
  readonly role: string;
  readonly name?: string;
  readonly content?: unknown;
  readonly toolCalls?: readonly ToolCall[];
  /** ISO 8601 start time */
  readonly startTime?: string;
  /** ISO 8601 end time */
  readonly endTime?: string;
  readonly durationMs?: number;
  readonly metadata?: Record<string, unknown>;
}
```

**Step 3: Update ProviderResponse interface**

Add `startTime`/`endTime` to `ProviderResponse`:

```typescript
export interface ProviderResponse {
  readonly raw?: unknown;
  readonly usage?: JsonObject;
  readonly outputMessages?: readonly OutputMessage[];
  readonly tokenUsage?: ProviderTokenUsage;
  readonly costUsd?: number;
  readonly durationMs?: number;
  readonly startTime?: string;
  readonly endTime?: string;
}
```

**Step 4: Update TraceSummary interface**

In `packages/core/src/evaluation/trace.ts`, add `startTime`/`endTime`/`llmCallCount` to `TraceSummary`:

```typescript
export interface TraceSummary {
  readonly eventCount: number;
  readonly toolNames: readonly string[];
  readonly toolCallsByName: Readonly<Record<string, number>>;
  readonly errorCount: number;
  readonly tokenUsage?: TokenUsage;
  readonly costUsd?: number;
  readonly durationMs?: number;
  readonly toolDurations?: Readonly<Record<string, readonly number[]>>;
  readonly startTime?: string;
  readonly endTime?: string;
  readonly llmCallCount?: number;
}
```

**Step 5: Update ExecutionMetrics interface**

Add `startTime`/`endTime` to `ExecutionMetrics`:

```typescript
export interface ExecutionMetrics {
  readonly tokenUsage?: TokenUsage;
  readonly costUsd?: number;
  readonly durationMs?: number;
  readonly startTime?: string;
  readonly endTime?: string;
}
```

**Step 6: Fix all compile errors from timestamp→startTime rename**

Search the entire codebase for references to the old `timestamp` field on ToolCall/OutputMessage and update to `startTime`. Key files:
- `packages/core/src/evaluation/providers/cli.ts` — Zod schemas (`timestamp` → `start_time`, add `end_time`) and mapping
- `packages/core/src/evaluation/providers/pi-agent-sdk.ts` — timestamp extraction
- `packages/core/src/evaluation/providers/pi-coding-agent.ts` — timestamp extraction
- `packages/core/test/` — any test fixtures referencing `timestamp`
- `packages/core/src/evaluation/loaders/evaluator-parser.ts` — if referenced
- `packages/eval/` — Zod schema for ToolCall/OutputMessage if defined there

**Step 7: Build and fix all type errors**

Run: `cd /home/christso/projects/agentv_feat-172-trace-timestamps && bun run build`
Expected: Clean build with no errors.

**Step 8: Run tests**

Run: `cd /home/christso/projects/agentv_feat-172-trace-timestamps && bun run test`
Expected: All 390 tests pass (the new fields are optional, so existing tests remain valid).

**Step 9: Commit**

```bash
git add -A && git commit -m "feat: add startTime/endTime to ToolCall, OutputMessage, ProviderResponse, TraceSummary"
```

---

## Task 2: Update `computeTraceSummary` to derive timing from spans

**Files:**
- Modify: `packages/core/src/evaluation/trace.ts`
- Create: `packages/core/test/evaluation/trace-summary.test.ts`

**Step 1: Write failing tests**

Create `packages/core/test/evaluation/trace-summary.test.ts`:

```typescript
import { describe, expect, it } from 'bun:test';
import { computeTraceSummary } from '../../src/evaluation/trace.js';

describe('computeTraceSummary', () => {
  it('derives startTime/endTime from message boundaries', () => {
    const messages = [
      {
        role: 'assistant',
        startTime: '2024-01-15T09:00:00Z',
        endTime: '2024-01-15T09:00:02Z',
        toolCalls: [{ tool: 'search', startTime: '2024-01-15T09:00:00Z', endTime: '2024-01-15T09:00:01Z', durationMs: 1000 }],
      },
      {
        role: 'assistant',
        startTime: '2024-01-15T09:00:02Z',
        endTime: '2024-01-15T09:00:05Z',
        toolCalls: [{ tool: 'fetch', startTime: '2024-01-15T09:00:03Z', endTime: '2024-01-15T09:00:04Z', durationMs: 1000 }],
      },
    ];

    const summary = computeTraceSummary(messages);

    expect(summary.startTime).toBe('2024-01-15T09:00:00Z');
    expect(summary.endTime).toBe('2024-01-15T09:00:05Z');
    expect(summary.eventCount).toBe(2);
  });

  it('computes toolDurations from tool call durationMs', () => {
    const messages = [
      {
        role: 'assistant',
        toolCalls: [
          { tool: 'search', durationMs: 100 },
          { tool: 'search', durationMs: 200 },
          { tool: 'fetch', durationMs: 300 },
        ],
      },
    ];

    const summary = computeTraceSummary(messages);

    expect(summary.toolDurations).toEqual({ fetch: [300], search: [100, 200] });
  });

  it('computes durationMs from startTime/endTime on tool calls when durationMs not provided', () => {
    const messages = [
      {
        role: 'assistant',
        toolCalls: [
          { tool: 'search', startTime: '2024-01-15T09:00:00Z', endTime: '2024-01-15T09:00:01.500Z' },
        ],
      },
    ];

    const summary = computeTraceSummary(messages);

    expect(summary.toolDurations).toEqual({ search: [1500] });
  });

  it('counts llmCallCount from assistant messages', () => {
    const messages = [
      { role: 'assistant', toolCalls: [{ tool: 'search' }] },
      { role: 'tool' },
      { role: 'assistant', toolCalls: [{ tool: 'fetch' }] },
    ];

    const summary = computeTraceSummary(messages);

    expect(summary.llmCallCount).toBe(2);
  });

  it('handles messages with no timing data', () => {
    const messages = [
      { role: 'assistant', toolCalls: [{ tool: 'search' }] },
    ];

    const summary = computeTraceSummary(messages);

    expect(summary.startTime).toBeUndefined();
    expect(summary.endTime).toBeUndefined();
    expect(summary.toolDurations).toBeUndefined();
    expect(summary.llmCallCount).toBe(1);
  });
});
```

**Step 2: Run test to verify it fails**

Run: `cd /home/christso/projects/agentv_feat-172-trace-timestamps && bun test packages/core/test/evaluation/trace-summary.test.ts`
Expected: FAIL (computeTraceSummary doesn't accept full messages or return timing fields yet).

**Step 3: Update `OutputMessageLike` and `computeTraceSummary`**

In `packages/core/src/evaluation/trace.ts`, expand `OutputMessageLike` and rewrite `computeTraceSummary`:

```typescript
interface OutputMessageLike {
  readonly role?: string;
  readonly startTime?: string;
  readonly endTime?: string;
  readonly toolCalls?: readonly {
    readonly tool: string;
    readonly durationMs?: number;
    readonly startTime?: string;
    readonly endTime?: string;
  }[];
}

export function computeTraceSummary(messages: readonly OutputMessageLike[]): TraceSummary {
  const toolCallCounts: Record<string, number> = {};
  const toolDurationsMap: Record<string, number[]> = {};
  let totalToolCalls = 0;
  let llmCallCount = 0;
  let earliestStart: string | undefined;
  let latestEnd: string | undefined;
  let hasAnyDuration = false;

  for (const message of messages) {
    if (message.role === 'assistant') {
      llmCallCount++;
    }

    // Collect message-level timestamps for overall boundaries
    if (message.startTime) {
      if (!earliestStart || message.startTime < earliestStart) {
        earliestStart = message.startTime;
      }
    }
    if (message.endTime) {
      if (!latestEnd || message.endTime > latestEnd) {
        latestEnd = message.endTime;
      }
    }

    if (!message.toolCalls) continue;

    for (const toolCall of message.toolCalls) {
      toolCallCounts[toolCall.tool] = (toolCallCounts[toolCall.tool] ?? 0) + 1;
      totalToolCalls++;

      // Derive duration: prefer explicit durationMs, fall back to startTime/endTime
      let duration = toolCall.durationMs;
      if (duration === undefined && toolCall.startTime && toolCall.endTime) {
        duration = new Date(toolCall.endTime).getTime() - new Date(toolCall.startTime).getTime();
      }

      if (duration !== undefined) {
        hasAnyDuration = true;
        if (!toolDurationsMap[toolCall.tool]) {
          toolDurationsMap[toolCall.tool] = [];
        }
        toolDurationsMap[toolCall.tool].push(duration);
      }

      // Tool call timestamps also contribute to overall boundaries
      if (toolCall.startTime) {
        if (!earliestStart || toolCall.startTime < earliestStart) {
          earliestStart = toolCall.startTime;
        }
      }
      if (toolCall.endTime) {
        if (!latestEnd || toolCall.endTime > latestEnd) {
          latestEnd = toolCall.endTime;
        }
      }
    }
  }

  const toolNames = Object.keys(toolCallCounts).sort();

  return {
    eventCount: totalToolCalls,
    toolNames,
    toolCallsByName: toolCallCounts,
    errorCount: 0,
    ...(earliestStart ? { startTime: earliestStart } : {}),
    ...(latestEnd ? { endTime: latestEnd } : {}),
    ...(hasAnyDuration ? { toolDurations: toolDurationsMap } : {}),
    ...(llmCallCount > 0 ? { llmCallCount } : {}),
  };
}
```

**Step 4: Update `mergeExecutionMetrics` to pass through startTime/endTime**

```typescript
export function mergeExecutionMetrics(
  summary: TraceSummary,
  metrics?: ExecutionMetrics,
): TraceSummary {
  if (!metrics) return summary;

  return {
    ...summary,
    tokenUsage: metrics.tokenUsage,
    costUsd: metrics.costUsd,
    durationMs: metrics.durationMs,
    // Provider-level startTime/endTime override message-derived ones
    ...(metrics.startTime ? { startTime: metrics.startTime } : {}),
    ...(metrics.endTime ? { endTime: metrics.endTime } : {}),
  };
}
```

**Step 5: Run tests**

Run: `cd /home/christso/projects/agentv_feat-172-trace-timestamps && bun test packages/core/test/evaluation/trace-summary.test.ts`
Expected: All new tests pass.

**Step 6: Run full test suite**

Run: `cd /home/christso/projects/agentv_feat-172-trace-timestamps && bun run test`
Expected: All tests pass.

**Step 7: Commit**

```bash
git add -A && git commit -m "feat: derive startTime/endTime/toolDurations/llmCallCount in computeTraceSummary"
```

---

## Task 3: Update orchestrator to pass startTime/endTime through mergeExecutionMetrics

**Files:**
- Modify: `packages/core/src/evaluation/orchestrator.ts`

**Step 1: Update both mergeExecutionMetrics call sites**

In `orchestrator.ts`, at lines ~466-471 and ~634-639, add `startTime`/`endTime` from `providerResponse`:

```typescript
const traceSummary = baseSummary
  ? mergeExecutionMetrics(baseSummary, {
      tokenUsage: providerResponse.tokenUsage,
      costUsd: providerResponse.costUsd,
      durationMs: providerResponse.durationMs,
      startTime: providerResponse.startTime,
      endTime: providerResponse.endTime,
    })
  : undefined;
```

**Step 2: Build and test**

Run: `cd /home/christso/projects/agentv_feat-172-trace-timestamps && bun run build && bun run test`
Expected: All pass.

**Step 3: Commit**

```bash
git add -A && git commit -m "feat: pass startTime/endTime from ProviderResponse through mergeExecutionMetrics"
```

---

## Task 4: Update providers to populate startTime/endTime

**Files:**
- Modify: `packages/core/src/evaluation/providers/cli.ts` (Zod schemas + mapping)
- Modify: `packages/core/src/evaluation/providers/pi-agent-sdk.ts`
- Modify: `packages/core/src/evaluation/providers/pi-coding-agent.ts`

**Step 1: Update CLI provider Zod schemas**

In `cli.ts`, update `ToolCallSchema` and `OutputMessageInputSchema`:
- Rename `timestamp` → `start_time` (snake_case for external schema)
- Add `end_time`
- Map to `startTime`/`endTime` in the conversion code

Keep `timestamp` as an alias for backward compat in the Zod schema using `.or()` or `.transform()`.

Actually, since the user said no soft deprecation, just rename `timestamp` to `start_time` and add `end_time`.

```typescript
const ToolCallSchema = z.object({
  tool: z.string(),
  input: z.unknown().optional(),
  output: z.unknown().optional(),
  id: z.string().optional(),
  start_time: z.string().optional(),
  end_time: z.string().optional(),
  duration_ms: z.number().optional(),
});

const OutputMessageInputSchema = z.object({
  role: z.string(),
  name: z.string().optional(),
  content: z.unknown().optional(),
  tool_calls: z.array(ToolCallSchema).optional(),
  start_time: z.string().optional(),
  end_time: z.string().optional(),
  duration_ms: z.number().optional(),
  metadata: z.record(z.unknown()).optional(),
});
```

Update the mapping code that converts parsed Zod output into `ToolCall`/`OutputMessage` objects to use `startTime`/`endTime`.

**Step 2: Update Pi Agent SDK provider**

In `pi-agent-sdk.ts`, change `timestamp` extraction to `startTime`:
```typescript
// Where it currently sets timestamp, change to startTime
startTime: typeof msg.timestamp === 'number'
  ? new Date(msg.timestamp).toISOString()
  : typeof msg.timestamp === 'string'
    ? msg.timestamp
    : undefined,
```

**Step 3: Update Pi Coding Agent provider**

Same pattern as pi-agent-sdk.

**Step 4: Build and test**

Run: `cd /home/christso/projects/agentv_feat-172-trace-timestamps && bun run build && bun run test`
Expected: All pass.

**Step 5: Commit**

```bash
git add -A && git commit -m "feat: update providers to populate startTime/endTime"
```

---

## Task 5: Add aggregate threshold fields to tool_trajectory evaluator

**Files:**
- Modify: `packages/core/src/evaluation/trace.ts` (ToolTrajectoryEvaluatorConfig)
- Modify: `packages/core/src/evaluation/evaluators/tool-trajectory.ts` (evaluation logic)
- Modify: `packages/core/src/evaluation/loaders/evaluator-parser.ts` (YAML parsing)
- Modify: `packages/core/test/evaluation/tool-trajectory-evaluator.test.ts` (tests)

**Step 1: Write failing tests**

Add to `packages/core/test/evaluation/tool-trajectory-evaluator.test.ts`:

```typescript
describe('aggregate thresholds', () => {
  it('fails when max_total_duration_ms exceeded', () => {
    const config: ToolTrajectoryEvaluatorConfig = {
      name: 'test',
      type: 'tool_trajectory',
      mode: 'any_order',
      minimums: { search: 1 },
      maxTotalDurationMs: 1000,
    };
    const evaluator = new ToolTrajectoryEvaluator({ config });
    const outputMessages: OutputMessage[] = [
      { role: 'assistant', toolCalls: [{ tool: 'search' }] },
    ];
    const traceSummary: TraceSummary = {
      eventCount: 1, toolNames: ['search'], toolCallsByName: { search: 1 },
      errorCount: 0, durationMs: 2000,
    };
    const result = evaluator.evaluate(createContext({ outputMessages, traceSummary }));
    expect(result.misses).toContainEqual(expect.stringContaining('total duration'));
  });

  it('passes when max_total_duration_ms not exceeded', () => {
    const config: ToolTrajectoryEvaluatorConfig = {
      name: 'test',
      type: 'tool_trajectory',
      mode: 'any_order',
      minimums: { search: 1 },
      maxTotalDurationMs: 5000,
    };
    const evaluator = new ToolTrajectoryEvaluator({ config });
    const outputMessages: OutputMessage[] = [
      { role: 'assistant', toolCalls: [{ tool: 'search' }] },
    ];
    const traceSummary: TraceSummary = {
      eventCount: 1, toolNames: ['search'], toolCallsByName: { search: 1 },
      errorCount: 0, durationMs: 2000,
    };
    const result = evaluator.evaluate(createContext({ outputMessages, traceSummary }));
    expect(result.misses).not.toContainEqual(expect.stringContaining('total duration'));
  });

  it('fails when max_llm_calls exceeded', () => {
    const config: ToolTrajectoryEvaluatorConfig = {
      name: 'test',
      type: 'tool_trajectory',
      mode: 'any_order',
      minimums: { search: 1 },
      maxLlmCalls: 2,
    };
    const evaluator = new ToolTrajectoryEvaluator({ config });
    const outputMessages: OutputMessage[] = [
      { role: 'assistant', toolCalls: [{ tool: 'search' }] },
    ];
    const traceSummary: TraceSummary = {
      eventCount: 1, toolNames: ['search'], toolCallsByName: { search: 1 },
      errorCount: 0, llmCallCount: 5,
    };
    const result = evaluator.evaluate(createContext({ outputMessages, traceSummary }));
    expect(result.misses).toContainEqual(expect.stringContaining('LLM calls'));
  });

  it('fails when max_tool_calls exceeded', () => {
    const config: ToolTrajectoryEvaluatorConfig = {
      name: 'test',
      type: 'tool_trajectory',
      mode: 'any_order',
      minimums: { search: 1 },
      maxToolCalls: 5,
    };
    const evaluator = new ToolTrajectoryEvaluator({ config });
    const outputMessages: OutputMessage[] = [
      { role: 'assistant', toolCalls: Array.from({ length: 10 }, () => ({ tool: 'search' })) },
    ];
    const result = evaluator.evaluate(createContext({ outputMessages }));
    expect(result.misses).toContainEqual(expect.stringContaining('tool calls'));
  });
});
```

**Step 2: Run tests to verify they fail**

Run: `cd /home/christso/projects/agentv_feat-172-trace-timestamps && bun test packages/core/test/evaluation/tool-trajectory-evaluator.test.ts`
Expected: FAIL (fields don't exist on config type).

**Step 3: Update ToolTrajectoryEvaluatorConfig**

In `packages/core/src/evaluation/trace.ts`:

```typescript
export interface ToolTrajectoryEvaluatorConfig {
  readonly name: string;
  readonly type: 'tool_trajectory';
  readonly mode: 'any_order' | 'in_order' | 'exact';
  readonly minimums?: Readonly<Record<string, number>>;
  readonly expected?: readonly ToolTrajectoryExpectedItem[];
  readonly weight?: number;
  readonly maxTotalDurationMs?: number;
  readonly maxLlmCalls?: number;
  readonly maxToolCalls?: number;
}
```

**Step 4: Add aggregate checks to ToolTrajectoryEvaluator**

In `tool-trajectory.ts`, add a private method that checks aggregate thresholds and call it from `evaluate()`. The method checks against `traceSummary` (for `durationMs`, `llmCallCount`) and against the extracted tool calls count. Violations count as misses in the scoring.

```typescript
private checkAggregateThresholds(
  toolCallCount: number,
  traceSummary?: TraceSummary,
): { hits: string[]; misses: string[] } {
  const hits: string[] = [];
  const misses: string[] = [];

  if (this.config.maxTotalDurationMs !== undefined) {
    const actual = traceSummary?.durationMs;
    if (actual !== undefined) {
      if (actual <= this.config.maxTotalDurationMs) {
        hits.push(`total duration ${actual}ms within limit (max: ${this.config.maxTotalDurationMs}ms)`);
      } else {
        misses.push(`total duration ${actual}ms exceeded limit (max: ${this.config.maxTotalDurationMs}ms)`);
      }
    }
  }

  if (this.config.maxLlmCalls !== undefined) {
    const actual = traceSummary?.llmCallCount;
    if (actual !== undefined) {
      if (actual <= this.config.maxLlmCalls) {
        hits.push(`${actual} LLM calls within limit (max: ${this.config.maxLlmCalls})`);
      } else {
        misses.push(`${actual} LLM calls exceeded limit (max: ${this.config.maxLlmCalls})`);
      }
    }
  }

  if (this.config.maxToolCalls !== undefined) {
    if (toolCallCount <= this.config.maxToolCalls) {
      hits.push(`${toolCallCount} tool calls within limit (max: ${this.config.maxToolCalls})`);
    } else {
      misses.push(`${toolCallCount} tool calls exceeded limit (max: ${this.config.maxToolCalls})`);
    }
  }

  return { hits, misses };
}
```

Call this from `evaluate()` and merge the results into the final score. Aggregate threshold checks count toward the total assertion count for scoring.

**Step 5: Fix evaluator-parser.ts to parse new fields and maxDurationMs**

In `evaluator-parser.ts`, within the `tool_trajectory` block (~line 347):

```typescript
// Parse aggregate thresholds
const maxTotalDurationMs = typeof rawEvaluator.max_total_duration_ms === 'number'
  ? rawEvaluator.max_total_duration_ms : undefined;
const maxLlmCalls = typeof rawEvaluator.max_llm_calls === 'number'
  ? rawEvaluator.max_llm_calls : undefined;
const maxToolCalls = typeof rawEvaluator.max_tool_calls === 'number'
  ? rawEvaluator.max_tool_calls : undefined;

const config: ToolTrajectoryEvaluatorConfig = {
  name,
  type: 'tool_trajectory',
  mode,
  ...(minimums ? { minimums } : {}),
  ...(expected ? { expected } : {}),
  ...(weight !== undefined ? { weight } : {}),
  ...(maxTotalDurationMs !== undefined ? { maxTotalDurationMs } : {}),
  ...(maxLlmCalls !== undefined ? { maxLlmCalls } : {}),
  ...(maxToolCalls !== undefined ? { maxToolCalls } : {}),
};
```

Also fix the existing bug where `maxDurationMs` is not parsed from expected items (~line 325):

```typescript
const maxDurationMs = typeof item.max_duration_ms === 'number' ? item.max_duration_ms : undefined;
expected.push({
  tool: item.tool,
  ...(args !== undefined ? { args } : {}),
  ...(maxDurationMs !== undefined ? { maxDurationMs } : {}),
});
```

**Step 6: Run tests**

Run: `cd /home/christso/projects/agentv_feat-172-trace-timestamps && bun test packages/core/test/evaluation/tool-trajectory-evaluator.test.ts`
Expected: All pass.

**Step 7: Run full suite**

Run: `cd /home/christso/projects/agentv_feat-172-trace-timestamps && bun run build && bun run test`
Expected: All pass.

**Step 8: Commit**

```bash
git add -A && git commit -m "feat: add aggregate thresholds (max_total_duration_ms, max_llm_calls, max_tool_calls) to tool_trajectory evaluator"
```

---

## Task 6: Add `--trace` flag and TraceWriter for trace persistence

**Files:**
- Create: `apps/cli/src/commands/eval/trace-writer.ts`
- Modify: `apps/cli/src/commands/eval/index.ts` (add --trace flag)
- Modify: `apps/cli/src/commands/eval/run-eval.ts` (thread trace flag, write traces)
- Create: `apps/cli/test/commands/eval/trace-writer.test.ts`

**Step 1: Write TraceWriter tests**

Create `apps/cli/test/commands/eval/trace-writer.test.ts`:

```typescript
import { describe, expect, it, afterEach } from 'bun:test';
import { existsSync, readFileSync, rmSync, mkdirSync } from 'node:fs';
import path from 'node:path';
import { TraceWriter } from '../../../src/commands/eval/trace-writer.js';

const TEMP_DIR = path.join(import.meta.dir, '.test-traces');

afterEach(() => {
  if (existsSync(TEMP_DIR)) rmSync(TEMP_DIR, { recursive: true });
});

describe('TraceWriter', () => {
  it('writes trace records as JSONL', async () => {
    mkdirSync(TEMP_DIR, { recursive: true });
    const filePath = path.join(TEMP_DIR, 'test.trace.jsonl');
    const writer = await TraceWriter.open(filePath);

    await writer.append({
      evalId: 'test-1',
      startTime: '2024-01-15T09:00:00Z',
      endTime: '2024-01-15T09:00:05Z',
      durationMs: 5000,
      spans: [
        { type: 'tool', name: 'search', startTime: '2024-01-15T09:00:00Z', endTime: '2024-01-15T09:00:01Z', durationMs: 1000 },
      ],
    });

    await writer.close();

    const content = readFileSync(filePath, 'utf8').trim();
    const record = JSON.parse(content);
    expect(record.eval_id).toBe('test-1');
    expect(record.spans).toHaveLength(1);
  });
});
```

**Step 2: Implement TraceWriter**

Create `apps/cli/src/commands/eval/trace-writer.ts`:

```typescript
import { createWriteStream } from 'node:fs';
import { mkdir } from 'node:fs/promises';
import path from 'node:path';
import { finished } from 'node:stream/promises';
import { Mutex } from 'async-mutex';
import { toSnakeCaseDeep } from '../../utils/case-conversion.js';

export interface TraceRecord {
  readonly evalId: string;
  readonly startTime?: string;
  readonly endTime?: string;
  readonly durationMs?: number;
  readonly spans: readonly TraceSpan[];
  readonly tokenUsage?: { readonly input: number; readonly output: number; readonly cached?: number };
  readonly costUsd?: number;
}

export interface TraceSpan {
  readonly type: 'tool';
  readonly name: string;
  readonly startTime?: string;
  readonly endTime?: string;
  readonly durationMs?: number;
  readonly input?: unknown;
  readonly output?: unknown;
}

export class TraceWriter {
  private readonly stream: ReturnType<typeof createWriteStream>;
  private readonly mutex = new Mutex();
  private closed = false;

  private constructor(stream: ReturnType<typeof createWriteStream>) {
    this.stream = stream;
  }

  static async open(filePath: string): Promise<TraceWriter> {
    await mkdir(path.dirname(filePath), { recursive: true });
    const stream = createWriteStream(filePath, { flags: 'w', encoding: 'utf8' });
    return new TraceWriter(stream);
  }

  async append(record: TraceRecord): Promise<void> {
    await this.mutex.runExclusive(async () => {
      if (this.closed) throw new Error('Cannot write to closed trace writer');
      const snakeCaseRecord = toSnakeCaseDeep(record);
      const line = `${JSON.stringify(snakeCaseRecord)}\n`;
      if (!this.stream.write(line)) {
        await new Promise<void>((resolve, reject) => {
          this.stream.once('drain', resolve);
          this.stream.once('error', reject);
        });
      }
    });
  }

  async close(): Promise<void> {
    if (this.closed) return;
    this.closed = true;
    this.stream.end();
    await finished(this.stream);
  }
}
```

**Step 3: Add --trace flag to CLI**

In `apps/cli/src/commands/eval/index.ts`, add:

```typescript
trace: flag({
  long: 'trace',
  description: 'Save full execution traces to .agentv/traces/',
}),
```

Pass it through `rawOptions`:
```typescript
trace: args.trace,
```

**Step 4: Thread trace flag through run-eval.ts**

In `run-eval.ts`:
- Add `trace: boolean` to `NormalizedOptions`
- In `normalizeOptions`: `trace: normalizeBoolean(rawOptions.trace)`
- Create trace writer when `options.trace` is true
- Build `TraceRecord` from `outputMessages` and `traceSummary` in the `onResult` callback
- Pass `outputMessages` alongside `EvaluationResult` (requires threading through or adding to the result temporarily)

The key challenge: `onResult` only receives `EvaluationResult` which doesn't contain `outputMessages`. We need to capture `outputMessages` at the orchestrator level.

**Approach:** Add an optional `outputMessages` field to `EvaluationResult` that's only populated when trace mode is on. This is controlled by a `captureTrace` option on `runEvaluation`.

Actually, simpler approach: Add `outputMessages?: readonly OutputMessage[]` to `EvaluationResult` in types.ts. The orchestrator already has access to `outputMessages` when building the result. The field is optional and only populated when the caller requests it via a new `captureOutputMessages` option on the evaluation runner. The JSONL writer already strips it by only writing `traceSummary`. We just need the trace writer to extract it.

Even simpler: Always include `outputMessages` on the result (it's already available at construction time), and let the JSONL writer exclude it. The trace writer uses it. This avoids threading a new option.

Wait — the JSONL writer just calls `toSnakeCaseDeep(record)` on the full `EvaluationResult`. Adding `outputMessages` would bloat the results JSONL. We need to strip it.

Best approach: Add `outputMessages` to `EvaluationResult` type. In the `onResult` callback in `run-eval.ts`, write the trace record from it, then strip it before passing to the output writer.

**Implementation in orchestrator.ts:**
Add `outputMessages` to the result object at both construction sites (~line 745 and error handling):

```typescript
return {
  // ... existing fields
  traceSummary,
  outputMessages,  // NEW: included for trace persistence
};
```

**Implementation in run-eval.ts:**
```typescript
onResult: async (result: EvaluationResult) => {
  // Write trace if enabled
  if (traceWriter && result.outputMessages) {
    await traceWriter.append(buildTraceRecord(result));
  }
  // Strip outputMessages before writing to results
  const { outputMessages: _, ...resultWithoutMessages } = result;
  await outputWriter.append(resultWithoutMessages as EvaluationResult);
},
```

**Step 5: Update EvaluationResult type**

In `packages/core/src/evaluation/types.ts`, add:

```typescript
export interface EvaluationResult {
  // ... existing fields
  readonly outputMessages?: readonly OutputMessage[];
}
```

**Step 6: Update orchestrator to include outputMessages in result**

In `packages/core/src/evaluation/orchestrator.ts`, at both result construction sites, add `outputMessages`.

**Step 7: Build and test**

Run: `cd /home/christso/projects/agentv_feat-172-trace-timestamps && bun run build && bun run test`
Expected: All pass.

**Step 8: Commit**

```bash
git add -A && git commit -m "feat: add --trace flag for persisting full execution traces to .agentv/traces/"
```

---

## Task 7: Update evaluator-parser tests and add integration validation

**Files:**
- Modify: `packages/core/test/evaluation/loaders/evaluator-parser.test.ts`

**Step 1: Add tests for new tool_trajectory fields**

Add tests verifying:
- `max_total_duration_ms` is parsed from YAML
- `max_llm_calls` is parsed from YAML
- `max_tool_calls` is parsed from YAML
- `max_duration_ms` on expected items is parsed (existing bug fix)

**Step 2: Run tests**

Run: `cd /home/christso/projects/agentv_feat-172-trace-timestamps && bun run test`
Expected: All pass.

**Step 3: Commit**

```bash
git add -A && git commit -m "test: add evaluator-parser tests for tool_trajectory aggregate thresholds"
```

---

## Task 8: Update examples and documentation

**Files:**
- Modify: relevant example YAML files in `examples/`
- Modify: `apps/web/src/content/docs/` (docs site)
- Modify: `.claude/skills/agentv-eval-builder/` (skill files)

**Step 1: Update examples**

Add `startTime`/`endTime` to any example trace data. Add an example using aggregate thresholds:

```yaml
evaluators:
  - name: efficiency-check
    type: tool_trajectory
    mode: in_order
    max_total_duration_ms: 10000
    max_llm_calls: 5
    max_tool_calls: 20
    expected:
      - tool: search
        max_duration_ms: 2000
      - tool: summarize
```

**Step 2: Update docs and skills**

Update documentation pages related to:
- Trace format (startTime/endTime)
- tool_trajectory evaluator (aggregate thresholds)
- CLI reference (--trace flag)

Update eval-builder skill with new fields.

**Step 3: Commit**

```bash
git add -A && git commit -m "docs: update examples, docs, and skills for trace timestamps and aggregate thresholds"
```

---

## Task 9: Final validation

**Step 1: Full build + typecheck + lint + test**

```bash
cd /home/christso/projects/agentv_feat-172-trace-timestamps && bun run build && bun run typecheck && bun run lint && bun run test
```

**Step 2: Fix any issues**

**Step 3: Final commit if needed**

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add startTime/endTime to trace types and persist traces to disk #172

Trace Timestamps & Persistence Implementation Plan

Task 1: Add `startTime`/`endTime` to core type interfaces

Task 2: Update `computeTraceSummary` to derive timing from spans

Task 3: Update orchestrator to pass startTime/endTime through mergeExecutionMetrics

Task 4: Update providers to populate startTime/endTime

Task 5: Add aggregate threshold fields to tool_trajectory evaluator

Task 6: Add `--trace` flag and TraceWriter for trace persistence

Task 7: Update evaluator-parser tests and add integration validation

Task 8: Update examples and documentation

Task 9: Final validation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

feat: Add startTime/endTime to trace types and persist traces to disk #172

Description

Trace Timestamps & Persistence Implementation Plan

Task 1: Add startTime/endTime to core type interfaces

Task 2: Update computeTraceSummary to derive timing from spans

Task 3: Update orchestrator to pass startTime/endTime through mergeExecutionMetrics

Task 4: Update providers to populate startTime/endTime

Task 5: Add aggregate threshold fields to tool_trajectory evaluator

Task 6: Add --trace flag and TraceWriter for trace persistence

Task 7: Update evaluator-parser tests and add integration validation

Task 8: Update examples and documentation

Task 9: Final validation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Task 1: Add `startTime`/`endTime` to core type interfaces

Task 2: Update `computeTraceSummary` to derive timing from spans

Task 6: Add `--trace` flag and TraceWriter for trace persistence