Skip to content

feat: Add startTime/endTime to trace types and persist traces to disk #172

@christso

Description

@christso

Trace Timestamps & Persistence Implementation Plan

For Claude: REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.

Goal: Add startTime/endTime to trace types, add aggregate threshold fields to tool_trajectory evaluator, and persist full traces to disk via --trace flag.

Architecture: Enrich existing ToolCall, OutputMessage, ProviderResponse, and TraceSummary interfaces with startTime/endTime ISO 8601 fields. Update computeTraceSummary to derive timing from span boundaries. Add aggregate assertion fields (max_total_duration_ms, max_llm_calls, max_tool_calls) to tool_trajectory evaluator. Add --trace CLI flag that writes full outputMessages to .agentv/traces/ as JSONL. Since we have few users, replace timestamp directly with startTime (no soft deprecation).

Tech Stack: TypeScript 5.x, Bun, Vitest, cmd-ts (CLI), Zod (provider schemas)


Task 1: Add startTime/endTime to core type interfaces

Files:

  • Modify: packages/core/src/evaluation/providers/types.ts (ToolCall, OutputMessage, ProviderResponse)
  • Modify: packages/core/src/evaluation/trace.ts (TraceSummary, ExecutionMetrics)

Step 1: Update ToolCall interface

In packages/core/src/evaluation/providers/types.ts, replace timestamp with startTime/endTime on ToolCall:

export interface ToolCall {
  readonly tool: string;
  readonly input?: unknown;
  readonly output?: unknown;
  readonly id?: string;
  /** ISO 8601 start time */
  readonly startTime?: string;
  /** ISO 8601 end time */
  readonly endTime?: string;
  readonly durationMs?: number;
}

Step 2: Update OutputMessage interface

Replace timestamp with startTime/endTime on OutputMessage:

export interface OutputMessage {
  readonly role: string;
  readonly name?: string;
  readonly content?: unknown;
  readonly toolCalls?: readonly ToolCall[];
  /** ISO 8601 start time */
  readonly startTime?: string;
  /** ISO 8601 end time */
  readonly endTime?: string;
  readonly durationMs?: number;
  readonly metadata?: Record<string, unknown>;
}

Step 3: Update ProviderResponse interface

Add startTime/endTime to ProviderResponse:

export interface ProviderResponse {
  readonly raw?: unknown;
  readonly usage?: JsonObject;
  readonly outputMessages?: readonly OutputMessage[];
  readonly tokenUsage?: ProviderTokenUsage;
  readonly costUsd?: number;
  readonly durationMs?: number;
  readonly startTime?: string;
  readonly endTime?: string;
}

Step 4: Update TraceSummary interface

In packages/core/src/evaluation/trace.ts, add startTime/endTime/llmCallCount to TraceSummary:

export interface TraceSummary {
  readonly eventCount: number;
  readonly toolNames: readonly string[];
  readonly toolCallsByName: Readonly<Record<string, number>>;
  readonly errorCount: number;
  readonly tokenUsage?: TokenUsage;
  readonly costUsd?: number;
  readonly durationMs?: number;
  readonly toolDurations?: Readonly<Record<string, readonly number[]>>;
  readonly startTime?: string;
  readonly endTime?: string;
  readonly llmCallCount?: number;
}

Step 5: Update ExecutionMetrics interface

Add startTime/endTime to ExecutionMetrics:

export interface ExecutionMetrics {
  readonly tokenUsage?: TokenUsage;
  readonly costUsd?: number;
  readonly durationMs?: number;
  readonly startTime?: string;
  readonly endTime?: string;
}

Step 6: Fix all compile errors from timestamp→startTime rename

Search the entire codebase for references to the old timestamp field on ToolCall/OutputMessage and update to startTime. Key files:

  • packages/core/src/evaluation/providers/cli.ts — Zod schemas (timestampstart_time, add end_time) and mapping
  • packages/core/src/evaluation/providers/pi-agent-sdk.ts — timestamp extraction
  • packages/core/src/evaluation/providers/pi-coding-agent.ts — timestamp extraction
  • packages/core/test/ — any test fixtures referencing timestamp
  • packages/core/src/evaluation/loaders/evaluator-parser.ts — if referenced
  • packages/eval/ — Zod schema for ToolCall/OutputMessage if defined there

Step 7: Build and fix all type errors

Run: cd /home/christso/projects/agentv_feat-172-trace-timestamps && bun run build
Expected: Clean build with no errors.

Step 8: Run tests

Run: cd /home/christso/projects/agentv_feat-172-trace-timestamps && bun run test
Expected: All 390 tests pass (the new fields are optional, so existing tests remain valid).

Step 9: Commit

git add -A && git commit -m "feat: add startTime/endTime to ToolCall, OutputMessage, ProviderResponse, TraceSummary"

Task 2: Update computeTraceSummary to derive timing from spans

Files:

  • Modify: packages/core/src/evaluation/trace.ts
  • Create: packages/core/test/evaluation/trace-summary.test.ts

Step 1: Write failing tests

Create packages/core/test/evaluation/trace-summary.test.ts:

import { describe, expect, it } from 'bun:test';
import { computeTraceSummary } from '../../src/evaluation/trace.js';

describe('computeTraceSummary', () => {
  it('derives startTime/endTime from message boundaries', () => {
    const messages = [
      {
        role: 'assistant',
        startTime: '2024-01-15T09:00:00Z',
        endTime: '2024-01-15T09:00:02Z',
        toolCalls: [{ tool: 'search', startTime: '2024-01-15T09:00:00Z', endTime: '2024-01-15T09:00:01Z', durationMs: 1000 }],
      },
      {
        role: 'assistant',
        startTime: '2024-01-15T09:00:02Z',
        endTime: '2024-01-15T09:00:05Z',
        toolCalls: [{ tool: 'fetch', startTime: '2024-01-15T09:00:03Z', endTime: '2024-01-15T09:00:04Z', durationMs: 1000 }],
      },
    ];

    const summary = computeTraceSummary(messages);

    expect(summary.startTime).toBe('2024-01-15T09:00:00Z');
    expect(summary.endTime).toBe('2024-01-15T09:00:05Z');
    expect(summary.eventCount).toBe(2);
  });

  it('computes toolDurations from tool call durationMs', () => {
    const messages = [
      {
        role: 'assistant',
        toolCalls: [
          { tool: 'search', durationMs: 100 },
          { tool: 'search', durationMs: 200 },
          { tool: 'fetch', durationMs: 300 },
        ],
      },
    ];

    const summary = computeTraceSummary(messages);

    expect(summary.toolDurations).toEqual({ fetch: [300], search: [100, 200] });
  });

  it('computes durationMs from startTime/endTime on tool calls when durationMs not provided', () => {
    const messages = [
      {
        role: 'assistant',
        toolCalls: [
          { tool: 'search', startTime: '2024-01-15T09:00:00Z', endTime: '2024-01-15T09:00:01.500Z' },
        ],
      },
    ];

    const summary = computeTraceSummary(messages);

    expect(summary.toolDurations).toEqual({ search: [1500] });
  });

  it('counts llmCallCount from assistant messages', () => {
    const messages = [
      { role: 'assistant', toolCalls: [{ tool: 'search' }] },
      { role: 'tool' },
      { role: 'assistant', toolCalls: [{ tool: 'fetch' }] },
    ];

    const summary = computeTraceSummary(messages);

    expect(summary.llmCallCount).toBe(2);
  });

  it('handles messages with no timing data', () => {
    const messages = [
      { role: 'assistant', toolCalls: [{ tool: 'search' }] },
    ];

    const summary = computeTraceSummary(messages);

    expect(summary.startTime).toBeUndefined();
    expect(summary.endTime).toBeUndefined();
    expect(summary.toolDurations).toBeUndefined();
    expect(summary.llmCallCount).toBe(1);
  });
});

Step 2: Run test to verify it fails

Run: cd /home/christso/projects/agentv_feat-172-trace-timestamps && bun test packages/core/test/evaluation/trace-summary.test.ts
Expected: FAIL (computeTraceSummary doesn't accept full messages or return timing fields yet).

Step 3: Update OutputMessageLike and computeTraceSummary

In packages/core/src/evaluation/trace.ts, expand OutputMessageLike and rewrite computeTraceSummary:

interface OutputMessageLike {
  readonly role?: string;
  readonly startTime?: string;
  readonly endTime?: string;
  readonly toolCalls?: readonly {
    readonly tool: string;
    readonly durationMs?: number;
    readonly startTime?: string;
    readonly endTime?: string;
  }[];
}

export function computeTraceSummary(messages: readonly OutputMessageLike[]): TraceSummary {
  const toolCallCounts: Record<string, number> = {};
  const toolDurationsMap: Record<string, number[]> = {};
  let totalToolCalls = 0;
  let llmCallCount = 0;
  let earliestStart: string | undefined;
  let latestEnd: string | undefined;
  let hasAnyDuration = false;

  for (const message of messages) {
    if (message.role === 'assistant') {
      llmCallCount++;
    }

    // Collect message-level timestamps for overall boundaries
    if (message.startTime) {
      if (!earliestStart || message.startTime < earliestStart) {
        earliestStart = message.startTime;
      }
    }
    if (message.endTime) {
      if (!latestEnd || message.endTime > latestEnd) {
        latestEnd = message.endTime;
      }
    }

    if (!message.toolCalls) continue;

    for (const toolCall of message.toolCalls) {
      toolCallCounts[toolCall.tool] = (toolCallCounts[toolCall.tool] ?? 0) + 1;
      totalToolCalls++;

      // Derive duration: prefer explicit durationMs, fall back to startTime/endTime
      let duration = toolCall.durationMs;
      if (duration === undefined && toolCall.startTime && toolCall.endTime) {
        duration = new Date(toolCall.endTime).getTime() - new Date(toolCall.startTime).getTime();
      }

      if (duration !== undefined) {
        hasAnyDuration = true;
        if (!toolDurationsMap[toolCall.tool]) {
          toolDurationsMap[toolCall.tool] = [];
        }
        toolDurationsMap[toolCall.tool].push(duration);
      }

      // Tool call timestamps also contribute to overall boundaries
      if (toolCall.startTime) {
        if (!earliestStart || toolCall.startTime < earliestStart) {
          earliestStart = toolCall.startTime;
        }
      }
      if (toolCall.endTime) {
        if (!latestEnd || toolCall.endTime > latestEnd) {
          latestEnd = toolCall.endTime;
        }
      }
    }
  }

  const toolNames = Object.keys(toolCallCounts).sort();

  return {
    eventCount: totalToolCalls,
    toolNames,
    toolCallsByName: toolCallCounts,
    errorCount: 0,
    ...(earliestStart ? { startTime: earliestStart } : {}),
    ...(latestEnd ? { endTime: latestEnd } : {}),
    ...(hasAnyDuration ? { toolDurations: toolDurationsMap } : {}),
    ...(llmCallCount > 0 ? { llmCallCount } : {}),
  };
}

Step 4: Update mergeExecutionMetrics to pass through startTime/endTime

export function mergeExecutionMetrics(
  summary: TraceSummary,
  metrics?: ExecutionMetrics,
): TraceSummary {
  if (!metrics) return summary;

  return {
    ...summary,
    tokenUsage: metrics.tokenUsage,
    costUsd: metrics.costUsd,
    durationMs: metrics.durationMs,
    // Provider-level startTime/endTime override message-derived ones
    ...(metrics.startTime ? { startTime: metrics.startTime } : {}),
    ...(metrics.endTime ? { endTime: metrics.endTime } : {}),
  };
}

Step 5: Run tests

Run: cd /home/christso/projects/agentv_feat-172-trace-timestamps && bun test packages/core/test/evaluation/trace-summary.test.ts
Expected: All new tests pass.

Step 6: Run full test suite

Run: cd /home/christso/projects/agentv_feat-172-trace-timestamps && bun run test
Expected: All tests pass.

Step 7: Commit

git add -A && git commit -m "feat: derive startTime/endTime/toolDurations/llmCallCount in computeTraceSummary"

Task 3: Update orchestrator to pass startTime/endTime through mergeExecutionMetrics

Files:

  • Modify: packages/core/src/evaluation/orchestrator.ts

Step 1: Update both mergeExecutionMetrics call sites

In orchestrator.ts, at lines ~466-471 and ~634-639, add startTime/endTime from providerResponse:

const traceSummary = baseSummary
  ? mergeExecutionMetrics(baseSummary, {
      tokenUsage: providerResponse.tokenUsage,
      costUsd: providerResponse.costUsd,
      durationMs: providerResponse.durationMs,
      startTime: providerResponse.startTime,
      endTime: providerResponse.endTime,
    })
  : undefined;

Step 2: Build and test

Run: cd /home/christso/projects/agentv_feat-172-trace-timestamps && bun run build && bun run test
Expected: All pass.

Step 3: Commit

git add -A && git commit -m "feat: pass startTime/endTime from ProviderResponse through mergeExecutionMetrics"

Task 4: Update providers to populate startTime/endTime

Files:

  • Modify: packages/core/src/evaluation/providers/cli.ts (Zod schemas + mapping)
  • Modify: packages/core/src/evaluation/providers/pi-agent-sdk.ts
  • Modify: packages/core/src/evaluation/providers/pi-coding-agent.ts

Step 1: Update CLI provider Zod schemas

In cli.ts, update ToolCallSchema and OutputMessageInputSchema:

  • Rename timestampstart_time (snake_case for external schema)
  • Add end_time
  • Map to startTime/endTime in the conversion code

Keep timestamp as an alias for backward compat in the Zod schema using .or() or .transform().

Actually, since the user said no soft deprecation, just rename timestamp to start_time and add end_time.

const ToolCallSchema = z.object({
  tool: z.string(),
  input: z.unknown().optional(),
  output: z.unknown().optional(),
  id: z.string().optional(),
  start_time: z.string().optional(),
  end_time: z.string().optional(),
  duration_ms: z.number().optional(),
});

const OutputMessageInputSchema = z.object({
  role: z.string(),
  name: z.string().optional(),
  content: z.unknown().optional(),
  tool_calls: z.array(ToolCallSchema).optional(),
  start_time: z.string().optional(),
  end_time: z.string().optional(),
  duration_ms: z.number().optional(),
  metadata: z.record(z.unknown()).optional(),
});

Update the mapping code that converts parsed Zod output into ToolCall/OutputMessage objects to use startTime/endTime.

Step 2: Update Pi Agent SDK provider

In pi-agent-sdk.ts, change timestamp extraction to startTime:

// Where it currently sets timestamp, change to startTime
startTime: typeof msg.timestamp === 'number'
  ? new Date(msg.timestamp).toISOString()
  : typeof msg.timestamp === 'string'
    ? msg.timestamp
    : undefined,

Step 3: Update Pi Coding Agent provider

Same pattern as pi-agent-sdk.

Step 4: Build and test

Run: cd /home/christso/projects/agentv_feat-172-trace-timestamps && bun run build && bun run test
Expected: All pass.

Step 5: Commit

git add -A && git commit -m "feat: update providers to populate startTime/endTime"

Task 5: Add aggregate threshold fields to tool_trajectory evaluator

Files:

  • Modify: packages/core/src/evaluation/trace.ts (ToolTrajectoryEvaluatorConfig)
  • Modify: packages/core/src/evaluation/evaluators/tool-trajectory.ts (evaluation logic)
  • Modify: packages/core/src/evaluation/loaders/evaluator-parser.ts (YAML parsing)
  • Modify: packages/core/test/evaluation/tool-trajectory-evaluator.test.ts (tests)

Step 1: Write failing tests

Add to packages/core/test/evaluation/tool-trajectory-evaluator.test.ts:

describe('aggregate thresholds', () => {
  it('fails when max_total_duration_ms exceeded', () => {
    const config: ToolTrajectoryEvaluatorConfig = {
      name: 'test',
      type: 'tool_trajectory',
      mode: 'any_order',
      minimums: { search: 1 },
      maxTotalDurationMs: 1000,
    };
    const evaluator = new ToolTrajectoryEvaluator({ config });
    const outputMessages: OutputMessage[] = [
      { role: 'assistant', toolCalls: [{ tool: 'search' }] },
    ];
    const traceSummary: TraceSummary = {
      eventCount: 1, toolNames: ['search'], toolCallsByName: { search: 1 },
      errorCount: 0, durationMs: 2000,
    };
    const result = evaluator.evaluate(createContext({ outputMessages, traceSummary }));
    expect(result.misses).toContainEqual(expect.stringContaining('total duration'));
  });

  it('passes when max_total_duration_ms not exceeded', () => {
    const config: ToolTrajectoryEvaluatorConfig = {
      name: 'test',
      type: 'tool_trajectory',
      mode: 'any_order',
      minimums: { search: 1 },
      maxTotalDurationMs: 5000,
    };
    const evaluator = new ToolTrajectoryEvaluator({ config });
    const outputMessages: OutputMessage[] = [
      { role: 'assistant', toolCalls: [{ tool: 'search' }] },
    ];
    const traceSummary: TraceSummary = {
      eventCount: 1, toolNames: ['search'], toolCallsByName: { search: 1 },
      errorCount: 0, durationMs: 2000,
    };
    const result = evaluator.evaluate(createContext({ outputMessages, traceSummary }));
    expect(result.misses).not.toContainEqual(expect.stringContaining('total duration'));
  });

  it('fails when max_llm_calls exceeded', () => {
    const config: ToolTrajectoryEvaluatorConfig = {
      name: 'test',
      type: 'tool_trajectory',
      mode: 'any_order',
      minimums: { search: 1 },
      maxLlmCalls: 2,
    };
    const evaluator = new ToolTrajectoryEvaluator({ config });
    const outputMessages: OutputMessage[] = [
      { role: 'assistant', toolCalls: [{ tool: 'search' }] },
    ];
    const traceSummary: TraceSummary = {
      eventCount: 1, toolNames: ['search'], toolCallsByName: { search: 1 },
      errorCount: 0, llmCallCount: 5,
    };
    const result = evaluator.evaluate(createContext({ outputMessages, traceSummary }));
    expect(result.misses).toContainEqual(expect.stringContaining('LLM calls'));
  });

  it('fails when max_tool_calls exceeded', () => {
    const config: ToolTrajectoryEvaluatorConfig = {
      name: 'test',
      type: 'tool_trajectory',
      mode: 'any_order',
      minimums: { search: 1 },
      maxToolCalls: 5,
    };
    const evaluator = new ToolTrajectoryEvaluator({ config });
    const outputMessages: OutputMessage[] = [
      { role: 'assistant', toolCalls: Array.from({ length: 10 }, () => ({ tool: 'search' })) },
    ];
    const result = evaluator.evaluate(createContext({ outputMessages }));
    expect(result.misses).toContainEqual(expect.stringContaining('tool calls'));
  });
});

Step 2: Run tests to verify they fail

Run: cd /home/christso/projects/agentv_feat-172-trace-timestamps && bun test packages/core/test/evaluation/tool-trajectory-evaluator.test.ts
Expected: FAIL (fields don't exist on config type).

Step 3: Update ToolTrajectoryEvaluatorConfig

In packages/core/src/evaluation/trace.ts:

export interface ToolTrajectoryEvaluatorConfig {
  readonly name: string;
  readonly type: 'tool_trajectory';
  readonly mode: 'any_order' | 'in_order' | 'exact';
  readonly minimums?: Readonly<Record<string, number>>;
  readonly expected?: readonly ToolTrajectoryExpectedItem[];
  readonly weight?: number;
  readonly maxTotalDurationMs?: number;
  readonly maxLlmCalls?: number;
  readonly maxToolCalls?: number;
}

Step 4: Add aggregate checks to ToolTrajectoryEvaluator

In tool-trajectory.ts, add a private method that checks aggregate thresholds and call it from evaluate(). The method checks against traceSummary (for durationMs, llmCallCount) and against the extracted tool calls count. Violations count as misses in the scoring.

private checkAggregateThresholds(
  toolCallCount: number,
  traceSummary?: TraceSummary,
): { hits: string[]; misses: string[] } {
  const hits: string[] = [];
  const misses: string[] = [];

  if (this.config.maxTotalDurationMs !== undefined) {
    const actual = traceSummary?.durationMs;
    if (actual !== undefined) {
      if (actual <= this.config.maxTotalDurationMs) {
        hits.push(`total duration ${actual}ms within limit (max: ${this.config.maxTotalDurationMs}ms)`);
      } else {
        misses.push(`total duration ${actual}ms exceeded limit (max: ${this.config.maxTotalDurationMs}ms)`);
      }
    }
  }

  if (this.config.maxLlmCalls !== undefined) {
    const actual = traceSummary?.llmCallCount;
    if (actual !== undefined) {
      if (actual <= this.config.maxLlmCalls) {
        hits.push(`${actual} LLM calls within limit (max: ${this.config.maxLlmCalls})`);
      } else {
        misses.push(`${actual} LLM calls exceeded limit (max: ${this.config.maxLlmCalls})`);
      }
    }
  }

  if (this.config.maxToolCalls !== undefined) {
    if (toolCallCount <= this.config.maxToolCalls) {
      hits.push(`${toolCallCount} tool calls within limit (max: ${this.config.maxToolCalls})`);
    } else {
      misses.push(`${toolCallCount} tool calls exceeded limit (max: ${this.config.maxToolCalls})`);
    }
  }

  return { hits, misses };
}

Call this from evaluate() and merge the results into the final score. Aggregate threshold checks count toward the total assertion count for scoring.

Step 5: Fix evaluator-parser.ts to parse new fields and maxDurationMs

In evaluator-parser.ts, within the tool_trajectory block (~line 347):

// Parse aggregate thresholds
const maxTotalDurationMs = typeof rawEvaluator.max_total_duration_ms === 'number'
  ? rawEvaluator.max_total_duration_ms : undefined;
const maxLlmCalls = typeof rawEvaluator.max_llm_calls === 'number'
  ? rawEvaluator.max_llm_calls : undefined;
const maxToolCalls = typeof rawEvaluator.max_tool_calls === 'number'
  ? rawEvaluator.max_tool_calls : undefined;

const config: ToolTrajectoryEvaluatorConfig = {
  name,
  type: 'tool_trajectory',
  mode,
  ...(minimums ? { minimums } : {}),
  ...(expected ? { expected } : {}),
  ...(weight !== undefined ? { weight } : {}),
  ...(maxTotalDurationMs !== undefined ? { maxTotalDurationMs } : {}),
  ...(maxLlmCalls !== undefined ? { maxLlmCalls } : {}),
  ...(maxToolCalls !== undefined ? { maxToolCalls } : {}),
};

Also fix the existing bug where maxDurationMs is not parsed from expected items (~line 325):

const maxDurationMs = typeof item.max_duration_ms === 'number' ? item.max_duration_ms : undefined;
expected.push({
  tool: item.tool,
  ...(args !== undefined ? { args } : {}),
  ...(maxDurationMs !== undefined ? { maxDurationMs } : {}),
});

Step 6: Run tests

Run: cd /home/christso/projects/agentv_feat-172-trace-timestamps && bun test packages/core/test/evaluation/tool-trajectory-evaluator.test.ts
Expected: All pass.

Step 7: Run full suite

Run: cd /home/christso/projects/agentv_feat-172-trace-timestamps && bun run build && bun run test
Expected: All pass.

Step 8: Commit

git add -A && git commit -m "feat: add aggregate thresholds (max_total_duration_ms, max_llm_calls, max_tool_calls) to tool_trajectory evaluator"

Task 6: Add --trace flag and TraceWriter for trace persistence

Files:

  • Create: apps/cli/src/commands/eval/trace-writer.ts
  • Modify: apps/cli/src/commands/eval/index.ts (add --trace flag)
  • Modify: apps/cli/src/commands/eval/run-eval.ts (thread trace flag, write traces)
  • Create: apps/cli/test/commands/eval/trace-writer.test.ts

Step 1: Write TraceWriter tests

Create apps/cli/test/commands/eval/trace-writer.test.ts:

import { describe, expect, it, afterEach } from 'bun:test';
import { existsSync, readFileSync, rmSync, mkdirSync } from 'node:fs';
import path from 'node:path';
import { TraceWriter } from '../../../src/commands/eval/trace-writer.js';

const TEMP_DIR = path.join(import.meta.dir, '.test-traces');

afterEach(() => {
  if (existsSync(TEMP_DIR)) rmSync(TEMP_DIR, { recursive: true });
});

describe('TraceWriter', () => {
  it('writes trace records as JSONL', async () => {
    mkdirSync(TEMP_DIR, { recursive: true });
    const filePath = path.join(TEMP_DIR, 'test.trace.jsonl');
    const writer = await TraceWriter.open(filePath);

    await writer.append({
      evalId: 'test-1',
      startTime: '2024-01-15T09:00:00Z',
      endTime: '2024-01-15T09:00:05Z',
      durationMs: 5000,
      spans: [
        { type: 'tool', name: 'search', startTime: '2024-01-15T09:00:00Z', endTime: '2024-01-15T09:00:01Z', durationMs: 1000 },
      ],
    });

    await writer.close();

    const content = readFileSync(filePath, 'utf8').trim();
    const record = JSON.parse(content);
    expect(record.eval_id).toBe('test-1');
    expect(record.spans).toHaveLength(1);
  });
});

Step 2: Implement TraceWriter

Create apps/cli/src/commands/eval/trace-writer.ts:

import { createWriteStream } from 'node:fs';
import { mkdir } from 'node:fs/promises';
import path from 'node:path';
import { finished } from 'node:stream/promises';
import { Mutex } from 'async-mutex';
import { toSnakeCaseDeep } from '../../utils/case-conversion.js';

export interface TraceRecord {
  readonly evalId: string;
  readonly startTime?: string;
  readonly endTime?: string;
  readonly durationMs?: number;
  readonly spans: readonly TraceSpan[];
  readonly tokenUsage?: { readonly input: number; readonly output: number; readonly cached?: number };
  readonly costUsd?: number;
}

export interface TraceSpan {
  readonly type: 'tool';
  readonly name: string;
  readonly startTime?: string;
  readonly endTime?: string;
  readonly durationMs?: number;
  readonly input?: unknown;
  readonly output?: unknown;
}

export class TraceWriter {
  private readonly stream: ReturnType<typeof createWriteStream>;
  private readonly mutex = new Mutex();
  private closed = false;

  private constructor(stream: ReturnType<typeof createWriteStream>) {
    this.stream = stream;
  }

  static async open(filePath: string): Promise<TraceWriter> {
    await mkdir(path.dirname(filePath), { recursive: true });
    const stream = createWriteStream(filePath, { flags: 'w', encoding: 'utf8' });
    return new TraceWriter(stream);
  }

  async append(record: TraceRecord): Promise<void> {
    await this.mutex.runExclusive(async () => {
      if (this.closed) throw new Error('Cannot write to closed trace writer');
      const snakeCaseRecord = toSnakeCaseDeep(record);
      const line = `${JSON.stringify(snakeCaseRecord)}\n`;
      if (!this.stream.write(line)) {
        await new Promise<void>((resolve, reject) => {
          this.stream.once('drain', resolve);
          this.stream.once('error', reject);
        });
      }
    });
  }

  async close(): Promise<void> {
    if (this.closed) return;
    this.closed = true;
    this.stream.end();
    await finished(this.stream);
  }
}

Step 3: Add --trace flag to CLI

In apps/cli/src/commands/eval/index.ts, add:

trace: flag({
  long: 'trace',
  description: 'Save full execution traces to .agentv/traces/',
}),

Pass it through rawOptions:

trace: args.trace,

Step 4: Thread trace flag through run-eval.ts

In run-eval.ts:

  • Add trace: boolean to NormalizedOptions
  • In normalizeOptions: trace: normalizeBoolean(rawOptions.trace)
  • Create trace writer when options.trace is true
  • Build TraceRecord from outputMessages and traceSummary in the onResult callback
  • Pass outputMessages alongside EvaluationResult (requires threading through or adding to the result temporarily)

The key challenge: onResult only receives EvaluationResult which doesn't contain outputMessages. We need to capture outputMessages at the orchestrator level.

Approach: Add an optional outputMessages field to EvaluationResult that's only populated when trace mode is on. This is controlled by a captureTrace option on runEvaluation.

Actually, simpler approach: Add outputMessages?: readonly OutputMessage[] to EvaluationResult in types.ts. The orchestrator already has access to outputMessages when building the result. The field is optional and only populated when the caller requests it via a new captureOutputMessages option on the evaluation runner. The JSONL writer already strips it by only writing traceSummary. We just need the trace writer to extract it.

Even simpler: Always include outputMessages on the result (it's already available at construction time), and let the JSONL writer exclude it. The trace writer uses it. This avoids threading a new option.

Wait — the JSONL writer just calls toSnakeCaseDeep(record) on the full EvaluationResult. Adding outputMessages would bloat the results JSONL. We need to strip it.

Best approach: Add outputMessages to EvaluationResult type. In the onResult callback in run-eval.ts, write the trace record from it, then strip it before passing to the output writer.

Implementation in orchestrator.ts:
Add outputMessages to the result object at both construction sites (~line 745 and error handling):

return {
  // ... existing fields
  traceSummary,
  outputMessages,  // NEW: included for trace persistence
};

Implementation in run-eval.ts:

onResult: async (result: EvaluationResult) => {
  // Write trace if enabled
  if (traceWriter && result.outputMessages) {
    await traceWriter.append(buildTraceRecord(result));
  }
  // Strip outputMessages before writing to results
  const { outputMessages: _, ...resultWithoutMessages } = result;
  await outputWriter.append(resultWithoutMessages as EvaluationResult);
},

Step 5: Update EvaluationResult type

In packages/core/src/evaluation/types.ts, add:

export interface EvaluationResult {
  // ... existing fields
  readonly outputMessages?: readonly OutputMessage[];
}

Step 6: Update orchestrator to include outputMessages in result

In packages/core/src/evaluation/orchestrator.ts, at both result construction sites, add outputMessages.

Step 7: Build and test

Run: cd /home/christso/projects/agentv_feat-172-trace-timestamps && bun run build && bun run test
Expected: All pass.

Step 8: Commit

git add -A && git commit -m "feat: add --trace flag for persisting full execution traces to .agentv/traces/"

Task 7: Update evaluator-parser tests and add integration validation

Files:

  • Modify: packages/core/test/evaluation/loaders/evaluator-parser.test.ts

Step 1: Add tests for new tool_trajectory fields

Add tests verifying:

  • max_total_duration_ms is parsed from YAML
  • max_llm_calls is parsed from YAML
  • max_tool_calls is parsed from YAML
  • max_duration_ms on expected items is parsed (existing bug fix)

Step 2: Run tests

Run: cd /home/christso/projects/agentv_feat-172-trace-timestamps && bun run test
Expected: All pass.

Step 3: Commit

git add -A && git commit -m "test: add evaluator-parser tests for tool_trajectory aggregate thresholds"

Task 8: Update examples and documentation

Files:

  • Modify: relevant example YAML files in examples/
  • Modify: apps/web/src/content/docs/ (docs site)
  • Modify: .claude/skills/agentv-eval-builder/ (skill files)

Step 1: Update examples

Add startTime/endTime to any example trace data. Add an example using aggregate thresholds:

evaluators:
  - name: efficiency-check
    type: tool_trajectory
    mode: in_order
    max_total_duration_ms: 10000
    max_llm_calls: 5
    max_tool_calls: 20
    expected:
      - tool: search
        max_duration_ms: 2000
      - tool: summarize

Step 2: Update docs and skills

Update documentation pages related to:

  • Trace format (startTime/endTime)
  • tool_trajectory evaluator (aggregate thresholds)
  • CLI reference (--trace flag)

Update eval-builder skill with new fields.

Step 3: Commit

git add -A && git commit -m "docs: update examples, docs, and skills for trace timestamps and aggregate thresholds"

Task 9: Final validation

Step 1: Full build + typecheck + lint + test

cd /home/christso/projects/agentv_feat-172-trace-timestamps && bun run build && bun run typecheck && bun run lint && bun run test

Step 2: Fix any issues

Step 3: Final commit if needed

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions