-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Trace Timestamps & Persistence Implementation Plan
For Claude: REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
Goal: Add startTime/endTime to trace types, add aggregate threshold fields to tool_trajectory evaluator, and persist full traces to disk via --trace flag.
Architecture: Enrich existing ToolCall, OutputMessage, ProviderResponse, and TraceSummary interfaces with startTime/endTime ISO 8601 fields. Update computeTraceSummary to derive timing from span boundaries. Add aggregate assertion fields (max_total_duration_ms, max_llm_calls, max_tool_calls) to tool_trajectory evaluator. Add --trace CLI flag that writes full outputMessages to .agentv/traces/ as JSONL. Since we have few users, replace timestamp directly with startTime (no soft deprecation).
Tech Stack: TypeScript 5.x, Bun, Vitest, cmd-ts (CLI), Zod (provider schemas)
Task 1: Add startTime/endTime to core type interfaces
Files:
- Modify:
packages/core/src/evaluation/providers/types.ts(ToolCall, OutputMessage, ProviderResponse) - Modify:
packages/core/src/evaluation/trace.ts(TraceSummary, ExecutionMetrics)
Step 1: Update ToolCall interface
In packages/core/src/evaluation/providers/types.ts, replace timestamp with startTime/endTime on ToolCall:
export interface ToolCall {
readonly tool: string;
readonly input?: unknown;
readonly output?: unknown;
readonly id?: string;
/** ISO 8601 start time */
readonly startTime?: string;
/** ISO 8601 end time */
readonly endTime?: string;
readonly durationMs?: number;
}Step 2: Update OutputMessage interface
Replace timestamp with startTime/endTime on OutputMessage:
export interface OutputMessage {
readonly role: string;
readonly name?: string;
readonly content?: unknown;
readonly toolCalls?: readonly ToolCall[];
/** ISO 8601 start time */
readonly startTime?: string;
/** ISO 8601 end time */
readonly endTime?: string;
readonly durationMs?: number;
readonly metadata?: Record<string, unknown>;
}Step 3: Update ProviderResponse interface
Add startTime/endTime to ProviderResponse:
export interface ProviderResponse {
readonly raw?: unknown;
readonly usage?: JsonObject;
readonly outputMessages?: readonly OutputMessage[];
readonly tokenUsage?: ProviderTokenUsage;
readonly costUsd?: number;
readonly durationMs?: number;
readonly startTime?: string;
readonly endTime?: string;
}Step 4: Update TraceSummary interface
In packages/core/src/evaluation/trace.ts, add startTime/endTime/llmCallCount to TraceSummary:
export interface TraceSummary {
readonly eventCount: number;
readonly toolNames: readonly string[];
readonly toolCallsByName: Readonly<Record<string, number>>;
readonly errorCount: number;
readonly tokenUsage?: TokenUsage;
readonly costUsd?: number;
readonly durationMs?: number;
readonly toolDurations?: Readonly<Record<string, readonly number[]>>;
readonly startTime?: string;
readonly endTime?: string;
readonly llmCallCount?: number;
}Step 5: Update ExecutionMetrics interface
Add startTime/endTime to ExecutionMetrics:
export interface ExecutionMetrics {
readonly tokenUsage?: TokenUsage;
readonly costUsd?: number;
readonly durationMs?: number;
readonly startTime?: string;
readonly endTime?: string;
}Step 6: Fix all compile errors from timestamp→startTime rename
Search the entire codebase for references to the old timestamp field on ToolCall/OutputMessage and update to startTime. Key files:
packages/core/src/evaluation/providers/cli.ts— Zod schemas (timestamp→start_time, addend_time) and mappingpackages/core/src/evaluation/providers/pi-agent-sdk.ts— timestamp extractionpackages/core/src/evaluation/providers/pi-coding-agent.ts— timestamp extractionpackages/core/test/— any test fixtures referencingtimestamppackages/core/src/evaluation/loaders/evaluator-parser.ts— if referencedpackages/eval/— Zod schema for ToolCall/OutputMessage if defined there
Step 7: Build and fix all type errors
Run: cd /home/christso/projects/agentv_feat-172-trace-timestamps && bun run build
Expected: Clean build with no errors.
Step 8: Run tests
Run: cd /home/christso/projects/agentv_feat-172-trace-timestamps && bun run test
Expected: All 390 tests pass (the new fields are optional, so existing tests remain valid).
Step 9: Commit
git add -A && git commit -m "feat: add startTime/endTime to ToolCall, OutputMessage, ProviderResponse, TraceSummary"Task 2: Update computeTraceSummary to derive timing from spans
Files:
- Modify:
packages/core/src/evaluation/trace.ts - Create:
packages/core/test/evaluation/trace-summary.test.ts
Step 1: Write failing tests
Create packages/core/test/evaluation/trace-summary.test.ts:
import { describe, expect, it } from 'bun:test';
import { computeTraceSummary } from '../../src/evaluation/trace.js';
describe('computeTraceSummary', () => {
it('derives startTime/endTime from message boundaries', () => {
const messages = [
{
role: 'assistant',
startTime: '2024-01-15T09:00:00Z',
endTime: '2024-01-15T09:00:02Z',
toolCalls: [{ tool: 'search', startTime: '2024-01-15T09:00:00Z', endTime: '2024-01-15T09:00:01Z', durationMs: 1000 }],
},
{
role: 'assistant',
startTime: '2024-01-15T09:00:02Z',
endTime: '2024-01-15T09:00:05Z',
toolCalls: [{ tool: 'fetch', startTime: '2024-01-15T09:00:03Z', endTime: '2024-01-15T09:00:04Z', durationMs: 1000 }],
},
];
const summary = computeTraceSummary(messages);
expect(summary.startTime).toBe('2024-01-15T09:00:00Z');
expect(summary.endTime).toBe('2024-01-15T09:00:05Z');
expect(summary.eventCount).toBe(2);
});
it('computes toolDurations from tool call durationMs', () => {
const messages = [
{
role: 'assistant',
toolCalls: [
{ tool: 'search', durationMs: 100 },
{ tool: 'search', durationMs: 200 },
{ tool: 'fetch', durationMs: 300 },
],
},
];
const summary = computeTraceSummary(messages);
expect(summary.toolDurations).toEqual({ fetch: [300], search: [100, 200] });
});
it('computes durationMs from startTime/endTime on tool calls when durationMs not provided', () => {
const messages = [
{
role: 'assistant',
toolCalls: [
{ tool: 'search', startTime: '2024-01-15T09:00:00Z', endTime: '2024-01-15T09:00:01.500Z' },
],
},
];
const summary = computeTraceSummary(messages);
expect(summary.toolDurations).toEqual({ search: [1500] });
});
it('counts llmCallCount from assistant messages', () => {
const messages = [
{ role: 'assistant', toolCalls: [{ tool: 'search' }] },
{ role: 'tool' },
{ role: 'assistant', toolCalls: [{ tool: 'fetch' }] },
];
const summary = computeTraceSummary(messages);
expect(summary.llmCallCount).toBe(2);
});
it('handles messages with no timing data', () => {
const messages = [
{ role: 'assistant', toolCalls: [{ tool: 'search' }] },
];
const summary = computeTraceSummary(messages);
expect(summary.startTime).toBeUndefined();
expect(summary.endTime).toBeUndefined();
expect(summary.toolDurations).toBeUndefined();
expect(summary.llmCallCount).toBe(1);
});
});Step 2: Run test to verify it fails
Run: cd /home/christso/projects/agentv_feat-172-trace-timestamps && bun test packages/core/test/evaluation/trace-summary.test.ts
Expected: FAIL (computeTraceSummary doesn't accept full messages or return timing fields yet).
Step 3: Update OutputMessageLike and computeTraceSummary
In packages/core/src/evaluation/trace.ts, expand OutputMessageLike and rewrite computeTraceSummary:
interface OutputMessageLike {
readonly role?: string;
readonly startTime?: string;
readonly endTime?: string;
readonly toolCalls?: readonly {
readonly tool: string;
readonly durationMs?: number;
readonly startTime?: string;
readonly endTime?: string;
}[];
}
export function computeTraceSummary(messages: readonly OutputMessageLike[]): TraceSummary {
const toolCallCounts: Record<string, number> = {};
const toolDurationsMap: Record<string, number[]> = {};
let totalToolCalls = 0;
let llmCallCount = 0;
let earliestStart: string | undefined;
let latestEnd: string | undefined;
let hasAnyDuration = false;
for (const message of messages) {
if (message.role === 'assistant') {
llmCallCount++;
}
// Collect message-level timestamps for overall boundaries
if (message.startTime) {
if (!earliestStart || message.startTime < earliestStart) {
earliestStart = message.startTime;
}
}
if (message.endTime) {
if (!latestEnd || message.endTime > latestEnd) {
latestEnd = message.endTime;
}
}
if (!message.toolCalls) continue;
for (const toolCall of message.toolCalls) {
toolCallCounts[toolCall.tool] = (toolCallCounts[toolCall.tool] ?? 0) + 1;
totalToolCalls++;
// Derive duration: prefer explicit durationMs, fall back to startTime/endTime
let duration = toolCall.durationMs;
if (duration === undefined && toolCall.startTime && toolCall.endTime) {
duration = new Date(toolCall.endTime).getTime() - new Date(toolCall.startTime).getTime();
}
if (duration !== undefined) {
hasAnyDuration = true;
if (!toolDurationsMap[toolCall.tool]) {
toolDurationsMap[toolCall.tool] = [];
}
toolDurationsMap[toolCall.tool].push(duration);
}
// Tool call timestamps also contribute to overall boundaries
if (toolCall.startTime) {
if (!earliestStart || toolCall.startTime < earliestStart) {
earliestStart = toolCall.startTime;
}
}
if (toolCall.endTime) {
if (!latestEnd || toolCall.endTime > latestEnd) {
latestEnd = toolCall.endTime;
}
}
}
}
const toolNames = Object.keys(toolCallCounts).sort();
return {
eventCount: totalToolCalls,
toolNames,
toolCallsByName: toolCallCounts,
errorCount: 0,
...(earliestStart ? { startTime: earliestStart } : {}),
...(latestEnd ? { endTime: latestEnd } : {}),
...(hasAnyDuration ? { toolDurations: toolDurationsMap } : {}),
...(llmCallCount > 0 ? { llmCallCount } : {}),
};
}Step 4: Update mergeExecutionMetrics to pass through startTime/endTime
export function mergeExecutionMetrics(
summary: TraceSummary,
metrics?: ExecutionMetrics,
): TraceSummary {
if (!metrics) return summary;
return {
...summary,
tokenUsage: metrics.tokenUsage,
costUsd: metrics.costUsd,
durationMs: metrics.durationMs,
// Provider-level startTime/endTime override message-derived ones
...(metrics.startTime ? { startTime: metrics.startTime } : {}),
...(metrics.endTime ? { endTime: metrics.endTime } : {}),
};
}Step 5: Run tests
Run: cd /home/christso/projects/agentv_feat-172-trace-timestamps && bun test packages/core/test/evaluation/trace-summary.test.ts
Expected: All new tests pass.
Step 6: Run full test suite
Run: cd /home/christso/projects/agentv_feat-172-trace-timestamps && bun run test
Expected: All tests pass.
Step 7: Commit
git add -A && git commit -m "feat: derive startTime/endTime/toolDurations/llmCallCount in computeTraceSummary"Task 3: Update orchestrator to pass startTime/endTime through mergeExecutionMetrics
Files:
- Modify:
packages/core/src/evaluation/orchestrator.ts
Step 1: Update both mergeExecutionMetrics call sites
In orchestrator.ts, at lines ~466-471 and ~634-639, add startTime/endTime from providerResponse:
const traceSummary = baseSummary
? mergeExecutionMetrics(baseSummary, {
tokenUsage: providerResponse.tokenUsage,
costUsd: providerResponse.costUsd,
durationMs: providerResponse.durationMs,
startTime: providerResponse.startTime,
endTime: providerResponse.endTime,
})
: undefined;Step 2: Build and test
Run: cd /home/christso/projects/agentv_feat-172-trace-timestamps && bun run build && bun run test
Expected: All pass.
Step 3: Commit
git add -A && git commit -m "feat: pass startTime/endTime from ProviderResponse through mergeExecutionMetrics"Task 4: Update providers to populate startTime/endTime
Files:
- Modify:
packages/core/src/evaluation/providers/cli.ts(Zod schemas + mapping) - Modify:
packages/core/src/evaluation/providers/pi-agent-sdk.ts - Modify:
packages/core/src/evaluation/providers/pi-coding-agent.ts
Step 1: Update CLI provider Zod schemas
In cli.ts, update ToolCallSchema and OutputMessageInputSchema:
- Rename
timestamp→start_time(snake_case for external schema) - Add
end_time - Map to
startTime/endTimein the conversion code
Keep timestamp as an alias for backward compat in the Zod schema using .or() or .transform().
Actually, since the user said no soft deprecation, just rename timestamp to start_time and add end_time.
const ToolCallSchema = z.object({
tool: z.string(),
input: z.unknown().optional(),
output: z.unknown().optional(),
id: z.string().optional(),
start_time: z.string().optional(),
end_time: z.string().optional(),
duration_ms: z.number().optional(),
});
const OutputMessageInputSchema = z.object({
role: z.string(),
name: z.string().optional(),
content: z.unknown().optional(),
tool_calls: z.array(ToolCallSchema).optional(),
start_time: z.string().optional(),
end_time: z.string().optional(),
duration_ms: z.number().optional(),
metadata: z.record(z.unknown()).optional(),
});Update the mapping code that converts parsed Zod output into ToolCall/OutputMessage objects to use startTime/endTime.
Step 2: Update Pi Agent SDK provider
In pi-agent-sdk.ts, change timestamp extraction to startTime:
// Where it currently sets timestamp, change to startTime
startTime: typeof msg.timestamp === 'number'
? new Date(msg.timestamp).toISOString()
: typeof msg.timestamp === 'string'
? msg.timestamp
: undefined,Step 3: Update Pi Coding Agent provider
Same pattern as pi-agent-sdk.
Step 4: Build and test
Run: cd /home/christso/projects/agentv_feat-172-trace-timestamps && bun run build && bun run test
Expected: All pass.
Step 5: Commit
git add -A && git commit -m "feat: update providers to populate startTime/endTime"Task 5: Add aggregate threshold fields to tool_trajectory evaluator
Files:
- Modify:
packages/core/src/evaluation/trace.ts(ToolTrajectoryEvaluatorConfig) - Modify:
packages/core/src/evaluation/evaluators/tool-trajectory.ts(evaluation logic) - Modify:
packages/core/src/evaluation/loaders/evaluator-parser.ts(YAML parsing) - Modify:
packages/core/test/evaluation/tool-trajectory-evaluator.test.ts(tests)
Step 1: Write failing tests
Add to packages/core/test/evaluation/tool-trajectory-evaluator.test.ts:
describe('aggregate thresholds', () => {
it('fails when max_total_duration_ms exceeded', () => {
const config: ToolTrajectoryEvaluatorConfig = {
name: 'test',
type: 'tool_trajectory',
mode: 'any_order',
minimums: { search: 1 },
maxTotalDurationMs: 1000,
};
const evaluator = new ToolTrajectoryEvaluator({ config });
const outputMessages: OutputMessage[] = [
{ role: 'assistant', toolCalls: [{ tool: 'search' }] },
];
const traceSummary: TraceSummary = {
eventCount: 1, toolNames: ['search'], toolCallsByName: { search: 1 },
errorCount: 0, durationMs: 2000,
};
const result = evaluator.evaluate(createContext({ outputMessages, traceSummary }));
expect(result.misses).toContainEqual(expect.stringContaining('total duration'));
});
it('passes when max_total_duration_ms not exceeded', () => {
const config: ToolTrajectoryEvaluatorConfig = {
name: 'test',
type: 'tool_trajectory',
mode: 'any_order',
minimums: { search: 1 },
maxTotalDurationMs: 5000,
};
const evaluator = new ToolTrajectoryEvaluator({ config });
const outputMessages: OutputMessage[] = [
{ role: 'assistant', toolCalls: [{ tool: 'search' }] },
];
const traceSummary: TraceSummary = {
eventCount: 1, toolNames: ['search'], toolCallsByName: { search: 1 },
errorCount: 0, durationMs: 2000,
};
const result = evaluator.evaluate(createContext({ outputMessages, traceSummary }));
expect(result.misses).not.toContainEqual(expect.stringContaining('total duration'));
});
it('fails when max_llm_calls exceeded', () => {
const config: ToolTrajectoryEvaluatorConfig = {
name: 'test',
type: 'tool_trajectory',
mode: 'any_order',
minimums: { search: 1 },
maxLlmCalls: 2,
};
const evaluator = new ToolTrajectoryEvaluator({ config });
const outputMessages: OutputMessage[] = [
{ role: 'assistant', toolCalls: [{ tool: 'search' }] },
];
const traceSummary: TraceSummary = {
eventCount: 1, toolNames: ['search'], toolCallsByName: { search: 1 },
errorCount: 0, llmCallCount: 5,
};
const result = evaluator.evaluate(createContext({ outputMessages, traceSummary }));
expect(result.misses).toContainEqual(expect.stringContaining('LLM calls'));
});
it('fails when max_tool_calls exceeded', () => {
const config: ToolTrajectoryEvaluatorConfig = {
name: 'test',
type: 'tool_trajectory',
mode: 'any_order',
minimums: { search: 1 },
maxToolCalls: 5,
};
const evaluator = new ToolTrajectoryEvaluator({ config });
const outputMessages: OutputMessage[] = [
{ role: 'assistant', toolCalls: Array.from({ length: 10 }, () => ({ tool: 'search' })) },
];
const result = evaluator.evaluate(createContext({ outputMessages }));
expect(result.misses).toContainEqual(expect.stringContaining('tool calls'));
});
});Step 2: Run tests to verify they fail
Run: cd /home/christso/projects/agentv_feat-172-trace-timestamps && bun test packages/core/test/evaluation/tool-trajectory-evaluator.test.ts
Expected: FAIL (fields don't exist on config type).
Step 3: Update ToolTrajectoryEvaluatorConfig
In packages/core/src/evaluation/trace.ts:
export interface ToolTrajectoryEvaluatorConfig {
readonly name: string;
readonly type: 'tool_trajectory';
readonly mode: 'any_order' | 'in_order' | 'exact';
readonly minimums?: Readonly<Record<string, number>>;
readonly expected?: readonly ToolTrajectoryExpectedItem[];
readonly weight?: number;
readonly maxTotalDurationMs?: number;
readonly maxLlmCalls?: number;
readonly maxToolCalls?: number;
}Step 4: Add aggregate checks to ToolTrajectoryEvaluator
In tool-trajectory.ts, add a private method that checks aggregate thresholds and call it from evaluate(). The method checks against traceSummary (for durationMs, llmCallCount) and against the extracted tool calls count. Violations count as misses in the scoring.
private checkAggregateThresholds(
toolCallCount: number,
traceSummary?: TraceSummary,
): { hits: string[]; misses: string[] } {
const hits: string[] = [];
const misses: string[] = [];
if (this.config.maxTotalDurationMs !== undefined) {
const actual = traceSummary?.durationMs;
if (actual !== undefined) {
if (actual <= this.config.maxTotalDurationMs) {
hits.push(`total duration ${actual}ms within limit (max: ${this.config.maxTotalDurationMs}ms)`);
} else {
misses.push(`total duration ${actual}ms exceeded limit (max: ${this.config.maxTotalDurationMs}ms)`);
}
}
}
if (this.config.maxLlmCalls !== undefined) {
const actual = traceSummary?.llmCallCount;
if (actual !== undefined) {
if (actual <= this.config.maxLlmCalls) {
hits.push(`${actual} LLM calls within limit (max: ${this.config.maxLlmCalls})`);
} else {
misses.push(`${actual} LLM calls exceeded limit (max: ${this.config.maxLlmCalls})`);
}
}
}
if (this.config.maxToolCalls !== undefined) {
if (toolCallCount <= this.config.maxToolCalls) {
hits.push(`${toolCallCount} tool calls within limit (max: ${this.config.maxToolCalls})`);
} else {
misses.push(`${toolCallCount} tool calls exceeded limit (max: ${this.config.maxToolCalls})`);
}
}
return { hits, misses };
}Call this from evaluate() and merge the results into the final score. Aggregate threshold checks count toward the total assertion count for scoring.
Step 5: Fix evaluator-parser.ts to parse new fields and maxDurationMs
In evaluator-parser.ts, within the tool_trajectory block (~line 347):
// Parse aggregate thresholds
const maxTotalDurationMs = typeof rawEvaluator.max_total_duration_ms === 'number'
? rawEvaluator.max_total_duration_ms : undefined;
const maxLlmCalls = typeof rawEvaluator.max_llm_calls === 'number'
? rawEvaluator.max_llm_calls : undefined;
const maxToolCalls = typeof rawEvaluator.max_tool_calls === 'number'
? rawEvaluator.max_tool_calls : undefined;
const config: ToolTrajectoryEvaluatorConfig = {
name,
type: 'tool_trajectory',
mode,
...(minimums ? { minimums } : {}),
...(expected ? { expected } : {}),
...(weight !== undefined ? { weight } : {}),
...(maxTotalDurationMs !== undefined ? { maxTotalDurationMs } : {}),
...(maxLlmCalls !== undefined ? { maxLlmCalls } : {}),
...(maxToolCalls !== undefined ? { maxToolCalls } : {}),
};Also fix the existing bug where maxDurationMs is not parsed from expected items (~line 325):
const maxDurationMs = typeof item.max_duration_ms === 'number' ? item.max_duration_ms : undefined;
expected.push({
tool: item.tool,
...(args !== undefined ? { args } : {}),
...(maxDurationMs !== undefined ? { maxDurationMs } : {}),
});Step 6: Run tests
Run: cd /home/christso/projects/agentv_feat-172-trace-timestamps && bun test packages/core/test/evaluation/tool-trajectory-evaluator.test.ts
Expected: All pass.
Step 7: Run full suite
Run: cd /home/christso/projects/agentv_feat-172-trace-timestamps && bun run build && bun run test
Expected: All pass.
Step 8: Commit
git add -A && git commit -m "feat: add aggregate thresholds (max_total_duration_ms, max_llm_calls, max_tool_calls) to tool_trajectory evaluator"Task 6: Add --trace flag and TraceWriter for trace persistence
Files:
- Create:
apps/cli/src/commands/eval/trace-writer.ts - Modify:
apps/cli/src/commands/eval/index.ts(add --trace flag) - Modify:
apps/cli/src/commands/eval/run-eval.ts(thread trace flag, write traces) - Create:
apps/cli/test/commands/eval/trace-writer.test.ts
Step 1: Write TraceWriter tests
Create apps/cli/test/commands/eval/trace-writer.test.ts:
import { describe, expect, it, afterEach } from 'bun:test';
import { existsSync, readFileSync, rmSync, mkdirSync } from 'node:fs';
import path from 'node:path';
import { TraceWriter } from '../../../src/commands/eval/trace-writer.js';
const TEMP_DIR = path.join(import.meta.dir, '.test-traces');
afterEach(() => {
if (existsSync(TEMP_DIR)) rmSync(TEMP_DIR, { recursive: true });
});
describe('TraceWriter', () => {
it('writes trace records as JSONL', async () => {
mkdirSync(TEMP_DIR, { recursive: true });
const filePath = path.join(TEMP_DIR, 'test.trace.jsonl');
const writer = await TraceWriter.open(filePath);
await writer.append({
evalId: 'test-1',
startTime: '2024-01-15T09:00:00Z',
endTime: '2024-01-15T09:00:05Z',
durationMs: 5000,
spans: [
{ type: 'tool', name: 'search', startTime: '2024-01-15T09:00:00Z', endTime: '2024-01-15T09:00:01Z', durationMs: 1000 },
],
});
await writer.close();
const content = readFileSync(filePath, 'utf8').trim();
const record = JSON.parse(content);
expect(record.eval_id).toBe('test-1');
expect(record.spans).toHaveLength(1);
});
});Step 2: Implement TraceWriter
Create apps/cli/src/commands/eval/trace-writer.ts:
import { createWriteStream } from 'node:fs';
import { mkdir } from 'node:fs/promises';
import path from 'node:path';
import { finished } from 'node:stream/promises';
import { Mutex } from 'async-mutex';
import { toSnakeCaseDeep } from '../../utils/case-conversion.js';
export interface TraceRecord {
readonly evalId: string;
readonly startTime?: string;
readonly endTime?: string;
readonly durationMs?: number;
readonly spans: readonly TraceSpan[];
readonly tokenUsage?: { readonly input: number; readonly output: number; readonly cached?: number };
readonly costUsd?: number;
}
export interface TraceSpan {
readonly type: 'tool';
readonly name: string;
readonly startTime?: string;
readonly endTime?: string;
readonly durationMs?: number;
readonly input?: unknown;
readonly output?: unknown;
}
export class TraceWriter {
private readonly stream: ReturnType<typeof createWriteStream>;
private readonly mutex = new Mutex();
private closed = false;
private constructor(stream: ReturnType<typeof createWriteStream>) {
this.stream = stream;
}
static async open(filePath: string): Promise<TraceWriter> {
await mkdir(path.dirname(filePath), { recursive: true });
const stream = createWriteStream(filePath, { flags: 'w', encoding: 'utf8' });
return new TraceWriter(stream);
}
async append(record: TraceRecord): Promise<void> {
await this.mutex.runExclusive(async () => {
if (this.closed) throw new Error('Cannot write to closed trace writer');
const snakeCaseRecord = toSnakeCaseDeep(record);
const line = `${JSON.stringify(snakeCaseRecord)}\n`;
if (!this.stream.write(line)) {
await new Promise<void>((resolve, reject) => {
this.stream.once('drain', resolve);
this.stream.once('error', reject);
});
}
});
}
async close(): Promise<void> {
if (this.closed) return;
this.closed = true;
this.stream.end();
await finished(this.stream);
}
}Step 3: Add --trace flag to CLI
In apps/cli/src/commands/eval/index.ts, add:
trace: flag({
long: 'trace',
description: 'Save full execution traces to .agentv/traces/',
}),Pass it through rawOptions:
trace: args.trace,Step 4: Thread trace flag through run-eval.ts
In run-eval.ts:
- Add
trace: booleantoNormalizedOptions - In
normalizeOptions:trace: normalizeBoolean(rawOptions.trace) - Create trace writer when
options.traceis true - Build
TraceRecordfromoutputMessagesandtraceSummaryin theonResultcallback - Pass
outputMessagesalongsideEvaluationResult(requires threading through or adding to the result temporarily)
The key challenge: onResult only receives EvaluationResult which doesn't contain outputMessages. We need to capture outputMessages at the orchestrator level.
Approach: Add an optional outputMessages field to EvaluationResult that's only populated when trace mode is on. This is controlled by a captureTrace option on runEvaluation.
Actually, simpler approach: Add outputMessages?: readonly OutputMessage[] to EvaluationResult in types.ts. The orchestrator already has access to outputMessages when building the result. The field is optional and only populated when the caller requests it via a new captureOutputMessages option on the evaluation runner. The JSONL writer already strips it by only writing traceSummary. We just need the trace writer to extract it.
Even simpler: Always include outputMessages on the result (it's already available at construction time), and let the JSONL writer exclude it. The trace writer uses it. This avoids threading a new option.
Wait — the JSONL writer just calls toSnakeCaseDeep(record) on the full EvaluationResult. Adding outputMessages would bloat the results JSONL. We need to strip it.
Best approach: Add outputMessages to EvaluationResult type. In the onResult callback in run-eval.ts, write the trace record from it, then strip it before passing to the output writer.
Implementation in orchestrator.ts:
Add outputMessages to the result object at both construction sites (~line 745 and error handling):
return {
// ... existing fields
traceSummary,
outputMessages, // NEW: included for trace persistence
};Implementation in run-eval.ts:
onResult: async (result: EvaluationResult) => {
// Write trace if enabled
if (traceWriter && result.outputMessages) {
await traceWriter.append(buildTraceRecord(result));
}
// Strip outputMessages before writing to results
const { outputMessages: _, ...resultWithoutMessages } = result;
await outputWriter.append(resultWithoutMessages as EvaluationResult);
},Step 5: Update EvaluationResult type
In packages/core/src/evaluation/types.ts, add:
export interface EvaluationResult {
// ... existing fields
readonly outputMessages?: readonly OutputMessage[];
}Step 6: Update orchestrator to include outputMessages in result
In packages/core/src/evaluation/orchestrator.ts, at both result construction sites, add outputMessages.
Step 7: Build and test
Run: cd /home/christso/projects/agentv_feat-172-trace-timestamps && bun run build && bun run test
Expected: All pass.
Step 8: Commit
git add -A && git commit -m "feat: add --trace flag for persisting full execution traces to .agentv/traces/"Task 7: Update evaluator-parser tests and add integration validation
Files:
- Modify:
packages/core/test/evaluation/loaders/evaluator-parser.test.ts
Step 1: Add tests for new tool_trajectory fields
Add tests verifying:
max_total_duration_msis parsed from YAMLmax_llm_callsis parsed from YAMLmax_tool_callsis parsed from YAMLmax_duration_mson expected items is parsed (existing bug fix)
Step 2: Run tests
Run: cd /home/christso/projects/agentv_feat-172-trace-timestamps && bun run test
Expected: All pass.
Step 3: Commit
git add -A && git commit -m "test: add evaluator-parser tests for tool_trajectory aggregate thresholds"Task 8: Update examples and documentation
Files:
- Modify: relevant example YAML files in
examples/ - Modify:
apps/web/src/content/docs/(docs site) - Modify:
.claude/skills/agentv-eval-builder/(skill files)
Step 1: Update examples
Add startTime/endTime to any example trace data. Add an example using aggregate thresholds:
evaluators:
- name: efficiency-check
type: tool_trajectory
mode: in_order
max_total_duration_ms: 10000
max_llm_calls: 5
max_tool_calls: 20
expected:
- tool: search
max_duration_ms: 2000
- tool: summarizeStep 2: Update docs and skills
Update documentation pages related to:
- Trace format (startTime/endTime)
- tool_trajectory evaluator (aggregate thresholds)
- CLI reference (--trace flag)
Update eval-builder skill with new fields.
Step 3: Commit
git add -A && git commit -m "docs: update examples, docs, and skills for trace timestamps and aggregate thresholds"Task 9: Final validation
Step 1: Full build + typecheck + lint + test
cd /home/christso/projects/agentv_feat-172-trace-timestamps && bun run build && bun run typecheck && bun run lint && bun run testStep 2: Fix any issues
Step 3: Final commit if needed