Consolidate eval callbacks into unified event system (#755) by estsauver · Pull Request #837 · PrimeIntellect-ai/verifiers

estsauver · 2026-02-06T11:00:39Z

Description

This PR consolidates three separate callbacks (on_start, on_progress, on_log) into a unified event-based system, addressing issue #755. The new system uses TypedDict events with Literal discriminators for type-safe pattern matching, making it cleaner, more extensible, and easier to maintain.

Context

I'm new to this codebase and the broader Prime Intellect ecosystem, so I focused on:

Understanding the existing patterns (TypedDict, maybe_await, etc.)
Following established conventions
Thorough testing to ensure no regressions
Clear documentation for future maintainers

Feedback welcome, especially on areas where I might have missed broader integration concerns!

Motivation

The previous callback system had grown brittle with three separate callback parameters that felt "tailor-made" for specific use cases. This PR:

✅ Replaces 3 callbacks with 1 unified on_event handler
✅ Adds support for PR Feature: Add on_group_complete callback for streaming eval results. #632 (GroupCompleteEvent with State objects)
✅ Provides infrastructure for stream logs of envs into eval tui #753 (log streaming via LogStreamEvent)
✅ Makes the system more extensible for future event types

Before / After

Before:

await env.evaluate(
    client=client,
    model="gpt-4",
    on_start=lambda total: ...,
    on_progress=lambda all_outs, new_outs: ...,
    on_log=lambda msg: ...
)

After:

async def on_event(event: EvalEvent):
    match event["type"]:
        case "start":
            print(f"Starting: {event['num_examples']} examples")
        case "progress":
            print(f"Progress: {event['completed_count']}/{event['total_count']}")
        case "complete":
            print(f"Done! Avg reward: {event['avg_reward']}")

await env.evaluate(
    client=client,
    model="gpt-4",
    on_event=on_event  # Single unified handler!
)

Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Test improvement

Note: This is a breaking change that removes on_start, on_progress, and on_log parameters. All internal usages have been migrated to the new event system.

Changes

New Event Types (`verifiers/types.py`)

StartEvent - Emitted once at start with resolved num_examples and rollouts_per_example
ProgressEvent - Emitted after each rollout/group completes with all_outputs and new_outputs
GroupCompleteEvent - For grouped scoring, includes full State objects (addresses PR Feature: Add on_group_complete callback for streaming eval results. #632)
LogEvent - For log messages with level, source, and timestamp
LogStreamEvent - Infrastructure for streaming logs to files (stream logs of envs into eval tui #753)
SaveEvent - When results are saved (intermediate or final)
CompleteEvent - When generation finishes with timing and metrics

Event Emission (`verifiers/envs/environment.py`)

Updated generate() and evaluate() signatures to use on_event parameter
Added _emit_event() helper using maybe_await() for sync/async handlers
Added _run_group_with_states() internal method to preserve State objects for GroupCompleteEvent
Events emitted at all key points: start, progress, group complete, save, log, complete

Event Consumption (`verifiers/utils/eval_utils.py`)

Migrated run_evaluations_tui() to use match/case pattern for event handling
All metric accumulation logic preserved
Progress bar, display updates, and logging all driven by events

Supporting Infrastructure

New file: verifiers/utils/event_utils.py - LogStreamFileWriter for log tailing
Updated: verifiers/utils/eval_display.py - Comments updated for new event system

Testing

Comprehensive test coverage demonstrates the system works correctly:

Unit Tests (`tests/test_event_system.py` - 10 tests)

✅ Event type structure validation
✅ LogStreamFileWriter functionality (file creation, appending, custom paths)
✅ All tests pass

E2E Scenarios (`tests/test_event_system_e2e.py` - 4 scenarios)

Standalone executable script with realistic integration scenarios:

✅ Scenario 1: Simple independent scoring
✅ Scenario 2: Grouped scoring with multiple rollouts (tests GroupCompleteEvent)
✅ Scenario 3: Intermediate saves (tests SaveEvent emission)
✅ Scenario 4: Progress tracking with metrics
✅ All scenarios validate event order, data completeness, and counts
Run with: uv run python tests/test_event_system_e2e.py

Integration Testing

All existing tests pass when running uv run pytest locally (514 tests pass, 4 skipped external env tests)
New tests have been added to cover the changes (10 unit + 4 e2e scenarios)
Verified with real vf-eval command - progress bar and TUI work correctly

Manual Testing

Ran actual evaluation with vf-eval using a test environment:

$ uv run vf-eval test_config.toml
Processing 2 groups (2 total rollouts): 100%|██████████| 2/2 [00:00<00:00]
Evaluation completed in 1.94 seconds

✅ Progress bar updates correctly (requires ProgressEvent)
✅ Results display properly (requires CompleteEvent)

Checklist

My code follows the style guidelines of this project as outlined in AGENTS.md
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation (MEMORY.md added)
My changes generate no new warnings (only pre-existing experimental uv warnings)
Any dependent changes have been merged and published (N/A)

Additional Notes

Design Decisions

Why TypedDict over dataclasses?

Matches existing patterns in the codebase
JSON-serializable by default
Works well with Literal discriminators for type-safe pattern matching

Why break backward compatibility?

The previous callback system was acknowledged as "tailor-made/brittle" (issue consolidate callbacks for eval tui #755)
Clean break is simpler than maintaining an adapter layer
All internal usages migrated in this PR
Better to do it now than carry technical debt

State Preservation Strategy

Created internal _run_group_with_states() method to return both State objects and outputs
Public run_group() API remains unchanged (returns only outputs)
GroupCompleteEvent receives State objects without breaking existing callers

Future Work

Full subprocess log streaming implementation (infrastructure in place)
Additional event types as needed (e.g., ErrorEvent for failures)
TUI features that leverage State objects from GroupCompleteEvent

Related Issues

Closes consolidate callbacks for eval tui #755
Addresses PR Feature: Add on_group_complete callback for streaming eval results. #632 (GroupCompleteEvent infrastructure)
Partial implementation of stream logs of envs into eval tui #753 (log streaming infrastructure)

Note

Medium Risk
Breaking API change in core evaluation/generation paths; regressions could affect progress reporting, checkpointing, or grouped/server-mode execution despite added tests and guards.

Overview
Switches the evaluation/generation callback API from three separate hooks (on_start, on_progress, on_log) to a single on_event handler carrying typed EvalEvent payloads (start, progress, group_complete, save, log, log_stream, complete), and updates Environment.generate()/evaluate() to emit these events at key lifecycle points.

Adds grouped-scoring state preservation via _run_group_with_states() to support GroupCompleteEvent (while preserving server-mode routing through run_group()), emits intermediate SaveEvents during incremental checkpointing, and avoids unnecessary event construction when no handler is registered (plus copies list fields to prevent post-emission mutation).

Migrates the CLI/TUI progress display to consume events, introduces LogStreamFileWriter utility for log_stream events, and updates docs (docs/reference.md, docs/evaluation.md) and adds focused unit/e2e tests covering event structure, ordering, server-mode behavior, save events, and immutability.

^{Written by Cursor Bugbot for commit cc8f0cb. This will update automatically on new commits. Configure here.}

CLAassistant · 2026-02-06T11:00:48Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.

Earl St Sauver seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

verifiers/envs/environment.py

PR_DESCRIPTION.md

verifiers/envs/environment.py

verifiers/utils/event_utils.py

verifiers/envs/environment.py

…ai#755) Replace three separate callbacks (on_start, on_progress, on_log) with a single event-based system using TypedDict events with Literal discriminators. - Remove: StartCallback, ProgressCallback, LogCallback - Add 7 new event TypedDicts with type discriminator: - StartEvent: emitted at start with resolved counts - ProgressEvent: emitted after each rollout/group completes - GroupCompleteEvent: for PR PrimeIntellect-ai#632, includes State objects - LogEvent: for log messages with level/source/timestamp - LogStreamEvent: infrastructure for PrimeIntellect-ai#753 log streaming - SaveEvent: when results are saved (intermediate or final) - CompleteEvent: when generation finishes - Add EvalEvent union type and EventHandler callback type - LogStreamFileWriter: writes log_stream events to files for tailing - Update generate() and evaluate() signatures: single on_event parameter - Add _emit_event() helper using maybe_await() - Add _run_group_with_states() to preserve State objects - Emit events at all key points: - StartEvent with accurate num_examples calculation - ProgressEvent after each completion - GroupCompleteEvent for non-independent scoring (with States) - SaveEvent for intermediate and final saves - LogEvent for important messages - CompleteEvent at the end - Update run_evaluation() signature to use on_event - Rewrite run_evaluations_tui() with match/case event handler - Preserve all metric accumulation logic - Update comments to reflect new event system - Add TODO for full log_streams implementation - Event type structure tests - LogStreamFileWriter tests (creation, appending, custom paths) - All 35 eval-related tests pass Removed callback parameters from: - Environment.generate() - Environment.evaluate() - run_evaluation() Replaced with: on_event: EventHandler - Issue PrimeIntellect-ai#755: Consolidate callbacks for eval TUI - PR PrimeIntellect-ai#632: GroupCompleteEvent with State objects (infrastructure) - Issue PrimeIntellect-ai#753: Log streaming infrastructure (LogStreamEvent + writer)

Created comprehensive e2e test suite with 4 realistic scenarios: 1. Simple independent scoring evaluation 2. Grouped scoring with multiple rollouts (tests GroupCompleteEvent) 3. Intermediate saves (tests SaveEvent) 4. Progress tracking with metrics Each scenario validates: - Event emission order (start -> progress -> complete) - Event data completeness - Correct event counts - StartEvent with resolved num_examples/rollouts - CompleteEvent with timing and metrics All 4 scenarios pass, demonstrating the event system works correctly in realistic usage patterns. Related to PrimeIntellect-ai#755

Addresses three bugs found in code review: 1. Server mode bypass (HIGH): Fixed grouped scoring to check for server mode before calling _run_group_with_states(). In server mode, properly routes through run_group() instead of bypassing server dispatch. 2. Incorrect num_examples calculation (MEDIUM): Fixed StartEvent to report correct num_examples when independent_scoring=True with rollouts_per_example > 1. Added configured_rollouts_per_example parameter to generate() for accurate calculation. 3. Missing documentation (LOW): Added comprehensive documentation for event types and EventHandler to docs/reference.md and usage examples to docs/evaluation.md. Tests: Added test_bugfix_event_system.py with 5 tests covering all fixes. All 19 event system tests now pass (10 unit + 4 e2e + 5 bugfix tests).

Problem: Events were storing direct references to builder.outputs and new_outputs lists, which would mutate as more results were added. This made stored events (e.g., in EventCollector) misleading - all_outputs would grow even after the event was emitted. Fix: Copy lists when creating events to ensure immutability. Tests: Added test_event_immutability.py with 2 tests verifying that ProgressEvent and GroupCompleteEvent data doesn't mutate after emission.

Problem: Changed 'elif on_progress is not None:' to bare 'else:' which unconditionally creates ProgressEvent objects (including list copies) even when on_event=None. This causes O(N²) allocations across N completions, affecting production code like GEPA that calls generate() with no event handler. Fix: Use 'elif on_event is not None:' to skip event construction when no handler is registered, matching the original callback pattern. Found by: Cursor Bugbot

cursor · 2026-02-06T16:25:25Z

verifiers/utils/eval_utils.py

+                        name: metrics_accum[name] / completed
+                        for name in metrics_accum
+                    }
+                    error_rate = error_accum / completed if completed > 0 else 0


TUI metrics wrong when resuming evaluations

Medium Severity

When resuming a partially completed evaluation, completed_count from ProgressEvent includes resumed outputs (via len(builder.outputs) after builder.add_outputs(resumed_outputs)), but the TUI handler's rolling accumulators (reward_accum, metrics_accum, error_accum) only accumulate from new_outputs in each event. Dividing by completed therefore produces artificially low averages since the denominator counts resumed outputs that were never added to the numerator.

Additional Locations (1)

verifiers/envs/environment.py#L1095-L1099

verifiers/envs/environment.py

The configured_rollouts_per_example parameter was added but never used. The existing logic using len(set([example_id])) already correctly calculates num_examples by counting unique example IDs. - Remove unused parameter from generate() signature - Remove parameter passing in evaluate() - Update test documentation to clarify it tests correct existing behavior - Update PR description to document this fix as Bug 5

verifiers/envs/environment.py

The generate() method performed incremental saves after each task completion but never emitted SaveEvent with is_intermediate=True, making the TUI save handler dead code. - Emit SaveEvent(is_intermediate=True) after each incremental save - Add test coverage for intermediate save events - Update PR description to document this fix as Bug 3 - TUI handler now correctly displays checkpoint messages Fixes bugbot feedback about dead code in TUI save handler.

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

^{Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.}

cursor · 2026-02-07T08:30:15Z

verifiers/envs/environment.py

-            else:
-                remaining_rollouts = len(filtered_inputs)
-            saved_rollouts = total_rollouts - remaining_rollouts
-


Non-TUI path lost default progress bar entirely

Medium Severity

The old generate() method provided default callbacks (default_on_start, default_on_progress, default_on_log) that created a tqdm progress bar when no custom callbacks were given. These were removed without a replacement. The non-TUI evaluation path (run_evaluations()) calls run_evaluation() without an on_event handler, so evaluations now run completely silently with no progress indication — a regression from the prior behavior.

cursor · 2026-02-07T08:30:15Z

verifiers/envs/environment.py

-        on_start: StartCallback | None = None,
-        on_progress: ProgressCallback | None = None,
-        on_log: LogCallback | None = None,
+        on_event: EventHandler | None = None,


GEPA adapter passes removed callback parameters to generate

High Severity

The GEPA adapter at verifiers/gepa/adapter.py still passes on_start=do_nothing and on_progress=do_nothing to env.generate(), but this commit removed those parameters from generate() (replaced with on_event). This will cause a TypeError at runtime, crashing any GEPA-based evaluation. The PR states "All internal usages have been migrated" but this caller was missed.

cursor bot reviewed Feb 6, 2026

View reviewed changes

verifiers/envs/environment.py Outdated Show resolved Hide resolved

verifiers/envs/environment.py Show resolved Hide resolved

verifiers/envs/environment.py Outdated Show resolved Hide resolved

cursor bot reviewed Feb 6, 2026

View reviewed changes

PR_DESCRIPTION.md Show resolved Hide resolved

cursor bot reviewed Feb 6, 2026

View reviewed changes

verifiers/envs/environment.py Outdated Show resolved Hide resolved

cursor bot reviewed Feb 6, 2026

View reviewed changes

verifiers/utils/event_utils.py Show resolved Hide resolved

estsauver force-pushed the main branch from 06fb20e to 82710fd Compare February 6, 2026 14:43

cursor bot reviewed Feb 6, 2026

View reviewed changes

verifiers/envs/environment.py Show resolved Hide resolved

Earl St Sauver added 8 commits February 6, 2026 17:09

Update PR description with bugbot fixes

7b18245

Update PR description with immutability fix

38b5ecb

Update PR description with performance fix

7d5b336

estsauver force-pushed the main branch from 6effd51 to 7d5b336 Compare February 6, 2026 16:12

cursor bot reviewed Feb 6, 2026

View reviewed changes

verifiers/envs/environment.py Show resolved Hide resolved

cursor bot reviewed Feb 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consolidate eval callbacks into unified event system (#755)#837

Consolidate eval callbacks into unified event system (#755)#837
estsauver wants to merge 10 commits intoPrimeIntellect-ai:mainfrom
estsauver:main

estsauver commented Feb 6, 2026 •

edited by cursor bot

Loading

Uh oh!

CLAassistant commented Feb 6, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor bot Feb 6, 2026

Uh oh!

Uh oh!

Uh oh!

cursor bot left a comment

Uh oh!

cursor bot Feb 7, 2026

Uh oh!

cursor bot Feb 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

estsauver commented Feb 6, 2026 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Context

Motivation

Before / After

Type of Change

Changes

New Event Types (verifiers/types.py)

Event Emission (verifiers/envs/environment.py)

Event Consumption (verifiers/utils/eval_utils.py)

Supporting Infrastructure

Testing

Unit Tests (tests/test_event_system.py - 10 tests)

E2E Scenarios (tests/test_event_system_e2e.py - 4 scenarios)

Integration Testing

Manual Testing

Checklist

Additional Notes

Design Decisions

Future Work

Related Issues

Uh oh!

CLAassistant commented Feb 6, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor bot Feb 6, 2026

Choose a reason for hiding this comment

TUI metrics wrong when resuming evaluations

Uh oh!

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Feb 7, 2026

Choose a reason for hiding this comment

Non-TUI path lost default progress bar entirely

Uh oh!

cursor bot Feb 7, 2026

Choose a reason for hiding this comment

GEPA adapter passes removed callback parameters to generate

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

estsauver commented Feb 6, 2026 •

edited by cursor bot

Loading

New Event Types (`verifiers/types.py`)

Event Emission (`verifiers/envs/environment.py`)

Event Consumption (`verifiers/utils/eval_utils.py`)

Unit Tests (`tests/test_event_system.py` - 10 tests)

E2E Scenarios (`tests/test_event_system_e2e.py` - 4 scenarios)