Consolidate eval callbacks into unified event system (#755)#837
Consolidate eval callbacks into unified event system (#755)#837estsauver wants to merge 10 commits intoPrimeIntellect-ai:mainfrom
Conversation
|
Earl St Sauver seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You have signed the CLA already but the status is still pending? Let us recheck it. |
…ai#755) Replace three separate callbacks (on_start, on_progress, on_log) with a single event-based system using TypedDict events with Literal discriminators. - Remove: StartCallback, ProgressCallback, LogCallback - Add 7 new event TypedDicts with type discriminator: - StartEvent: emitted at start with resolved counts - ProgressEvent: emitted after each rollout/group completes - GroupCompleteEvent: for PR PrimeIntellect-ai#632, includes State objects - LogEvent: for log messages with level/source/timestamp - LogStreamEvent: infrastructure for PrimeIntellect-ai#753 log streaming - SaveEvent: when results are saved (intermediate or final) - CompleteEvent: when generation finishes - Add EvalEvent union type and EventHandler callback type - LogStreamFileWriter: writes log_stream events to files for tailing - Update generate() and evaluate() signatures: single on_event parameter - Add _emit_event() helper using maybe_await() - Add _run_group_with_states() to preserve State objects - Emit events at all key points: - StartEvent with accurate num_examples calculation - ProgressEvent after each completion - GroupCompleteEvent for non-independent scoring (with States) - SaveEvent for intermediate and final saves - LogEvent for important messages - CompleteEvent at the end - Update run_evaluation() signature to use on_event - Rewrite run_evaluations_tui() with match/case event handler - Preserve all metric accumulation logic - Update comments to reflect new event system - Add TODO for full log_streams implementation - Event type structure tests - LogStreamFileWriter tests (creation, appending, custom paths) - All 35 eval-related tests pass Removed callback parameters from: - Environment.generate() - Environment.evaluate() - run_evaluation() Replaced with: on_event: EventHandler - Issue PrimeIntellect-ai#755: Consolidate callbacks for eval TUI - PR PrimeIntellect-ai#632: GroupCompleteEvent with State objects (infrastructure) - Issue PrimeIntellect-ai#753: Log streaming infrastructure (LogStreamEvent + writer)
Created comprehensive e2e test suite with 4 realistic scenarios: 1. Simple independent scoring evaluation 2. Grouped scoring with multiple rollouts (tests GroupCompleteEvent) 3. Intermediate saves (tests SaveEvent) 4. Progress tracking with metrics Each scenario validates: - Event emission order (start -> progress -> complete) - Event data completeness - Correct event counts - StartEvent with resolved num_examples/rollouts - CompleteEvent with timing and metrics All 4 scenarios pass, demonstrating the event system works correctly in realistic usage patterns. Related to PrimeIntellect-ai#755
Addresses three bugs found in code review: 1. Server mode bypass (HIGH): Fixed grouped scoring to check for server mode before calling _run_group_with_states(). In server mode, properly routes through run_group() instead of bypassing server dispatch. 2. Incorrect num_examples calculation (MEDIUM): Fixed StartEvent to report correct num_examples when independent_scoring=True with rollouts_per_example > 1. Added configured_rollouts_per_example parameter to generate() for accurate calculation. 3. Missing documentation (LOW): Added comprehensive documentation for event types and EventHandler to docs/reference.md and usage examples to docs/evaluation.md. Tests: Added test_bugfix_event_system.py with 5 tests covering all fixes. All 19 event system tests now pass (10 unit + 4 e2e + 5 bugfix tests).
Problem: Events were storing direct references to builder.outputs and new_outputs lists, which would mutate as more results were added. This made stored events (e.g., in EventCollector) misleading - all_outputs would grow even after the event was emitted. Fix: Copy lists when creating events to ensure immutability. Tests: Added test_event_immutability.py with 2 tests verifying that ProgressEvent and GroupCompleteEvent data doesn't mutate after emission.
Problem: Changed 'elif on_progress is not None:' to bare 'else:' which unconditionally creates ProgressEvent objects (including list copies) even when on_event=None. This causes O(N²) allocations across N completions, affecting production code like GEPA that calls generate() with no event handler. Fix: Use 'elif on_event is not None:' to skip event construction when no handler is registered, matching the original callback pattern. Found by: Cursor Bugbot
| name: metrics_accum[name] / completed | ||
| for name in metrics_accum | ||
| } | ||
| error_rate = error_accum / completed if completed > 0 else 0 |
There was a problem hiding this comment.
TUI metrics wrong when resuming evaluations
Medium Severity
When resuming a partially completed evaluation, completed_count from ProgressEvent includes resumed outputs (via len(builder.outputs) after builder.add_outputs(resumed_outputs)), but the TUI handler's rolling accumulators (reward_accum, metrics_accum, error_accum) only accumulate from new_outputs in each event. Dividing by completed therefore produces artificially low averages since the denominator counts resumed outputs that were never added to the numerator.
Additional Locations (1)
The configured_rollouts_per_example parameter was added but never used. The existing logic using len(set([example_id])) already correctly calculates num_examples by counting unique example IDs. - Remove unused parameter from generate() signature - Remove parameter passing in evaluate() - Update test documentation to clarify it tests correct existing behavior - Update PR description to document this fix as Bug 5
The generate() method performed incremental saves after each task completion but never emitted SaveEvent with is_intermediate=True, making the TUI save handler dead code. - Emit SaveEvent(is_intermediate=True) after each incremental save - Add test coverage for intermediate save events - Update PR description to document this fix as Bug 3 - TUI handler now correctly displays checkpoint messages Fixes bugbot feedback about dead code in TUI save handler.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
| else: | ||
| remaining_rollouts = len(filtered_inputs) | ||
| saved_rollouts = total_rollouts - remaining_rollouts | ||
|
|
There was a problem hiding this comment.
Non-TUI path lost default progress bar entirely
Medium Severity
The old generate() method provided default callbacks (default_on_start, default_on_progress, default_on_log) that created a tqdm progress bar when no custom callbacks were given. These were removed without a replacement. The non-TUI evaluation path (run_evaluations()) calls run_evaluation() without an on_event handler, so evaluations now run completely silently with no progress indication — a regression from the prior behavior.
| on_start: StartCallback | None = None, | ||
| on_progress: ProgressCallback | None = None, | ||
| on_log: LogCallback | None = None, | ||
| on_event: EventHandler | None = None, |
There was a problem hiding this comment.
GEPA adapter passes removed callback parameters to generate
High Severity
The GEPA adapter at verifiers/gepa/adapter.py still passes on_start=do_nothing and on_progress=do_nothing to env.generate(), but this commit removed those parameters from generate() (replaced with on_event). This will cause a TypeError at runtime, crashing any GEPA-based evaluation. The PR states "All internal usages have been migrated" but this caller was missed.


Description
This PR consolidates three separate callbacks (
on_start,on_progress,on_log) into a unified event-based system, addressing issue #755. The new system uses TypedDict events with Literal discriminators for type-safe pattern matching, making it cleaner, more extensible, and easier to maintain.Context
I'm new to this codebase and the broader Prime Intellect ecosystem, so I focused on:
Feedback welcome, especially on areas where I might have missed broader integration concerns!
Motivation
The previous callback system had grown brittle with three separate callback parameters that felt "tailor-made" for specific use cases. This PR:
on_eventhandlerBefore / After
Before:
After:
Type of Change
Note: This is a breaking change that removes
on_start,on_progress, andon_logparameters. All internal usages have been migrated to the new event system.Changes
New Event Types (
verifiers/types.py)num_examplesandrollouts_per_exampleall_outputsandnew_outputsStateobjects (addresses PR Feature: Add on_group_complete callback for streaming eval results. #632)Event Emission (
verifiers/envs/environment.py)generate()andevaluate()signatures to useon_eventparameter_emit_event()helper usingmaybe_await()for sync/async handlers_run_group_with_states()internal method to preserve State objects for GroupCompleteEventEvent Consumption (
verifiers/utils/eval_utils.py)run_evaluations_tui()to usematch/casepattern for event handlingSupporting Infrastructure
verifiers/utils/event_utils.py- LogStreamFileWriter for log tailingverifiers/utils/eval_display.py- Comments updated for new event systemTesting
Comprehensive test coverage demonstrates the system works correctly:
Unit Tests (
tests/test_event_system.py- 10 tests)E2E Scenarios (
tests/test_event_system_e2e.py- 4 scenarios)Standalone executable script with realistic integration scenarios:
uv run python tests/test_event_system_e2e.pyIntegration Testing
uv run pytestlocally (514 tests pass, 4 skipped external env tests)vf-evalcommand - progress bar and TUI work correctlyManual Testing
Ran actual evaluation with
vf-evalusing a test environment:✅ Progress bar updates correctly (requires ProgressEvent)
✅ Results display properly (requires CompleteEvent)
Checklist
Additional Notes
Design Decisions
Why TypedDict over dataclasses?
Why break backward compatibility?
State Preservation Strategy
_run_group_with_states()method to return both State objects and outputsrun_group()API remains unchanged (returns only outputs)Future Work
Related Issues
Note
Medium Risk
Breaking API change in core evaluation/generation paths; regressions could affect progress reporting, checkpointing, or grouped/server-mode execution despite added tests and guards.
Overview
Switches the evaluation/generation callback API from three separate hooks (
on_start,on_progress,on_log) to a singleon_eventhandler carrying typedEvalEventpayloads (start,progress,group_complete,save,log,log_stream,complete), and updatesEnvironment.generate()/evaluate()to emit these events at key lifecycle points.Adds grouped-scoring state preservation via
_run_group_with_states()to supportGroupCompleteEvent(while preserving server-mode routing throughrun_group()), emits intermediateSaveEvents during incremental checkpointing, and avoids unnecessary event construction when no handler is registered (plus copies list fields to prevent post-emission mutation).Migrates the CLI/TUI progress display to consume events, introduces
LogStreamFileWriterutility forlog_streamevents, and updates docs (docs/reference.md,docs/evaluation.md) and adds focused unit/e2e tests covering event structure, ordering, server-mode behavior, save events, and immutability.Written by Cursor Bugbot for commit cc8f0cb. This will update automatically on new commits. Configure here.