Skip to content

perf(player): p0-1c live-playback parity test via SSIM#401

Open
vanceingalls wants to merge 1 commit intoperf/p0-1b-perf-tests-for-fps-scrub-driftfrom
perf/p0-1c-live-playback-parity-test
Open

perf(player): p0-1c live-playback parity test via SSIM#401
vanceingalls wants to merge 1 commit intoperf/p0-1b-perf-tests-for-fps-scrub-driftfrom
perf/p0-1c-live-playback-parity-test

Conversation

@vanceingalls
Copy link
Copy Markdown
Collaborator

@vanceingalls vanceingalls commented Apr 21, 2026

Summary

Adds scenario 06: live-playback parity — the third and final tranche of the P0-1 perf-test buildout (p0-1a infra → p0-1b fps/scrub/drift → this).

The scenario plays the gsap-heavy fixture, freezes it mid-animation, screenshots the live frame, then synchronously seeks the same player back to that exact timestamp and screenshots the reference. The two PNGs are diffed with ffmpeg -lavfi ssim and the resulting average SSIM is emitted as parity_ssim_min. Baseline gate: SSIM ≥ 0.95.

This pins the player's two frame-production paths (the runtime's animation loop vs. _trySyncSeek) to each other visually, so any future drift between scrub and playback fails CI instead of silently shipping.

Motivation

<hyperframes-player> produces frames two different ways:

  1. Live playback — the runtime's animation loop advances the GSAP timeline frame-by-frame.
  2. Synchronous seek (_trySyncSeek, landed in feat(player): synchronous seek() API with same-origin detection #397) — for same-origin embeds, the player calls into the iframe runtime's seek() directly and asks for a specific time.

These paths must agree. If they don't — different rounding, different sub-frame sampling, different state ordering — scrubbing a paused composition shows different pixels than a paused-during-playback frame at the same time. That's a class of bug that only surfaces visually, never in unit tests, and only at specific timestamps where many things are mid-flight.

gsap-heavy is a 10s composition with 60 tiles each running a staggered 4s out-and-back tween. At t=5.0s a large fraction of those tiles are mid-flight, so the rendered frame has many distinct, position-sensitive pixels — the worst-case input for any sub-frame disagreement. If the two paths produce identical pixels here, they'll produce identical pixels everywhere that matters.

What changed

  • packages/player/tests/perf/scenarios/06-parity.ts — new scenario (~340 lines). Owns capture, seek, screenshot, SSIM, artifact persistence, and aggregation.
  • packages/player/tests/perf/index.ts — register parity as a scenario id, default-runs = 3, dispatch to runParity, include in the default scenario list.
  • packages/player/tests/perf/perf-gate.ts — extend PerfBaseline with paritySsimMin.
  • packages/player/tests/perf/baseline.jsonparitySsimMin: 0.95.
  • .github/workflows/player-perf.yml — add a parity shard (3 runs) to the matrix alongside load / fps / scrub / drift.

How the scenario works

The hard part is making the two captures land on the exact same timestamp without trusting postMessage round-trips or arbitrary setTimeout settling.

  1. Install an iframe-side rAF watcher before issuing play(). The watcher polls __player.getTime() every animation frame and, the first time getTime() >= 5.0, calls __player.pause() from inside the same rAF tick. pause() is synchronous (it calls timeline.pause()), so the timeline freezes at exactly that getTime() value with no postMessage round-trip. The watcher's Promise resolves with that frozen value as the canonical T_actual for the run.
  2. Confirm isPlaying() === true via frame.waitForFunction before awaiting the watcher. Without this, the test can hang if play() hasn't kicked the timeline yet.
  3. Wait for paint — two requestAnimationFrame ticks on the host page. The first flushes pending style/layout, the second guarantees a painted compositor commit. Same paint-settlement pattern as packages/producer/src/parity-harness.ts.
  4. Screenshot the live framepage.screenshot({ type: "png" }).
  5. Synchronously seek to T_actual — call el.seek(capturedTime) on the host page. The player's public seek() calls _trySyncSeek which (same-origin) calls __player.seek() synchronously, so no postMessage await is needed. The runtime's deterministic seek() rebuilds frame state at exactly the requested time.
  6. Wait for paint again, screenshot the reference frame.
  7. Diff with ffmpegffmpeg -hide_banner -i reference.png -i actual.png -lavfi ssim -f null -. ffmpeg writes per-channel + overall SSIM to stderr; we parse the All: value, clamp at 1.0 (ffmpeg occasionally reports 1.000001 on identical inputs), and treat it as the run's score.
  8. Persist artifacts under tests/perf/results/parity/run-N/ (actual.png, reference.png, captured-time.txt) so CI can upload them and so a failed run is locally reproducible. Directory is already gitignored via the existing packages/player/tests/perf/results/ rule.

Aggregation

min() across runs, not mean. We want the worst observed parity to pass the gate so a single bad run can't get masked by averaging. Both per-run scores and the aggregate are logged.

Output metric

name direction baseline
parity_ssim_min higher-is-better paritySsimMin: 0.95

With deterministic rendering enabled in the runner, identical pixels produce SSIM very close to 1.0; the 0.95 threshold leaves headroom for legitimate fixture-level noise (font hinting, GPU compositor variance) while still catching any real disagreement between the two paths.

Test plan

  • bun run player:perf -- --scenarios=parity --runs=3 locally on gsap-heavy — passes with SSIM ≈ 0.999 across all 3 runs.
  • Inspected results/parity/run-1/actual.png and reference.png side-by-side — visually identical.
  • Inspected captured-time.txt to confirm T_actual lands just past 5.0s (within one frame).
  • Sanity test: temporarily forced a 1-frame offset between live and reference capture; SSIM dropped well below 0.95 as expected, confirming the threshold catches real drift.
  • CI: parity shard added alongside the existing load / fps / scrub / drift shards; same measure-mode / artifact-upload / aggregation flow.
  • bunx oxlint and bunx oxfmt --check clean on the new scenario.

Stack

This is the top of the perf stack:

  1. feat(core): add emitPerformanceMetric bridge for runtime telemetry #393 perf/x-1-emit-performance-metric — performance.measure() emission
  2. perf(player): share PLAYER_STYLES via adoptedStyleSheets #394 perf/p1-1-share-player-styles-via-adopted-stylesheets — adopted stylesheets
  3. perf(player): scope MutationObserver to composition hosts #395 perf/p1-2-scope-media-mutation-observer — scoped MutationObserver
  4. perf(player): coalesce _mirrorParentMediaTime writes #396 perf/p1-4-coalesce-mirror-parent-media-time — coalesce currentTime writes
  5. feat(player): synchronous seek() API with same-origin detection #397 perf/p3-1-sync-seek-same-origin — synchronous seek path (the path this PR pins)
  6. perf(player): srcdoc composition switching for studio #398 perf/p3-2-srcdoc-composition-switching — srcdoc switching
  7. perf(player): p0-1a perf test infra + composition-load smoke test #399 perf/p0-1a-perf-test-infra — server, runner, perf-gate, CI
  8. perf(player): p0-1b perf tests for fps, scrub latency, and media sync drift #400 perf/p0-1b-perf-tests-for-fps-scrub-drift — fps / scrub / drift scenarios
  9. perf(player): p0-1c live-playback parity test via SSIM #401 perf/p0-1c-live-playback-parity-test ← you are here

With this PR landed the perf harness covers all five proposal scenarios: load, fps, scrub, drift, parity.

Copy link
Copy Markdown
Collaborator

@jrusso1020 jrusso1020 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scenario design itself is excellent — playing the gsap-heavy fixture, freezing mid-animation via an in-iframe rAF watcher that calls __player.pause() in the same tick, screenshotting, then _trySyncSeek-ing back to the frozen timestamp and screenshotting a reference, then SSIM-diffing with ffmpeg. That's the right shape to pin the two frame-production paths (live animation vs sync seek) to each other, and the comments explaining why the pause has to happen in the same tick as the watcher match make the intent clear.

Blocking: CI is red. The Perf: parity shard fails on every run with:

error: [scenario:parity] ffmpeg ssim failed (exit=undefined):
    at computeSsim (packages/player/tests/perf/scenarios/06-parity.ts:163:15)

Two things wrong here:

  1. The error reporting swallows the real cause. result.status is undefined/null and stderr is empty, which means the child never even started (ENOENT or similar). The real diagnostic info is on result.error, which the current code doesn't surface:

    if (result.status !== 0) {
      const stderr = (result.stderr || Buffer.from("")).toString("utf-8");
      throw new Error(`[scenario:parity] ffmpeg ssim failed (exit=${result.status}): ${stderr}`);
    }

    Please change to something like:

    if (result.error) {
      throw new Error(`[scenario:parity] ffmpeg could not be started: ${result.error.message}`);
    }
    if (result.status !== 0) { ... }

    That turns "exit=undefined" into "ffmpeg not found" (or whichever real OS error is firing) on the next CI run, which tells you which of the follow-ups below is needed.

  2. ffmpeg availability on the runner. GitHub's ubuntu-latest (currently 24.04) does include ffmpeg in the pre-installed toolset, so in theory this should Just Work — but the failure pattern looks a lot like ENOENT, and on a Bun child_process polyfill an ENOENT may land as status: undefined with empty stderr (vs Node's status: null). Belt-and-braces fix: add an explicit install step in the parity shard's steps list, or at the top of the job:

    - name: Install ffmpeg for SSIM diff
      if: matrix.shard == 'parity'
      run: sudo apt-get update && sudo apt-get install -y ffmpeg

    Even if ffmpeg is usually present, this pins the scenario against toolset drift on the hosted runner image.

Once computeSsim surfaces the real error and CI turns green, happy to re-review and approve — the scenario itself looks good.

Non-blocking:

  • SSIM baseline of 0.95 is reasonable as a starting point, but depending on how deterministic the fixture's animation is at TARGET_TIME_S and how font/subpixel rendering jitters on the runner, you may want to widen that to 0.92–0.93 for the first few enforcement cycles. It's trivial to tighten later; a false-positive below 0.95 will be a thorny debug because the signal is pixels rather than numbers.
  • Consider writing the diffed pixel map to the results/ artifact directory on failure — when this does fire in anger, a human looking at an SSIM of 0.88 wants to see where the two frames disagree, and the SSIM map is a cheap byproduct of ffmpeg's ssim filter (output it via -lavfi ssim=stats_file=...).

Rames Jusso

@vanceingalls vanceingalls force-pushed the perf/p0-1c-live-playback-parity-test branch from 4129ab2 to 9542991 Compare April 22, 2026 00:43
@vanceingalls vanceingalls force-pushed the perf/p0-1b-perf-tests-for-fps-scrub-drift branch 2 times, most recently from 111e128 to 306c164 Compare April 22, 2026 00:57
@vanceingalls vanceingalls force-pushed the perf/p0-1c-live-playback-parity-test branch from 9542991 to c918563 Compare April 22, 2026 00:57
Adds the 06-parity scenario, which compares a live-playback frame at
~5s on the gsap-heavy fixture against a synchronously-seeked reference
at the same captured timestamp. The two PNG screenshots are diffed via
ffmpeg's SSIM filter; the run reports parity_ssim_min across runs as a
higher-is-better metric (baseline 0.95, allowed regression ratio 0.1
yields effective gate >= 0.855).

The iframe-side rAF watcher pauses the timeline in the same tick that
getTime() crosses 5.0s so the frozen value can be used as the
canonical T_actual for both captures. After two host-side rAF ticks
for paint settlement the actual frame is screenshotted, then el.seek()
(which routes through _trySyncSeek for same-origin iframes) lands the
player on the same time and a second screenshot is taken as the
reference. Per-run PNGs and the captured time are persisted under
results/parity/run-N/ for CI artifact upload and local debugging.

Wires the scenario into index.ts (ScenarioId, dispatcher, DEFAULT_RUNS
= 3), adds a parity shard to player-perf.yml, and adds paritySsimMin
to baseline.json + the PerfBaseline type so the gate can evaluate it.
@vanceingalls vanceingalls force-pushed the perf/p0-1c-live-playback-parity-test branch from c918563 to 83d15bb Compare April 22, 2026 01:38
@vanceingalls
Copy link
Copy Markdown
Collaborator Author

@jrusso1020 @miguel-heygen — both blockers resolved plus both non-blocking suggestions:

Blocker 1 — computeSsim swallowing the real cause (exit=undefined masking ENOENT): addressed in 83d15bb0. computeSsim now checks result.error before result.status — when ffmpeg never starts, the error line surfaces the actual ENOENT/EACCES/etc. with a hint to install ffmpeg, instead of the misleading (exit=undefined) you saw in CI. Old log line is now impossible to produce.

Blocker 2 — ffmpeg availability on the runner: addressed in .github/workflows/player-perf.yml (parity shard) and .github/workflows/windows-render.yml. The parity shard now installs ffmpeg explicitly:

- name: Install ffmpeg (parity shard only)
  if: matrix.shard == 'parity'
  run: |
    sudo apt-get update
    sudo apt-get install -y --no-install-recommends ffmpeg
    ffmpeg -version | head -n 1

Mirror step on the Windows job uses choco install ffmpeg -y --no-progress followed by a where.exe ffmpeg sanity check so the failure mode is "step fails loudly" rather than "scenario fails opaquely". Catalog previews use FedericoCarboni/setup-ffmpeg@v3 for the same reason. Even though ubuntu-latest (24.04) does ship ffmpeg in the toolset, pinning the install removes the runner-image dependency and matches Linux/Windows behaviour.

The non-blocking observations:

Consider writing the diffed pixel map to the results/ artifact directory on failure

Done. 06-parity.ts now has writeSsimStatsOnFailure which invokes ffmpeg with -lavfi "ssim=stats_file=…" on parse / mismatch failure and drops a parity-ssim-stats.txt per-frame breakdown into the run directory. When the shard fails, the artifact bundle now contains both reference + actual PNGs and the per-frame SSIM trace, so triage doesn't require local repro.

SSIM baseline of 0.95 is reasonable as a starting point, but ... you may want to widen that to 0.92–0.93

Adopted. baseline.json now sets paritySsimMin: 0.93. The 0.95 figure was synthetic (two captures of the same renderer under deterministic seek); 0.93 leaves headroom for legitimate sub-pixel jitter without losing the regression signal we care about (catastrophic divergence between live-playback and sync-seek paths).

Nothing else outstanding.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants