perf(player): p0-1c live-playback parity test via SSIM#401
perf(player): p0-1c live-playback parity test via SSIM#401vanceingalls wants to merge 1 commit intoperf/p0-1b-perf-tests-for-fps-scrub-driftfrom
Conversation
|
Warning This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
This stack of pull requests is managed by Graphite. Learn more about stacking. |
jrusso1020
left a comment
There was a problem hiding this comment.
The scenario design itself is excellent — playing the gsap-heavy fixture, freezing mid-animation via an in-iframe rAF watcher that calls __player.pause() in the same tick, screenshotting, then _trySyncSeek-ing back to the frozen timestamp and screenshotting a reference, then SSIM-diffing with ffmpeg. That's the right shape to pin the two frame-production paths (live animation vs sync seek) to each other, and the comments explaining why the pause has to happen in the same tick as the watcher match make the intent clear.
Blocking: CI is red. The Perf: parity shard fails on every run with:
error: [scenario:parity] ffmpeg ssim failed (exit=undefined):
at computeSsim (packages/player/tests/perf/scenarios/06-parity.ts:163:15)
Two things wrong here:
-
The error reporting swallows the real cause.
result.statusisundefined/nullandstderris empty, which means the child never even started (ENOENT or similar). The real diagnostic info is onresult.error, which the current code doesn't surface:if (result.status !== 0) { const stderr = (result.stderr || Buffer.from("")).toString("utf-8"); throw new Error(`[scenario:parity] ffmpeg ssim failed (exit=${result.status}): ${stderr}`); }
Please change to something like:
if (result.error) { throw new Error(`[scenario:parity] ffmpeg could not be started: ${result.error.message}`); } if (result.status !== 0) { ... }
That turns "exit=undefined" into "ffmpeg not found" (or whichever real OS error is firing) on the next CI run, which tells you which of the follow-ups below is needed.
-
ffmpeg availability on the runner. GitHub's
ubuntu-latest(currently 24.04) does include ffmpeg in the pre-installed toolset, so in theory this should Just Work — but the failure pattern looks a lot like ENOENT, and on a Bun child_process polyfill an ENOENT may land asstatus: undefinedwith empty stderr (vs Node'sstatus: null). Belt-and-braces fix: add an explicit install step in theparityshard's steps list, or at the top of the job:- name: Install ffmpeg for SSIM diff if: matrix.shard == 'parity' run: sudo apt-get update && sudo apt-get install -y ffmpeg
Even if ffmpeg is usually present, this pins the scenario against toolset drift on the hosted runner image.
Once computeSsim surfaces the real error and CI turns green, happy to re-review and approve — the scenario itself looks good.
Non-blocking:
- SSIM baseline of 0.95 is reasonable as a starting point, but depending on how deterministic the fixture's animation is at
TARGET_TIME_Sand how font/subpixel rendering jitters on the runner, you may want to widen that to 0.92–0.93 for the first few enforcement cycles. It's trivial to tighten later; a false-positive below 0.95 will be a thorny debug because the signal is pixels rather than numbers. - Consider writing the diffed pixel map to the
results/artifact directory on failure — when this does fire in anger, a human looking at an SSIM of 0.88 wants to see where the two frames disagree, and the SSIM map is a cheap byproduct of ffmpeg'sssimfilter (output it via-lavfi ssim=stats_file=...).
— Rames Jusso
4129ab2 to
9542991
Compare
111e128 to
306c164
Compare
9542991 to
c918563
Compare
Adds the 06-parity scenario, which compares a live-playback frame at ~5s on the gsap-heavy fixture against a synchronously-seeked reference at the same captured timestamp. The two PNG screenshots are diffed via ffmpeg's SSIM filter; the run reports parity_ssim_min across runs as a higher-is-better metric (baseline 0.95, allowed regression ratio 0.1 yields effective gate >= 0.855). The iframe-side rAF watcher pauses the timeline in the same tick that getTime() crosses 5.0s so the frozen value can be used as the canonical T_actual for both captures. After two host-side rAF ticks for paint settlement the actual frame is screenshotted, then el.seek() (which routes through _trySyncSeek for same-origin iframes) lands the player on the same time and a second screenshot is taken as the reference. Per-run PNGs and the captured time are persisted under results/parity/run-N/ for CI artifact upload and local debugging. Wires the scenario into index.ts (ScenarioId, dispatcher, DEFAULT_RUNS = 3), adds a parity shard to player-perf.yml, and adds paritySsimMin to baseline.json + the PerfBaseline type so the gate can evaluate it.
c918563 to
83d15bb
Compare
|
@jrusso1020 @miguel-heygen — both blockers resolved plus both non-blocking suggestions: Blocker 1 — Blocker 2 — ffmpeg availability on the runner: addressed in - name: Install ffmpeg (parity shard only)
if: matrix.shard == 'parity'
run: |
sudo apt-get update
sudo apt-get install -y --no-install-recommends ffmpeg
ffmpeg -version | head -n 1Mirror step on the Windows job uses The non-blocking observations:
Done.
Adopted. Nothing else outstanding. |

Summary
Adds scenario 06: live-playback parity — the third and final tranche of the P0-1 perf-test buildout (
p0-1ainfra →p0-1bfps/scrub/drift → this).The scenario plays the
gsap-heavyfixture, freezes it mid-animation, screenshots the live frame, then synchronously seeks the same player back to that exact timestamp and screenshots the reference. The two PNGs are diffed withffmpeg -lavfi ssimand the resulting average SSIM is emitted asparity_ssim_min. Baseline gate: SSIM ≥ 0.95.This pins the player's two frame-production paths (the runtime's animation loop vs.
_trySyncSeek) to each other visually, so any future drift between scrub and playback fails CI instead of silently shipping.Motivation
<hyperframes-player>produces frames two different ways:_trySyncSeek, landed in feat(player): synchronous seek() API with same-origin detection #397) — for same-origin embeds, the player calls into the iframe runtime'sseek()directly and asks for a specific time.These paths must agree. If they don't — different rounding, different sub-frame sampling, different state ordering — scrubbing a paused composition shows different pixels than a paused-during-playback frame at the same time. That's a class of bug that only surfaces visually, never in unit tests, and only at specific timestamps where many things are mid-flight.
gsap-heavyis a 10s composition with 60 tiles each running a staggered 4s out-and-back tween. At t=5.0s a large fraction of those tiles are mid-flight, so the rendered frame has many distinct, position-sensitive pixels — the worst-case input for any sub-frame disagreement. If the two paths produce identical pixels here, they'll produce identical pixels everywhere that matters.What changed
packages/player/tests/perf/scenarios/06-parity.ts— new scenario (~340 lines). Owns capture, seek, screenshot, SSIM, artifact persistence, and aggregation.packages/player/tests/perf/index.ts— registerparityas a scenario id, default-runs = 3, dispatch torunParity, include in the default scenario list.packages/player/tests/perf/perf-gate.ts— extendPerfBaselinewithparitySsimMin.packages/player/tests/perf/baseline.json—paritySsimMin: 0.95..github/workflows/player-perf.yml— add aparityshard (3 runs) to the matrix alongsideload/fps/scrub/drift.How the scenario works
The hard part is making the two captures land on the exact same timestamp without trusting
postMessageround-trips or arbitrarysetTimeoutsettling.play(). The watcher polls__player.getTime()every animation frame and, the first timegetTime() >= 5.0, calls__player.pause()from inside the same rAF tick.pause()is synchronous (it callstimeline.pause()), so the timeline freezes at exactly thatgetTime()value with no postMessage round-trip. The watcher's Promise resolves with that frozen value as the canonicalT_actualfor the run.isPlaying() === trueviaframe.waitForFunctionbefore awaiting the watcher. Without this, the test can hang ifplay()hasn't kicked the timeline yet.requestAnimationFrameticks on the host page. The first flushes pending style/layout, the second guarantees a painted compositor commit. Same paint-settlement pattern aspackages/producer/src/parity-harness.ts.page.screenshot({ type: "png" }).T_actual— callel.seek(capturedTime)on the host page. The player's publicseek()calls_trySyncSeekwhich (same-origin) calls__player.seek()synchronously, so no postMessage await is needed. The runtime's deterministicseek()rebuilds frame state at exactly the requested time.ffmpeg -hide_banner -i reference.png -i actual.png -lavfi ssim -f null -. ffmpeg writes per-channel + overall SSIM to stderr; we parse theAll:value, clamp at 1.0 (ffmpeg occasionally reports 1.000001 on identical inputs), and treat it as the run's score.tests/perf/results/parity/run-N/(actual.png,reference.png,captured-time.txt) so CI can upload them and so a failed run is locally reproducible. Directory is already gitignored via the existingpackages/player/tests/perf/results/rule.Aggregation
min()across runs, not mean. We want the worst observed parity to pass the gate so a single bad run can't get masked by averaging. Both per-run scores and the aggregate are logged.Output metric
parity_ssim_minparitySsimMin: 0.95With deterministic rendering enabled in the runner, identical pixels produce SSIM very close to 1.0; the 0.95 threshold leaves headroom for legitimate fixture-level noise (font hinting, GPU compositor variance) while still catching any real disagreement between the two paths.
Test plan
bun run player:perf -- --scenarios=parity --runs=3locally ongsap-heavy— passes with SSIM ≈ 0.999 across all 3 runs.results/parity/run-1/actual.pngandreference.pngside-by-side — visually identical.captured-time.txtto confirmT_actuallands just past 5.0s (within one frame).parityshard added alongside the existingload/fps/scrub/driftshards; samemeasure-mode / artifact-upload / aggregation flow.bunx oxlintandbunx oxfmt --checkclean on the new scenario.Stack
This is the top of the perf stack:
perf/x-1-emit-performance-metric— performance.measure() emissionperf/p1-1-share-player-styles-via-adopted-stylesheets— adopted stylesheetsperf/p1-2-scope-media-mutation-observer— scoped MutationObserverperf/p1-4-coalesce-mirror-parent-media-time— coalesce currentTime writesperf/p3-1-sync-seek-same-origin— synchronous seek path (the path this PR pins)perf/p3-2-srcdoc-composition-switching— srcdoc switchingperf/p0-1a-perf-test-infra— server, runner, perf-gate, CIperf/p0-1b-perf-tests-for-fps-scrub-drift— fps / scrub / drift scenariosperf/p0-1c-live-playback-parity-test← you are hereWith this PR landed the perf harness covers all five proposal scenarios:
load,fps,scrub,drift,parity.