perf(player): p0-1c live-playback parity test via SSIM by vanceingalls · Pull Request #401 · heygen-com/hyperframes

vanceingalls · 2026-04-21T23:05:53Z

Summary

Adds scenario 06: live-playback parity — the third and final tranche of the P0-1 perf-test buildout (p0-1a infra → p0-1b fps/scrub/drift → this).

The scenario plays the gsap-heavy fixture, freezes it mid-animation, screenshots the live frame, then synchronously seeks the same player back to that exact timestamp and screenshots the reference. The two PNGs are diffed with ffmpeg -lavfi ssim and the resulting average SSIM is emitted as parity_ssim_min. Baseline gate: SSIM ≥ 0.95.

This pins the player's two frame-production paths (the runtime's animation loop vs. _trySyncSeek) to each other visually, so any future drift between scrub and playback fails CI instead of silently shipping.

Motivation

<hyperframes-player> produces frames two different ways:

Live playback — the runtime's animation loop advances the GSAP timeline frame-by-frame.
Synchronous seek (_trySyncSeek, landed in feat(player): synchronous seek() API with same-origin detection #397) — for same-origin embeds, the player calls into the iframe runtime's seek() directly and asks for a specific time.

These paths must agree. If they don't — different rounding, different sub-frame sampling, different state ordering — scrubbing a paused composition shows different pixels than a paused-during-playback frame at the same time. That's a class of bug that only surfaces visually, never in unit tests, and only at specific timestamps where many things are mid-flight.

gsap-heavy is a 10s composition with 60 tiles each running a staggered 4s out-and-back tween. At t=5.0s a large fraction of those tiles are mid-flight, so the rendered frame has many distinct, position-sensitive pixels — the worst-case input for any sub-frame disagreement. If the two paths produce identical pixels here, they'll produce identical pixels everywhere that matters.

What changed

packages/player/tests/perf/scenarios/06-parity.ts — new scenario (~340 lines). Owns capture, seek, screenshot, SSIM, artifact persistence, and aggregation.
packages/player/tests/perf/index.ts — register parity as a scenario id, default-runs = 3, dispatch to runParity, include in the default scenario list.
packages/player/tests/perf/perf-gate.ts — extend PerfBaseline with paritySsimMin.
packages/player/tests/perf/baseline.json — paritySsimMin: 0.95.
.github/workflows/player-perf.yml — add a parity shard (3 runs) to the matrix alongside load / fps / scrub / drift.

How the scenario works

The hard part is making the two captures land on the exact same timestamp without trusting postMessage round-trips or arbitrary setTimeout settling.

Install an iframe-side rAF watcher before issuing play(). The watcher polls __player.getTime() every animation frame and, the first time getTime() >= 5.0, calls __player.pause() from inside the same rAF tick. pause() is synchronous (it calls timeline.pause()), so the timeline freezes at exactly that getTime() value with no postMessage round-trip. The watcher's Promise resolves with that frozen value as the canonical T_actual for the run.
Confirm isPlaying() === true via frame.waitForFunction before awaiting the watcher. Without this, the test can hang if play() hasn't kicked the timeline yet.
Wait for paint — two requestAnimationFrame ticks on the host page. The first flushes pending style/layout, the second guarantees a painted compositor commit. Same paint-settlement pattern as packages/producer/src/parity-harness.ts.
Screenshot the live frame — page.screenshot({ type: "png" }).
Synchronously seek to T_actual — call el.seek(capturedTime) on the host page. The player's public seek() calls _trySyncSeek which (same-origin) calls __player.seek() synchronously, so no postMessage await is needed. The runtime's deterministic seek() rebuilds frame state at exactly the requested time.
Wait for paint again, screenshot the reference frame.
Diff with ffmpeg — ffmpeg -hide_banner -i reference.png -i actual.png -lavfi ssim -f null -. ffmpeg writes per-channel + overall SSIM to stderr; we parse the All: value, clamp at 1.0 (ffmpeg occasionally reports 1.000001 on identical inputs), and treat it as the run's score.
Persist artifacts under tests/perf/results/parity/run-N/ (actual.png, reference.png, captured-time.txt) so CI can upload them and so a failed run is locally reproducible. Directory is already gitignored via the existing packages/player/tests/perf/results/ rule.

Aggregation

min() across runs, not mean. We want the worst observed parity to pass the gate so a single bad run can't get masked by averaging. Both per-run scores and the aggregate are logged.

Output metric

name	direction	baseline
`parity_ssim_min`	higher-is-better	`paritySsimMin: 0.95`

With deterministic rendering enabled in the runner, identical pixels produce SSIM very close to 1.0; the 0.95 threshold leaves headroom for legitimate fixture-level noise (font hinting, GPU compositor variance) while still catching any real disagreement between the two paths.

Test plan

bun run player:perf -- --scenarios=parity --runs=3 locally on gsap-heavy — passes with SSIM ≈ 0.999 across all 3 runs.
Inspected results/parity/run-1/actual.png and reference.png side-by-side — visually identical.
Inspected captured-time.txt to confirm T_actual lands just past 5.0s (within one frame).
Sanity test: temporarily forced a 1-frame offset between live and reference capture; SSIM dropped well below 0.95 as expected, confirming the threshold catches real drift.
CI: parity shard added alongside the existing load / fps / scrub / drift shards; same measure-mode / artifact-upload / aggregation flow.
bunx oxlint and bunx oxfmt --check clean on the new scenario.

Stack

This is the top of the perf stack:

feat(core): add emitPerformanceMetric bridge for runtime telemetry #393 perf/x-1-emit-performance-metric — performance.measure() emission
perf(player): share PLAYER_STYLES via adoptedStyleSheets #394 perf/p1-1-share-player-styles-via-adopted-stylesheets — adopted stylesheets
perf(player): scope MutationObserver to composition hosts #395 perf/p1-2-scope-media-mutation-observer — scoped MutationObserver
perf(player): coalesce _mirrorParentMediaTime writes #396 perf/p1-4-coalesce-mirror-parent-media-time — coalesce currentTime writes
feat(player): synchronous seek() API with same-origin detection #397 perf/p3-1-sync-seek-same-origin — synchronous seek path (the path this PR pins)
perf(player): srcdoc composition switching for studio #398 perf/p3-2-srcdoc-composition-switching — srcdoc switching
perf(player): p0-1a perf test infra + composition-load smoke test #399 perf/p0-1a-perf-test-infra — server, runner, perf-gate, CI
perf(player): p0-1b perf tests for fps, scrub latency, and media sync drift #400 perf/p0-1b-perf-tests-for-fps-scrub-drift — fps / scrub / drift scenarios
perf(player): p0-1c live-playback parity test via SSIM #401 perf/p0-1c-live-playback-parity-test ← you are here

With this PR landed the perf harness covers all five proposal scenarios: load, fps, scrub, drift, parity.

vanceingalls · 2026-04-21T23:06:13Z

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

This stack of pull requests is managed by Graphite. Learn more about stacking.

jrusso1020

The scenario design itself is excellent — playing the gsap-heavy fixture, freezing mid-animation via an in-iframe rAF watcher that calls __player.pause() in the same tick, screenshotting, then _trySyncSeek-ing back to the frozen timestamp and screenshotting a reference, then SSIM-diffing with ffmpeg. That's the right shape to pin the two frame-production paths (live animation vs sync seek) to each other, and the comments explaining why the pause has to happen in the same tick as the watcher match make the intent clear.

Blocking: CI is red. The Perf: parity shard fails on every run with:

error: [scenario:parity] ffmpeg ssim failed (exit=undefined):
    at computeSsim (packages/player/tests/perf/scenarios/06-parity.ts:163:15)

Two things wrong here:

The error reporting swallows the real cause. result.status is undefined/null and stderr is empty, which means the child never even started (ENOENT or similar). The real diagnostic info is on result.error, which the current code doesn't surface:
```
if (result.status !== 0) {
  const stderr = (result.stderr || Buffer.from("")).toString("utf-8");
  throw new Error(`[scenario:parity] ffmpeg ssim failed (exit=${result.status}): ${stderr}`);
}
```
Please change to something like:
```
if (result.error) {
  throw new Error(`[scenario:parity] ffmpeg could not be started: ${result.error.message}`);
}
if (result.status !== 0) { ... }
```
That turns "exit=undefined" into "ffmpeg not found" (or whichever real OS error is firing) on the next CI run, which tells you which of the follow-ups below is needed.
ffmpeg availability on the runner. GitHub's ubuntu-latest (currently 24.04) does include ffmpeg in the pre-installed toolset, so in theory this should Just Work — but the failure pattern looks a lot like ENOENT, and on a Bun child_process polyfill an ENOENT may land as status: undefined with empty stderr (vs Node's status: null). Belt-and-braces fix: add an explicit install step in the parity shard's steps list, or at the top of the job:
```
- name: Install ffmpeg for SSIM diff
  if: matrix.shard == 'parity'
  run: sudo apt-get update && sudo apt-get install -y ffmpeg
```
Even if ffmpeg is usually present, this pins the scenario against toolset drift on the hosted runner image.

Once computeSsim surfaces the real error and CI turns green, happy to re-review and approve — the scenario itself looks good.

Non-blocking:

SSIM baseline of 0.95 is reasonable as a starting point, but depending on how deterministic the fixture's animation is at TARGET_TIME_S and how font/subpixel rendering jitters on the runner, you may want to widen that to 0.92–0.93 for the first few enforcement cycles. It's trivial to tighten later; a false-positive below 0.95 will be a thorny debug because the signal is pixels rather than numbers.
Consider writing the diffed pixel map to the results/ artifact directory on failure — when this does fire in anger, a human looking at an SSIM of 0.88 wants to see where the two frames disagree, and the SSIM map is a cheap byproduct of ffmpeg's ssim filter (output it via -lavfi ssim=stats_file=...).

— Rames Jusso

Adds the 06-parity scenario, which compares a live-playback frame at ~5s on the gsap-heavy fixture against a synchronously-seeked reference at the same captured timestamp. The two PNG screenshots are diffed via ffmpeg's SSIM filter; the run reports parity_ssim_min across runs as a higher-is-better metric (baseline 0.95, allowed regression ratio 0.1 yields effective gate >= 0.855). The iframe-side rAF watcher pauses the timeline in the same tick that getTime() crosses 5.0s so the frozen value can be used as the canonical T_actual for both captures. After two host-side rAF ticks for paint settlement the actual frame is screenshotted, then el.seek() (which routes through _trySyncSeek for same-origin iframes) lands the player on the same time and a second screenshot is taken as the reference. Per-run PNGs and the captured time are persisted under results/parity/run-N/ for CI artifact upload and local debugging. Wires the scenario into index.ts (ScenarioId, dispatcher, DEFAULT_RUNS = 3), adds a parity shard to player-perf.yml, and adds paritySsimMin to baseline.json + the PerfBaseline type so the gate can evaluate it.

vanceingalls · 2026-04-22T21:51:50Z

@jrusso1020 @miguel-heygen — both blockers resolved plus both non-blocking suggestions:

Blocker 1 — computeSsim swallowing the real cause (exit=undefined masking ENOENT): addressed in 83d15bb0. computeSsim now checks result.error before result.status — when ffmpeg never starts, the error line surfaces the actual ENOENT/EACCES/etc. with a hint to install ffmpeg, instead of the misleading (exit=undefined) you saw in CI. Old log line is now impossible to produce.

Blocker 2 — ffmpeg availability on the runner: addressed in .github/workflows/player-perf.yml (parity shard) and .github/workflows/windows-render.yml. The parity shard now installs ffmpeg explicitly:

- name: Install ffmpeg (parity shard only)
  if: matrix.shard == 'parity'
  run: |
    sudo apt-get update
    sudo apt-get install -y --no-install-recommends ffmpeg
    ffmpeg -version | head -n 1

Mirror step on the Windows job uses choco install ffmpeg -y --no-progress followed by a where.exe ffmpeg sanity check so the failure mode is "step fails loudly" rather than "scenario fails opaquely". Catalog previews use FedericoCarboni/setup-ffmpeg@v3 for the same reason. Even though ubuntu-latest (24.04) does ship ffmpeg in the toolset, pinning the install removes the runner-image dependency and matches Linux/Windows behaviour.

The non-blocking observations:

Consider writing the diffed pixel map to the results/ artifact directory on failure

Done. 06-parity.ts now has writeSsimStatsOnFailure which invokes ffmpeg with -lavfi "ssim=stats_file=…" on parse / mismatch failure and drops a parity-ssim-stats.txt per-frame breakdown into the run directory. When the shard fails, the artifact bundle now contains both reference + actual PNGs and the per-frame SSIM trace, so triage doesn't require local repro.

SSIM baseline of 0.95 is reasonable as a starting point, but ... you may want to widen that to 0.92–0.93

Adopted. baseline.json now sets paritySsimMin: 0.93. The 0.95 figure was synthetic (two captures of the same renderer under deterministic seek); 0.93 leaves headroom for legitimate sub-pixel jitter without losing the regression signal we care about (catastrophic divergence between live-playback and sync-seek paths).

Nothing else outstanding.

vanceingalls mentioned this pull request Apr 21, 2026

perf(player): p0-1b perf tests for fps, scrub latency, and media sync drift #400

Open

5 tasks

vanceingalls marked this pull request as ready for review April 21, 2026 23:13

jrusso1020 requested changes Apr 22, 2026

View reviewed changes

vanceingalls force-pushed the perf/p0-1c-live-playback-parity-test branch from 4129ab2 to 9542991 Compare April 22, 2026 00:43

vanceingalls force-pushed the perf/p0-1b-perf-tests-for-fps-scrub-drift branch 2 times, most recently from 111e128 to 306c164 Compare April 22, 2026 00:57

vanceingalls force-pushed the perf/p0-1c-live-playback-parity-test branch from 9542991 to c918563 Compare April 22, 2026 00:57

vanceingalls force-pushed the perf/p0-1c-live-playback-parity-test branch from c918563 to 83d15bb Compare April 22, 2026 01:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(player): p0-1c live-playback parity test via SSIM#401

perf(player): p0-1c live-playback parity test via SSIM#401
vanceingalls wants to merge 1 commit intoperf/p0-1b-perf-tests-for-fps-scrub-driftfrom
perf/p0-1c-live-playback-parity-test

vanceingalls commented Apr 21, 2026 •

edited

Loading

Uh oh!

vanceingalls commented Apr 21, 2026 •

edited

Loading

Uh oh!

jrusso1020 left a comment

Uh oh!

vanceingalls commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

vanceingalls commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

What changed

How the scenario works

Aggregation

Output metric

Test plan

Stack

Uh oh!

vanceingalls commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jrusso1020 left a comment

Choose a reason for hiding this comment

Uh oh!

vanceingalls commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vanceingalls commented Apr 21, 2026 •

edited

Loading

vanceingalls commented Apr 21, 2026 •

edited

Loading