Add GPU CI: e2e tests and benchmarks by ryan-williams · Pull Request #64 · Quantum-Accelerators/electrai

ryan-williams · 2026-01-20T20:07:38Z

Summary

Add deterministic e2e training test (tests/e2e_train.py) with platform-specific expected values
Add GPU e2e workflow (gpu-e2e.yml) using ec2-gha on EC2 g6.xlarge (NVIDIA L4), runs on PRs targeting main
Add GPU benchmark workflow (gpu-benchmark.yml) with configurable model size, weekly schedule, and manual dispatch
Add WandB logging to benchmark: logs training metrics, model config, dataset version, and instance type
Add scripts/s3_sync.py: reusable S3 data sync with size filtering and deterministic dataset hashing
Add gen-expected.yml for regenerating expected values on GHA runners (macOS, Ubuntu)
Add e2e_training_demo.ipynb notebook with training visualization
Add .github/workflows/README.md documenting all workflows
Rename tests/electrai/ → tests/test_electrai/ to fix import shadowing
Add --gradient-checkpoint flag to e2e_train.py for large models on limited VRAM
Benchmark summary includes linked commit SHA for traceability

Passing Runs

GPU E2E Test #27 — GPU + CPU baseline, both pass (Feb 24)
GPU Benchmark #27 — 50 S3 samples, WandB logging (Feb 24)

Required Setup

Secrets:

GH_SA_TOKEN — GitHub PAT for runner registration
WANDB_API_KEY — WandB API key (optional, for benchmark logging; currently Ryan's personal key)

IAM/OIDC:

Trust policy configured in Open-Athena/ops for ec2-gha OIDC authentication

Screenshots

EC2 instances	EC2 tags
GitHub runners
WandB dashboard

Test plan

GPU e2e test passes on EC2 g6.xlarge
CPU e2e test passes on EC2
GPU benchmark completes with production-size model (32ch/16 blocks, 128^3 grids)
gen-expected.yml generates correct values on macOS and Ubuntu
Expected values verified across all 3 platforms (darwin-arm64, linux, linux-gpu)
WandB run logged with correct metadata (project, dataset version, instance type)
scripts/s3_sync.py downloads correct samples and generates deterministic dataset hash

Reproducible end-to-end training test: small model (8 channels, 2 residual blocks) on 5 MP samples, verifies final validation loss matches platform-specific expected values. - `tests/e2e_train.py`: CLI with `--gpu`, `--epochs`, `--update-expected` - `tests/expected_values.json`: per-platform expected losses - `tests/test_e2e_train.py`: pytest wrapper - `examples/e2e_training_demo.ipynb`: notebook with training plots - Rename `tests/electrai/` → `tests/test_electrai/` to avoid shadowing the installed package when running scripts directly Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

GPU CI workflows (EC2 via ec2-gha): - `gpu-e2e.yml`: deterministic test on GPU + CPU, triggers on PRs to main - `gpu-benchmark.yml`: larger model (32ch/16 blocks) benchmark - `gen-expected.yml`: regenerate expected values on macOS + Linux - `README.md`: documents workflow responsibilities and expected values Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

`tests/e2e_train.py`: - `--wandb-project` (`-W`): enable WandB logging with project name, tags (platform, channels, blocks, git SHA), and CI metadata - `--max-file-size` (`-M`): skip input files larger than N MB to avoid GPU OOM on large grids (e.g. 25MB safe for L4 24GB VRAM) - Deterministic mode disabled when WandB active (benchmark mode) `gpu-benchmark.yml`: - `max_file_size` input (default 25MB): filters during S3 sync via `s3api` query, avoiding download of oversized samples - `wandb_project` input: passes `WANDB_API_KEY` from secrets, adds WandB link to step summary 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

- Add `scripts/s3_sync.py`: standalone `uv run --script` CLI that syncs CHGCAR samples from S3, filters by file size, creates map file, and writes `DATASET_HASH` to `$GITHUB_OUTPUT` when in CI - Replace ~30 lines of bash in `gpu-benchmark.yml` with script call - Auto-compute `dataset_version` as MD5 hash of sorted sample IDs (in both `s3_sync.py` and `e2e_train.py` as fallback) - Log per-sample file-size filtering to stderr in `e2e_train.py` - Mirror S3 data under `data/s3/` (gitignored) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Copilot

Pull request overview

This PR adds GPU-focused CI workflows (E2E determinism + benchmarks) and supporting tooling/tests to make model training runs reproducible and traceable across platforms/instances.

Changes:

Added EC2-backed GPU workflows for deterministic E2E training and scheduled/dispatchable benchmarks (with optional WandB logging and a summary report).
Added deterministic E2E training harness + expected-values artifacts, plus an S3 sync helper to build benchmark datasets reproducibly.
Added/expanded unit tests around Zarr conversion, S3 Zarr data loading, and SRGAN model components; added a demo notebook.

Reviewed changes

Copilot reviewed 11 out of 19 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
`tests/test_electrai/zarr_conversion/test_convert_to_zarr.py`	New tests for CHGCAR loading and Zarr conversion.
`tests/test_electrai/model/test_srgan_layernorm_pbc.py`	New comprehensive unit tests for SRGAN/PBC components.
`tests/test_electrai/dataloader/test_mp_zarr_s3_data.py`	New unit tests for local/S3 Zarr dataset reader and splitting.
`tests/test_e2e_train.py`	Pytest wrapper that runs the deterministic training script in a subprocess.
`tests/expected_values.json`	Platform-specific expected loss values for deterministic E2E check.
`tests/e2e_train.py`	Deterministic E2E training runner; supports GPU, filtering, and WandB logging.
`scripts/s3_sync.py`	CLI helper to sync/prepare S3 samples and produce a deterministic dataset hash.
`.github/workflows/gpu-e2e.yml`	EC2-backed workflow for GPU + CPU deterministic E2E training checks.
`.github/workflows/gpu-benchmark.yml`	EC2-backed GPU benchmark workflow (scheduled + manual) with optional CPU comparison and WandB integration.
`.github/workflows/gen-expected.yml`	Generates expected-values artifacts on GitHub-hosted macOS/Linux runners.
`.github/workflows/README.md`	Documentation for workflows and expected-values generation.
`specs/04-ci-wandb.md`	Spec/design doc for CI → WandB tracking and regression detection ideas.
`pyproject.toml`	Adds notebook execution dependencies + updates Ruff per-file ignores.
`examples/e2e_training_demo.ipynb`	Demo notebook showing deterministic training and visualization.
`.gitignore`	Ignores S3 mirror directory (`data/s3/`).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

tests/e2e_train.py

scripts/s3_sync.py

.github/workflows/gpu-benchmark.yml

pyproject.toml

specs/04-ci-wandb.md

.github/workflows/gpu-e2e.yml

.github/workflows/README.md

- Gate WandB logging on `WANDB_API_KEY` presence in `gpu-benchmark.yml` - Move notebook deps (`nbconvert`, `ipykernel`, `papermill`) to `[notebooks]` optional extra - Skip `gpu-e2e.yml` EC2 job on fork PRs (secrets unavailable) - Fix `get_platform()` to use resolved accelerator, not `--gpu` flag - Document S3 `ListObjectsV2` lexicographic ordering guarantee in `s3_sync.py` - Fix stale WandB statement in `specs/04-ci-wandb.md` - Update `.github/workflows/README.md` with current defaults and features Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 10 out of 15 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

scripts/s3_sync.py

.github/workflows/gpu-benchmark.yml

- Fix `if final_train_loss:` → `is not None` (0.0 would be falsy) - Use `$EPOCHS` input in CPU baseline step instead of hardcoded `5` - Use env vars for S3 sync step inputs (script injection hardening) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

ryan-williams · 2026-03-05T21:58:54Z

Re: re-review #3 — addressed in 2120663:

if final_train_loss: → is not None: fixed
CPU baseline --epochs 5: now uses $EPOCHS from workflow input
none/false/disabled in CLI: skipping — the workflow handles it, unlikely edge case

ryan-williams · 2026-03-05T22:00:54Z

@claude another review pls

claude · 2026-03-05T22:01:24Z

Claude finished @ryan-williams's task in 3m 5s —— View job

Code Review (re-review #4)

Gathering context
Reviewing changed files
Posting review feedback

All issues from the previous reviews are confirmed fixed. I found one unresolved issue and two minor nits.

Unresolved: `# fmt: off` in `scripts/s3_sync.py` still unclosed

scripts/s3_sync.py:40 — The Copilot review flagged that # fmt: off at line 40 is never closed, disabling formatting for the entire rest of the file. The response said "Replaced with # fmt: skip on the one long line. Fixed in 264f256." But looking at the actual diff in 264f256, only the import style changed (import click → from click import command, option) — the # fmt: off was left in place and no # fmt: on or # fmt: skip was added.

e2e_train.py has the same decorator block pattern and correctly closes it with # fmt: on before def main( (line 77), which passes pre-commit hooks. The same fix applies here:

# scripts/s3_sync.py — add # fmt: on after the last decorator
@option("-v", "--verbose", is_flag=True, help="Verbose output")
# fmt: on        ← add this
def main(

[Fix this →](https://claude.ai/code?q=In%20scripts%2Fs3_sync.py%20on%20branch%20gpu-ci%2C%20add%20%60%23%20fmt%3A%20on%60%20on%20a%20new%20line%20after%20line%2047%20(%40option(%22-v%22%2C%20%22--verbose%22%2C%20...))%20and%20before%20%60def%20main(%60%20on%20line%2048%2C%20to%20close%20the%20%60%23%20fmt%3A%20off%60%20block%20at%20line%2040%20and%20match%20the%20pattern%20used%20in%20e2e_train.py&repo=Quantum-Accelerators/electrai&branch=gpu-ci)

Minor: `gen-expected.yml` merge step uses bare `open()` without `with`

.github/workflows/gen-expected.yml:89-90 — The inline Python script uses open() without a with statement:

darwin_arm64 = json.load(open('tmp/darwin-arm64/expected_values.json'))
linux = json.load(open('tmp/linux/expected_values.json'))

Files are left open until GC. In this short-lived script context it's harmless, but inconsistent with the with open(...) as f: pattern used for the write on line 97. Trivial fix:

with open('tmp/darwin-arm64/expected_values.json') as f:
    darwin_arm64 = json.load(f)
with open('tmp/linux/expected_values.json') as f:
    linux = json.load(f)

Minor: `specs/ci-wandb.md` architecture section references `gpu-weekly.yml`

specs/ci-wandb.md:45 — The "Architecture" section includes a YAML snippet showing name: Weekly GPU Training with a workflow called gpu-weekly.yml (the original design name), while the actual implementation is gpu-benchmark.yml. Since this is a spec doc it's somewhat expected to retain the original design, but it may confuse future readers. A brief comment noting the actual filename would help.

Confirmed fixes from re-review #3

✅ if final_train_loss: → if final_train_loss is not None: (tests/e2e_train.py:385)
✅ CPU baseline --epochs 5 → --epochs "$EPOCHS" (gpu-e2e.yml:96)
✅ none/false/disabled CLI escape hatch: intentionally skipped

Summary

The PR is in excellent shape — four rounds of review have resulted in a very solid, well-tested CI addition. The unclosed # fmt: off in s3_sync.py is the only substantive remaining item (and is a straightforward one-line fix). The rest are cosmetic. Ready to merge once the # fmt: off is closed.

Copilot

Pull request overview

Copilot reviewed 10 out of 15 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

.github/workflows/gpu-benchmark.yml

Avoids script injection risk by keeping all expression interpolations in `env:` block rather than inline in `run:` script. Also fix stale `gpu-weekly.yml` reference in spec. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

ryan-williams · 2026-03-06T13:35:47Z

Thanks for the re-review! Responses:

# fmt: off in s3_sync.py: This was already fixed in 2120663 (the commit after 264f256 that you reviewed). The # fmt: off block was replaced with a targeted # fmt: skip on the one long --max-file-size line. Ruff's RUF028 doesn't allow # fmt: on between decorators and function def, so # fmt: skip is the right approach here.

bare open() in gen-expected.yml: Acknowledged, but this is a 3-line inline Python snippet in a CI step — the process exits immediately after. Skipping.

gpu-weekly.yml in spec: Fixed in 4d0c3a9 — updated to gpu-benchmark.yml.

ryan-williams · 2026-03-06T13:39:30Z

@claude another review pls

claude · 2026-03-06T13:39:59Z

Claude finished @ryan-williams's task in 2m 14s —— View job

Code Review (re-review #5)

Gathering context
Reviewing changed files
Posting review feedback

All issues from the previous four rounds of review have been confirmed fixed. The PR is ready to merge. Full verification below.

Confirmed fixes from re-review #4

✅ # fmt: off in scripts/s3_sync.py — replaced with targeted # fmt: skip on the long --max-file-size line (line 42). No unclosed formatter block remains.
✅ ${{ }} interpolations in benchmark summary step — GPU_TIME, CPU_TIME, WANDB_RUN_URL, and github.token all moved to env: block (lines 211–214), no direct interpolation in run: script.
✅ specs/ci-wandb.md — gpu-weekly.yml reference removed; architecture section updated.

Minor: bare `open()` in `gen-expected.yml:89-90`

Ryan has explicitly accepted this (short-lived inline script, process exits immediately). No action needed — flagging only for traceability.

Summary

Four rounds of review have addressed every substantive issue: the --epochs mismatch, dead LossTracker class, import time placement, hardcoded WandB entity, fork-PR guard, val_loss None guard, artifact download instructions, # fmt: off block, env-var security hardening, and CPU baseline epoch count. The codebase is clean, well-tested, and well-documented.

Ready to merge.

Copilot

Pull request overview

Copilot reviewed 10 out of 15 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

.github/workflows/gen-expected.yml

Read `tests/expected_values.json` before merging so that keys from other workflows (e.g. `linux-gpu` from `gpu-e2e.yml`) aren't dropped. Also use `with` statements for file reads. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 10 out of 15 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

tests/test_e2e_train.py

.github/workflows/gen-expected.yml

Without it, `tests/expected_values.json` doesn't exist and the merge step can't preserve `linux-gpu` keys from the repo. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 10 out of 15 changed files in this pull request and generated no new comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

forklady42

Overall, looking good but I noticed a couple of divergences from the current model config.

forklady42 · 2026-03-09T20:09:05Z

tests/e2e_train.py

+
+    # Hydra-style model config for LightningGenerator (uses hydra.utils.instantiate)
+    cfg.model = {
+        "_target_": "electrai.model.srgan_layernorm_pbc.GeneratorResNet",


electrai.model.resunet.ResUNet3D is now our default model.

forklady42 · 2026-03-09T20:15:23Z

tests/e2e_train.py

+    cfg.dataset_name = "mp"
+    cfg.data_path = data_path or str(repo_root / "data/MP/chgcars/input")
+    cfg.label_path = label_path or str(repo_root / "data/MP/chgcars/label")
+    cfg.map_path = map_path or str(repo_root / "data/MP/map/map_sample.json.gz")


Is this supposed to match our current config? We shifted patterns since this PR started 😅. Now relying on a txt file similar to the map_path with data and label directories under the same root path.

forklady42 · 2026-03-09T20:18:22Z

tests/e2e_train.py

+        echo(f"Train samples: {len(train_data)}, Val samples: {len(test_data)}")
+
+    def dict_collate(batch):
+        """Collate tuples from MPDataset into dicts for LightningGenerator."""


For my own clarity, MPDataset isn't a class, right? I initially assumed it was and looked around for it.

ResUNet3D is now the default model. Update hydra config dict, regenerate darwin-arm64 expected values, remove stale linux/linux-gpu values (to be regenerated via CI). Also fix `MPDataset` → `RhoData` in docstring. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Replace `--data-path`/`--label-path`/`--map-path` with `--data-root` - Use `RhoRead` LightningDataModule (filelist + data/label dirs) - Rename `data/MP/chgcars/input/` → `data/` to match `RhoRead` layout - Add `mp_filelist.txt` for sample IDs - Update `s3_sync.py` to output `data/` dir + filelist (not `input/` + json.gz map) - Update benchmark workflow `DATA_ARGS` - Regenerate darwin-arm64 expected values (linux/linux-gpu pending CI) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

ryan-williams force-pushed the gpu-ci branch 2 times, most recently from 8c08388 to 3823ccc Compare January 20, 2026 20:17

ryan-williams force-pushed the gpu-ci branch 3 times, most recently from 6940d9f to 0ac2a42 Compare February 7, 2026 01:44

ryan-williams force-pushed the gpu-ci branch from 0ac2a42 to 45de070 Compare February 17, 2026 20:38

ryan-williams changed the title ~~Add GPU e2e testing with lambda-gha~~ Add GPU CI: e2e tests and benchmarks Feb 17, 2026

ryan-williams force-pushed the gpu-ci branch 4 times, most recently from 64d822a to cdcbcc9 Compare February 19, 2026 16:04

ryan-williams force-pushed the gpu-ci branch 3 times, most recently from 470eb7e to 52b81a3 Compare February 19, 2026 19:30

ryan-williams force-pushed the gpu-ci branch from 52b81a3 to 2006e48 Compare February 19, 2026 20:15

ryan-williams and others added 4 commits February 20, 2026 19:47

pre-commit auto-fixes

1ecd8a3

Add commit SHA to benchmark summary table

82dbe51

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

ryan-williams marked this pull request as ready for review February 24, 2026 13:46

ryan-williams and others added 3 commits February 24, 2026 09:11

Add PR number to benchmark summary table

a1ec6bd

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Bump default s3_samples from 10 to 25

7b3d188

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Bump default s3_samples to 50, use workflow name in WandB run ID

a0bb132

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

ryan-williams requested a review from Copilot February 25, 2026 13:01

Copilot started reviewing on behalf of ryan-williams February 25, 2026 13:02 View session

Copilot AI reviewed Feb 25, 2026

View reviewed changes

ryan-williams requested a review from Copilot February 26, 2026 19:39

Copilot started reviewing on behalf of ryan-williams March 5, 2026 13:49 View session

Copilot AI reviewed Mar 5, 2026

View reviewed changes

scripts/s3_sync.py Outdated Show resolved Hide resolved

.github/workflows/gpu-benchmark.yml Outdated Show resolved Hide resolved

ryan-williams requested a review from Copilot March 5, 2026 22:00

Copilot started reviewing on behalf of ryan-williams March 5, 2026 22:01 View session

ryan-williams force-pushed the gpu-ci branch from 264f256 to 2120663 Compare March 5, 2026 22:05

Copilot AI reviewed Mar 5, 2026

View reviewed changes

.github/workflows/gpu-benchmark.yml Outdated Show resolved Hide resolved

ryan-williams requested a review from Copilot March 6, 2026 13:39

Copilot started reviewing on behalf of ryan-williams March 6, 2026 13:42 View session

Copilot AI reviewed Mar 6, 2026

View reviewed changes

.github/workflows/gen-expected.yml Outdated Show resolved Hide resolved

ryan-williams requested review from Copilot, forklady42 and hanaol March 9, 2026 15:12

Copilot started reviewing on behalf of ryan-williams March 9, 2026 15:14 View session

Copilot AI reviewed Mar 9, 2026

View reviewed changes

tests/test_e2e_train.py Show resolved Hide resolved

.github/workflows/gen-expected.yml Show resolved Hide resolved

Add checkout step to gen-expected.yml merge job

e733463

Without it, `tests/expected_values.json` doesn't exist and the merge step can't preserve `linux-gpu` keys from the repo. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

ryan-williams requested a review from Copilot March 9, 2026 16:05

Copilot started reviewing on behalf of ryan-williams March 9, 2026 16:06 View session

Copilot AI reviewed Mar 9, 2026

View reviewed changes

forklady42 reviewed Mar 9, 2026

View reviewed changes

ryan-williams and others added 2 commits March 9, 2026 17:44

Conversation

ryan-williams commented Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Passing Runs

Required Setup

Screenshots

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

ryan-williams commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ryan-williams commented Mar 5, 2026

Uh oh!

claude bot commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review (re-review #4)

Unresolved: # fmt: off in scripts/s3_sync.py still unclosed

Minor: gen-expected.yml merge step uses bare open() without with

Minor: specs/ci-wandb.md architecture section references gpu-weekly.yml

Confirmed fixes from re-review #3

Summary

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

ryan-williams commented Mar 6, 2026

Uh oh!

ryan-williams commented Mar 6, 2026

Uh oh!

claude bot commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review (re-review #5)

Confirmed fixes from re-review #4

Minor: bare open() in gen-expected.yml:89-90

Summary

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

forklady42 left a comment

Choose a reason for hiding this comment

Uh oh!

forklady42 Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

forklady42 Mar 9, 2026

Choose a reason for hiding this comment

ryan-williams commented Jan 20, 2026 •

edited

Loading

ryan-williams commented Mar 5, 2026 •

edited

Loading

claude bot commented Mar 5, 2026 •

edited

Loading

Unresolved: `# fmt: off` in `scripts/s3_sync.py` still unclosed

Minor: `gen-expected.yml` merge step uses bare `open()` without `with`

Minor: `specs/ci-wandb.md` architecture section references `gpu-weekly.yml`

claude bot commented Mar 6, 2026 •

edited

Loading

Minor: bare `open()` in `gen-expected.yml:89-90`