Skip to content

Add GPU CI: e2e tests and benchmarks#64

Open
ryan-williams wants to merge 25 commits intomainfrom
gpu-ci
Open

Add GPU CI: e2e tests and benchmarks#64
ryan-williams wants to merge 25 commits intomainfrom
gpu-ci

Conversation

@ryan-williams
Copy link
Collaborator

@ryan-williams ryan-williams commented Jan 20, 2026

Summary

  • Add deterministic e2e training test (tests/e2e_train.py) with platform-specific expected values
  • Add GPU e2e workflow (gpu-e2e.yml) using ec2-gha on EC2 g6.xlarge (NVIDIA L4), runs on PRs targeting main
  • Add GPU benchmark workflow (gpu-benchmark.yml) with configurable model size, weekly schedule, and manual dispatch
  • Add WandB logging to benchmark: logs training metrics, model config, dataset version, and instance type
  • Add scripts/s3_sync.py: reusable S3 data sync with size filtering and deterministic dataset hashing
  • Add gen-expected.yml for regenerating expected values on GHA runners (macOS, Ubuntu)
  • Add e2e_training_demo.ipynb notebook with training visualization
  • Add .github/workflows/README.md documenting all workflows
  • Rename tests/electrai/tests/test_electrai/ to fix import shadowing
  • Add --gradient-checkpoint flag to e2e_train.py for large models on limited VRAM
  • Benchmark summary includes linked commit SHA for traceability

Passing Runs

Required Setup

Secrets:

  • GH_SA_TOKEN — GitHub PAT for runner registration
  • WANDB_API_KEY — WandB API key (optional, for benchmark logging; currently Ryan's personal key)

IAM/OIDC:

  • Trust policy configured in Open-Athena/ops for ec2-gha OIDC authentication

Screenshots

EC2 instances
EC2 tags
GitHub runners
WandB dashboard

Test plan

  • GPU e2e test passes on EC2 g6.xlarge
  • CPU e2e test passes on EC2
  • GPU benchmark completes with production-size model (32ch/16 blocks, 128^3 grids)
  • gen-expected.yml generates correct values on macOS and Ubuntu
  • Expected values verified across all 3 platforms (darwin-arm64, linux, linux-gpu)
  • WandB run logged with correct metadata (project, dataset version, instance type)
  • scripts/s3_sync.py downloads correct samples and generates deterministic dataset hash

@ryan-williams ryan-williams force-pushed the gpu-ci branch 2 times, most recently from 8c08388 to 3823ccc Compare January 20, 2026 20:17
@ryan-williams ryan-williams force-pushed the gpu-ci branch 3 times, most recently from 6940d9f to 0ac2a42 Compare February 7, 2026 01:44
@ryan-williams ryan-williams changed the title Add GPU e2e testing with lambda-gha Add GPU CI: e2e tests and benchmarks Feb 17, 2026
@ryan-williams ryan-williams force-pushed the gpu-ci branch 4 times, most recently from 64d822a to cdcbcc9 Compare February 19, 2026 16:04
Reproducible end-to-end training test: small model (8 channels, 2
residual blocks) on 5 MP samples, verifies final validation loss
matches platform-specific expected values.

- `tests/e2e_train.py`: CLI with `--gpu`, `--epochs`, `--update-expected`
- `tests/expected_values.json`: per-platform expected losses
- `tests/test_e2e_train.py`: pytest wrapper
- `examples/e2e_training_demo.ipynb`: notebook with training plots
- Rename `tests/electrai/` → `tests/test_electrai/` to avoid shadowing
  the installed package when running scripts directly

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@ryan-williams ryan-williams force-pushed the gpu-ci branch 3 times, most recently from 470eb7e to 52b81a3 Compare February 19, 2026 19:30
GPU CI workflows (EC2 via ec2-gha):
- `gpu-e2e.yml`: deterministic test on GPU + CPU, triggers on PRs to main
- `gpu-benchmark.yml`: larger model (32ch/16 blocks) benchmark
- `gen-expected.yml`: regenerate expected values on macOS + Linux
- `README.md`: documents workflow responsibilities and expected values

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
ryan-williams and others added 4 commits February 20, 2026 19:47
`tests/e2e_train.py`:
- `--wandb-project` (`-W`): enable WandB logging with project name, tags
  (platform, channels, blocks, git SHA), and CI metadata
- `--max-file-size` (`-M`): skip input files larger than N MB to avoid
  GPU OOM on large grids (e.g. 25MB safe for L4 24GB VRAM)
- Deterministic mode disabled when WandB active (benchmark mode)

`gpu-benchmark.yml`:
- `max_file_size` input (default 25MB): filters during S3 sync via
  `s3api` query, avoiding download of oversized samples
- `wandb_project` input: passes `WANDB_API_KEY` from secrets, adds
  WandB link to step summary

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Add `scripts/s3_sync.py`: standalone `uv run --script` CLI that syncs
  CHGCAR samples from S3, filters by file size, creates map file, and
  writes `DATASET_HASH` to `$GITHUB_OUTPUT` when in CI
- Replace ~30 lines of bash in `gpu-benchmark.yml` with script call
- Auto-compute `dataset_version` as MD5 hash of sorted sample IDs
  (in both `s3_sync.py` and `e2e_train.py` as fallback)
- Log per-sample file-size filtering to stderr in `e2e_train.py`
- Mirror S3 data under `data/s3/` (gitignored)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@ryan-williams ryan-williams marked this pull request as ready for review February 24, 2026 13:46
ryan-williams and others added 3 commits February 24, 2026 09:11
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds GPU-focused CI workflows (E2E determinism + benchmarks) and supporting tooling/tests to make model training runs reproducible and traceable across platforms/instances.

Changes:

  • Added EC2-backed GPU workflows for deterministic E2E training and scheduled/dispatchable benchmarks (with optional WandB logging and a summary report).
  • Added deterministic E2E training harness + expected-values artifacts, plus an S3 sync helper to build benchmark datasets reproducibly.
  • Added/expanded unit tests around Zarr conversion, S3 Zarr data loading, and SRGAN model components; added a demo notebook.

Reviewed changes

Copilot reviewed 11 out of 19 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
tests/test_electrai/zarr_conversion/test_convert_to_zarr.py New tests for CHGCAR loading and Zarr conversion.
tests/test_electrai/model/test_srgan_layernorm_pbc.py New comprehensive unit tests for SRGAN/PBC components.
tests/test_electrai/dataloader/test_mp_zarr_s3_data.py New unit tests for local/S3 Zarr dataset reader and splitting.
tests/test_e2e_train.py Pytest wrapper that runs the deterministic training script in a subprocess.
tests/expected_values.json Platform-specific expected loss values for deterministic E2E check.
tests/e2e_train.py Deterministic E2E training runner; supports GPU, filtering, and WandB logging.
scripts/s3_sync.py CLI helper to sync/prepare S3 samples and produce a deterministic dataset hash.
.github/workflows/gpu-e2e.yml EC2-backed workflow for GPU + CPU deterministic E2E training checks.
.github/workflows/gpu-benchmark.yml EC2-backed GPU benchmark workflow (scheduled + manual) with optional CPU comparison and WandB integration.
.github/workflows/gen-expected.yml Generates expected-values artifacts on GitHub-hosted macOS/Linux runners.
.github/workflows/README.md Documentation for workflows and expected-values generation.
specs/04-ci-wandb.md Spec/design doc for CI → WandB tracking and regression detection ideas.
pyproject.toml Adds notebook execution dependencies + updates Ruff per-file ignores.
examples/e2e_training_demo.ipynb Demo notebook showing deterministic training and visualization.
.gitignore Ignores S3 mirror directory (data/s3/).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- Gate WandB logging on `WANDB_API_KEY` presence in `gpu-benchmark.yml`
- Move notebook deps (`nbconvert`, `ipykernel`, `papermill`) to `[notebooks]` optional extra
- Skip `gpu-e2e.yml` EC2 job on fork PRs (secrets unavailable)
- Fix `get_platform()` to use resolved accelerator, not `--gpu` flag
- Document S3 `ListObjectsV2` lexicographic ordering guarantee in `s3_sync.py`
- Fix stale WandB statement in `specs/04-ci-wandb.md`
- Update `.github/workflows/README.md` with current defaults and features

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 15 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- Fix `if final_train_loss:` → `is not None` (0.0 would be falsy)
- Use `$EPOCHS` input in CPU baseline step instead of hardcoded `5`
- Use env vars for S3 sync step inputs (script injection hardening)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@ryan-williams
Copy link
Collaborator Author

ryan-williams commented Mar 5, 2026

Re: re-review #3 — addressed in 2120663:

  • if final_train_loss:is not None: fixed
  • CPU baseline --epochs 5: now uses $EPOCHS from workflow input
  • none/false/disabled in CLI: skipping — the workflow handles it, unlikely edge case

@ryan-williams ryan-williams requested a review from Copilot March 5, 2026 22:00
@ryan-williams
Copy link
Collaborator Author

@claude another review pls

@claude
Copy link

claude bot commented Mar 5, 2026

Claude finished @ryan-williams's task in 3m 5s —— View job


Code Review (re-review #4)

  • Gathering context
  • Reviewing changed files
  • Posting review feedback

All issues from the previous reviews are confirmed fixed. I found one unresolved issue and two minor nits.


Unresolved: # fmt: off in scripts/s3_sync.py still unclosed

scripts/s3_sync.py:40 — The Copilot review flagged that # fmt: off at line 40 is never closed, disabling formatting for the entire rest of the file. The response said "Replaced with # fmt: skip on the one long line. Fixed in 264f256." But looking at the actual diff in 264f256, only the import style changed (import clickfrom click import command, option) — the # fmt: off was left in place and no # fmt: on or # fmt: skip was added.

e2e_train.py has the same decorator block pattern and correctly closes it with # fmt: on before def main( (line 77), which passes pre-commit hooks. The same fix applies here:

# scripts/s3_sync.py — add # fmt: on after the last decorator
@option("-v", "--verbose", is_flag=True, help="Verbose output")
# fmt: on        ← add this
def main(

[Fix this →](https://claude.ai/code?q=In%20scripts%2Fs3_sync.py%20on%20branch%20gpu-ci%2C%20add%20%60%23%20fmt%3A%20on%60%20on%20a%20new%20line%20after%20line%2047%20(%40option(%22-v%22%2C%20%22--verbose%22%2C%20...))%20and%20before%20%60def%20main(%60%20on%20line%2048%2C%20to%20close%20the%20%60%23%20fmt%3A%20off%60%20block%20at%20line%2040%20and%20match%20the%20pattern%20used%20in%20e2e_train.py&repo=Quantum-Accelerators/electrai&branch=gpu-ci)


Minor: gen-expected.yml merge step uses bare open() without with

.github/workflows/gen-expected.yml:89-90 — The inline Python script uses open() without a with statement:

darwin_arm64 = json.load(open('tmp/darwin-arm64/expected_values.json'))
linux = json.load(open('tmp/linux/expected_values.json'))

Files are left open until GC. In this short-lived script context it's harmless, but inconsistent with the with open(...) as f: pattern used for the write on line 97. Trivial fix:

with open('tmp/darwin-arm64/expected_values.json') as f:
    darwin_arm64 = json.load(f)
with open('tmp/linux/expected_values.json') as f:
    linux = json.load(f)

Minor: specs/ci-wandb.md architecture section references gpu-weekly.yml

specs/ci-wandb.md:45 — The "Architecture" section includes a YAML snippet showing name: Weekly GPU Training with a workflow called gpu-weekly.yml (the original design name), while the actual implementation is gpu-benchmark.yml. Since this is a spec doc it's somewhat expected to retain the original design, but it may confuse future readers. A brief comment noting the actual filename would help.


Confirmed fixes from re-review #3

  • if final_train_loss:if final_train_loss is not None: (tests/e2e_train.py:385)
  • ✅ CPU baseline --epochs 5--epochs "$EPOCHS" (gpu-e2e.yml:96)
  • none/false/disabled CLI escape hatch: intentionally skipped

Summary

The PR is in excellent shape — four rounds of review have resulted in a very solid, well-tested CI addition. The unclosed # fmt: off in s3_sync.py is the only substantive remaining item (and is a straightforward one-line fix). The rest are cosmetic. Ready to merge once the # fmt: off is closed.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 15 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Avoids script injection risk by keeping all expression interpolations
in `env:` block rather than inline in `run:` script. Also fix stale
`gpu-weekly.yml` reference in spec.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@ryan-williams
Copy link
Collaborator Author

Thanks for the re-review! Responses:

# fmt: off in s3_sync.py: This was already fixed in 2120663 (the commit after 264f256 that you reviewed). The # fmt: off block was replaced with a targeted # fmt: skip on the one long --max-file-size line. Ruff's RUF028 doesn't allow # fmt: on between decorators and function def, so # fmt: skip is the right approach here.

bare open() in gen-expected.yml: Acknowledged, but this is a 3-line inline Python snippet in a CI step — the process exits immediately after. Skipping.

gpu-weekly.yml in spec: Fixed in 4d0c3a9 — updated to gpu-benchmark.yml.

@ryan-williams ryan-williams requested a review from Copilot March 6, 2026 13:39
@ryan-williams
Copy link
Collaborator Author

@claude another review pls

@claude
Copy link

claude bot commented Mar 6, 2026

Claude finished @ryan-williams's task in 2m 14s —— View job


Code Review (re-review #5)

  • Gathering context
  • Reviewing changed files
  • Posting review feedback

All issues from the previous four rounds of review have been confirmed fixed. The PR is ready to merge. Full verification below.


Confirmed fixes from re-review #4

  • # fmt: off in scripts/s3_sync.py — replaced with targeted # fmt: skip on the long --max-file-size line (line 42). No unclosed formatter block remains.
  • ${{ }} interpolations in benchmark summary step — GPU_TIME, CPU_TIME, WANDB_RUN_URL, and github.token all moved to env: block (lines 211–214), no direct interpolation in run: script.
  • specs/ci-wandb.mdgpu-weekly.yml reference removed; architecture section updated.

Minor: bare open() in gen-expected.yml:89-90

Ryan has explicitly accepted this (short-lived inline script, process exits immediately). No action needed — flagging only for traceability.


Summary

Four rounds of review have addressed every substantive issue: the --epochs mismatch, dead LossTracker class, import time placement, hardcoded WandB entity, fork-PR guard, val_loss None guard, artifact download instructions, # fmt: off block, env-var security hardening, and CPU baseline epoch count. The codebase is clean, well-tested, and well-documented.

Ready to merge.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 15 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Read `tests/expected_values.json` before merging so that keys from
other workflows (e.g. `linux-gpu` from `gpu-e2e.yml`) aren't dropped.
Also use `with` statements for file reads.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 15 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Without it, `tests/expected_values.json` doesn't exist and the merge
step can't preserve `linux-gpu` keys from the repo.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 15 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Collaborator

@forklady42 forklady42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, looking good but I noticed a couple of divergences from the current model config.


# Hydra-style model config for LightningGenerator (uses hydra.utils.instantiate)
cfg.model = {
"_target_": "electrai.model.srgan_layernorm_pbc.GeneratorResNet",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

electrai.model.resunet.ResUNet3D is now our default model.

cfg.dataset_name = "mp"
cfg.data_path = data_path or str(repo_root / "data/MP/chgcars/input")
cfg.label_path = label_path or str(repo_root / "data/MP/chgcars/label")
cfg.map_path = map_path or str(repo_root / "data/MP/map/map_sample.json.gz")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this supposed to match our current config? We shifted patterns since this PR started 😅. Now relying on a txt file similar to the map_path with data and label directories under the same root path.

echo(f"Train samples: {len(train_data)}, Val samples: {len(test_data)}")

def dict_collate(batch):
"""Collate tuples from MPDataset into dicts for LightningGenerator."""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my own clarity, MPDataset isn't a class, right? I initially assumed it was and looked around for it.

ryan-williams and others added 2 commits March 9, 2026 17:44
ResUNet3D is now the default model. Update hydra config dict,
regenerate darwin-arm64 expected values, remove stale linux/linux-gpu
values (to be regenerated via CI). Also fix `MPDataset` → `RhoData`
in docstring.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Replace `--data-path`/`--label-path`/`--map-path` with `--data-root`
- Use `RhoRead` LightningDataModule (filelist + data/label dirs)
- Rename `data/MP/chgcars/input/` → `data/` to match `RhoRead` layout
- Add `mp_filelist.txt` for sample IDs
- Update `s3_sync.py` to output `data/` dir + filelist (not `input/` + json.gz map)
- Update benchmark workflow `DATA_ARGS`
- Regenerate darwin-arm64 expected values (linux/linux-gpu pending CI)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants