Skip to content

fix(build): remove numpy from dependency blacklist, rename to SIZE_PROHIBITIVE_PACKAGES#263

Open
deanq wants to merge 4 commits intomainfrom
deanq/ae-2410-fix-cpu-endpoint-missing-numpy
Open

fix(build): remove numpy from dependency blacklist, rename to SIZE_PROHIBITIVE_PACKAGES#263
deanq wants to merge 4 commits intomainfrom
deanq/ae-2410-fix-cpu-endpoint-missing-numpy

Conversation

@deanq
Copy link
Member

@deanq deanq commented Mar 10, 2026

Summary

  • Rename BASE_IMAGE_PACKAGES to SIZE_PROHIBITIVE_PACKAGES to reflect the actual constraint (500 MB tarball limit, not base image contents)
  • Remove numpy from the blacklist -- it was being stripped from CPU endpoint build artifacts where python-slim has no pre-installed packages
  • Keep torch, torchvision, torchaudio, triton (CUDA-specific, too large for tarball, pre-installed in GPU images)

Root cause: The blacklist was defined by what the GPU base image ships, not by physical size constraints. This silently broke CPU endpoints that declared numpy as a dependency.

Companion PRs:

Test plan

  • make quality-check passes
  • Verify numpy is no longer in SIZE_PROHIBITIVE_PACKAGES
  • Verify torch ecosystem packages are still excluded
  • Build a CPU endpoint with dependencies=["numpy"] and confirm numpy is in the tarball

The dependency blacklist was defined by what the GPU base image ships
(torch, numpy, triton, etc.), which silently broke CPU endpoints using
python-slim. Numpy and similar packages aren't pre-installed in slim
images, so excluding them caused runtime ImportErrors.

Rename BASE_IMAGE_PACKAGES to SIZE_PROHIBITIVE_PACKAGES and remove numpy.
The blacklist now contains only packages that exceed the 500 MB tarball
limit (torch ecosystem + triton), which are CUDA-specific and never
needed by CPU endpoints.
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the build-time “auto-exclude” package list to reflect tarball size constraints (rather than GPU base-image contents), and stops stripping numpy from CPU build artifacts.

Changes:

  • Renames BASE_IMAGE_PACKAGES to SIZE_PROHIBITIVE_PACKAGES and updates associated messaging.
  • Removes numpy from the auto-excluded set while keeping the torch/CUDA ecosystem exclusions.
  • Updates unit tests to reflect the new constant name and expected numpy behavior.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
src/runpod_flash/cli/commands/build.py Renames and redefines the auto-exclusion set and updates exclusion messaging/docs in code comments/docstrings.
tests/unit/cli/commands/test_build.py Updates tests to use the renamed constant and ensures numpy is no longer auto-filtered.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

deanq added 2 commits March 9, 2026 20:42
ResourceDiscovery only found LB endpoints (ep = Endpoint(...) +
@ep.get/post) and @Remote patterns, missing the QB pattern where
@endpoint(...) decorates a function/class directly. This caused
--auto-provision to skip all queue-based endpoints.

- Add _is_endpoint_direct_decorator() AST check for @endpoint(...)
- Record decorated function/class name (not variable) for QB pattern
- Extract resource_config from __remote_config__ on wrapped functions
- Add 6 tests covering GPU/CPU QB, class, directory scan, mixed
- Update stale assertion from "Auto-excluded base image packages" to "Auto-excluded size-prohibitive packages"
- Fix "Numpy" casing to "NumPy" in test docstring
- Reword SIZE_PROHIBITIVE_PACKAGES comment to focus on size constraints, not runtime assumptions
Copy link
Contributor

@runpod-Henrik runpod-Henrik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QA Results — PR #263 + flash-examples #42

What changed

  • CPU workers declaring numpy as a dependency now receive it in the build artifact. Previously numpy was silently stripped, causing import failures at runtime on CPU instances where the base image has no pre-installed packages.
  • @Endpoint(...) used directly on functions and classes is now correctly discovered by flash run, flash build, and flash deploy.

What works

Scenario Result
CPU worker with dependencies=["numpy"] — deploys, numpy importable, correct output returned PASS — numpy=2.4.3, mean/std/median all correct
flash build --exclude numpy — user exclusion flag still works after the rename PASS — numpy absent from tarball, other deps present
flash run discovers QB-decorated workers (@Endpoint on functions) alongside existing workers PASS — both mixed_worker endpoints appear in dev server alongside 4 existing workers
CPU regression — 02_cpu_worker, 03_mixed_workers, autoscaling, CPU LB PASS — all 4 pass

What was not tested

Gap Risk
GPU worker with dependencies=["numpy"] — numpy is no longer excluded from GPU tarballs as a side effect of this fix. GPU base image has numpy pre-installed so the bundled copy should be ignored at runtime, but this path has no E2E coverage. Medium
GPU regression Low — no GPU-specific code paths changed

Verdict

Pass for the scenarios tested. The fix works for its stated purpose. The one untested path (GPU + numpy) is a behaviour change without coverage — risk is low given GPU base image precedence, but worth a note.

The server.py codegen imported LB config variables by their raw name
(e.g. "api"). When multiple files exported the same variable name,
later imports overwrote earlier ones, causing GPU LB endpoints to
dispatch to the wrong resource (CPU worker image instead of GPU).

Config variables are now imported with unique aliases derived from the
resource name (_cfg_{resource_name}). Also passes endpoint dependencies
through lb_execute to the stub so the remote worker installs them.
Copy link
Contributor

@runpod-Henrik runpod-Henrik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QA Update — commit 6bfac3e

Bug found and fixed: When flash run is invoked at the root of flash-examples with multiple LB workers exporting the same config variable name (api), the generated server overwrote the first import with the second, causing GPU LB endpoints to provision and dispatch through the CPU resource config instead.

Scenario: GET /03_advanced_workers/05_load_balancer/gpu_lb/info → was provisioning live-03_05_load_balancer_cpu-fb instead of live-03_05_load_balancer_gpu-fb.

Fix verified: Generated server.py now imports each config variable with a unique alias (_cfg__03_advanced_workers_05_load_balancer_gpu_lb, _cfg__03_advanced_workers_05_load_balancer_cpu_lb). All GPU LB routes dispatch through the GPU config, all CPU LB routes through the CPU config. Confirmed at root of flash-examples with all workers loaded.

This scenario was missing from the original test plan. We ran flash run in a single subdirectory where no two files share a variable name. Root-level multi-LB dispatch correctness (not just discovery) was not covered.

@runpod-Henrik
Copy link
Contributor

QA Update — LB config alias fix

What changed: When flash run is started at the project root with multiple load-balancer files that export the same variable name, the generated dev server previously imported them under a shared name — the last import overwrote earlier ones, causing GPU LB routes to dispatch against the CPU resource config. Fixed by aliasing each LB config import uniquely per file.

Tested (manual E2E, flash run at project root):

  • GPU LB routes dispatch to the GPU resource, not the CPU resource
  • CPU LB routes dispatch to the CPU resource, unaffected by the fix
  • Queue-based endpoints alongside load-balancers continue to work correctly
  • Cross-worker load-balancer (LB calling QB workers) dispatches via the correct config
  • flash deploy with both CPU and GPU LB files provisions two distinct endpoints — no collision in the deploy path

Not tested: Deployed LB workers end-to-end (GPU endpoint not live during testing; CPU LB end-to-end covered in prior testing on this PR).

Verdict: Pass. No regressions found. The fix correctly scopes the alias to the dev server path — the deploy path is unaffected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants