Skip to content

Add GPU Support for rlaunch multi#495

Draft
sivonxay wants to merge 7 commits intomaterialsproject:mainfrom
sivonxay:multi_gpu
Draft

Add GPU Support for rlaunch multi#495
sivonxay wants to merge 7 commits intomaterialsproject:mainfrom
sivonxay:multi_gpu

Conversation

@sivonxay
Copy link
Contributor

There is currently no way to distribute GPUs among fireworks when running small jobs in parallel on one system.

An example: On NERSC, you get exclusive access to 1 Perlmutter nodes with 4 A100 GPUs. If you were to run 4 fireworks that require 1 GPU each, using rlaunch multi 4, each firework would be responsible for determining which GPUs to run on. Most python code will default to checking the CUDA_VISIBLE_DEVICES and either taking the first or all gpus resulting in an oversubscription leading to poor performance or an error.

I don't believe this implementation would work for systems with non-NVIDIA/CUDA GPUs. I believe AMD devices require setting the HIP_VISIBLE_DEVICES variable, but I don't have access to any system with multiple AMD GPUs to test that.

This might not be the best way to implement this, but it does raise a question about whether or not there is a need for a more general way to distribute non-CPU devices (GPU and TPU) among sub-jobs.

@computron computron added the stale Stale/abandoned PRs and issues label Mar 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

stale Stale/abandoned PRs and issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants