Fix AssertionError: no app_id collisions expected when scheduling JobGroups locally #404
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What does this PR do?
This PR fixes an
AssertionError: no app_id collisions expectedcrash that occurs when submitting aJobGroup(multiple bundled tasks) using the local executor (executor="none").Why is it needed?
Root Cause:
When scheduling a
JobGrouplocally, the previous logic iterated through the executables but failed to maintain distinctdryrun_infofor each task. Specifically, thedryrun_infovariable was being overwritten in the loop, causing the scheduler to attempt to register multiple tasks with conflicting or identical states. This triggered the collision check in the underlyingtorchxlocal scheduler.Error Traceback:
Changes
nemo_run/run/torchx_backend/schedulers/local.py(or relevant file path).dryrun_infofor each executable within aJobGroupinstead of overwriting the variable.Test Plan
I have verified this fix using the reproduction script provided in the issue.
JobGroupwith multiple tasks.executor="none".AssertionError.Fixes #403