Skip to content

Conversation

@bzantium
Copy link

What does this PR do?

This PR fixes an AssertionError: no app_id collisions expected crash that occurs when submitting a JobGroup (multiple bundled tasks) using the local executor (executor="none").

Why is it needed?

Root Cause:
When scheduling a JobGroup locally, the previous logic iterated through the executables but failed to maintain distinct dryrun_info for each task. Specifically, the dryrun_info variable was being overwritten in the loop, causing the scheduler to attempt to register multiple tasks with conflicting or identical states. This triggered the collision check in the underlying torchx local scheduler.

Error Traceback:


File ".../nemo_run/run/torchx_backend/schedulers/local.py", line 106, in schedule
app_id = super().schedule(dryrun_info=dryrun_info)
File ".../torchx/schedulers/local_scheduler.py", line 791, in schedule
app_id not in self._apps
AssertionError: no app_id collisions expected since uuid4 suffix is used

Changes

  • Updated the scheduling logic in nemo_run/run/torchx_backend/schedulers/local.py (or relevant file path).
  • The code now correctly generates and uses unique dryrun_info for each executable within a JobGroup instead of overwriting the variable.
  • This ensures that each task is submitted with a unique App ID, preventing collisions.

Test Plan

I have verified this fix using the reproduction script provided in the issue.

  1. Created a JobGroup with multiple tasks.
  2. Ran the experiment with executor="none".
  3. Confirmed that all tasks were scheduled and executed without the AssertionError.

Fixes #403

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

AssertionError: no app_id collisions expected` when scheduling JobGroup with multiple executables (Local Executor)

1 participant