Skip to content

[AutoTuner] Fix flaky resume check test#3966

Open
harsh-kumar-patwa wants to merge 1 commit intoThe-OpenROAD-Project:masterfrom
harsh-kumar-patwa:fix-resume-check-test
Open

[AutoTuner] Fix flaky resume check test#3966
harsh-kumar-patwa wants to merge 1 commit intoThe-OpenROAD-Project:masterfrom
harsh-kumar-patwa:fix-resume-check-test

Conversation

@harsh-kumar-patwa
Copy link
Contributor

Summary

  • Replaces the fixed time.sleep(120) in the resume check test with a polling approach using Ray Tune's ExperimentAnalysis to detect when trials complete
  • Adds managed_process context manager for safe subprocess cleanup and stop_ray_cluster helper that retries until Ray shuts down cleanly
  • Re-enables the resume check test in test_autotuner.sh

Closes #3005

This addresses the review feedback from the draft PR #3070 by @vvbandeira. The main improvements over that draft are:

  • Uses ExperimentAnalysis to check experiment status instead of a fixed sleep, making the test reliable across different hardware
  • Properly cleans up subprocesses on failure using a context manager
  • Ensures Ray cluster is fully stopped before resuming

Test plan

  • CI passes the re-enabled resume check test (tools.AutoTuner.test.resume_check.ResumeCheck.test_tune_resume)
  • Verify no regressions in other AutoTuner tests

@harsh-kumar-patwa
Copy link
Contributor Author

@luarss @vvbandeira @maliberty This PR fixes the flaky resume check test that has been disabled since issue #3005. I reviewed the feedback on the draft PR #3070 and used a different approach based on Ray Tune's ExperimentAnalysis to poll for experiment status instead of using a fixed sleep. Would appreciate your review.

Replace the fixed time.sleep(120) with ExperimentAnalysis-based polling
to reliably detect when trials complete before stopping the initial run.
This addresses the flakiness reported in issue The-OpenROAD-Project#3005 and the review
feedback from draft PR The-OpenROAD-Project#3070.

Key changes:
- Use Ray Tune ExperimentAnalysis to poll experiment status instead of
  fixed sleep
- Add managed_process context manager for safe subprocess cleanup
- Add stop_ray_cluster helper that retries until Ray shuts down cleanly
- Re-enable the resume check test in test_autotuner.sh

Signed-off-by: Harsh <harshkumar3446@gmail.com>
Signed-off-by: Harsh Kumar <harshkumar3446@gmail.com>
@luarss luarss added the autotuner Flow autotuner label Mar 8, 2026
@luarss
Copy link
Contributor

luarss commented Mar 9, 2026

Tests seem to be failing

@maliberty maliberty requested a review from luarss March 9, 2026 15:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

autotuner Flow autotuner

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Re-enable AutoTuner ResumeCheck tests

2 participants