-
Notifications
You must be signed in to change notification settings - Fork 536
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Jobs] Move task retry logic to correct branch in stream_logs_by_id
#4407
base: master
Are you sure you want to change the base?
Conversation
@cblmemo PTAL, thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing this @andylizf ! mostly lgtm. Left one discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @andylizf for fixing this! This looks mostly good to me. Could we add a smoke test for this new fix, so we can catch any regression in the future?
sky/jobs/utils.py
Outdated
@@ -384,8 +384,29 @@ def stream_logs_by_id(job_id: int, follow: bool = True) -> str: | |||
job_statuses = backend.get_job_status(handle, stream_logs=False) | |||
job_status = list(job_statuses.values())[0] | |||
assert job_status is not None, 'No job found.' | |||
assert task_id is not None, job_id | |||
if job_status == job_lib.JobStatus.FAILED: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There can be a few different statuses that can trigger the retry, including FAILED_SETUP
and FAILED
. Should we add both?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree we should add both FAILED
and FAILED_SETUP
as retry triggers since setup failure is also a user program failure.
Also, The comment mentions FAILED_SETUP needs to be after FAILED - does this ordering affect our retry logic implementation?
# The job setup failed (only in effect if --detach-setup is set). It
# needs to be placed after the `FAILED` state, so that the status
# set by our generated ray program will not be overwritten by
# ray's job status (FAILED).
# This is for a better UX, so that the user can find out the reason
# of the failure quickly.
FAILED_SETUP = 'FAILED_SETUP'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! lgtm. should be good to go after adding smoke test.
@cblmemo The smoke test passed, PTAL, thanks! |
Co-authored-by: Tian Xia <[email protected]>
Fixes #4250
The task retry logic in
stream_logs_by_id
was placed in the wrong branch of the if-else structure. It was in theelse
branch ofif task_id < num_tasks - 1 and follow
, which meant it only triggered when the last task failed, rather than when each task failed.Move the retry logic to be checked immediately after we detect a job failure, using
JobStatus.FAILED
which stands for user code failure, before handling the task progression logic.Tested (run the relevant ones):
bash format.sh
pytest tests/test_smoke.py
pytest tests/test_smoke.py::test_fill_in_the_name
conda deactivate; bash -i tests/backward_compatibility_tests.sh
Manual Test
For a job with task retry enabled:
Before fix
The job immediately terminates after first failure without retrying.
After Fix
The job properly retries the failed task once before terminating.