Skip to content

Commit

Permalink
[Core] Fix jobs longer than 12 days (#3460)
Browse files Browse the repository at this point in the history
* Fix jobs longer than 12 days

* format

* address comment
  • Loading branch information
Michaelvll authored Apr 22, 2024
1 parent 1bdcd01 commit 19df9d3
Showing 1 changed file with 8 additions and 1 deletion.
9 changes: 8 additions & 1 deletion sky/backends/cloud_vm_ray_backend.py
Original file line number Diff line number Diff line change
Expand Up @@ -268,7 +268,14 @@ def get_or_fail(futures, pg) -> List[int]:
\"\"\"Wait for tasks, if any fails, cancel all unready.\"\"\"
returncodes = [1] * len(futures)
# Wait for 1 task to be ready.
ready, unready = ray.wait(futures)
ready = []
# Keep invoking ray.wait if ready is empty. This is because
# ray.wait with timeout=None will only wait for 10**6 seconds,
# which will cause tasks running for more than 12 days to return
before becoming ready. (Such tasks are common in serving jobs.)
# Reference: https://github.com/ray-project/ray/blob/ray-2.9.3/python/ray/_private/worker.py#L2845-L2846
while not ready:
ready, unready = ray.wait(futures)
idx = futures.index(ready[0])
returncodes[idx] = ray.get(ready[0])
while unready:
Expand Down

0 comments on commit 19df9d3

Please sign in to comment.