-
Notifications
You must be signed in to change notification settings - Fork 442
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] Redis Cleanup Job blocks deletion if GCS didn't start #1725
Comments
It seems to be the right time to make the Redis cleanup a best-effort task (#1557 (comment)) cc @rueian |
FWIW, you can also let the user toggle between strict and best-effort with a flag |
Hi @smit-kiri, Could you let me know what error message you saw on the Redis Cleanup job? Or did it just exit with code 1 without any message? |
I think it will fail immediately instead of retrying, as the key doesn't exist in Redis. You can see the details in the "Note" section of this PR description: #1592 (comment). |
Yeah, so I wonder could we have another option - treating key not found as succeed? |
This is a better behavior. This may require a change in Ray. Would you mind opening an issue in the Ray repository? |
That makes sense too! Although does the job re-try deleting if it fails the first time? The Redis instance might not be available when it tries first, but it might after a couple minutes |
Yes, it will retry several times if your redis is not started yet. The detailed behavior is covered in the PR #1757 and the PR can solve your current issue. @kevin85421 I will also open a PR to make the cleanup job, by default, work in a best-effort manner in any case later. |
Close by #1766 |
Search before asking
KubeRay Component
ray-operator
What happened + What you expected to happen
When trying to delete a
RayService
where the head node hadn't spun up, the Redis Cleanup job runs into an error and the pod is stuck in anError
state. TheRayCluster
resource doesn't get deleted until the job finishes successfully.I would expect the job to retry a few times with a backoff and still delete the resource if it doesn't succeed after a set number of retries.
Reproduction script
Create a
RayService
resource and delete it while the head pod is still inPending
orContainerCreating
stage.Anything else
No response
Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: