Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Redis Cleanup Job blocks deletion if GCS didn't start #1725

Closed
1 of 2 tasks
smit-kiri opened this issue Dec 8, 2023 · 9 comments
Closed
1 of 2 tasks

[Bug] Redis Cleanup Job blocks deletion if GCS didn't start #1725

smit-kiri opened this issue Dec 8, 2023 · 9 comments
Assignees
Labels
bug Something isn't working

Comments

@smit-kiri
Copy link

Search before asking

  • I searched the issues and found no similar issues.

KubeRay Component

ray-operator

What happened + What you expected to happen

When trying to delete a RayService where the head node hadn't spun up, the Redis Cleanup job runs into an error and the pod is stuck in an Error state. The RayCluster resource doesn't get deleted until the job finishes successfully.

I would expect the job to retry a few times with a backoff and still delete the resource if it doesn't succeed after a set number of retries.

Reproduction script

Create a RayService resource and delete it while the head pod is still in Pending or ContainerCreating stage.

Anything else

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@smit-kiri smit-kiri added the bug Something isn't working label Dec 8, 2023
@kevin85421
Copy link
Member

It seems to be the right time to make the Redis cleanup a best-effort task (#1557 (comment)) cc @rueian

@smit-kiri
Copy link
Author

FWIW, you can also let the user toggle between strict and best-effort with a flag

@rueian
Copy link
Contributor

rueian commented Dec 15, 2023

Hi @smit-kiri,

Could you let me know what error message you saw on the Redis Cleanup job? Or did it just exit with code 1 without any message?

@kevin85421
Copy link
Member

@rueian

I think it will fail immediately instead of retrying, as the key doesn't exist in Redis. You can see the details in the "Note" section of this PR description: #1592 (comment).

@rueian
Copy link
Contributor

rueian commented Dec 15, 2023

@rueian

I think it will fail immediately instead of retrying, as the key doesn't exist in Redis. You can see the details in the "Note" section of this PR description: #1592 (comment).

Yeah, so I wonder could we have another option - treating key not found as succeed?

@kevin85421
Copy link
Member

treating key not found as succeed?

This is a better behavior. This may require a change in Ray. Would you mind opening an issue in the Ray repository?

@smit-kiri
Copy link
Author

Yeah, so I wonder could we have another option - treating key not found as succeed?

That makes sense too! Although does the job re-try deleting if it fails the first time? The Redis instance might not be available when it tries first, but it might after a couple minutes

@rueian
Copy link
Contributor

rueian commented Dec 17, 2023

Yeah, so I wonder could we have another option - treating key not found as succeed?

That makes sense too! Although does the job re-try deleting if it fails the first time? The Redis instance might not be available when it tries first, but it might after a couple minutes

Yes, it will retry several times if your redis is not started yet. The detailed behavior is covered in the PR #1757 and the PR can solve your current issue.

@kevin85421 I will also open a PR to make the cleanup job, by default, work in a best-effort manner in any case later.

@kevin85421
Copy link
Member

Close by #1766

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants