-
Notifications
You must be signed in to change notification settings - Fork 436
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] Readiness and liveness probes failing when applying ray-service.sample.yaml file #2269
Comments
You can check Step 9 for more details https://docs.ray.io/en/master/cluster/kubernetes/user-guides/rayservice.html#step-9-why-1-worker-pod-isnt-ready. |
If you are interested in contributing to Ray or KubeRay, you can open a PR to add a new issue in Ray's documentation and then add the link to step 9 in that section. |
You can ping me for review. |
Should I raise a PR on the repo? |
Instead of updating the YAML, I may prefer to update Step 4 to explain why the readiness probe failure is an expected behavior. |
I have raised a PR for RayService Troubleshooting. Can you please check it? |
Hello, I have seen this error while following the steps on Deploy on Kubernetes Normal Created 2m26s kubelet Created container ray-worker
Normal Started 2m26s kubelet Started container ray-worker
Warning Unhealthy 48s (x19 over 2m13s) kubelet Readiness probe failed: success I have tried the suggested solution but it seems to be working only for that use case. I am still figuring out what is going on. The logs show me this: $ klo -n kuberay rayservice-sample-raycluster-dq4cs-small-group-worker-jsc7n -c ray-worker
error: error from server (NotFound): pods "rayservice-sample-raycluster-dq4cs-small-group-worker-jsc7n" not found in namespace "kuberay"
ubuntu@ip-172-31-14-240:~$ klo -n kuberay rayservice-sample-raycluster-dq4cs-small-group-worker-rjnsd -c ray-worker
[2024-11-19 03:55:22,190 W 8 8] global_state_accessor.cc:465: Some processes that the driver needs to connect to have not registered with GCS, so retrying. Have you run 'ray start' on this node?
[2024-11-19 03:55:23,193 W 8 8] global_state_accessor.cc:465: Some processes that the driver needs to connect to have not registered with GCS, so retrying. Have you run 'ray start' on this node?
2024-11-19 03:55:22,088 INFO scripts.py:926 -- Local node IP: 10.1.34.83
2024-11-19 03:55:24,199 SUCC scripts.py:939 -- --------------------
2024-11-19 03:55:24,199 SUCC scripts.py:940 -- Ray runtime started.
2024-11-19 03:55:24,199 SUCC scripts.py:941 -- --------------------
2024-11-19 03:55:24,199 INFO scripts.py:943 -- To terminate the Ray runtime, run
2024-11-19 03:55:24,199 INFO scripts.py:944 -- ray stop
2024-11-19 03:55:24,199 INFO scripts.py:952 -- --block
2024-11-19 03:55:24,199 INFO scripts.py:953 -- This command will now block forever until terminated by a signal.
2024-11-19 03:55:24,199 INFO scripts.py:956 -- Running subprocesses are monitored and a message will be printed if any of them terminate unexpectedly. Subprocesses exit with SIGTERM will be treated as graceful, thus NOT reported. I have done some basic search on the message I will be updating this with the progress I make. Update # 1 [13:27 11/19/2024]: It seems that tests of the samples are being skipped #2475 |
I ran into this very problem but the root cause was that I was running the x86_64 images on a M1 mac. Things worked after I switched to aarch64 images of ray |
@drZoid Interesting. Thanks for bringing this up. In my case this is all being executed in x86_64. |
Search before asking
KubeRay Component
ray-operator
What happened + What you expected to happen
When following the setup tutorial, the Step 3 is pointing to a incorrect 'ray-service.sample.yaml'. When applying the above file, the worker node crashes and the logs suggest that the readiness/liveness probe failed. The expected behaviour is as follows:
But in reality:
Reproduction script
Followed this tutorial
YAML file
Anything else
No response
Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: