Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Readiness and liveness probes failing when applying ray-service.sample.yaml file #2269

Open
2 tasks done
YASHY2K opened this issue Jul 24, 2024 · 9 comments
Open
2 tasks done
Labels
bug Something isn't working docs Improvements or additions to documentation rayservice

Comments

@YASHY2K
Copy link

YASHY2K commented Jul 24, 2024

Search before asking

  • I searched the issues and found no similar issues.

KubeRay Component

ray-operator

What happened + What you expected to happen

When following the setup tutorial, the Step 3 is pointing to a incorrect 'ray-service.sample.yaml'. When applying the above file, the worker node crashes and the logs suggest that the readiness/liveness probe failed. The expected behaviour is as follows:

ervice-sample-raycluster-6mj28-worker-small-group-kg4v5 1/1 Running 0 3m52s
rayservice-sample-raycluster-6mj28-head-x77h4 1/1 Running 0 3m52s

But in reality:

ervice-sample-raycluster-6mj28-worker-small-group-kg4v5 0/1 Running 0
rayservice-sample-raycluster-6mj28-head-x77h4 1/1 Running 0
image

Reproduction script

Followed this tutorial
YAML file

Anything else

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@YASHY2K YASHY2K added bug Something isn't working triage labels Jul 24, 2024
@kevin85421
Copy link
Member

@kevin85421
Copy link
Member

If you are interested in contributing to Ray or KubeRay, you can open a PR to add a new issue in Ray's documentation and then add the link to step 9 in that section.

@kevin85421
Copy link
Member

You can ping me for review.

@kevin85421 kevin85421 added rayservice docs Improvements or additions to documentation and removed triage labels Jul 27, 2024
@YASHY2K
Copy link
Author

YASHY2K commented Jul 29, 2024

Should I raise a PR on the repo?
The only change I have made is to the ray service sample yaml file, Shouldn't break anything.

@kevin85421
Copy link
Member

Instead of updating the YAML, I may prefer to update Step 4 to explain why the readiness probe failure is an expected behavior.

@YASHY2K
Copy link
Author

YASHY2K commented Jul 29, 2024

I have raised a PR for RayService Troubleshooting. Can you please check it?

@frivas-at-navteca
Copy link

frivas-at-navteca commented Nov 19, 2024

Hello, I have seen this error while following the steps on Deploy on Kubernetes

  Normal   Created    2m26s                 kubelet            Created container ray-worker
  Normal   Started    2m26s                 kubelet            Started container ray-worker
  Warning  Unhealthy  48s (x19 over 2m13s)  kubelet            Readiness probe failed: success

I have tried the suggested solution but it seems to be working only for that use case.

I am still figuring out what is going on. The logs show me this:

$ klo -n kuberay rayservice-sample-raycluster-dq4cs-small-group-worker-jsc7n -c ray-worker
error: error from server (NotFound): pods "rayservice-sample-raycluster-dq4cs-small-group-worker-jsc7n" not found in namespace "kuberay"
ubuntu@ip-172-31-14-240:~$ klo -n kuberay rayservice-sample-raycluster-dq4cs-small-group-worker-rjnsd -c ray-worker
[2024-11-19 03:55:22,190 W 8 8] global_state_accessor.cc:465: Some processes that the driver needs to connect to have not registered with GCS, so retrying. Have you run 'ray start' on this node?
[2024-11-19 03:55:23,193 W 8 8] global_state_accessor.cc:465: Some processes that the driver needs to connect to have not registered with GCS, so retrying. Have you run 'ray start' on this node?
2024-11-19 03:55:22,088	INFO scripts.py:926 -- Local node IP: 10.1.34.83
2024-11-19 03:55:24,199	SUCC scripts.py:939 -- --------------------
2024-11-19 03:55:24,199	SUCC scripts.py:940 -- Ray runtime started.
2024-11-19 03:55:24,199	SUCC scripts.py:941 -- --------------------
2024-11-19 03:55:24,199	INFO scripts.py:943 -- To terminate the Ray runtime, run
2024-11-19 03:55:24,199	INFO scripts.py:944 --   ray stop
2024-11-19 03:55:24,199	INFO scripts.py:952 -- --block
2024-11-19 03:55:24,199	INFO scripts.py:953 -- This command will now block forever until terminated by a signal.
2024-11-19 03:55:24,199	INFO scripts.py:956 -- Running subprocesses are monitored and a message will be printed if any of them terminate unexpectedly. Subprocesses exit with SIGTERM will be treated as graceful, thus NOT reported.

I have done some basic search on the message global_state_accessor.cc:465: Some processes that the driver needs to connect to have not registered with GCS, so retrying. Have you run 'ray start' on this node? but I have not found if that might be the cause of the error or not, I am not sure if it is a warning or not.

I will be updating this with the progress I make.

Update # 1 [13:27 11/19/2024]: It seems that tests of the samples are being skipped #2475
Update # 2 [13:09 11/22/2024]: Increased the resources (CPU and Memory) it doesn't seem to show any improvement. However when I test the application it works perfectly. One more thing I tested was using Bitnami's image and I got a different error in the logs. As it is a Warning and everything seems to work correct I am giving other more pressing matters a priority and I will get back to this.

@drZoid
Copy link

drZoid commented Dec 12, 2024

I ran into this very problem but the root cause was that I was running the x86_64 images on a M1 mac. Things worked after I switched to aarch64 images of ray

@frivas-at-navteca
Copy link

@drZoid Interesting. Thanks for bringing this up. In my case this is all being executed in x86_64.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working docs Improvements or additions to documentation rayservice
Projects
None yet
Development

No branches or pull requests

4 participants