fix: fix too short timeout causing cascading failures #4133

morotti · 2025-10-17T16:25:38Z

Why are these changes needed?

Hello,

The 2 second timeout on liveness probes it way too short. it is causing cascading failures when the container is busy and cannot reply immediately. this is especially bad if you have cpu limits configured on the ray pods, which restricts how much cpu the container can use.

In addition to that, TCP takes 2-3 seconds to detect a lost packet and retry. You should NEVER have a timeout below 5 seconds in any production software.

Checks

I've made sure the tests are passing.
Testing Strategy
- Unit tests
- Manual tests
- This PR is not tested in CI (hopefully your test suite will run once the PR is opened ? :) )
- This PR has been tested in production

the 2 second timeout on liveness probes it way too short. it is causing cascading failures when the container is busy and cannot reply immediately. this is especially bad if you have cpu limits configured on the ray pods. in addition to that, TCP takes 2-3 seconds to detect a lost packet and retry. you should NEVER have a timeout below 5 seconds in any production software. Signed-off-by: morotti <[email protected]>

morotti · 2025-10-17T16:28:58Z

Screenshot from production, this is your ray software failing in production ;)

You can see the ray pod has been up for few hours and it is failing checks sporadically, because the timeout is too short.

andrewsykim · 2025-10-17T17:53:23Z

I've encountered similar issues in the past from my testing, see #2355

We also increased the exec probe timeout for Head pod to 5s, so I am also open to increasing to 5s in worker pods #2353

andrewsykim · 2025-10-17T17:54:32Z

Longer term, we really need to remove dependency to exec probes, I believe that once we are using HTTP probes, we can use shorter timeouts with significantly better reliability. There's a PR for using http probes (#2360), however, it's blocked on Ray unifiying health check endpoints ray-project/ray#56204

andrewsykim · 2025-10-17T17:55:19Z

ray-operator/config/samples/ray-cluster.auth.yaml

              - bash
              - -c
-              - wget -T 2 -q -O- http://localhost:52365/api/local_raylet_healthz | grep success && wget -T 10 -q -O- http://localhost:8443/api/gcs_healthz | grep success
+              - wget -T 10 -q -O- http://localhost:52365/api/local_raylet_healthz | grep success && wget -T 10 -q -O- http://localhost:8443/api/gcs_healthz | grep success


it's not enough to increase wget timeout, you need to also increase probeTimeout in the container's probe config

btw, this is just an example manifest, you need to update the controller logic if you want to change default behavior

andrewsykim reviewed Oct 17, 2025

View reviewed changes

kevin85421 assigned andrewsykim Oct 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: fix too short timeout causing cascading failures #4133

fix: fix too short timeout causing cascading failures #4133

Uh oh!

morotti commented Oct 17, 2025

Uh oh!

morotti commented Oct 17, 2025

Uh oh!

andrewsykim commented Oct 17, 2025

Uh oh!

andrewsykim commented Oct 17, 2025

Uh oh!

andrewsykim Oct 17, 2025

Uh oh!

andrewsykim Oct 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix: fix too short timeout causing cascading failures #4133

Are you sure you want to change the base?

fix: fix too short timeout causing cascading failures #4133

Uh oh!

Conversation

morotti commented Oct 17, 2025

Why are these changes needed?

Checks

Uh oh!

morotti commented Oct 17, 2025

Uh oh!

andrewsykim commented Oct 17, 2025

Uh oh!

andrewsykim commented Oct 17, 2025

Uh oh!

andrewsykim Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

andrewsykim Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants