Skip to content

Conversation

morotti
Copy link

@morotti morotti commented Oct 17, 2025

Why are these changes needed?

Hello,

The 2 second timeout on liveness probes it way too short. it is causing cascading failures when the container is busy and cannot reply immediately. this is especially bad if you have cpu limits configured on the ray pods, which restricts how much cpu the container can use.

In addition to that, TCP takes 2-3 seconds to detect a lost packet and retry. You should NEVER have a timeout below 5 seconds in any production software.

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested in CI (hopefully your test suite will run once the PR is opened ? :) )
    • This PR has been tested in production

the 2 second timeout on liveness probes it way too short. it is causing cascading failures when the container is busy and cannot reply immediately. this is especially bad if you have cpu limits configured on the ray pods.

in addition to that, TCP takes 2-3 seconds to detect a lost packet and retry. you should NEVER have a timeout below 5 seconds in any production software.

Signed-off-by: morotti <[email protected]>
@morotti
Copy link
Author

morotti commented Oct 17, 2025

Screenshot from production, this is your ray software failing in production ;)

You can see the ray pod has been up for few hours and it is failing checks sporadically, because the timeout is too short.

image

@andrewsykim
Copy link
Member

I've encountered similar issues in the past from my testing, see #2355

We also increased the exec probe timeout for Head pod to 5s, so I am also open to increasing to 5s in worker pods #2353

@andrewsykim
Copy link
Member

Longer term, we really need to remove dependency to exec probes, I believe that once we are using HTTP probes, we can use shorter timeouts with significantly better reliability. There's a PR for using http probes (#2360), however, it's blocked on Ray unifiying health check endpoints ray-project/ray#56204

- bash
- -c
- wget -T 2 -q -O- http://localhost:52365/api/local_raylet_healthz | grep success && wget -T 10 -q -O- http://localhost:8443/api/gcs_healthz | grep success
- wget -T 10 -q -O- http://localhost:52365/api/local_raylet_healthz | grep success && wget -T 10 -q -O- http://localhost:8443/api/gcs_healthz | grep success
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not enough to increase wget timeout, you need to also increase probeTimeout in the container's probe config

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw, this is just an example manifest, you need to update the controller logic if you want to change default behavior

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants