-
-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[🐛 Bug]: Nodes stuck in DRAINING #2654
Comments
@farioas, thank you for creating this issue. We will troubleshoot it as soon as we can. Info for maintainersTriage this issue by using labels.
If information is missing, add a helpful comment and then
If the issue is a question, add the
If the issue is valid but there is no time to troubleshoot it, consider adding the
If the issue requires changes or fixes from an external project (e.g., ChromeDriver, GeckoDriver, MSEdgeDriver, W3C),
add the applicable
After troubleshooting the issue, please add the Thank you! |
Is this Deployment or Job? |
It's Job. |
I think this is an expected behavior.
to inform that max-sessions reached, and Node status switching from UP to DRAINING. The node will be drained once the current session is completed.
Probably, Node could not reach the Hub status endpoint to double check the NodeId there. Can you take a Pod to printenv all container env vars for my checking further? |
I already tried to fight with this strange behavior of Hub, when it kills itself because of:
This worked for some amount of time:
After hub restart old nodes (that were provisioned, but haven't received a task) are not connected back to grid, but KEDA sees all the provisioned nodes and doing nothing - so we have overfilled queue and idling provisioned nodes. Currently I was able to solve this behavior by switching to distributed mode with Redis as an external data store. |
That is the liveness probe in Hub, it will be failed once queue is up but the number of sessions keep as 0 for a while until |
I also discovered a scenario where Hub get restarted (e.g it took over 1 minute to be ready fully). In the meantime, Node pods are scaled up and started registration, the
|
Even with isolated components the distributor still restarted:
|
@VietND96 If you're interested, I can send you the logs from the distributor and router for the past few hours to your email for troubleshooting |
Can you also share your values file that was used to deploy the setup (just set of different configs against the default values.yaml)? |
I just saw your issue seems similar to #2655 where the Distributor and Router get rebooted. |
I sent my log values to your email. Here's my override for the default values.yaml:
webdriver init snippet:
|
What happened?
I'm using KEDA with drain session = 1.
After session execution node stuck and won't deprovision.
In the same time I see these message in grid:
Command used to start Selenium Grid with Docker (or Kubernetes)
Relevant log output
Operating System
EKS
Docker Selenium version (image tag)
4.28.1-20250202
Selenium Grid chart version (chart version)
0.39.2
The text was updated successfully, but these errors were encountered: