[🐛 Bug]: Nodes stuck in DRAINING #2654

farioas · 2025-02-12T17:58:46Z

What happened?

I'm using KEDA with drain session = 1.

After session execution node stuck and won't deprovision.

In the same time I see these message in grid:

2025-02-12 17:50:57,374 DEBUG [Probe.Liveness] - Session Queue Size: 60, Session Count: 0, Max Session: 0
2025-02-12 17:50:57,375 DEBUG [Probe.Liveness] - It seems the Distributor is delayed in processing a new session in the queue. Probe checks failed.

Command used to start Selenium Grid with Docker (or Kubernetes)

Helm chart 0.39.2

Relevant log output

16:31:38.418 INFO [NodeServer$2.lambda$start$1] - Sending registration event...
16:31:43.421 INFO [NodeServer$2.lambda$start$1] - Sending registration event...
16:31:46.218 INFO [NodeServer.lambda$createHandlers$2] - Node has been added
16:31:48.445 INFO [LocalNode.newSession] - Session created by the Node. Id: d712a9e4ffc7956978a907188bd423cd, Caps: Capabilities {acceptInsecureCerts: false, browserName: chrome, browserVersion: 133.0.6943.53, chrome: {chromedriverVersion: 133.0.6943.53 (9a80935019b0..., userDataDir: /tmp/.org.chromium.Chromium...}, fedcm:accounts: true, goog:chromeOptions: {debuggerAddress: localhost:40295}, networkConnectionEnabled: false, pageLoadStrategy: normal, platformName: linux, proxy: Proxy(), se:bidiEnabled: false, se:cdp: wss://selenium-grid.XXXX..., se:cdpVersion: 133.0.6943.53, se:containerName: selenium-grid-selenium-node..., se:noVncPort: 7900, se:vnc: wss://selenium-grid.appx.hu..., se:vncEnabled: true, se:vncLocalAddress: ws://10.1.118.22:7900, setWindowRect: true, strictFileInteractability: false, timeouts: {implicit: 0, pageLoad: 300000, script: 30000}, unhandledPromptBehavior: dismiss and notify, webauthn:extension:credBlob: true, webauthn:extension:largeBlob: true, webauthn:extension:minPinLength: true, webauthn:extension:prf: true, webauthn:virtualAuthenticators: true}
16:31:48.447 INFO [LocalNode.checkSessionCount] - Draining Node, configured sessions value (1) has been reached.
2025-02-12 16:32:13,351 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:32:18,227 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:32:23,305 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:32:28,931 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:32:33,307 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:32:36,334 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:32:43,333 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:32:48,730 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:32:53,351 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:32:58,129 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:33:03,305 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:33:07,628 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:33:13,355 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:33:19,381 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:33:23,336 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:33:29,363 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:33:33,359 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:33:39,389 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:33:43,332 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:33:49,359 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:33:53,346 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:33:56,529 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:34:03,307 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:34:09,031 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:34:13,347 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:34:17,729 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:34:23,327 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:34:26,732 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:34:33,311 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:34:36,728 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:34:43,322 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:34:46,835 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:34:53,354 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:34:57,727 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:35:03,372 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:35:06,731 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:35:13,325 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:35:16,430 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:35:23,345 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:35:26,432 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:35:33,304 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:35:37,327 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:35:43,334 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:35:46,734 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:35:53,322 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:35:56,828 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:36:03,303 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:36:06,827 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:36:13,323 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:36:17,827 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:36:23,302 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:36:26,529 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:36:33,351 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:36:38,034 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:36:43,320 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:36:47,637 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:36:53,321 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:36:56,927 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:37:03,322 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:37:06,828 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:37:13,333 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:37:17,236 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:37:23,351 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:37:28,130 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:37:33,323 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:37:37,428 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:37:43,362 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:37:46,930 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:37:53,323 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:37:56,526 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:38:03,303 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:38:07,530 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:38:13,322 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:38:17,226 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:38:23,321 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:38:26,527 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:38:33,343 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:38:38,127 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:38:43,311 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:38:48,227 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:38:53,301 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:38:57,637 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:39:03,310 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:39:06,336 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:39:13,321 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING

Operating System

EKS

Docker Selenium version (image tag)

4.28.1-20250202

Selenium Grid chart version (chart version)

0.39.2

The text was updated successfully, but these errors were encountered:

github-actions · 2025-02-12T17:58:58Z

@farioas, thank you for creating this issue. We will troubleshoot it as soon as we can.

Info for maintainers

Triage this issue by using labels.

If information is missing, add a helpful comment and then I-issue-template label.

If the issue is a question, add the I-question label.

If the issue is valid but there is no time to troubleshoot it, consider adding the help wanted label.

If the issue requires changes or fixes from an external project (e.g., ChromeDriver, GeckoDriver, MSEdgeDriver, W3C), add the applicable G-* label, and it will provide the correct link and auto-close the issue.

After troubleshooting the issue, please add the R-awaiting answer label.

Thank you!

VietND96 · 2025-02-13T05:23:28Z

Is this Deployment or Job?

farioas · 2025-02-13T09:20:06Z

It's Job.

VietND96 · 2025-02-18T02:21:51Z

I think this is an expected behavior.
The message

16:31:48.447 INFO [LocalNode.checkSessionCount] - Draining Node, configured sessions value (1) has been reached.

to inform that max-sessions reached, and Node status switching from UP to DRAINING. The node will be drained once the current session is completed.
I am not sure at this time, in Node was having any session ongoing. I will probably update the probe to print number of ongoing sessions together with Node status.
On message

Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.

Probably, Node could not reach the Hub status endpoint to double check the NodeId there. Can you take a Pod to printenv all container env vars for my checking further?

farioas · 2025-02-18T09:48:20Z

I already tried to fight with this strange behavior of Hub, when it kills itself because of:

It seems the Distributor is delayed in processing a new session in the queue. Probe checks failed.

This worked for some amount of time:

hub:
  livenessProbe:
    failureThreshold: 50

After hub restart old nodes (that were provisioned, but haven't received a task) are not connected back to grid, but KEDA sees all the provisioned nodes and doing nothing - so we have overfilled queue and idling provisioned nodes.

Currently I was able to solve this behavior by switching to distributed mode with Redis as an external data store.

VietND96 · 2025-02-18T10:42:29Z

That is the liveness probe in Hub, it will be failed once queue is up but the number of sessions keep as 0 for a while until failureThreshold
As your description, I understand the issue is Hub get restarted. After restarted, existing Nodes could not register back to Hub, there is no Node attached in Hub, while having a mount of Node pods are running, which cause KEDA scaler get confused that already had enough node, then no more scale happening.

VietND96 · 2025-02-18T17:27:40Z

I also discovered a scenario where Hub get restarted (e.g it took over 1 minute to be ready fully). In the meantime, Node pods are scaled up and started registration, the register-period decides how long Node will try to register to Distributor, after that Node will not attempt to register again. In Helm chart, it Node startup probe is enabled, the probe period will be used for that register period (around 60s as I see). If Hub get back after this period, the Node pod will continue running there and do nothing. It leads to above scenario for autoscaling works incorrect.

From Grid core, I am adding a flag to shutdown Node if registration failed, it will help Node container can restart and retry register, if it is restarted many times the pod status will reflect status CrashLoopBackOff - [grid] Node flag register-shutdown-on-failure selenium#15297

farioas · 2025-02-18T20:02:17Z

Even with isolated components the distributor still restarted:

2025-02-17 03:10:22,642 DEBUG [Probe.Liveness] - Distributor is healthy.
2025-02-17 03:11:22,650 DEBUG [Probe.Liveness] - Distributor is healthy.
2025-02-17 03:12:22,661 DEBUG [Probe.Liveness] - Session Queue Size: 20, Session Count: 0, Max Session: 0
2025-02-17 03:12:22,662 DEBUG [Probe.Liveness] - It seems the Distributor is delayed in processing a new session in the queue. Probe checks failed.
2025-02-17 03:13:22,661 DEBUG [Probe.Liveness] - Session Queue Size: 20, Session Count: 0, Max Session: 0
2025-02-17 03:13:22,663 DEBUG [Probe.Liveness] - It seems the Distributor is delayed in processing a new session in the queue. Probe checks failed.
2025-02-17 03:14:22,661 DEBUG [Probe.Liveness] - Session Queue Size: 20, Session Count: 0, Max Session: 0
2025-02-17 03:14:22,662 DEBUG [Probe.Liveness] - It seems the Distributor is delayed in processing a new session in the queue. Probe checks failed.
2025-02-17 03:15:22,680 DEBUG [Probe.Liveness] - Session Queue Size: 20, Session Count: 0, Max Session: 0
2025-02-17 03:15:22,681 DEBUG [Probe.Liveness] - It seems the Distributor is delayed in processing a new session in the queue. Probe checks failed.
2025-02-17 03:16:22,669 DEBUG [Probe.Liveness] - Session Queue Size: 20, Session Count: 0, Max Session: 0
2025-02-17 03:16:22,670 DEBUG [Probe.Liveness] - It seems the Distributor is delayed in processing a new session in the queue. Probe checks failed.
2025-02-17 03:17:22,680 DEBUG [Probe.Liveness] - Session Queue Size: 20, Session Count: 0, Max Session: 0
2025-02-17 03:17:22,681 DEBUG [Probe.Liveness] - It seems the Distributor is delayed in processing a new session in the queue. Probe checks failed.
2025-02-17 03:18:22,653 DEBUG [Probe.Liveness] - Session Queue Size: 20, Session Count: 0, Max Session: 0
2025-02-17 03:18:22,654 DEBUG [Probe.Liveness] - It seems the Distributor is delayed in processing a new session in the queue. Probe checks failed.
2025-02-17 03:19:22,660 DEBUG [Probe.Liveness] - Session Queue Size: 20, Session Count: 0, Max Session: 0
2025-02-17 03:19:22,661 DEBUG [Probe.Liveness] - It seems the Distributor is delayed in processing a new session in the queue. Probe checks failed.
2025-02-17 03:20:22,650 DEBUG [Probe.Liveness] - Session Queue Size: 20, Session Count: 0, Max Session: 0
2025-02-17 03:20:22,651 DEBUG [Probe.Liveness] - It seems the Distributor is delayed in processing a new session in the queue. Probe checks failed.
2025-02-17 03:21:22,694 DEBUG [Probe.Liveness] - Session Queue Size: 20, Session Count: 0, Max Session: 0
2025-02-17 03:21:22,695 DEBUG [Probe.Liveness] - It seems the Distributor is delayed in processing a new session in the queue. Probe checks failed.
2025-02-17 03:22:22,682 DEBUG [Probe.Liveness] - Session Queue Size: 20, Session Count: 0, Max Session: 0
2025-02-17 03:22:22,683 DEBUG [Probe.Liveness] - It seems the Distributor is delayed in processing a new session in the queue. Probe checks failed.
2025-02-17 03:23:22,668 DEBUG [Probe.Liveness] - Session Queue Size: 20, Session Count: 0, Max Session: 0
2025-02-17 03:23:22,669 DEBUG [Probe.Liveness] - It seems the Distributor is delayed in processing a new session in the queue. Probe checks failed.
2025-02-17 03:24:22,685 DEBUG [Probe.Liveness] - Session Queue Size: 20, Session Count: 0, Max Session: 0
2025-02-17 03:24:22,686 DEBUG [Probe.Liveness] - It seems the Distributor is delayed in processing a new session in the queue. Probe checks failed.
2025-02-17 03:25:22,671 DEBUG [Probe.Liveness] - Session Queue Size: 20, Session Count: 0, Max Session: 0
2025-02-17 03:25:22,672 DEBUG [Probe.Liveness] - It seems the Distributor is delayed in processing a new session in the queue. Probe checks failed.
2025-02-17 03:26:22,712 DEBUG [Probe.Liveness] - Session Queue Size: 20, Session Count: 0, Max Session: 0
2025-02-17 03:26:22,713 DEBUG [Probe.Liveness] - It seems the Distributor is delayed in processing a new session in the queue. Probe checks failed.
2025-02-17 03:27:22,650 DEBUG [Probe.Liveness] - Session Queue Size: 20, Session Count: 0, Max Session: 0
2025-02-17 03:27:22,651 DEBUG [Probe.Liveness] - It seems the Distributor is delayed in processing a new session in the queue. Probe checks failed.
2025-02-17 03:28:22,684 DEBUG [Probe.Liveness] - Session Queue Size: 20, Session Count: 0, Max Session: 0
2025-02-17 03:28:22,685 DEBUG [Probe.Liveness] - It seems the Distributor is delayed in processing a new session in the queue. Probe checks failed.
2025-02-17 03:29:22,674 DEBUG [Probe.Liveness] - Session Queue Size: 23, Session Count: 0, Max Session: 0
2025-02-17 03:29:22,675 DEBUG [Probe.Liveness] - It seems the Distributor is delayed in processing a new session in the queue. Probe checks failed.
2025-02-17 03:30:22,694 DEBUG [Probe.Liveness] - Session Queue Size: 23, Session Count: 0, Max Session: 0
2025-02-17 03:30:22,695 DEBUG [Probe.Liveness] - It seems the Distributor is delayed in processing a new session in the queue. Probe checks failed.
2025-02-17 03:31:22,705 DEBUG [Probe.Liveness] - Session Queue Size: 23, Session Count: 0, Max Session: 0
2025-02-17 03:31:22,707 DEBUG [Probe.Liveness] - It seems the Distributor is delayed in processing a new session in the queue. Probe checks failed.
2025-02-17 03:32:22,656 DEBUG [Probe.Liveness] - Session Queue Size: 23, Session Count: 0, Max Session: 0
2025-02-17 03:32:22,657 DEBUG [Probe.Liveness] - It seems the Distributor is delayed in processing a new session in the queue. Probe checks failed.
2025-02-17 03:33:22,685 DEBUG [Probe.Liveness] - Session Queue Size: 23, Session Count: 0, Max Session: 0
2025-02-17 03:33:22,686 DEBUG [Probe.Liveness] - It seems the Distributor is delayed in processing a new session in the queue. Probe checks failed.
2025-02-17 03:34:22,667 DEBUG [Probe.Liveness] - Session Queue Size: 23, Session Count: 0, Max Session: 0
2025-02-17 03:34:22,668 DEBUG [Probe.Liveness] - It seems the Distributor is delayed in processing a new session in the queue. Probe checks failed.
2025-02-17 03:35:22,676 DEBUG [Probe.Liveness] - Session Queue Size: 23, Session Count: 0, Max Session: 0
2025-02-17 03:35:22,677 DEBUG [Probe.Liveness] - It seems the Distributor is delayed in processing a new session in the queue. Probe checks failed.
2025-02-17 03:36:22,685 DEBUG [Probe.Liveness] - Session Queue Size: 43, Session Count: 0, Max Session: 0
2025-02-17 03:36:22,687 DEBUG [Probe.Liveness] - It seems the Distributor is delayed in processing a new session in the queue. Probe checks failed.
2025-02-17 03:37:22,692 DEBUG [Probe.Liveness] - Session Queue Size: 43, Session Count: 0, Max Session: 0
2025-02-17 03:37:22,694 DEBUG [Probe.Liveness] - It seems the Distributor is delayed in processing a new session in the queue. Probe checks failed.
2025-02-17 03:38:22,670 DEBUG [Probe.Liveness] - Session Queue Size: 43, Session Count: 0, Max Session: 0
2025-02-17 03:38:22,671 DEBUG [Probe.Liveness] - It seems the Distributor is delayed in processing a new session in the queue. Probe checks failed.
2025-02-17 03:39:22,697 DEBUG [Probe.Liveness] - Session Queue Size: 43, Session Count: 0, Max Session: 0
2025-02-17 03:39:22,699 DEBUG [Probe.Liveness] - It seems the Distributor is delayed in processing a new session in the queue. Probe checks failed.
2025-02-17 03:40:22,694 DEBUG [Probe.Liveness] - Session Queue Size: 43, Session Count: 0, Max Session: 0
2025-02-17 03:40:22,695 DEBUG [Probe.Liveness] - It seems the Distributor is delayed in processing a new session in the queue. Probe checks failed.
2025-02-17 03:41:22,693 DEBUG [Probe.Liveness] - Session Queue Size: 43, Session Count: 0, Max Session: 0
2025-02-17 03:41:22,694 DEBUG [Probe.Liveness] - It seems the Distributor is delayed in processing a new session in the queue. Probe checks failed.
Trapped SIGTERM/SIGINT/x so shutting down supervisord...
2025-02-17 03:41:23,517 WARN received SIGTERM indicating exit request
2025-02-17 03:41:23,518 INFO waiting for selenium-grid-distributor to die
2025-02-17 03:41:24,519 WARN stopped: selenium-grid-distributor (terminated by SIGTERM)

farioas · 2025-02-20T10:28:42Z

Today it happened again and I observed strange resource consumption in distributor and router.
Distributor consumed 2 cores:

Router consumed 6 cores:

At the same time keda provisioned 200 pods, which is expected for nightly testing:

2025-02-20T04:17:03Z	INFO	scaleexecutor	Scaling Jobs	{"scaledJob.Name": "selenium-grid-selenium-node-chrome", "scaledJob.Namespace": "selenium-grid", "Number of running Jobs": 200}

But 0 of these pods were utilized by hub:
router logs

Feb 20, 2025 3:41:11 AM org.openqa.selenium.remote.tracing.SpanWrappedHttpHandler execute
WARNING: Unable to execute request: java.util.concurrent.TimeoutException
Build info: version: '4.28.1', revision: '73f5ad48a2'
System info: os.name: 'Linux', os.arch: 'amd64', os.version: '6.1.124-134.200.amzn2023.x86_64', java.version: '17.0.13'
Driver info: driver.version: unknown
org.openqa.selenium.TimeoutException: java.util.concurrent.TimeoutException
Build info: version: '4.28.1', revision: '73f5ad48a2'
System info: os.name: 'Linux', os.arch: 'amd64', os.version: '6.1.124-134.200.amzn2023.x86_64', java.version: '17.0.13'
Driver info: driver.version: unknown
	at org.openqa.selenium.remote.http.jdk.JdkHttpClient.execute(JdkHttpClient.java:424)
	at org.openqa.selenium.remote.tracing.TracedHttpClient.execute(TracedHttpClient.java:54)
	at org.openqa.selenium.grid.security.AddSecretFilter.lambda$apply$0(AddSecretFilter.java:40)
	at org.openqa.selenium.remote.http.Filter.lambda$andFinally$1(Filter.java:55)
	at org.openqa.selenium.grid.sessionqueue.remote.RemoteNewSessionQueue.addToQueue(RemoteNewSessionQueue.java:110)
	at org.openqa.selenium.grid.sessionqueue.NewSessionQueue.lambda$new$0(NewSessionQueue.java:68)
	at org.openqa.selenium.remote.http.Route$TemplatizedRoute.handle(Route.java:192)
	at org.openqa.selenium.remote.http.Route.execute(Route.java:69)
	at org.openqa.selenium.remote.http.Route$CombinedRoute.handle(Route.java:360)
	at org.openqa.selenium.remote.http.Route.execute(Route.java:69)
	at org.openqa.selenium.grid.sessionqueue.NewSessionQueue.execute(NewSessionQueue.java:128)
	at org.openqa.selenium.remote.tracing.SpanWrappedHttpHandler.execute(SpanWrappedHttpHandler.java:87)
	at org.openqa.selenium.remote.http.Filter$1.execute(Filter.java:63)
	at org.openqa.selenium.remote.http.Route$CombinedRoute.handle(Route.java:360)
	at org.openqa.selenium.remote.http.Route.execute(Route.java:69)
	at org.openqa.selenium.grid.router.Router.execute(Router.java:89)
	at org.openqa.selenium.grid.web.EnsureSpecCompliantResponseHeaders.lambda$apply$0(EnsureSpecCompliantResponseHeaders.java:34)
	at org.openqa.selenium.remote.http.Filter$1.execute(Filter.java:63)
	at org.openqa.selenium.remote.http.Route$CombinedRoute.handle(Route.java:360)
	at org.openqa.selenium.remote.http.Route.execute(Route.java:69)
	at org.openqa.selenium.remote.http.Route$NestedRoute.handle(Route.java:270)
	at org.openqa.selenium.remote.http.Route.execute(Route.java:69)
	at org.openqa.selenium.remote.http.Route$CombinedRoute.handle(Route.java:360)
	at org.openqa.selenium.remote.http.Route.execute(Route.java:69)
	at org.openqa.selenium.grid.security.BasicAuthenticationFilter.lambda$apply$0(BasicAuthenticationFilter.java:54)
	at org.openqa.selenium.remote.http.Filter$1.execute(Filter.java:63)
	at org.openqa.selenium.remote.http.Route$CombinedRoute.handle(Route.java:360)
	at org.openqa.selenium.remote.http.Route.execute(Route.java:69)
	at org.openqa.selenium.remote.AddWebDriverSpecHeaders.lambda$apply$0(AddWebDriverSpecHeaders.java:35)
	at org.openqa.selenium.remote.ErrorFilter.lambda$apply$0(ErrorFilter.java:44)
	at org.openqa.selenium.remote.http.Filter$1.execute(Filter.java:63)
	at org.openqa.selenium.remote.ErrorFilter.lambda$apply$0(ErrorFilter.java:44)
	at org.openqa.selenium.remote.http.Filter$1.execute(Filter.java:63)
	at org.openqa.selenium.netty.server.SeleniumHandler.lambda$channelRead0$0(SeleniumHandler.java:49)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:840)
Caused by: java.util.concurrent.TimeoutException
	at java.base/java.util.concurrent.CompletableFuture$Timeout.run(CompletableFuture.java:2874)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
	... 3 more

Distributor logs

Exception in thread "Local Distributor - Session Creation" org.openqa.selenium.WebDriverException: Command timed out after 1 minute(s)
Build info: version: '4.28.1', revision: '73f5ad48a2'
System info: os.name: 'Linux', os.arch: 'amd64', os.version: '6.1.127-135.201.amzn2023.x86_64', java.version: '17.0.13'
Driver info: driver.version: unknown
	at jdk.internal.reflect.GeneratedConstructorAccessor15.newInstance(Unknown Source)
	at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:500)
	at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:481)
	at org.openqa.selenium.remote.ErrorCodec.decode(ErrorCodec.java:167)
	at org.openqa.selenium.grid.web.Values.get(Values.java:47)
	at org.openqa.selenium.grid.sessionmap.remote.RemoteSessionMap.makeRequest(RemoteSessionMap.java:118)
	at org.openqa.selenium.grid.sessionmap.remote.RemoteSessionMap.add(RemoteSessionMap.java:82)
	at org.openqa.selenium.grid.distributor.local.LocalDistributor.newSession(LocalDistributor.java:598)
	at org.openqa.selenium.grid.distributor.local.LocalDistributor$NewSessionRunnable.handleNewSessionRequest(LocalDistributor.java:861)
	at org.openqa.selenium.grid.distributor.local.LocalDistributor$NewSessionRunnable.lambda$run$1(LocalDistributor.java:818)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:840)

farioas · 2025-02-20T10:36:06Z

@VietND96 If you're interested, I can send you the logs from the distributor and router for the past few hours to your email for troubleshooting

VietND96 · 2025-02-20T10:42:46Z

Can you also share your values file that was used to deploy the setup (just set of different configs against the default values.yaml)?
And also the code where you initialize RemoteWebDriver (which includes all browser args, capabilities) for me to reproduce?

VietND96 · 2025-02-20T11:25:15Z

I just saw your issue seems similar to #2655 where the Distributor and Router get rebooted.

farioas · 2025-02-20T14:46:43Z

I sent my log values to your email.

Here's my override for the default values.yaml:

global:
  seleniumGrid:
    stdoutProbeLog: true
    nodesImageTag: nightly # Pull the latest stable chrome image all the time (other options: dev | beta)

ingress:
  hostname: my.app.host
  nginx:
    useHttp2: false
  tls:
    - secretName: my.app.host-tls-secret
      hosts:
        - my.app.host

autoscaling:
  patchObjectFinalizers:
    enabled: false # doesn't work with argocd
  annotations: ! # remove helm hooks
  enableWithExistingKEDA: true
  scaledJobOptions:
    scalingStrategy:
      strategy: default
  scaledOptions:
    minReplicaCount: 0
    maxReplicaCount: 200
    pollingInterval: 10

chromeNode:
  enabled: true
  annotations:
    karpenter.sh/do-not-disrupt: "true" # protect from karpenter evict
  resources:
    requests:
      memory: "2Gi"
      cpu: "1"
    limits:
      memory: "2Gi"
      cpu:
  extraVolumeMounts:
    - name: downloads
      mountPath: /home/seluser/Downloads
  extraVolumes:
    - name: downloads
      emptyDir: {}
  extraEnvironmentVariables:
    - name: SE_BROWSER_ARGS_DISABLE_FEATURES
      value: "--disable-features=OptimizationGuideModelDownloading,OptimizationHintsFetching,OptimizationTargetPrediction,OptimizationHints"
    - name: SE_BROWSER_ARGS_DISABLE_SEARCH_ENGINE
      value: "--disable-search-engine-choice-screen"
    - name: SE_JAVA_OPTS
      value: "-Dselenium.LOGGER.level=SEVERE -Dwebdriver.chrome.whitelistedIps="
    - name: SE_VNC_NO_PASSWORD
      value: "true"
    - name: SE_SCREEN_WIDTH
      value: "1920"
    - name: SE_SCREEN_HEIGHT
      value: "1200"
    - name: SE_NODE_SESSION_TIMEOUT
      value: "1800"
    - name: SE_NODE_ENABLE_MANAGED_DOWNLOADS
      value: "true"
  readinessProbe:
    enabled: false
  livenessProbe:
    enabled: true
  hpa:
    unsafeSsl: 'false'
    platformName: "Linux"

firefoxNode:
  enabled: true
  resources:
    requests:
      memory: "2Gi"
      cpu: "1"
    limits:
      memory: "2Gi"
      cpu:
  extraVolumeMounts:
    - name: downloads
      mountPath: /home/seluser/Downloads
  extraVolumes:
    - name: downloads
      emptyDir: { }
  extraEnvironmentVariables:
    - name: SE_VNC_NO_PASSWORD
      value: "true"
    - name: SE_SCREEN_WIDTH
      value: "1920"
    - name: SE_SCREEN_HEIGHT
      value: "1200"
  readinessProbe:
    enabled: false
  livenessProbe:
    enabled: true
  hpa:
    platformName: "Linux"

edgeNode:
  enabled: false

redis:
  enabled: true

isolateComponents: true
components:
  sessionMap:
    resources:
      requests:
        cpu: "1"
        memory: "2Gi"
      limits:
        cpu:
        memory: "2Gi"
    externalDatastore:
      enabled: true
      backend: redis
  sessionQueue:
    extraEnvironmentVariables:
      - name: SE_SESSION_REQUEST_TIMEOUT
        value: "1800"
  router:
    resources:
      requests:
        cpu: "1"
        memory: "2Gi"
      limits:
        cpu:
        memory: "2Gi"
  distributor:
    resources:
      requests:
        cpu: "2"
        memory: "4Gi"
      limits:
        cpu:
        memory: "8Gi"

basicAuth:
  enabled: true
  username: user
  password: 123456
  embeddedUrl: true

webdriver init snippet:

# pom.xml dep
        <dependency>
            <groupId>io.github.bonigarcia</groupId>
            <artifactId>webdrivermanager</artifactId>
                <version>5.9.2</version>
            <scope>test</scope>
        </dependency>

chromeOpts.setCapability("browserName", "chrome");
chromeOpts.setCapability("platformName", "Linux");

try {
    Map<String, Object> prefsMap = new HashMap<>();
    prefsMap.put("profile.default_content_settings.popups", 0);
    prefsMap.put("download.default_directory", "/home/seluser/Downloads/");
    chromeOpts.setExperimentalOption("prefs", prefsMap);

    WebDriver driver;

    driver = WebDriverManager.chromedriver()
            .remoteAddress(new URL(gridUrl))
            .capabilities(chromeOpts)
            .create();

    driver.manage().window().maximize();
    FileHandler.setCurrentDownloadFolderName(driver, downloadFolderName);
    FileHandler.setCurrentDownloadFolderPath(driver, downloadFilepath);

    return driver;

} catch (MalformedURLException e) {
    throw new RuntimeException(e);
}

farioas added the needs-triaging label Feb 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[🐛 Bug]: Nodes stuck in DRAINING #2654

[🐛 Bug]: Nodes stuck in DRAINING #2654

farioas commented Feb 12, 2025

github-actions bot commented Feb 12, 2025

VietND96 commented Feb 13, 2025

farioas commented Feb 13, 2025

VietND96 commented Feb 18, 2025

farioas commented Feb 18, 2025

VietND96 commented Feb 18, 2025

VietND96 commented Feb 18, 2025

farioas commented Feb 18, 2025

farioas commented Feb 20, 2025

farioas commented Feb 20, 2025

VietND96 commented Feb 20, 2025

VietND96 commented Feb 20, 2025

farioas commented Feb 20, 2025 •

edited

Loading

[🐛 Bug]: Nodes stuck in DRAINING #2654

[🐛 Bug]: Nodes stuck in DRAINING #2654

Comments

farioas commented Feb 12, 2025

What happened?

Command used to start Selenium Grid with Docker (or Kubernetes)

Relevant log output

Operating System

Docker Selenium version (image tag)

Selenium Grid chart version (chart version)

github-actions bot commented Feb 12, 2025

VietND96 commented Feb 13, 2025

farioas commented Feb 13, 2025

VietND96 commented Feb 18, 2025

farioas commented Feb 18, 2025

VietND96 commented Feb 18, 2025

VietND96 commented Feb 18, 2025

farioas commented Feb 18, 2025

farioas commented Feb 20, 2025

farioas commented Feb 20, 2025

VietND96 commented Feb 20, 2025

VietND96 commented Feb 20, 2025

farioas commented Feb 20, 2025 • edited Loading

farioas commented Feb 20, 2025 •

edited

Loading