Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[🐛 Bug]: Nodes stuck in DRAINING #2654

Open
farioas opened this issue Feb 12, 2025 · 13 comments
Open

[🐛 Bug]: Nodes stuck in DRAINING #2654

farioas opened this issue Feb 12, 2025 · 13 comments

Comments

@farioas
Copy link

farioas commented Feb 12, 2025

What happened?

I'm using KEDA with drain session = 1.

After session execution node stuck and won't deprovision.

In the same time I see these message in grid:

2025-02-12 17:50:57,374 DEBUG [Probe.Liveness] - Session Queue Size: 60, Session Count: 0, Max Session: 0
2025-02-12 17:50:57,375 DEBUG [Probe.Liveness] - It seems the Distributor is delayed in processing a new session in the queue. Probe checks failed.

Command used to start Selenium Grid with Docker (or Kubernetes)

Helm chart 0.39.2

Relevant log output

16:31:38.418 INFO [NodeServer$2.lambda$start$1] - Sending registration event...
16:31:43.421 INFO [NodeServer$2.lambda$start$1] - Sending registration event...
16:31:46.218 INFO [NodeServer.lambda$createHandlers$2] - Node has been added
16:31:48.445 INFO [LocalNode.newSession] - Session created by the Node. Id: d712a9e4ffc7956978a907188bd423cd, Caps: Capabilities {acceptInsecureCerts: false, browserName: chrome, browserVersion: 133.0.6943.53, chrome: {chromedriverVersion: 133.0.6943.53 (9a80935019b0..., userDataDir: /tmp/.org.chromium.Chromium...}, fedcm:accounts: true, goog:chromeOptions: {debuggerAddress: localhost:40295}, networkConnectionEnabled: false, pageLoadStrategy: normal, platformName: linux, proxy: Proxy(), se:bidiEnabled: false, se:cdp: wss://selenium-grid.XXXX..., se:cdpVersion: 133.0.6943.53, se:containerName: selenium-grid-selenium-node..., se:noVncPort: 7900, se:vnc: wss://selenium-grid.appx.hu..., se:vncEnabled: true, se:vncLocalAddress: ws://10.1.118.22:7900, setWindowRect: true, strictFileInteractability: false, timeouts: {implicit: 0, pageLoad: 300000, script: 30000}, unhandledPromptBehavior: dismiss and notify, webauthn:extension:credBlob: true, webauthn:extension:largeBlob: true, webauthn:extension:minPinLength: true, webauthn:extension:prf: true, webauthn:virtualAuthenticators: true}
16:31:48.447 INFO [LocalNode.checkSessionCount] - Draining Node, configured sessions value (1) has been reached.
2025-02-12 16:32:13,351 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:32:18,227 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:32:23,305 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:32:28,931 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:32:33,307 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:32:36,334 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:32:43,333 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:32:48,730 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:32:53,351 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:32:58,129 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:33:03,305 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:33:07,628 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:33:13,355 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:33:19,381 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:33:23,336 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:33:29,363 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:33:33,359 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:33:39,389 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:33:43,332 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:33:49,359 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:33:53,346 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:33:56,529 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:34:03,307 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:34:09,031 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:34:13,347 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:34:17,729 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:34:23,327 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:34:26,732 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:34:33,311 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:34:36,728 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:34:43,322 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:34:46,835 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:34:53,354 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:34:57,727 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:35:03,372 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:35:06,731 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:35:13,325 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:35:16,430 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:35:23,345 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:35:26,432 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:35:33,304 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:35:37,327 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:35:43,334 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:35:46,734 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:35:53,322 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:35:56,828 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:36:03,303 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:36:06,827 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:36:13,323 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:36:17,827 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:36:23,302 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:36:26,529 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:36:33,351 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:36:38,034 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:36:43,320 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:36:47,637 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:36:53,321 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:36:56,927 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:37:03,322 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:37:06,828 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:37:13,333 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:37:17,236 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:37:23,351 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:37:28,130 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:37:33,323 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:37:37,428 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:37:43,362 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:37:46,930 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:37:53,323 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:37:56,526 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:38:03,303 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:38:07,530 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:38:13,322 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:38:17,226 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:38:23,321 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:38:26,527 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:38:33,343 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:38:38,127 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:38:43,311 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:38:48,227 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:38:53,301 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:38:57,637 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:39:03,310 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:39:06,336 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:39:13,321 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING

Operating System

EKS

Docker Selenium version (image tag)

4.28.1-20250202

Selenium Grid chart version (chart version)

0.39.2

Copy link

@farioas, thank you for creating this issue. We will troubleshoot it as soon as we can.


Info for maintainers

Triage this issue by using labels.

If information is missing, add a helpful comment and then I-issue-template label.

If the issue is a question, add the I-question label.

If the issue is valid but there is no time to troubleshoot it, consider adding the help wanted label.

If the issue requires changes or fixes from an external project (e.g., ChromeDriver, GeckoDriver, MSEdgeDriver, W3C), add the applicable G-* label, and it will provide the correct link and auto-close the issue.

After troubleshooting the issue, please add the R-awaiting answer label.

Thank you!

@VietND96
Copy link
Member

Is this Deployment or Job?

@farioas
Copy link
Author

farioas commented Feb 13, 2025

It's Job.

@VietND96
Copy link
Member

I think this is an expected behavior.
The message

16:31:48.447 INFO [LocalNode.checkSessionCount] - Draining Node, configured sessions value (1) has been reached.

to inform that max-sessions reached, and Node status switching from UP to DRAINING. The node will be drained once the current session is completed.
I am not sure at this time, in Node was having any session ongoing. I will probably update the probe to print number of ongoing sessions together with Node status.
On message

Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.

Probably, Node could not reach the Hub status endpoint to double check the NodeId there. Can you take a Pod to printenv all container env vars for my checking further?

@farioas
Copy link
Author

farioas commented Feb 18, 2025

I already tried to fight with this strange behavior of Hub, when it kills itself because of:

It seems the Distributor is delayed in processing a new session in the queue. Probe checks failed.

This worked for some amount of time:

hub:
  livenessProbe:
    failureThreshold: 50

After hub restart old nodes (that were provisioned, but haven't received a task) are not connected back to grid, but KEDA sees all the provisioned nodes and doing nothing - so we have overfilled queue and idling provisioned nodes.

Currently I was able to solve this behavior by switching to distributed mode with Redis as an external data store.

@VietND96
Copy link
Member

That is the liveness probe in Hub, it will be failed once queue is up but the number of sessions keep as 0 for a while until failureThreshold
As your description, I understand the issue is Hub get restarted. After restarted, existing Nodes could not register back to Hub, there is no Node attached in Hub, while having a mount of Node pods are running, which cause KEDA scaler get confused that already had enough node, then no more scale happening.

@VietND96
Copy link
Member

I also discovered a scenario where Hub get restarted (e.g it took over 1 minute to be ready fully). In the meantime, Node pods are scaled up and started registration, the register-period decides how long Node will try to register to Distributor, after that Node will not attempt to register again. In Helm chart, it Node startup probe is enabled, the probe period will be used for that register period (around 60s as I see). If Hub get back after this period, the Node pod will continue running there and do nothing. It leads to above scenario for autoscaling works incorrect.

@farioas
Copy link
Author

farioas commented Feb 18, 2025

Even with isolated components the distributor still restarted:

2025-02-17 03:10:22,642 DEBUG [Probe.Liveness] - Distributor is healthy.
2025-02-17 03:11:22,650 DEBUG [Probe.Liveness] - Distributor is healthy.
2025-02-17 03:12:22,661 DEBUG [Probe.Liveness] - Session Queue Size: 20, Session Count: 0, Max Session: 0
2025-02-17 03:12:22,662 DEBUG [Probe.Liveness] - It seems the Distributor is delayed in processing a new session in the queue. Probe checks failed.
2025-02-17 03:13:22,661 DEBUG [Probe.Liveness] - Session Queue Size: 20, Session Count: 0, Max Session: 0
2025-02-17 03:13:22,663 DEBUG [Probe.Liveness] - It seems the Distributor is delayed in processing a new session in the queue. Probe checks failed.
2025-02-17 03:14:22,661 DEBUG [Probe.Liveness] - Session Queue Size: 20, Session Count: 0, Max Session: 0
2025-02-17 03:14:22,662 DEBUG [Probe.Liveness] - It seems the Distributor is delayed in processing a new session in the queue. Probe checks failed.
2025-02-17 03:15:22,680 DEBUG [Probe.Liveness] - Session Queue Size: 20, Session Count: 0, Max Session: 0
2025-02-17 03:15:22,681 DEBUG [Probe.Liveness] - It seems the Distributor is delayed in processing a new session in the queue. Probe checks failed.
2025-02-17 03:16:22,669 DEBUG [Probe.Liveness] - Session Queue Size: 20, Session Count: 0, Max Session: 0
2025-02-17 03:16:22,670 DEBUG [Probe.Liveness] - It seems the Distributor is delayed in processing a new session in the queue. Probe checks failed.
2025-02-17 03:17:22,680 DEBUG [Probe.Liveness] - Session Queue Size: 20, Session Count: 0, Max Session: 0
2025-02-17 03:17:22,681 DEBUG [Probe.Liveness] - It seems the Distributor is delayed in processing a new session in the queue. Probe checks failed.
2025-02-17 03:18:22,653 DEBUG [Probe.Liveness] - Session Queue Size: 20, Session Count: 0, Max Session: 0
2025-02-17 03:18:22,654 DEBUG [Probe.Liveness] - It seems the Distributor is delayed in processing a new session in the queue. Probe checks failed.
2025-02-17 03:19:22,660 DEBUG [Probe.Liveness] - Session Queue Size: 20, Session Count: 0, Max Session: 0
2025-02-17 03:19:22,661 DEBUG [Probe.Liveness] - It seems the Distributor is delayed in processing a new session in the queue. Probe checks failed.
2025-02-17 03:20:22,650 DEBUG [Probe.Liveness] - Session Queue Size: 20, Session Count: 0, Max Session: 0
2025-02-17 03:20:22,651 DEBUG [Probe.Liveness] - It seems the Distributor is delayed in processing a new session in the queue. Probe checks failed.
2025-02-17 03:21:22,694 DEBUG [Probe.Liveness] - Session Queue Size: 20, Session Count: 0, Max Session: 0
2025-02-17 03:21:22,695 DEBUG [Probe.Liveness] - It seems the Distributor is delayed in processing a new session in the queue. Probe checks failed.
2025-02-17 03:22:22,682 DEBUG [Probe.Liveness] - Session Queue Size: 20, Session Count: 0, Max Session: 0
2025-02-17 03:22:22,683 DEBUG [Probe.Liveness] - It seems the Distributor is delayed in processing a new session in the queue. Probe checks failed.
2025-02-17 03:23:22,668 DEBUG [Probe.Liveness] - Session Queue Size: 20, Session Count: 0, Max Session: 0
2025-02-17 03:23:22,669 DEBUG [Probe.Liveness] - It seems the Distributor is delayed in processing a new session in the queue. Probe checks failed.
2025-02-17 03:24:22,685 DEBUG [Probe.Liveness] - Session Queue Size: 20, Session Count: 0, Max Session: 0
2025-02-17 03:24:22,686 DEBUG [Probe.Liveness] - It seems the Distributor is delayed in processing a new session in the queue. Probe checks failed.
2025-02-17 03:25:22,671 DEBUG [Probe.Liveness] - Session Queue Size: 20, Session Count: 0, Max Session: 0
2025-02-17 03:25:22,672 DEBUG [Probe.Liveness] - It seems the Distributor is delayed in processing a new session in the queue. Probe checks failed.
2025-02-17 03:26:22,712 DEBUG [Probe.Liveness] - Session Queue Size: 20, Session Count: 0, Max Session: 0
2025-02-17 03:26:22,713 DEBUG [Probe.Liveness] - It seems the Distributor is delayed in processing a new session in the queue. Probe checks failed.
2025-02-17 03:27:22,650 DEBUG [Probe.Liveness] - Session Queue Size: 20, Session Count: 0, Max Session: 0
2025-02-17 03:27:22,651 DEBUG [Probe.Liveness] - It seems the Distributor is delayed in processing a new session in the queue. Probe checks failed.
2025-02-17 03:28:22,684 DEBUG [Probe.Liveness] - Session Queue Size: 20, Session Count: 0, Max Session: 0
2025-02-17 03:28:22,685 DEBUG [Probe.Liveness] - It seems the Distributor is delayed in processing a new session in the queue. Probe checks failed.
2025-02-17 03:29:22,674 DEBUG [Probe.Liveness] - Session Queue Size: 23, Session Count: 0, Max Session: 0
2025-02-17 03:29:22,675 DEBUG [Probe.Liveness] - It seems the Distributor is delayed in processing a new session in the queue. Probe checks failed.
2025-02-17 03:30:22,694 DEBUG [Probe.Liveness] - Session Queue Size: 23, Session Count: 0, Max Session: 0
2025-02-17 03:30:22,695 DEBUG [Probe.Liveness] - It seems the Distributor is delayed in processing a new session in the queue. Probe checks failed.
2025-02-17 03:31:22,705 DEBUG [Probe.Liveness] - Session Queue Size: 23, Session Count: 0, Max Session: 0
2025-02-17 03:31:22,707 DEBUG [Probe.Liveness] - It seems the Distributor is delayed in processing a new session in the queue. Probe checks failed.
2025-02-17 03:32:22,656 DEBUG [Probe.Liveness] - Session Queue Size: 23, Session Count: 0, Max Session: 0
2025-02-17 03:32:22,657 DEBUG [Probe.Liveness] - It seems the Distributor is delayed in processing a new session in the queue. Probe checks failed.
2025-02-17 03:33:22,685 DEBUG [Probe.Liveness] - Session Queue Size: 23, Session Count: 0, Max Session: 0
2025-02-17 03:33:22,686 DEBUG [Probe.Liveness] - It seems the Distributor is delayed in processing a new session in the queue. Probe checks failed.
2025-02-17 03:34:22,667 DEBUG [Probe.Liveness] - Session Queue Size: 23, Session Count: 0, Max Session: 0
2025-02-17 03:34:22,668 DEBUG [Probe.Liveness] - It seems the Distributor is delayed in processing a new session in the queue. Probe checks failed.
2025-02-17 03:35:22,676 DEBUG [Probe.Liveness] - Session Queue Size: 23, Session Count: 0, Max Session: 0
2025-02-17 03:35:22,677 DEBUG [Probe.Liveness] - It seems the Distributor is delayed in processing a new session in the queue. Probe checks failed.
2025-02-17 03:36:22,685 DEBUG [Probe.Liveness] - Session Queue Size: 43, Session Count: 0, Max Session: 0
2025-02-17 03:36:22,687 DEBUG [Probe.Liveness] - It seems the Distributor is delayed in processing a new session in the queue. Probe checks failed.
2025-02-17 03:37:22,692 DEBUG [Probe.Liveness] - Session Queue Size: 43, Session Count: 0, Max Session: 0
2025-02-17 03:37:22,694 DEBUG [Probe.Liveness] - It seems the Distributor is delayed in processing a new session in the queue. Probe checks failed.
2025-02-17 03:38:22,670 DEBUG [Probe.Liveness] - Session Queue Size: 43, Session Count: 0, Max Session: 0
2025-02-17 03:38:22,671 DEBUG [Probe.Liveness] - It seems the Distributor is delayed in processing a new session in the queue. Probe checks failed.
2025-02-17 03:39:22,697 DEBUG [Probe.Liveness] - Session Queue Size: 43, Session Count: 0, Max Session: 0
2025-02-17 03:39:22,699 DEBUG [Probe.Liveness] - It seems the Distributor is delayed in processing a new session in the queue. Probe checks failed.
2025-02-17 03:40:22,694 DEBUG [Probe.Liveness] - Session Queue Size: 43, Session Count: 0, Max Session: 0
2025-02-17 03:40:22,695 DEBUG [Probe.Liveness] - It seems the Distributor is delayed in processing a new session in the queue. Probe checks failed.
2025-02-17 03:41:22,693 DEBUG [Probe.Liveness] - Session Queue Size: 43, Session Count: 0, Max Session: 0
2025-02-17 03:41:22,694 DEBUG [Probe.Liveness] - It seems the Distributor is delayed in processing a new session in the queue. Probe checks failed.
Trapped SIGTERM/SIGINT/x so shutting down supervisord...
2025-02-17 03:41:23,517 WARN received SIGTERM indicating exit request
2025-02-17 03:41:23,518 INFO waiting for selenium-grid-distributor to die
2025-02-17 03:41:24,519 WARN stopped: selenium-grid-distributor (terminated by SIGTERM)

@farioas
Copy link
Author

farioas commented Feb 20, 2025

Today it happened again and I observed strange resource consumption in distributor and router.
Distributor consumed 2 cores:
Image
Router consumed 6 cores:
Image

At the same time keda provisioned 200 pods, which is expected for nightly testing:

2025-02-20T04:17:03Z	INFO	scaleexecutor	Scaling Jobs	{"scaledJob.Name": "selenium-grid-selenium-node-chrome", "scaledJob.Namespace": "selenium-grid", "Number of running Jobs": 200}

But 0 of these pods were utilized by hub:
router logs

Feb 20, 2025 3:41:11 AM org.openqa.selenium.remote.tracing.SpanWrappedHttpHandler execute
WARNING: Unable to execute request: java.util.concurrent.TimeoutException
Build info: version: '4.28.1', revision: '73f5ad48a2'
System info: os.name: 'Linux', os.arch: 'amd64', os.version: '6.1.124-134.200.amzn2023.x86_64', java.version: '17.0.13'
Driver info: driver.version: unknown
org.openqa.selenium.TimeoutException: java.util.concurrent.TimeoutException
Build info: version: '4.28.1', revision: '73f5ad48a2'
System info: os.name: 'Linux', os.arch: 'amd64', os.version: '6.1.124-134.200.amzn2023.x86_64', java.version: '17.0.13'
Driver info: driver.version: unknown
	at org.openqa.selenium.remote.http.jdk.JdkHttpClient.execute(JdkHttpClient.java:424)
	at org.openqa.selenium.remote.tracing.TracedHttpClient.execute(TracedHttpClient.java:54)
	at org.openqa.selenium.grid.security.AddSecretFilter.lambda$apply$0(AddSecretFilter.java:40)
	at org.openqa.selenium.remote.http.Filter.lambda$andFinally$1(Filter.java:55)
	at org.openqa.selenium.grid.sessionqueue.remote.RemoteNewSessionQueue.addToQueue(RemoteNewSessionQueue.java:110)
	at org.openqa.selenium.grid.sessionqueue.NewSessionQueue.lambda$new$0(NewSessionQueue.java:68)
	at org.openqa.selenium.remote.http.Route$TemplatizedRoute.handle(Route.java:192)
	at org.openqa.selenium.remote.http.Route.execute(Route.java:69)
	at org.openqa.selenium.remote.http.Route$CombinedRoute.handle(Route.java:360)
	at org.openqa.selenium.remote.http.Route.execute(Route.java:69)
	at org.openqa.selenium.grid.sessionqueue.NewSessionQueue.execute(NewSessionQueue.java:128)
	at org.openqa.selenium.remote.tracing.SpanWrappedHttpHandler.execute(SpanWrappedHttpHandler.java:87)
	at org.openqa.selenium.remote.http.Filter$1.execute(Filter.java:63)
	at org.openqa.selenium.remote.http.Route$CombinedRoute.handle(Route.java:360)
	at org.openqa.selenium.remote.http.Route.execute(Route.java:69)
	at org.openqa.selenium.grid.router.Router.execute(Router.java:89)
	at org.openqa.selenium.grid.web.EnsureSpecCompliantResponseHeaders.lambda$apply$0(EnsureSpecCompliantResponseHeaders.java:34)
	at org.openqa.selenium.remote.http.Filter$1.execute(Filter.java:63)
	at org.openqa.selenium.remote.http.Route$CombinedRoute.handle(Route.java:360)
	at org.openqa.selenium.remote.http.Route.execute(Route.java:69)
	at org.openqa.selenium.remote.http.Route$NestedRoute.handle(Route.java:270)
	at org.openqa.selenium.remote.http.Route.execute(Route.java:69)
	at org.openqa.selenium.remote.http.Route$CombinedRoute.handle(Route.java:360)
	at org.openqa.selenium.remote.http.Route.execute(Route.java:69)
	at org.openqa.selenium.grid.security.BasicAuthenticationFilter.lambda$apply$0(BasicAuthenticationFilter.java:54)
	at org.openqa.selenium.remote.http.Filter$1.execute(Filter.java:63)
	at org.openqa.selenium.remote.http.Route$CombinedRoute.handle(Route.java:360)
	at org.openqa.selenium.remote.http.Route.execute(Route.java:69)
	at org.openqa.selenium.remote.AddWebDriverSpecHeaders.lambda$apply$0(AddWebDriverSpecHeaders.java:35)
	at org.openqa.selenium.remote.ErrorFilter.lambda$apply$0(ErrorFilter.java:44)
	at org.openqa.selenium.remote.http.Filter$1.execute(Filter.java:63)
	at org.openqa.selenium.remote.ErrorFilter.lambda$apply$0(ErrorFilter.java:44)
	at org.openqa.selenium.remote.http.Filter$1.execute(Filter.java:63)
	at org.openqa.selenium.netty.server.SeleniumHandler.lambda$channelRead0$0(SeleniumHandler.java:49)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:840)
Caused by: java.util.concurrent.TimeoutException
	at java.base/java.util.concurrent.CompletableFuture$Timeout.run(CompletableFuture.java:2874)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
	... 3 more

Distributor logs

Exception in thread "Local Distributor - Session Creation" org.openqa.selenium.WebDriverException: Command timed out after 1 minute(s)
Build info: version: '4.28.1', revision: '73f5ad48a2'
System info: os.name: 'Linux', os.arch: 'amd64', os.version: '6.1.127-135.201.amzn2023.x86_64', java.version: '17.0.13'
Driver info: driver.version: unknown
	at jdk.internal.reflect.GeneratedConstructorAccessor15.newInstance(Unknown Source)
	at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:500)
	at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:481)
	at org.openqa.selenium.remote.ErrorCodec.decode(ErrorCodec.java:167)
	at org.openqa.selenium.grid.web.Values.get(Values.java:47)
	at org.openqa.selenium.grid.sessionmap.remote.RemoteSessionMap.makeRequest(RemoteSessionMap.java:118)
	at org.openqa.selenium.grid.sessionmap.remote.RemoteSessionMap.add(RemoteSessionMap.java:82)
	at org.openqa.selenium.grid.distributor.local.LocalDistributor.newSession(LocalDistributor.java:598)
	at org.openqa.selenium.grid.distributor.local.LocalDistributor$NewSessionRunnable.handleNewSessionRequest(LocalDistributor.java:861)
	at org.openqa.selenium.grid.distributor.local.LocalDistributor$NewSessionRunnable.lambda$run$1(LocalDistributor.java:818)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:840)

@farioas
Copy link
Author

farioas commented Feb 20, 2025

@VietND96 If you're interested, I can send you the logs from the distributor and router for the past few hours to your email for troubleshooting

@VietND96
Copy link
Member

Can you also share your values file that was used to deploy the setup (just set of different configs against the default values.yaml)?
And also the code where you initialize RemoteWebDriver (which includes all browser args, capabilities) for me to reproduce?

@VietND96
Copy link
Member

I just saw your issue seems similar to #2655 where the Distributor and Router get rebooted.

@farioas
Copy link
Author

farioas commented Feb 20, 2025

I sent my log values to your email.

Here's my override for the default values.yaml:

global:
  seleniumGrid:
    stdoutProbeLog: true
    nodesImageTag: nightly # Pull the latest stable chrome image all the time (other options: dev | beta)

ingress:
  hostname: my.app.host
  nginx:
    useHttp2: false
  tls:
    - secretName: my.app.host-tls-secret
      hosts:
        - my.app.host

autoscaling:
  patchObjectFinalizers:
    enabled: false # doesn't work with argocd
  annotations: ! # remove helm hooks
  enableWithExistingKEDA: true
  scaledJobOptions:
    scalingStrategy:
      strategy: default
  scaledOptions:
    minReplicaCount: 0
    maxReplicaCount: 200
    pollingInterval: 10

chromeNode:
  enabled: true
  annotations:
    karpenter.sh/do-not-disrupt: "true" # protect from karpenter evict
  resources:
    requests:
      memory: "2Gi"
      cpu: "1"
    limits:
      memory: "2Gi"
      cpu:
  extraVolumeMounts:
    - name: downloads
      mountPath: /home/seluser/Downloads
  extraVolumes:
    - name: downloads
      emptyDir: {}
  extraEnvironmentVariables:
    - name: SE_BROWSER_ARGS_DISABLE_FEATURES
      value: "--disable-features=OptimizationGuideModelDownloading,OptimizationHintsFetching,OptimizationTargetPrediction,OptimizationHints"
    - name: SE_BROWSER_ARGS_DISABLE_SEARCH_ENGINE
      value: "--disable-search-engine-choice-screen"
    - name: SE_JAVA_OPTS
      value: "-Dselenium.LOGGER.level=SEVERE -Dwebdriver.chrome.whitelistedIps="
    - name: SE_VNC_NO_PASSWORD
      value: "true"
    - name: SE_SCREEN_WIDTH
      value: "1920"
    - name: SE_SCREEN_HEIGHT
      value: "1200"
    - name: SE_NODE_SESSION_TIMEOUT
      value: "1800"
    - name: SE_NODE_ENABLE_MANAGED_DOWNLOADS
      value: "true"
  readinessProbe:
    enabled: false
  livenessProbe:
    enabled: true
  hpa:
    unsafeSsl: 'false'
    platformName: "Linux"

firefoxNode:
  enabled: true
  resources:
    requests:
      memory: "2Gi"
      cpu: "1"
    limits:
      memory: "2Gi"
      cpu:
  extraVolumeMounts:
    - name: downloads
      mountPath: /home/seluser/Downloads
  extraVolumes:
    - name: downloads
      emptyDir: { }
  extraEnvironmentVariables:
    - name: SE_VNC_NO_PASSWORD
      value: "true"
    - name: SE_SCREEN_WIDTH
      value: "1920"
    - name: SE_SCREEN_HEIGHT
      value: "1200"
  readinessProbe:
    enabled: false
  livenessProbe:
    enabled: true
  hpa:
    platformName: "Linux"

edgeNode:
  enabled: false

redis:
  enabled: true

isolateComponents: true
components:
  sessionMap:
    resources:
      requests:
        cpu: "1"
        memory: "2Gi"
      limits:
        cpu:
        memory: "2Gi"
    externalDatastore:
      enabled: true
      backend: redis
  sessionQueue:
    extraEnvironmentVariables:
      - name: SE_SESSION_REQUEST_TIMEOUT
        value: "1800"
  router:
    resources:
      requests:
        cpu: "1"
        memory: "2Gi"
      limits:
        cpu:
        memory: "2Gi"
  distributor:
    resources:
      requests:
        cpu: "2"
        memory: "4Gi"
      limits:
        cpu:
        memory: "8Gi"

basicAuth:
  enabled: true
  username: user
  password: 123456
  embeddedUrl: true

webdriver init snippet:

# pom.xml dep
        <dependency>
            <groupId>io.github.bonigarcia</groupId>
            <artifactId>webdrivermanager</artifactId>
                <version>5.9.2</version>
            <scope>test</scope>
        </dependency>
chromeOpts.setCapability("browserName", "chrome");
chromeOpts.setCapability("platformName", "Linux");

try {
    Map<String, Object> prefsMap = new HashMap<>();
    prefsMap.put("profile.default_content_settings.popups", 0);
    prefsMap.put("download.default_directory", "/home/seluser/Downloads/");
    chromeOpts.setExperimentalOption("prefs", prefsMap);

    WebDriver driver;

    driver = WebDriverManager.chromedriver()
            .remoteAddress(new URL(gridUrl))
            .capabilities(chromeOpts)
            .create();

    driver.manage().window().maximize();
    FileHandler.setCurrentDownloadFolderName(driver, downloadFolderName);
    FileHandler.setCurrentDownloadFolderPath(driver, downloadFilepath);

    return driver;

} catch (MalformedURLException e) {
    throw new RuntimeException(e);
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants