Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

slow initial connection #1200

Open
oliver-sanders opened this issue Jan 10, 2023 · 22 comments
Open

slow initial connection #1200

oliver-sanders opened this issue Jan 10, 2023 · 22 comments
Labels
bug Something isn't working investigation
Milestone

Comments

@oliver-sanders
Copy link
Member

oliver-sanders commented Jan 10, 2023

Updates

  • Issue where the websocket connection is closed before it is succseefully opened.
  • Observed in Firefox, Edge and Chrome so far with the following messages:
    • WebSocket connection to '' failed: WebSocket is closed before the connection is established.
    • Firefox can’t establish a connection to the server at ws://.
  • Observed on Linux and Windows.
  • The websocket gets closed with error code 1006 and a blank reason. This is the same result as killing the UIS.
  • There is a related timeout issue which can present similarly, add debugging into the subscriptions-transport-ws code to tell the difference.
  • These failed requests do not reach the Tornado websocket implementation.

OP

Sometimes it takes a long time for the UI to connect with the backend.

It seems to get there in the end but it often takes in excess of 45 seconds to establish the subscription connection. The issue occurs randomly when the page is loaded and causes the no-connection error box hangs around.

Whilst we are waiting for the connection this message appears multiple times (Firefox):

 Firefox can’t establish a connection to the server at ws://localhost:8888/cylc/subscriptions.

Which suggests multiple failed attempts at connection, however, after jamming some logging into the UIS (backend server) I can see no evidence of connection attempts (at least none which actually get into the websocket code). The UIS responds quickly once it receives the request.

It's not clear what's causing this, however, it is worth noting that we are using a deprecated framework for subscriptions, we should try moving to something more modern and see if the problem goes away by itself - #1028.

@oliver-sanders oliver-sanders added the bug Something isn't working label Jan 10, 2023
@oliver-sanders oliver-sanders added this to the 1.5.0 milestone Jan 10, 2023
@MetRonnie MetRonnie modified the milestones: 1.5.0, 1.6.0 Jan 25, 2023
@oliver-sanders oliver-sanders modified the milestones: 1.6.0, 2.0.0 Apr 24, 2023
@MetRonnie
Copy link
Member

MetRonnie commented Apr 28, 2023

I don't think I have ever encountered this on Chrome

@oliver-sanders
Copy link
Member Author

oliver-sanders commented May 2, 2023

Try FF/VDI, the lag can be up to one minute.

@oliver-sanders
Copy link
Member Author

I've replicated this with offline mode on the Vue 3 branch :(

@MetRonnie
Copy link
Member

Assuming this is just an issue on VDI, what could be the cause?

@oliver-sanders
Copy link
Member Author

I'm assuming this is a general issue exacerbated by a laggy platform, you can see multiple connection attempts in the console.

@hjoliver
Copy link
Member

(What's VDI - virtual desktop?)

@oliver-sanders
Copy link
Member Author

I've done some investigation into this but have come out empty handed.

I've also tried migrating from graphql-transport-ws to graphql-ws, however, this did not fix the issue.

So far the error has only been reported in RedHat builds of Firefox where it presents like so:

 Firefox can’t establish a connection to the server at ws://localhost:8888/cylc/subscriptions.

This error will persist across multiple retries, it may last for over one minute before clearing.

This happens randomly on page load, sometimes it connects straight away.

@MetRonnie MetRonnie modified the milestones: 2.1.0, 2.2.0 Sep 7, 2023
@kinow
Copy link
Member

kinow commented Oct 16, 2023

P.s. I used to see this when developing the UI. Never investigated the cause though. Sometimes ffox would keep showing the red icon at the top, showing that I was offline and there would be those js messages.

Other times it was after I opened the browser the first time. It happened with just that five workflow, or with multiple workflows (i.e probably not the amount of data).

However, when I used that fake uiserver (offline mode or random tests in python + pre made request/response) I would not get that.

So my guess would be something between browser, uis, and first xhr and/or wss request. Maybe setting a debugger in some parts of uis graphene or web request handling code that could be busy (handling an old request, waiting for some background task?) would replicate the issue...

@oliver-sanders oliver-sanders modified the milestones: 2.2.0, 2.3.0 Nov 2, 2023
@oliver-sanders
Copy link
Member Author

oliver-sanders commented Nov 13, 2023

This report on Bugzilla looks very similar:

Websocket ssl connection can be very slow to connect in Firefox

https://bugzilla.mozilla.org/show_bug.cgi?id=1478314

As does this issue:

Firefox takes multiple attempts to connect to websocket resulting in slow UI loads

fluidd-core/fluidd#931

There a bunch of other similar sounding reports out there with varying levels of details. Here are some summarised statements pulled out of these reports:

  • WSS/HTTPS works, WS/HTTP doesn't (x2).
  • Regular cert works, self-signed cert doesn't.
  • Local host IP works, local host domain doesn't.
  • IPv4 works, IPV6 doesn't (x3).
  • Earlier versions of push.js works, later versions don't, (we aren't using push.js).

I've replicated the issue with:

  • HTTP/WS and HTTPS/WSS (self-signed).
  • localhost and 127.0.0.1.

@oliver-sanders
Copy link
Member Author

Issue now replicated on:

  • Firefox (Linux)
  • Firefox (Windows)
  • Chrome (Windows)
  • Edge (Windows)

Chromium error:

WebSocket connection to '<URL>' failed: WebSocket is closed before the connection is established.

Gecko error:

 Firefox can’t establish a connection to the server at ws://localhost:8888/cylc/subscriptions.

@oliver-sanders
Copy link
Member Author

Connection Timeout

After some experimenting, I've seen a couple cases where connections failed on connection timeout. The timeout starts at 1s and increases with each failure. This diff bumps the default 1s connection timeout up to 10s:

diff --git a/src/graphql/index.js b/src/graphql/index.js
index cc252c96..141ee0ef 100644
--- a/src/graphql/index.js
+++ b/src/graphql/index.js
@@ -74,7 +74,9 @@ export function getCylcHeaders () {
 export function createSubscriptionClient (wsUrl, options = {}, wsImpl = null) {
   const opts = Object.assign({
     reconnect: true,
-    lazy: false
+    lazy: false,
+    minTimeout: 10000, // default 1000
+    // timeout: 300000 // default 30000
   }, options)
   const subscriptionClient = new SubscriptionClient(wsUrl, opts, wsImpl)
   // these are the available hooks in the subscription client lifecycle

Which appears to clear connection timeout issues, however, this would appear to be a distraction which does not solve the main issue.

Connection closed before it's established

  • Cylc-ui uses Apollo Client for GraphQL.
  • We use it with subscriptions-transport-ws to provide websocket support.
  • subscriptions-transport-ws package creates and manages the websocket connections.
  • The issue occurs when the websocket's onclose handler (defined by subscriptions-transport-ws) gets called.
  • The onclose method does not appear to be called by our code or subscriptions-transport-ws suggesting it's invoked by the websocket implementation (debugger call stack not especially helpful here)?

Call stack (note line numbers shifted slightly by diff):

Screenshot from 2023-11-20 11-04-27

onclose first argument
close { target: WebSocket, isTrusted: true, wasClean: false, code: 1006, reason: "", srcElement: WebSocket, currentTarget: WebSocket, eventPhase: 2, bubbles: false, cancelable: false, … }
​
bubbles: false
​
cancelBubble: false
​
cancelable: false
​
code: 1006
​
composed: false
​
currentTarget: WebSocket { url: "wss://localhost:8888/cylc/subscriptions", readyState: 3, bufferedAmount: 0, … }
​​
binaryType: "blob"
​​
bufferedAmount: 0
​​
extensions: ""
​​
onclose: function onclose(r, i, o)​​
onerror: function onerror(r)​​
onmessage: function onmessage(r)​​
onopen: function onopen()
​​
protocol: ""
​​
readyState: 3
​​
url: "wss://localhost:8888/cylc/subscriptions"
​​
<prototype>: WebSocketPrototype { close: close(), send: send(), url: Getter, … }
​
defaultPrevented: false
​
eventPhase: 2
​
explicitOriginalTarget: WebSocket { url: "wss://localhost:8888/cylc/subscriptions", readyState: 3, bufferedAmount: 0, … }
​
isTrusted: true
​
originalTarget: WebSocket { url: "wss://localhost:8888/cylc/subscriptions", readyState: 3, bufferedAmount: 0, … }
​
reason: ""
​
returnValue: true
​
srcElement: WebSocket { url: "wss://localhost:8888/cylc/subscriptions", readyState: 3, bufferedAmount: 0, … }
​
target: WebSocket { url: "wss://localhost:8888/cylc/subscriptions", readyState: 3, bufferedAmount: 0, … }
​
timeStamp: 6989
​
type: "close"
​
wasClean: false
​
<get isTrusted()>: function isTrusted()
​
<prototype>: CloseEventPrototype { wasClean: Getter, code: Getter, reason: Getter, … }

@MetRonnie
Copy link
Member

Looks like I can reproduce the problem by using a custom network profile in Chrome with latency of ≥ 3 secs! This might explain why you in particular keep seeing this, as you are plagued by bad ping times?

@oliver-sanders
Copy link
Member Author

oliver-sanders commented Nov 20, 2023

I think that would cause the connection timeout issue I mentioned above which I think presents in the same way, though should go away after enough repeats (timeouts willing). (note ping time issues shouldn't affect connections to local servers).

To tell the difference, I've been jamming debugging into subscriptions-transport-ws/dist/client.js.

This bit handles connection timeout issues:

438     SubscriptionClient.prototype.checkMaxConnectTimeout = function () {
439         var _this = this;
440         this.clearMaxConnectTimeout();
441         var duration = this.maxConnectTimeGenerator.duration()
442         this.maxConnectTimeoutId = setTimeout(function () {
443             if (_this.status !== _this.wsImpl.OPEN) {
444                 _this.reconnecting = true;
445                 console.warn(`Closed by max connect timeout: ${duration}`)                                                     
446                 _this.close(false, true);
447             }    
448         }, duration);
449     };         

But this one is the real issue:

483         this.client.onclose = function (code, reason, wasClean) {
484             console.log('# onclose', code, reason, wasClean)
485             if (!_this.closedByUser) {
486                 // TODO: this would appear to be the source of the problem!
487                 console.warn('Closed by onclose')
488                 debugger
489                 _this.close(false, false);
490             }    
491         };    

@oliver-sanders
Copy link
Member Author

oliver-sanders commented Nov 20, 2023

From testing with Dave on Friday (who seldom notices the issue), we found evidence of this issue in the console more times than not, i.e. the issue hist most of the time, but isn't noticed unless it persists for long enough. Worth checking the console for these messages.

@MetRonnie

This comment was marked as off-topic.

@oliver-sanders

This comment was marked as off-topic.

@MetRonnie

This comment was marked as off-topic.

@oliver-sanders

This comment was marked as off-topic.

@oliver-sanders
Copy link
Member Author

From manual inspection, the websocket error code is 1006 which is:

Reserved for situations where the connection was closed abnormally without sending a close frame when a status code was expected.

This appears to mean that the connection was closed unexpectedly (no close message), but I'm not sure how it's determining that, is ping-pong established this early on.

Note, the reason field is blank. This exactly matches the response you get when you stop a running server.

@kinow
Copy link
Member

kinow commented Mar 30, 2024

@oliver-sanders ' in this issue show what I can now easily reproduce locally 🎉

I was reading an article on JS and performance for Autosubmit UI, when I noticed the author commented something interesting about browser settings that control the maximum number of connections to a server. The author also mentioned about this being a problem to WebSockets, which is why some people used a service workers to have a single client connecting to the server.

Anyhoo, I opened Firefox, then about:config, and searched for websocket. It returned only 12, with a couple of entries for devtools (these can be discarded).

Everything was working fine in the UI, and network.websocket.max-connections was set to 200 by default. Then I opened a few tabs, reduced it to 30, opened a few more tabs with the Cylc GUI, and voilà

image

I started closing my tabs and... eventually it connected.

image

I can't spend much more time on this issue, but I think

  • this is a plausible explanation, as that reflects the scenario I had when the issue happened.... I was working on several tabs in a browser,
  • the bug happened with complex and simple experiments
  • the bug would go away in some seconds, or if I closed other tabs or the browser and opened it again

Open questions (that I wish I had time to keep digging 🤓)

  • I tried monitoring the connections with lsof and netstat but I got different numbers than 30. I was expecting to find a way to filter by process + server (socket, domain name, etc.).
  • When a WebSocket is closed, does it mean the browser connection (or its object in some pool?) is discarded? When the maxCount variable is decreased?
  • Any way to monitor that number and find a better error message, maybe somewhere in the stack trace it says "Reached Maximum Connections" or something?
  • There are other settings like delays for reconnect, timeouts, would any of that help?
  • Is there anything else on the server side (tornado/jupyter/graphql-ws that could be throttling the max connections as well? i.e. firefox/browser causes the issue, but maybe other parts of the system do that too?, or even linux 🤷‍♂️ )
  • I only know how to manage websockets connections in Ffox, but there should be a similar setting in Chrome/Chromium/Edge/Safari/Opera/etc.

So if anyone is experiencing this a lot, and if you have a lot of tabs, go to about:config and increase that number (assuming you consider it safe as the server & client memory would be affected).

Possible solution:

  • Use a SharedWorker (and this SO) in the UI to reuse the WebSockets connection
    • Does it work with current Cylc arch? (I'm using Cylc for small experiments but not following the dev closely)
    • Any security issues when two tabs in the same browser have different users logged in (e.g. Firefox container tabs)?
  • Use some other type of Service Worker?
  • Ask users to manage the number of max connections, especially if they monitor multi experiments in multi tabs, maybe update the UI error message too...
  • Replace WebSocket by something else? (LongPolling, WebRTC, something from HTTP/3? 🤷‍♂️ )

I don't have this issue as we are not using WebSockets in the Autosubmit UI, so this is as far as my curiosity took me 🙂 , hope that helps.

Bruno

@oliver-sanders oliver-sanders modified the milestones: 2.5.0, 2.6.0 Apr 24, 2024
@ColemanTom
Copy link

Just commenting (and following) that I see this a lot too. But, I haven't seen this error message mentioned, so I'll add it (it always happens after the "can't establish a connection to the server" line.

The connection to wss://..../cylc/subscriptions was interrupted while the page was loading.

@ColemanTom
Copy link

So if anyone is experiencing this a lot, and if you have a lot of tabs, go to about:config and increase that number (assuming you consider it safe as the server & client memory would be affected).

It isn't just if you have many tabs. I have a single tab open and this happens. The page eventually loads, but it is slow.

@oliver-sanders oliver-sanders modified the milestones: 2.6.0, 2.7.0 Oct 9, 2024
@oliver-sanders oliver-sanders modified the milestones: 2.7.0, 2.8.0 Dec 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working investigation
Projects
None yet
Development

No branches or pull requests

5 participants