Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SockJS is not emitting 'close' event #127

Open
maxguzenski opened this issue Jul 6, 2013 · 19 comments
Open

SockJS is not emitting 'close' event #127

maxguzenski opened this issue Jul 6, 2013 · 19 comments

Comments

@maxguzenski
Copy link

I have a chat (facebook style chat) than during the day it gets over 3000 open sockets (online users).

But during the night, I have a maximum of 400 users online, yet the chat (with sockjs) accuses that it still has, more or less, 2000 open sockets.

If I restart the server (browser will automatically reconnect) than my chat server back to having only 400 sockets (which proves that 2000 connections were incorrect).

In developing this bug does not exist, only in production ...

What can I do to find out what's going on?

@Incisive
Copy link

Incisive commented Jul 6, 2013

How are you quantifying the number of open sockets / connected users? Have you checked the number of TCP connections open on the server level?

The reason it could only exist in production is you may not be replicating the user behavior / browser that is causing the incorrect socket count.

-Ryan

@maxguzenski
Copy link
Author

I'll check the number of TCP connections next time...

to quantify (a very small peace):

var sockCount = {};

io.on('connection', function(conn) {
   var username  = null;

   conn.on('data', function(jsonData) {
      var data = JSON.parse(jsonData || "{}");

      if (data.channel === 'aim:auth' && helper.validateUserToken(data.username, data.token)) {
         username = data.username;         
         sockCount[data.username] = (sockCount[data.username] || 0) + 1;
         conn.write(helper.toJSON({channel: 'ready', message: "ok"}));
         return;
      }

      if (!username) {
         conn.close(42, 'not authentificated');
         return;
      }

      if (data.channel !== 'aim:auth') {
         conn.emit(data.channel, {original: data, user: username});
      }
   });

   conn.on('close', function() {
      if (username && sockCount[username] && typeof sockCount[username] === 'number') {
         sockCount[username] = sockCount[username] - 1;
         if (sockCount[username] < 1) delete sockCount[username];
      }
   });
});

So, when I print "sockCount", I have something like:

user1, 1 open socket
user2, 10 open sockets
user3, 3 open sockets

total user: 1600, total sockets: 3000 ... that, I sure know, is not true.

@Incisive
Copy link

Incisive commented Jul 6, 2013

I've found from my experiences using this in production that incrementing counters on connect / close events isn't the most accurate way.

On my applications that I've wanted an accurate count of users online I have always implemented a heartbeat system. Basically the user sends back a identification packet every x seconds and I store it off in a database like MongoDB (which allows me to expire documents after x seconds) using an upsert. So whenever a user pings back the server will update the heartbeat entry, or create a new one if it's been cleaned up.

From there you can simply query the row to get a count of users online, and even go further to get which types of users are online etc.

@maxguzenski
Copy link
Author

"close events isn't the most accurate way."

But what happens? just there moments that 'close' event doesn't execute? Is there someway to reproduce this issue on development side?

@Incisive
Copy link

Incisive commented Jul 6, 2013

I'm really not sure, but from what I've seen it doesn't always fire. Have you ever tried using a non-websocket capable browser that has to fall-back to xhr-polling?

@maxguzenski
Copy link
Author

OK, now is the moment that my chat has more then a real online users.

total users: 2304, total sockets: 4370

sysctl

"sysctl" => {
  'net.core.somaxconn'       => 16384,
  'net.core.rmem_max'       => 16777216,
  'net.core.wmem_max'      => 16777216,
  'net.ipv4.tcp_rmem'         => "4096 87380 16777216",
  'net.ipv4.tcp_wmem'        => "4096 65536 16777216",
  'net.ipv4.tcp_syncookies' => 1,
  'file-max'                        => 999999
}

netstat -a | wc -l => 2110
lsof -p 346 | wc -l => 1850

AFTER A CHAT RESTART (browser re-connect automatically in 10s ... I waited 15min)
total users: 1145, total sockets: 1331

netstat -a | wc -l => 2717
lsof -p 3707 | wc -l => 1732

@maxguzenski
Copy link
Author

but anyway, this is something that SockJS SHOULD manage internally. It SHOULD manage this heartbeat.

@mrjoes
Copy link
Member

mrjoes commented Jul 8, 2013

Out of curiosity, can you do breakdown by transports used for all opened
connections? Can you tell if it is problem with specific fallback protocol?

One of the possibilities - polling transports use server-side session and
they will keep it alive for few seconds between requests. So, there can be
situation when there are less TCP connections than number of connected
SockJS clients.

On Mon, Jul 8, 2013 at 10:22 AM, Maximiliano Guzenski <
[email protected]> wrote:

but anyway, this is something that SockJS SHOULD manage internally. It
SHOULD manage this heartbeat.


Reply to this email directly or view it on GitHubhttps://github.com//issues/127#issuecomment-20589326
.

@maxguzenski
Copy link
Author

Now is 5am at Brazil, it is not a issue by few seconds, it is a issue in hours or forever...

90% of my user use a up to data Chrome version, I pretty shore that is not a protocol specific issue. But I'll do more tests.

but maybe, is a issue at my code, I'll share it in gist

@mrjoes
Copy link
Member

mrjoes commented Jul 8, 2013

No problem, just send your findings when you have them.

I used sockjs-node (and sockjs-tornado) and did not experience problem
you're having.

I'll try to help, but I don't even know why it happens.

Serge.

On Mon, Jul 8, 2013 at 10:35 AM, Maximiliano Guzenski <
[email protected]> wrote:

Now is 5am at Brazil, it is not a issue by few seconds, it is a issue in
hours or forever...

90% of my user use a up to data Chrome version, I pretty shore that is not
a protocol specific issue. But I'll do more tests.

but maybe, is a issue at my code, I'll share it in gist


Reply to this email directly or view it on GitHubhttps://github.com//issues/127#issuecomment-20589715
.

@maxguzenski
Copy link
Author

there is it... I removed some code (not related to this issue)
https://gist.github.com/maxguzenski/5946916

@yurynix
Copy link

yurynix commented Jul 8, 2013

My input on this:

Users length: 1426
Connections count: 1909
{ 'xhr-streaming': 197,
'xhr-polling': 83,
websocket: 1626,
'jsonp-polling': 3,
eventsource: 0,
htmlfile: 0 }

netstat ESTABLISHED sockets: 4066 (i'm behind HAProxy, so devide by 2)

The difference between users count to connection count is due some users connected in multiple tabs / browsers which is ok in my scenario.

@maxguzenski what version of sockjs / node you are using? have you seen #99 you have sockets stuck in some other states?

@maxguzenski
Copy link
Author

sockjs-node = 0.3.7 (up to date version)
sockjs-client = 0.3.4 (up to date version)

But now, I was thinking... I did have this issue a few months ago. In the last couple months I update nodejs (from 0.10.something to 0.10.12) and socketjs from 0.3.4 to 0.3.7 ... I'll try sockJS 0.3.4 again, before downgrade nodejs

[updated]
Sorry, I always use sockJS 0.3.7, since this chat is in production... The only thing that really changes was nodejs version.

@maxguzenski
Copy link
Author

Ok, I'm debugging it... now I'm logging everything.
I've 1000 connections right now, and at least one hang in state 3 (closed), readable: false and writable: false (websocket protocol, IE10)... and looks like that conn.on('close', ..) wasnt execute.

[updated]
My sockJS app has 4 sockets on state "closed" that didn't emite "close" event.
In this way, it will take between 700 and 900 "hanging connections" tomorrow morning

@maxguzenski
Copy link
Author

Hi
In the end of day, my SockJS webchat had more then 100 socks in state 'closed', but it had not fire 'closed' event. This bug occurred less than 5% of all the connections that I had during the day.

To work around this problem, I just made:

setInterval(function() {
   var connsIds = Object.keys(conns);
   connsIds.each(function(connId) { 
      if (conns[connId].readyState === 3) delete conns[connId];
   }
}, 10*60*1000);

@yurynix
Copy link

yurynix commented Jul 11, 2013

After seeing that issue, I've checked my app, it runs behind SSL haproxy and all the incoming connections go threw it.

I've kill -9 haproxy now, and waited few minutes, even though all the tcp connections on the OS level were dropped, sockjs indeed have not emited close events correctly, and i had following stats:
Connections count: 861
Protocol counters: { 'jsonp-polling': 0,
'xhr-streaming': 0,
websocket: 861,
'xhr-polling': 0,
htmlfile: 0,
eventsource: 0 }

Might be a problem with faye implementation

@arjunmehta
Copy link

This may be related to the Chrome issue: https://code.google.com/p/chromium/issues/detail?id=76358 but likely part of a larger connectivity/network issue that is causing the close event to be emitted unreliably.

In any case, I have tested my application using Firefox, Safari and Chrome, and all of them display this symptom:

If I close the lid of my machine to put my computer to sleep, the connections are not properly closed, and they remain indefinitely improperly terminated in about 95% of the cases.

It is really unpredictable... sometimes these connections remain in limbo in readyState 3 (most common case) without emitting the close event, sometimes they remain in a readyState 1, also thus not executing the close event, and occasionally, for some reason, they happen to emit the close event. But this is rare.

If I trace the contents of my connection map, these connections build up over time, and, as mentioned, lead to a major memory leak/block.

Examining the transport.js code, it seems like there are potentially some conflicting operations that are occurring when a connection "times out". The session, as well as the event listeners for close and end are removed by the time the methods that emit them are supposed to be executed.

My hunch is that some of the calls (ie. didClose, didAbort, tearDown, unregister etc.) are all being called linearly as if they were blocking, when they should be called asynchronously using callbacks to ensure a consistent order of operations on the connection closing processes.

Could someone confirm the reason why callbacks are not used at all in the transports.js script?

@airs0urce
Copy link

I still have this issue. I check for connections list and readyState is CLOSED. I can only reproduce it on production server. I'm using not sock.js, but http://npmjs.com/faye-websocket (0.11.0). As sock.js using faye-websocket too, may be this issue somewhere inside the lib. I implemented woraround from @maxguzenski and now it works

@airs0urce
Copy link

@maxguzenski Did you stick with that workaround or found actual issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants