Pings are occasionally being sent to non-Snap sockets (concrete example: PostgreSQL) #10

ocharles · 2014-07-17T13:55:08Z

This is a bit of weird bug, but I'm seeing pings from websockets-snap being sent to an open postgres connection. I previously thought this was a bug with postgresql-simple, as you can read about here.

My guess is summarized in this IRC log:

11:27 AM <ocharles> Ok, so here's something *really* weird
11:27 AM <ocharles> We're using websockets-snap to upgrade a Snap connection to websockets
11:27 AM <ocharles> But occasionally, the server sent ping is ending up in postgresqls socket
11:27 AM <ocharles> Has anyone got any idea wtf could be going on there?
11:30 AM <ocharles> My feeling is the connection is getting closed, but websockets-snap is still trying to send a ping out - and it happens that a new socket has opened to postgresql
11:30 AM <gcollins> do two or more threads have access to the same fd here?
11:30 AM <ocharles> jaspervdj: I do notice that "forkPingThread" ignores *all* exceptions and is otherwise never killed - that is also a bit suspect
11:31 AM <ocharles> gcollins: Hard for me to say - my app is not dealing with Fds and Sockets - it's just pulling postgresql connections out of a data-pool and sending data back over a websocket
11:32 AM <gcollins> point is if someone closes your fd on you then some other thread can get the same fd back from the os
11:32 AM <ocharles> what is "my thread" here? the thread serving the websocket?
11:32 AM <gcollins> either that or the postgres one
11:32 AM <ocharles> because that's my assumption - the websocket closes, the ping is still running, but before the ping sends a postgresql connection is opened on a new web request
11:32 AM <gcollins> probably the websocket
11:32 AM <gcollins> yes that is the obvious conclusion
11:32 AM <gcollins> gotta go
11:33 AM <ocharles> but the ping is sent to snap's iteratee - somewhat suggesting it's a bug in snap
11:34 AM <ocharles> Ok, it looks like I can actually disable that ping thread, so I will try that
11:36 AM <gcollins> ocharles: not really -- if snap kills your thread but you ignore the threadKilled and keep using the socket ….
11:36 AM <ocharles> it doesn't do that it turns out - it just terminates with () on any exception
11:36 AM <ocharles> http://hackage.haskell.org/package/websockets-snap-0.8.2.1/docs/src/Network-WebSockets-Snap.html see forkPingThread at the bottom
11:37 AM <ocharles> but I can disable that, so i'll deploy with that disabled for a day and see how the logs look...
12:23 PM <lpsmith> ocharles, that's a bit disturbing, as actually there's a project I was really planning to use threepenny-gui and postgresql-simple on :-(
12:24 PM <ocharles> indeed
12:24 PM <lpsmith> so that's a bug that will be affecting me in the near future.
12:24 PM <ocharles> You can turn the pings off
12:24 PM <lpsmith> Thanks for finding it anyhow =)
12:24 PM <lpsmith> I'll need pings
12:25 PM <ocharles> Right, but you could do that yourself
12:25 PM <ocharles> (engine-io which I'm using has its own ping system)
12:25 PM <lpsmith> I suppose
12:25 PM <ocharles> but yes, this is worrying and does need to be fixed
12:25 PM <ocharles> hopefully you won't encounter it, because I will have fixed it :)
12:25 PM <lpsmith> =)
12:26 PM <lpsmith> ocharles, well, maybe I'm wrong, but my *impression* is that once a fd is closed, it's not re-used until the file descriptors wrap around.
12:27 PM <lpsmith> So your hypothesis doesn't jibe with my (admittedly very tenuous) understanding of how file descriptors work on that count.
12:27 PM <ocharles> i have no idea. gcollins above suggests that a different thread could do it
12:27 PM <ocharles> i have no idea how fd allocation works
12:27 PM <ocharles> but... time for a burrito!
12:27 PM <lpsmith> :)
12:34 PM <lpsmith> ocharles, actually going async 0.4.3.0 was to better support database heartbeats on an observer thread...
12:35 PM <lpsmith> basically, if I never hear from the database, how do I know the connection is alive? Well I was issuing SELECT 1 query every 15-30 seconds I didn't hear from the database, which helped but the timeouts didn't work.
12:38 PM <lpsmith> ocharles, yeah I guess that might make sense, that the "next" file descriptor is a thread-local variable.
12:38 PM <lpsmith> or rather, the next file descriptor to start searching from.
12:50 PM <lpsmith> So I learned something about file descriptor allocation anyway, that my assumptions aren't entirely correct, at least for multithreaded apps.

The text was updated successfully, but these errors were encountered:

lpsmith · 2014-07-17T14:35:04Z

Yeah, it turns out my "tenuous" understanding of how file descriptor indexes are allocated was in fact completely wrong. File descriptor indexes are regularly reused, and Linux and Solaris at least seem to assign the lowest available descriptor index whenever a new descriptor is created.

So I feel pretty confident that this is indeed a use-after-close fault.

dzhus · 2014-10-06T08:09:23Z

Why doesn't websockets-snap use lockingInputStream?

jaspervdj · 2015-01-02T11:49:58Z

This should be fixed in the latest release.

elliottt · 2015-02-10T19:59:32Z

I think this is the same bug that I was observing when I made pull request #13. Though my pull request made the problem less frequent, it didn't solve the problem. After some investigation, I think that my problem stemmed from using receive instead of the receiveDataMessage, which doesn't do the necessary socket bookkeeping to eventually put the socket into the Unavailable state. The result of this is that while my server app was exiting successfully, it was doing so in a way that left the socket available. As the ping thread is never terminated by the ServerAppDone exception, which I was never able to observe being caught in copyIterateeToMVar, it would eventually fire again writing the ping to a file descriptor that was assumed free and reused during a future request.

In the short term, switching to using receiveDataMessage and catching a ConnectionException seems to fix my problem, but ultimately it seems like the ping thread just needs to be killed explicitly when the server app exits. I've made changes in the Galois fork of the repo, as commit GaloisInc/websockets-snap@5c991e0, though I'm not convinced that the ping thread will die if the server thread throws an exception that bubbles all the way out.

ocharles · 2015-02-10T20:14:05Z

I am definitely seeing errors in PostgreSQL's log, as before - so something is still going wrong - but unfortunately I haven't been able to put much time into it. As far as I can tell, I have disabled websockets own ping system entirely, and am only using receiveDataMessage and sendTextData.

jaspervdj · 2015-02-10T20:55:20Z

@elliottt I will merge in your patch tomorrow, although I think I will rewrite it a bit and put into a different part of the code. Thanks!

elliottt · 2015-02-10T22:07:41Z

@ocharles, you're still seeing the error even with the ping thread disabled? If so, that totally blows my theory :)

jaspervdj · 2015-02-14T14:43:34Z

@elliottt I think I found the issue. Due to a terrible oversight on my side there was no real sharing of information of when the parse end or write end were closed -- both had to detect it separately. Would you be willing to test the latest master branch?

elliottt · 2015-02-14T17:14:25Z

Absolutely! I'll test today. Thanks!

On Sat, Feb 14, 2015, 06:43 Jasper Van der Jeugt [email protected]
wrote:

@elliottt https://github.com/elliottt I think I found the issue. Due to
a terrible oversight on my side there was no real sharing of information of
when the parse end or write end were closed -- both had to detect it
separately. Would you be willing to test the latest master branch?

—
Reply to this email directly or view it on GitHub
#10 (comment)
.

elliottt · 2015-02-14T18:34:42Z

Everything looks fine here, but I'm testing with a smaller dataset than before. I can give you more definitive results on Monday.

ocharles · 2015-02-14T18:55:32Z

I can also test this in production on Monday and see if I still get the same random error messages from PostgreSQL

jaspervdj · 2015-02-24T23:01:06Z

Any updates on this?

jaspervdj · 2015-02-24T23:03:57Z

This possible fix is included in websockets-0.9.3.0, by the way.

ocharles · 2015-02-25T11:21:29Z

Added an item to this weeks to-do list to have a look at this.

ocharles · 2015-05-08T10:01:39Z

Unfortunately, I can still see

May 08 10:51:43 api postgres[1197]: FATAL:  invalid frontend message type 129

in our PostgreSQL logs which suggests this is still happening.

lpsmith · 2015-05-16T12:40:03Z

Ahh, apparently, returning the lowest available descriptor index is something the POSIX standard specifies. (Why? That honestly doesn't make sense to me...)

So yeah, use-after-close is a real problem on anything that conforms to that dictum.

creswick · 2015-05-21T15:04:33Z

@jaspervdj Is the 0.9.3.0 release available somewhere? (I don't see it on hackage, or a tag on the repo.)

ocharles · 2015-05-21T15:09:16Z

@creswick 0.9.3.0 is on Hackage, as is 0.9.4.0

creswick · 2015-05-21T15:57:29Z

@ocharles am I just looking in the wrong place? I don't see them here: http://hackage.haskell.org/package/websockets-snap

ocharles · 2015-05-21T15:58:57Z

The versions are for websockets, and websockets-snap can build with the latest versions

creswick · 2015-05-21T16:11:28Z

ah, thanks!

I am still seeing the problem with websockets-0.9.4.0, unfortunately.

lpsmith mentioned this issue Aug 6, 2014

Make it safe to use inotify descriptors after they have been closed. lpsmith/linux-inotify#1

Closed

henrylaxen mentioned this issue Sep 8, 2014

how to exit runWebSocketSnap #12

Closed

jaspervdj closed this as completed Jan 2, 2015

jaspervdj reopened this Feb 24, 2015

moesenle mentioned this issue Apr 21, 2019

Ensure that ping threads are closed when the ServerApp finishes. #25

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pings are occasionally being sent to non-Snap sockets (concrete example: PostgreSQL) #10

Pings are occasionally being sent to non-Snap sockets (concrete example: PostgreSQL) #10

ocharles commented Jul 17, 2014

lpsmith commented Jul 17, 2014

dzhus commented Oct 6, 2014

jaspervdj commented Jan 2, 2015

elliottt commented Feb 10, 2015

ocharles commented Feb 10, 2015

jaspervdj commented Feb 10, 2015

elliottt commented Feb 10, 2015

jaspervdj commented Feb 14, 2015

elliottt commented Feb 14, 2015

elliottt commented Feb 14, 2015

ocharles commented Feb 14, 2015

jaspervdj commented Feb 24, 2015

jaspervdj commented Feb 24, 2015

ocharles commented Feb 25, 2015

ocharles commented May 8, 2015

lpsmith commented May 16, 2015

creswick commented May 21, 2015

ocharles commented May 21, 2015

creswick commented May 21, 2015

ocharles commented May 21, 2015

creswick commented May 21, 2015

Pings are occasionally being sent to non-Snap sockets (concrete example: PostgreSQL) #10

Pings are occasionally being sent to non-Snap sockets (concrete example: PostgreSQL) #10

Comments

ocharles commented Jul 17, 2014

lpsmith commented Jul 17, 2014

dzhus commented Oct 6, 2014

jaspervdj commented Jan 2, 2015

elliottt commented Feb 10, 2015

ocharles commented Feb 10, 2015

jaspervdj commented Feb 10, 2015

elliottt commented Feb 10, 2015

jaspervdj commented Feb 14, 2015

elliottt commented Feb 14, 2015

elliottt commented Feb 14, 2015

ocharles commented Feb 14, 2015

jaspervdj commented Feb 24, 2015

jaspervdj commented Feb 24, 2015

ocharles commented Feb 25, 2015

ocharles commented May 8, 2015

lpsmith commented May 16, 2015

creswick commented May 21, 2015

ocharles commented May 21, 2015

creswick commented May 21, 2015

ocharles commented May 21, 2015

creswick commented May 21, 2015