Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pings are occasionally being sent to non-Snap sockets (concrete example: PostgreSQL) #10

Open
ocharles opened this issue Jul 17, 2014 · 21 comments

Comments

@ocharles
Copy link

This is a bit of weird bug, but I'm seeing pings from websockets-snap being sent to an open postgres connection. I previously thought this was a bug with postgresql-simple, as you can read about here.

My guess is summarized in this IRC log:

11:27 AM <ocharles> Ok, so here's something *really* weird
11:27 AM <ocharles> We're using websockets-snap to upgrade a Snap connection to websockets
11:27 AM <ocharles> But occasionally, the server sent ping is ending up in postgresqls socket
11:27 AM <ocharles> Has anyone got any idea wtf could be going on there?
11:30 AM <ocharles> My feeling is the connection is getting closed, but websockets-snap is still trying to send a ping out - and it happens that a new socket has opened to postgresql
11:30 AM <gcollins> do two or more threads have access to the same fd here?
11:30 AM <ocharles> jaspervdj: I do notice that "forkPingThread" ignores *all* exceptions and is otherwise never killed - that is also a bit suspect
11:31 AM <ocharles> gcollins: Hard for me to say - my app is not dealing with Fds and Sockets - it's just pulling postgresql connections out of a data-pool and sending data back over a websocket
11:32 AM <gcollins> point is if someone closes your fd on you then some other thread can get the same fd back from the os
11:32 AM <ocharles> what is "my thread" here? the thread serving the websocket?
11:32 AM <gcollins> either that or the postgres one
11:32 AM <ocharles> because that's my assumption - the websocket closes, the ping is still running, but before the ping sends a postgresql connection is opened on a new web request
11:32 AM <gcollins> probably the websocket
11:32 AM <gcollins> yes that is the obvious conclusion
11:32 AM <gcollins> gotta go
11:33 AM <ocharles> but the ping is sent to snap's iteratee - somewhat suggesting it's a bug in snap
11:34 AM <ocharles> Ok, it looks like I can actually disable that ping thread, so I will try that
11:36 AM <gcollins> ocharles: not really -- if snap kills your thread but you ignore the threadKilled and keep using the socket ….
11:36 AM <ocharles> it doesn't do that it turns out - it just terminates with () on any exception
11:36 AM <ocharles> http://hackage.haskell.org/package/websockets-snap-0.8.2.1/docs/src/Network-WebSockets-Snap.html see forkPingThread at the bottom
11:37 AM <ocharles> but I can disable that, so i'll deploy with that disabled for a day and see how the logs look...
12:23 PM <lpsmith> ocharles, that's a bit disturbing, as actually there's a project I was really planning to use threepenny-gui and postgresql-simple on :-(
12:24 PM <ocharles> indeed
12:24 PM <lpsmith> so that's a bug that will be affecting me in the near future.
12:24 PM <ocharles> You can turn the pings off
12:24 PM <lpsmith> Thanks for finding it anyhow =)
12:24 PM <lpsmith> I'll need pings
12:25 PM <ocharles> Right, but you could do that yourself
12:25 PM <ocharles> (engine-io which I'm using has its own ping system)
12:25 PM <lpsmith> I suppose
12:25 PM <ocharles> but yes, this is worrying and does need to be fixed
12:25 PM <ocharles> hopefully you won't encounter it, because I will have fixed it :)
12:25 PM <lpsmith> =)
12:26 PM <lpsmith> ocharles, well, maybe I'm wrong, but my *impression* is that once a fd is closed, it's not re-used until the file descriptors wrap around.
12:27 PM <lpsmith> So your hypothesis doesn't jibe with my (admittedly very tenuous) understanding of how file descriptors work on that count.
12:27 PM <ocharles> i have no idea. gcollins above suggests that a different thread could do it
12:27 PM <ocharles> i have no idea how fd allocation works
12:27 PM <ocharles> but... time for a burrito!
12:27 PM <lpsmith> :)
12:34 PM <lpsmith> ocharles, actually going async 0.4.3.0 was to better support database heartbeats on an observer thread...
12:35 PM <lpsmith> basically, if I never hear from the database, how do I know the connection is alive? Well I was issuing SELECT 1 query every 15-30 seconds I didn't hear from the database, which helped but the timeouts didn't work.
12:38 PM <lpsmith> ocharles, yeah I guess that might make sense, that the "next" file descriptor is a thread-local variable.
12:38 PM <lpsmith> or rather, the next file descriptor to start searching from.
12:50 PM <lpsmith> So I learned something about file descriptor allocation anyway, that my assumptions aren't entirely correct, at least for multithreaded apps.
@lpsmith
Copy link

lpsmith commented Jul 17, 2014

Yeah, it turns out my "tenuous" understanding of how file descriptor indexes are allocated was in fact completely wrong. File descriptor indexes are regularly reused, and Linux and Solaris at least seem to assign the lowest available descriptor index whenever a new descriptor is created.

So I feel pretty confident that this is indeed a use-after-close fault.

@dzhus
Copy link
Contributor

dzhus commented Oct 6, 2014

Why doesn't websockets-snap use lockingInputStream?

@jaspervdj
Copy link
Owner

This should be fixed in the latest release.

@elliottt
Copy link
Contributor

I think this is the same bug that I was observing when I made pull request #13. Though my pull request made the problem less frequent, it didn't solve the problem. After some investigation, I think that my problem stemmed from using receive instead of the receiveDataMessage, which doesn't do the necessary socket bookkeeping to eventually put the socket into the Unavailable state. The result of this is that while my server app was exiting successfully, it was doing so in a way that left the socket available. As the ping thread is never terminated by the ServerAppDone exception, which I was never able to observe being caught in copyIterateeToMVar, it would eventually fire again writing the ping to a file descriptor that was assumed free and reused during a future request.

In the short term, switching to using receiveDataMessage and catching a ConnectionException seems to fix my problem, but ultimately it seems like the ping thread just needs to be killed explicitly when the server app exits. I've made changes in the Galois fork of the repo, as commit GaloisInc/websockets-snap@5c991e0, though I'm not convinced that the ping thread will die if the server thread throws an exception that bubbles all the way out.

@ocharles
Copy link
Author

I am definitely seeing errors in PostgreSQL's log, as before - so something is still going wrong - but unfortunately I haven't been able to put much time into it. As far as I can tell, I have disabled websockets own ping system entirely, and am only using receiveDataMessage and sendTextData.

@jaspervdj
Copy link
Owner

@elliottt I will merge in your patch tomorrow, although I think I will rewrite it a bit and put into a different part of the code. Thanks!

@elliottt
Copy link
Contributor

@ocharles, you're still seeing the error even with the ping thread disabled? If so, that totally blows my theory :)

@jaspervdj
Copy link
Owner

@elliottt I think I found the issue. Due to a terrible oversight on my side there was no real sharing of information of when the parse end or write end were closed -- both had to detect it separately. Would you be willing to test the latest master branch?

@elliottt
Copy link
Contributor

Absolutely! I'll test today. Thanks!

On Sat, Feb 14, 2015, 06:43 Jasper Van der Jeugt [email protected]
wrote:

@elliottt https://github.com/elliottt I think I found the issue. Due to
a terrible oversight on my side there was no real sharing of information of
when the parse end or write end were closed -- both had to detect it
separately. Would you be willing to test the latest master branch?


Reply to this email directly or view it on GitHub
#10 (comment)
.

@elliottt
Copy link
Contributor

Everything looks fine here, but I'm testing with a smaller dataset than before. I can give you more definitive results on Monday.

@ocharles
Copy link
Author

I can also test this in production on Monday and see if I still get the same random error messages from PostgreSQL

@jaspervdj
Copy link
Owner

Any updates on this?

@jaspervdj
Copy link
Owner

This possible fix is included in websockets-0.9.3.0, by the way.

@jaspervdj jaspervdj reopened this Feb 24, 2015
@ocharles
Copy link
Author

Added an item to this weeks to-do list to have a look at this.

@ocharles
Copy link
Author

ocharles commented May 8, 2015

Unfortunately, I can still see

May 08 10:51:43 api postgres[1197]: FATAL:  invalid frontend message type 129

in our PostgreSQL logs which suggests this is still happening.

@lpsmith
Copy link

lpsmith commented May 16, 2015

Ahh, apparently, returning the lowest available descriptor index is something the POSIX standard specifies. (Why? That honestly doesn't make sense to me...)

So yeah, use-after-close is a real problem on anything that conforms to that dictum.

@creswick
Copy link

@jaspervdj Is the 0.9.3.0 release available somewhere? (I don't see it on hackage, or a tag on the repo.)

@ocharles
Copy link
Author

@creswick 0.9.3.0 is on Hackage, as is 0.9.4.0

@creswick
Copy link

@ocharles am I just looking in the wrong place? I don't see them here: http://hackage.haskell.org/package/websockets-snap

@ocharles
Copy link
Author

The versions are for websockets, and websockets-snap can build with the latest versions

@creswick
Copy link

ah, thanks!

I am still seeing the problem with websockets-0.9.4.0, unfortunately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants