Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Blocked indefinitely in copyIterateeToMVar #14

Open
wereHamster opened this issue Aug 23, 2015 · 11 comments
Open

Blocked indefinitely in copyIterateeToMVar #14

wereHamster opened this issue Aug 23, 2015 · 11 comments

Comments

@wereHamster
Copy link

I have a strange bug which appears to cause the EL.head in copyIterateeToMVar to block indefinitely if the application is compiled with -threaded. I used print style debugging to track that down: a print statement just before EL.head is executed while a print statement just after is not.

I was not able to reproduce the bug on my Mac OS X development machine. The OS where the proxy runs is a Linux server. GHC 7.10.1.

The application is a rather simple websocket proxy: it is a snap server which accepts requests from web browsers and forwards the messages to one of the available backend servers (and messages from backend servers back to the correct clients).

The symptoms are that the server stops forwarding messages to the backend server, but messages from the backend server do make their way through the proxy to the browser. This happens only after the first message from the backend server passes through to the web browser. Until then the proxy is accepting messages just fine. A typical session looks like this ([C]lient, [P]roxy, [B]ackend):

C -> P -> B
C -> P -> B
[ 5 seconds pass when no messages pass either way ]
C <- P <- B   -- A burst of many messages from the backend to the client
C -> P        -- The proxy never processes this message
C <- P <- B   -- Periodic messages every ~1-5 seconds
C -> P        -- This one is also lost, as are all further messages
...

However, if the client continuously sends messages through the proxy (say, every 100ms), then the bug sometimes disappears (but not always).

What's even more interesting, is that the bug appears to depend on the pattern of messages that the backend sends to the client. I have different backends, and for some the bug shows in 100% of the attemps, and for others never, so there is a strong correlation between the backend used and the bug. The message pattern is different for each backend.

I suspect the problem is somewhere in Snap (or how websockets-snap interacts with snap), because:

  • The TCP connection between the web browser and the proxy is up and functioning correctly. The messages from the backend servers are successfully sent through the proxy back to the web browser.
  • The web browser is sending messages to the server: the Chrome dev console indicates that.
  • The receive buffer on the server is empty (checked with netstat), so the proxy (at least the snap part of it) is reading the data from the linux kernel socket into the application memory.
@wereHamster
Copy link
Author

I have discovered another deadlock, this time I can actually reproduce it quite reliably on my local machine (Mac).

https://gist.github.com/wereHamster/fbc3e53837007a5d6f8c43c0ecf7317e

It's essentially a websocket server and client in one. The client connects to the server. The server starts sending messages (with increasing size) to the client.

Compile with stack build and then run with stack exec wsbug.

The number you see is the length of the message that was sent. It increases by 10 in each iteration.

Quite often (>75% of the runs), the output stops after less than a second. When it stops it's because the server is not sending any data, is blocked in WS.sendTextData.

@wereHamster
Copy link
Author

I can reproduce it reliably under Mac, but not under Linux. Should I file it as a bug with GHC (IO manager)?

@jaspervdj
Copy link
Owner

I think we should try to reduce the example to something much smaller first -- you can't expect GHC devs to read through library code. I'll see if I can spot something fishy. I don't currently have a Mac though.

@wereHamster
Copy link
Author

That would mean working through snap internals, then io-streams, then network :( I know the sending side is blocked inside the send(2) syscall, that much I was able to track down.

@wereHamster
Copy link
Author

I tried to make roughly the server/client setup with just Network.Socket and System.IO.Streams, but was not able to reproduce the issue.

Through threadscope I was able to find out that the sending thread is blocked in an MVar.

If I remove the lock in Network.WebSockets.StreammakeStreamsend' then the problem goes away (albeit in a strange way), more to that below.

Threadscope

This trace is with the lock in makeStream/send' removed!

screen shot 2017-09-11 at 00 56 10

You see almost 100% activity in the first half of each green 'block'. That's both the sending and receiving side doing their job. Eventually the sending side (thread 4) is blocked on an MVar, and the activity slightly decreases. From then on the receiving thread (thread 6) is mostly active, until it, too, blocks on an MVar.

With the lock in makeStream/send' present, activity never resumes, the process is stuck forever. With the lock gone, there is a roughly 2+ second delay and then the whole process repeats (a period of 100% activity, thread 4 blocks, then a period of slightly lower activity until thread 6 blocks and then a 2+ second gap). Just when the activity resumes threads 10 and 12 are waked up, I don't know what they do, maybe some internal timers in Snap or GHC RTS? But note that the gap in activity is not consistent, it's at least 2s but can be much longer, as seen in the screenshot.

@jaspervdj
Copy link
Owner

Are there still similar issues when you use a recent Snap (where we don't need the iteratee/MVar trick anymore)?

@wereHamster
Copy link
Author

I compiled the code against stackage nightly 2017-08-25, which contains snap-{core,server}-1.0.3.0. Is that recent enough?

@jaspervdj
Copy link
Owner

Yeah, that snapshot should no longer contain copyIterateeToMVar (websockets-snap >= 0.10).

@wereHamster
Copy link
Author

I'm confused though what MVar the writing thread is blocking on. Since I removed the one in makeStream/send', it means there must be some other MVar inside websockets or snap along the path of writing into a network socket (there remains only one thread that is doing the writing in my Main.hs file, after I removed the code which spawns the ping thread from websockets-snap pendingOnAccept). I was not yet able to track down where it is however.

@wereHamster
Copy link
Author

Next insight:

Inside the websocket handler in the server, at the end I loop forever, reading from the connection and discarding the data (ie. forever (void $ WS.receiveData conn)). If I replace that with a plain threadDelay, the deadlock doesn't happen. In this test case I can do that because I know the client will never send any data. But this behavior mirrors what I see in my real-world application, a server, which at the beginning sends a lot of data to the browser but the browser doesn't send anything back. So it looks like there is a MVar that is shared between receiveData and sendTextData, and receiveData takes it and doesn't release it back.

@wereHamster
Copy link
Author

I noticed that I can prevent this deadlock by sending a ping along with each data frame, forcing the remote side to respond with a pong and thereby waking the local receiver thread up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants