-
Notifications
You must be signed in to change notification settings - Fork 548
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Processes with unterminated TCP connections are non reusable. #2592
Comments
@CamJN @FooBarWidget I notice you both are maintainers of this repo. Do you mind taking a look at this issue and provide any explanation on how this might be blocking processes? This impacts this particular customer's usage of AWS SDK for Ruby. |
Hi @mullermp and @tubsandcans, Passenger doesn't inherent have any interactions with how the app deals with TCP sockets, so I don't really see how Passenger can be a cause here. However, you can use this trick to debug things: when a process is stuck, send SIGQUIT to it, and it'll dump the backtraces of all threads. This will allow you to see whether it's blocked on anything. If you still suspect it's Passenger-related, please provide the output of |
Hi @FooBarWidget, thanks for replying! The issue here is that I cannot safely reproduce this error as it only happens in production under heavy load/use. To reproduce requires that my production service crashes, which I'm not willing to do. All I know is that Passenger had no problem re-using its supervised processes before they had any TCP sockets in CLOSE_WAIT (this behavior being introduced by the aws-sdk gem). Since then, I have monkey-patched this gem to not leave sockets in CLOSE_WAIT and instead destroy them. Everything has been working as it had before the aws-sdk integration, so there is definitely some interplay going on between Passenger and its supervised processes with CLOSE_WAIT sockets. |
The only thing I can think of is that, by keeping so many sockets open, you reach the Ruby process's file descriptor limit. Sometimes, when Passenger's Ruby side accepts a new requests, it may have to open a new socket connection (= new file descriptor). That would fail if the limit was already reached at that point. But if that's the case then you should see error messages in your web server error log file. |
And since you can't safely reproduce this issue, then the next best thing you can do is to automatically run diagnostics next time the issue does happen. You could create a cron job that sends SIGQUIT to all your application processes once every few minutes, so that if any of them does freeze, then at least you have backtraces. |
@tubsandcans Did you investigate @FooBarWidget's response? |
@mullermp I don't have a reasonable way to test it. I'd have to roll back the patch and wait until our main production service falls over and then hopefully gather enough debug information to find a solution. I'm not (nor is my boss) willing to do that. I was hoping someone would give me a concrete answer about Passenger's supervisor and supervised processes with CLOSE_WAIT sockets affecting its ability to re-use them. I think our long-term solution is to abandon Passenger, it's a form of tech debt unfortunately. |
I don't have a dog in that fight, but I would recommend just using Puma as it's pretty much the default for rails applications. You should not see issues with Ruby SDK's connection pooling. |
In our multi-process Passenger+Nginx service, we are noticing an eventual reduction in available RAM up until the point where the service crashes. This only began happening after integrating a feature that reads+writes to AWS S3, the gem for which utilizes a connection-pool by placing used connections in CLOSE_WAIT state for future resumption.
The problem, however, is that once a Passenger RubyApp instance/process is linked to one of these CLOSE_WAIT TCP connections, it appears to no longer have requests routed to it, thus locking up system resources. This causes Passenger to eventually create many more RubyApp instances than our max process count, which inevitably crashes the app.
I brought this issue up with ruby-aws-sdk gem maintainers, and I am able to work around this issue by monkey patching that gem. However, it seems worth investigating this issue from the perspective of Passenger and it (possibly) not being able to reuse processes with CLOSE_WAIT TCP connections.
The text was updated successfully, but these errors were encountered: