Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Processes with unterminated TCP connections are non reusable. #2592

Open
tubsandcans opened this issue Feb 19, 2025 · 8 comments
Open

Processes with unterminated TCP connections are non reusable. #2592

tubsandcans opened this issue Feb 19, 2025 · 8 comments

Comments

@tubsandcans
Copy link

In our multi-process Passenger+Nginx service, we are noticing an eventual reduction in available RAM up until the point where the service crashes. This only began happening after integrating a feature that reads+writes to AWS S3, the gem for which utilizes a connection-pool by placing used connections in CLOSE_WAIT state for future resumption.

The problem, however, is that once a Passenger RubyApp instance/process is linked to one of these CLOSE_WAIT TCP connections, it appears to no longer have requests routed to it, thus locking up system resources. This causes Passenger to eventually create many more RubyApp instances than our max process count, which inevitably crashes the app.

I brought this issue up with ruby-aws-sdk gem maintainers, and I am able to work around this issue by monkey patching that gem. However, it seems worth investigating this issue from the perspective of Passenger and it (possibly) not being able to reuse processes with CLOSE_WAIT TCP connections.

@mullermp
Copy link

mullermp commented Mar 3, 2025

@CamJN @FooBarWidget I notice you both are maintainers of this repo. Do you mind taking a look at this issue and provide any explanation on how this might be blocking processes? This impacts this particular customer's usage of AWS SDK for Ruby.

@FooBarWidget
Copy link
Member

Hi @mullermp and @tubsandcans, Passenger doesn't inherent have any interactions with how the app deals with TCP sockets, so I don't really see how Passenger can be a cause here. However, you can use this trick to debug things: when a process is stuck, send SIGQUIT to it, and it'll dump the backtraces of all threads. This will allow you to see whether it's blocked on anything.

If you still suspect it's Passenger-related, please provide the output of passenger-status during a problematic time. It would be even more helpful if you can provide a reproducible case.

@tubsandcans
Copy link
Author

Hi @FooBarWidget, thanks for replying! The issue here is that I cannot safely reproduce this error as it only happens in production under heavy load/use. To reproduce requires that my production service crashes, which I'm not willing to do.

All I know is that Passenger had no problem re-using its supervised processes before they had any TCP sockets in CLOSE_WAIT (this behavior being introduced by the aws-sdk gem). Since then, I have monkey-patched this gem to not leave sockets in CLOSE_WAIT and instead destroy them. Everything has been working as it had before the aws-sdk integration, so there is definitely some interplay going on between Passenger and its supervised processes with CLOSE_WAIT sockets.

@FooBarWidget
Copy link
Member

The only thing I can think of is that, by keeping so many sockets open, you reach the Ruby process's file descriptor limit. Sometimes, when Passenger's Ruby side accepts a new requests, it may have to open a new socket connection (= new file descriptor). That would fail if the limit was already reached at that point. But if that's the case then you should see error messages in your web server error log file.

@FooBarWidget
Copy link
Member

And since you can't safely reproduce this issue, then the next best thing you can do is to automatically run diagnostics next time the issue does happen. You could create a cron job that sends SIGQUIT to all your application processes once every few minutes, so that if any of them does freeze, then at least you have backtraces.

@mullermp
Copy link

@tubsandcans Did you investigate @FooBarWidget's response?

@tubsandcans
Copy link
Author

@mullermp I don't have a reasonable way to test it. I'd have to roll back the patch and wait until our main production service falls over and then hopefully gather enough debug information to find a solution.

I'm not (nor is my boss) willing to do that. I was hoping someone would give me a concrete answer about Passenger's supervisor and supervised processes with CLOSE_WAIT sockets affecting its ability to re-use them. I think our long-term solution is to abandon Passenger, it's a form of tech debt unfortunately.

@mullermp
Copy link

I don't have a dog in that fight, but I would recommend just using Puma as it's pretty much the default for rails applications. You should not see issues with Ruby SDK's connection pooling.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants