-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Broken Pipe and EOFErrors #119
Comments
Yes, AFAIK, sidekiq does set Also I don't have the code to sidekiq-ent handy, but perhaps it's a problem similar to sidekiq/sidekiq#5834. I'll have a look tomorrow to see if I can see something fishy. FYI @mperham |
Question, is the vast majority coming from |
Seems to be all over the place. |
Ok, so I had a look but couldn't spot anything wrong. I really fail to see what could have made the difference on your setup. Is the Sidekiq upgrade the only thing you shipped?
I'd say the very first step would be to double check reconnects do happen on your setup. I'd suggest forking the gem to add some logs here: redis-client/lib/redis_client.rb Lines 642 to 644 in ae29d65
I suspect it's fine, but you never know. Then I assume you use some third party Redis provider? It's possible that they're failing over of things like that. Any pattern on the errors? Is it a constant stream or is there some spikes? If so, you may need to try exponential backoff retries: https://github.com/redis-rb/redis-client#exponential-backoff |
Any updates on this issue? I'm experiencing the same after upgrading to sidekiq to 7.1.2 (which relies in redis-client 0.14.1). I'm deploying on Heroku-22
|
I wonder if this could have been solved by #126. But that does require a forking setup in which |
I've also been getting a lot of
|
I'm seeing a similar behaviour too (both More info in https://gitlab.com/gitlab-com/gl-infra/production/-/issues/17158#note_1659816466. We are fairly confident the upgrade caused the incident but unsure of exactly why. We did a roll-back of the release containing the sidekiq 7 upgrade and the problem (redis errors + sidekiq apdex issues) went away. We subsequently re-deployed the release but this time without the sidekiq 7 upgrade commit and the metrics were all fine. |
@slai11 I'd love to get to the bottom of that issue, but without experiencing it myself, it's pretty much impossible. But if you figure out a repro, please share I'll be happy to squash that bug. |
We've been seeing the same issue: Redis 6.X with TLS, redis-client 0.18, connection_pool 2.4.1 It happens to all our Redis connections, including caching & Sidekiq. I've tried messing with I've tried to reproduce it but so far no luck. This comment seems the closest I could find to an explanation, so maybe it could help to link it here: mperham/connection_pool#177 (comment). |
I believe we're also getting the same here with redis-client 0.18.0 and connection_pool 2.4.1. We're also using TLS. It's only happening with a job that uses activejob-uniqueness and redlock though:
|
sidekiq/sidekiq#5929 (comment) - provided more info about case. So basically it happening with no load on sidekiq daemon and some threads get timeout error (some, not all of them in sidekiq). redis.rb had no such issues in previous sidekiq <= 6 |
I'm not sure yet if I reproduced this specific issue, but today I set up a Google Cloud Memorystore instance with Redis 6.2.13 and TLS enabled. Then I ran: require 'redis-client'
redis_config = RedisClient.config(host: 'redacted-redis-hostname', port: 6378, ssl: true, password: 'redacted-password')
redis = redis_config.new_client
redis.call("PING")
=> "PONG" I waited an hour or so, and sent another PING, but this time I got the EPIPE: irb(main):015:0> redis.call("PING")
/opt/gitlab/embedded/lib/ruby/gems/3.0.0/gems/redis-client-0.18.0/lib/redis_client/ruby_connection.rb:76:in `rescue in write': Broken pipe (RedisClient::ConnectionError)
from /opt/gitlab/embedded/lib/ruby/gems/3.0.0/gems/redis-client-0.18.0/lib/redis_client/ruby_connection.rb:73:in `write'
from /opt/gitlab/embedded/lib/ruby/gems/3.0.0/gems/redis-client-0.18.0/lib/redis_client/connection_mixin.rb:30:in `call'
from /opt/gitlab/embedded/lib/ruby/gems/3.0.0/gems/redis-client-0.18.0/lib/redis_client.rb:256:in `block (2 levels) in call'
from /opt/gitlab/embedded/lib/ruby/gems/3.0.0/gems/redis-client-0.18.0/lib/redis_client/middlewares.rb:16:in `call'
from /opt/gitlab/embedded/lib/ruby/gems/3.0.0/gems/redis-client-0.18.0/lib/redis_client.rb:255:in `block in call'
from /opt/gitlab/embedded/lib/ruby/gems/3.0.0/gems/redis-client-0.18.0/lib/redis_client.rb:675:in `ensure_connected'
from /opt/gitlab/embedded/lib/ruby/gems/3.0.0/gems/redis-client-0.18.0/lib/redis_client.rb:254:in `call'
from (irb):15:in `<main>'
from /opt/gitlab/embedded/lib/ruby/gems/3.0.0/gems/irb-1.3.5/exe/irb:11:in `<top (required)>'
from /opt/gitlab/embedded/bin/irb:23:in `load'
from /opt/gitlab/embedded/bin/irb:23:in `<main>'
/opt/gitlab/embedded/lib/ruby/gems/3.0.0/gems/openssl-3.1.0/lib/openssl/buffering.rb:415:in `syswrite_nonblock': Broken pipe (Errno::EPIPE)
from /opt/gitlab/embedded/lib/ruby/gems/3.0.0/gems/openssl-3.1.0/lib/openssl/buffering.rb:415:in `write_nonblock'
from /opt/gitlab/embedded/lib/ruby/gems/3.0.0/gems/redis-client-0.18.0/lib/redis_client/ruby_connection/buffered_io.rb:64:in `block in write'
from /opt/gitlab/embedded/lib/ruby/gems/3.0.0/gems/redis-client-0.18.0/lib/redis_client/ruby_connection/buffered_io.rb:63:in `loop'
from /opt/gitlab/embedded/lib/ruby/gems/3.0.0/gems/redis-client-0.18.0/lib/redis_client/ruby_connection/buffered_io.rb:63:in `write'
from /opt/gitlab/embedded/lib/ruby/gems/3.0.0/gems/redis-client-0.18.0/lib/redis_client/ruby_connection.rb:74:in `write'
from /opt/gitlab/embedded/lib/ruby/gems/3.0.0/gems/redis-client-0.18.0/lib/redis_client/connection_mixin.rb:30:in `call'
from /opt/gitlab/embedded/lib/ruby/gems/3.0.0/gems/redis-client-0.18.0/lib/redis_client.rb:256:in `block (2 levels) in call'
from /opt/gitlab/embedded/lib/ruby/gems/3.0.0/gems/redis-client-0.18.0/lib/redis_client/middlewares.rb:16:in `call'
from /opt/gitlab/embedded/lib/ruby/gems/3.0.0/gems/redis-client-0.18.0/lib/redis_client.rb:255:in `block in call'
from /opt/gitlab/embedded/lib/ruby/gems/3.0.0/gems/redis-client-0.18.0/lib/redis_client.rb:675:in `ensure_connected'
from /opt/gitlab/embedded/lib/ruby/gems/3.0.0/gems/redis-client-0.18.0/lib/redis_client.rb:254:in `call'
from (irb):15:in `<main>'
from /opt/gitlab/embedded/lib/ruby/gems/3.0.0/gems/irb-1.3.5/exe/irb:11:in `<top (required)>'
from /opt/gitlab/embedded/bin/irb:23:in `load'
from /opt/gitlab/embedded/bin/irb:23:in `<main>' Naturally, I was curious if Google's 10-minute VPC timeout was kicking in, but the issue didn't seem to occur if I attempted a PING after 10 minutes. In addition, I also verified with Wireshark that TCP keepalives appear to be sent and acked every 15 seconds, so it doesn't appear that the TCP connection is going idle. I also checked that the Redis I'm not sure why the socket appears to be closing like this, so I'll continue to investigate. |
Redis 6.2.8 did get it after all: redis/redis#11607 |
Sidekiq threads send |
In Sidekiq 7, calls to `Sidekiq.redis` checks out a connection from an internal connection pool instead of the connection pool used by the job processors. If that connection is checked out but not used, it's possible that the heartbeat thread won't keep the connection alive before the Redis server client timeout. To avoid this, update the checks to return earlier if the Redis connection is not needed. This might help reduce intermittent `Broken pipe` errors as reported in redis-rb/redis-client#119.
In Sidekiq 7, calls to `Sidekiq.redis` checks out a connection from an internal connection pool instead of the connection pool used by the job processors. If a connection is checked out but not used, it's possible that the heartbeat thread won't keep the connection alive before the Redis server client timeout. To avoid this, update the checks to return earlier if the Redis connection is not needed. This might help reduce intermittent `Broken pipe` errors as reported in redis-rb/redis-client#119.
I've done some lengthy investigation on this issue as it relates to Sidekiq, and at the moment I don't believe this is a redis-client problem. I suspect some of the issues may be caused by:
I reproduced broken pipe and EOFErrors by setting the Redis Sidekiq 6 used a single connection pool and added 5 additional connections as leeway. However, as described in https://github.com/sidekiq/sidekiq/blob/main/docs/capsule.md#redis-pools, Sidekiq 7 created:
For those of you having trouble:
It's important to determine where this is happening first because a call to For example, I was seeing these broken pipe errors from within the heartbeat thread (line numbers may be slightly off due to debugging), until I realized that /opt/gitlab/embedded/lib/ruby/gems/3.0.0/gems/sidekiq-cron-1.9.1/lib/sidekiq/cron/job.rb:20:in `should_enque?' /opt/gitlab/embedded/lib/ruby/gems/3.0.0/gems/sidekiq-cron-1.9.1/lib/sidekiq/cron/job.rb:39:in `test_and_enque_for_time!' /opt/gitlab/embedded/lib/ruby/gems/3.0.0/gems/sidekiq-cron-1.9.1/lib/sidekiq/cron/poller.rb:36:in `enqueue_job' /opt/gitlab/embedded/lib/ruby/gems/3.0.0/gems/sidekiq-cron-1.9.1/lib/sidekiq/cron/poller.rb:23:in `block in enqueue' /opt/gitlab/embedded/lib/ruby/gems/3.0.0/gems/sidekiq-cron-1.9.1/lib/sidekiq/cron/poller.rb:22:in `each' /opt/gitlab/embedded/lib/ruby/gems/3.0.0/gems/sidekiq-cron-1.9.1/lib/sidekiq/cron/poller.rb:22:in `enqueue' /opt/gitlab/embedded/service/gitlab-rails/lib/gitlab/patch/sidekiq_poller.rb:9:in `block (2 levels) in enqueue' /opt/gitlab/embedded/service/gitlab-rails/gems/gitlab-safe_request_store/lib/gitlab/safe_request_store.rb:66:in `enabling_request_store' /opt/gitlab/embedded/service/gitlab-rails/gems/gitlab-safe_request_store/lib/gitlab/safe_request_store.rb:59:in `ensure_request_store' /opt/gitlab/embedded/service/gitlab-rails/lib/gitlab/patch/sidekiq_poller.rb:8:in `block in enqueue' /opt/gitlab/embedded/lib/ruby/gems/3.0.0/gems/activesupport-7.0.8/lib/active_support/execution_wrapper.rb:92:in `wrap' /opt/gitlab/embedded/lib/ruby/gems/3.0.0/gems/activesupport-7.0.8/lib/active_support/reloader.rb:72:in `block in wrap' /opt/gitlab/embedded/lib/ruby/gems/3.0.0/gems/activesupport-7.0.8/lib/active_support/execution_wrapper.rb:92:in `wrap' /opt/gitlab/embedded/lib/ruby/gems/3.0.0/gems/activesupport-7.0.8/lib/active_support/reloader.rb:71:in `wrap' /opt/gitlab/embedded/service/gitlab-rails/lib/gitlab/patch/sidekiq_poller.rb:7:in `enqueue' /opt/gitlab/embedded/lib/ruby/gems/3.0.0/gems/sidekiq-7.1.6/lib/sidekiq/scheduled.rb:99:in `block in start' /opt/gitlab/embedded/lib/ruby/gems/3.0.0/gems/sidekiq-7.1.6/lib/sidekiq/component.rb:10:in `watchdog' /opt/gitlab/embedded/lib/ruby/gems/3.0.0/gems/sidekiq-7.1.6/lib/sidekiq/component.rb:19:in `block in safe_thread' /opt/gitlab/embedded/lib/ruby/gems/3.0.0/gems/redis-client-0.18.0/lib/redis_client.rb:672:in `ensure_connected' /opt/gitlab/embedded/lib/ruby/gems/3.0.0/gems/redis-client-0.18.0/lib/redis_client.rb:408:in `pipelined' /opt/gitlab/embedded/lib/ruby/gems/3.0.0/gems/redis-client-0.18.0/lib/redis_client/decorator.rb:51:in `pipelined' /opt/gitlab/embedded/lib/ruby/gems/3.0.0/gems/sidekiq-7.1.6/lib/sidekiq/launcher.rb:153:in `block in ❤' /opt/gitlab/embedded/lib/ruby/gems/3.0.0/gems/sidekiq-7.1.6/lib/sidekiq/config.rb:166:in `block in redis' /opt/gitlab/embedded/lib/ruby/gems/3.0.0/gems/connection_pool-2.4.1/lib/connection_pool.rb:110:in `block (2 levels) in with' /opt/gitlab/embedded/lib/ruby/gems/3.0.0/gems/connection_pool-2.4.1/lib/connection_pool.rb:109:in `handle_interrupt' /opt/gitlab/embedded/lib/ruby/gems/3.0.0/gems/connection_pool-2.4.1/lib/connection_pool.rb:109:in `block in with' /opt/gitlab/embedded/lib/ruby/gems/3.0.0/gems/connection_pool-2.4.1/lib/connection_pool.rb:106:in `handle_interrupt' /opt/gitlab/embedded/lib/ruby/gems/3.0.0/gems/connection_pool-2.4.1/lib/connection_pool.rb:106:in `with' /opt/gitlab/embedded/lib/ruby/gems/3.0.0/gems/sidekiq-7.1.6/lib/sidekiq/config.rb:163:in `redis' /opt/gitlab/embedded/lib/ruby/gems/3.0.0/gems/sidekiq-7.1.6/lib/sidekiq/component.rb:28:in `redis' /opt/gitlab/embedded/lib/ruby/gems/3.0.0/gems/sidekiq-7.1.6/lib/sidekiq/launcher.rb:149:in `❤' /opt/gitlab/embedded/lib/ruby/gems/3.0.0/gems/sidekiq-7.1.6/lib/sidekiq/launcher.rb:99:in `beat' /opt/gitlab/embedded/lib/ruby/gems/3.0.0/gems/sidekiq-7.1.6/lib/sidekiq/launcher.rb:91:in `block in start_heartbeat' /opt/gitlab/embedded/lib/ruby/gems/3.0.0/gems/sidekiq-7.1.6/lib/sidekiq/launcher.rb:90:in `loop' /opt/gitlab/embedded/lib/ruby/gems/3.0.0/gems/sidekiq-7.1.6/lib/sidekiq/launcher.rb:90:in `start_heartbeat' /opt/gitlab/embedded/lib/ruby/gems/3.0.0/gems/sidekiq-7.1.6/lib/sidekiq/component.rb:10:in `watchdog' /opt/gitlab/embedded/lib/ruby/gems/3.0.0/gems/sidekiq-7.1.6/lib/sidekiq/component.rb:19:in `block in safe_thread' In Sidekiq 6, the repeated |
In Sidekiq 7, calls to `Sidekiq.redis` checks out a connection from an internal connection pool instead of the connection pool used by the job processors. If a connection is checked out but not used, it's possible that the heartbeat thread won't keep the connection alive before the Redis server client timeout. To avoid this, update the checks to return earlier if the Redis connection is not needed. This might help reduce intermittent `Broken pipe` errors as reported in redis-rb/redis-client#119.
This may also just be due to instrumentation changes. Looking at the stacktraces posted above, I see that you are using datadog.
For the old Redis <5 client, it just patches |
Good point. I should have mentioned that, though I noticed many of the stack traces above suggest the problem is happening on the internal pool. That makes me wonder whether the heartbeat thread isn't keeping these internal connections alive properly. As you mentioned somewhere else, a thread will tend to get the same connection that it used previously. I've validated this by printing out the In Sidekiq 6, the connection pool was touched by all job threads, so I suspect the likelihood of idling on any one connection was zero. @mperham Should there be an option to make the heartbeat thread iterate through all connections? Right now it might be acceptable to let the Redis client reconnect gracefully, but users that have a Redis I think it would also help if
|
I should also mention the default |
I don't follow. |
My mistake. For some reason I thought the retry logic was in Sidekiq, but it's in To reiterate, I think we're likely seeing more of these
The problem is that it's difficult to determine whether these errors are just noise, or whether they're indicative of a more serious problem. |
Right. I'll try to find some time to look at this instrumentation, see if we can hide non-final errors from it. Unless someone beat me to it of course. |
In Sidekiq 7, calls to `Sidekiq.redis` checks out a connection from an internal connection pool instead of the connection pool used by the job processors. If a connection is checked out but not used, it's possible that the heartbeat thread won't keep the connection alive before the Redis server client timeout. To avoid this, update the checks to return earlier if the Redis connection is not needed. This might help reduce intermittent `Broken pipe` errors as reported in redis-rb/redis-client#119.
I've seen notes that Redis server has issues with OpenSSL 3. I've seen issues with Datadog and NewRelic's Redis instrumentation with the redis -> redis-client migration. It's difficult to know which bug or component is at fault. If you are seeing mysterious errors, all I can advise at this point is to upgrade everything possibly related, including Redis server itself and any gems visible in the backtrace. |
@byroot I was looking into this and noticed that Based on my understanding, retrying the command in Since the instrumentation middlewares are reporting the errors, could we pass useful information like redis-client/lib/redis_client.rb Line 677 in d1b9284
middleware.call ? The middleware authors could then handle the errors accordingly with this new information. e.g. rescue ConnectionError and ignore if attempts is 0.
def measure_round_trip_delay
- ensure_connected do |connection|
- @middlewares.call(["PING"], config) do
+ ensure_connected do |connection, attempts|
+ @middlewares.call(["PING"], config, attempts) do
connection.measure_round_trip_delay
end
end |
We were seeing the increase, and I was able to go from an error trace to viewing the error in the apm view. From there, I could see a lot of sidekiq job traces that had been affected with the same issue id, and see the status was "OK". This leads me to believe this theory. Thanks for suggesting it. 😄 |
I opened DataDog/dd-trace-rb#3820, I'll try to work with Datadog people to make the necessary changes to these transcient errors can be either ignored or marked as retried. |
@davidgm0 I am having the same issue with redis-client 0.22 and activejob-uniqueness. Did you get to the bottom of this? |
We recently switched from Sidekiq 6.5.8 to 7.1 (where there was a change to use
redis-client
) and have noticed a significant increase in errors, all coming from RedisClient. All errors follow one of three forms:#116 seemed possibly related, but it looks like Sidekiq already defaults to having
reconnect_attempts
set to 1 so maybe a dead end there.I've struggled to figure out what the actual cause is. Any ideas what direction to head on this?
The text was updated successfully, but these errors were encountered: