Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] socket_timeout parameter has no effect if the link is broken between redis instance and the client. Outage towards a slot range for several minutes. #579

Open
jzkiss opened this issue Jul 5, 2024 · 0 comments

Comments

@jzkiss
Copy link

jzkiss commented Jul 5, 2024

Describe the bug
socket_timeout parameter has no effect if the link is broken between redis instance and the client. Outage towards a slot range for several minutes.

To Reproduce

  • Async client is used with cluster mode. Redis cluster is used with 3 masters and 3 slaves.
  • generate continuous traffic
  • during the traffic, select a master redis instance, apply the following iptables rule in the container/vm of that server:
    iptables -A OUTPUT -p tcp --sport redis_port -s redis_ip -j DROP
  • kill that redis server (kill -9 redis_server_pid) [-> new master election will happen for that slot range]

Expected behavior
After socket_timeout, redis-plus-plus discover the new elected master / broken connection, traffic is redirected to that master

Unexpected result: old connection is used, continuous TCP packet retransmissions, no response to users for the given slot range for several minutes

Environment:
OS: Rocky Linux 8.2-20.el8.0.1
Compiler: gcc version 8.5.0
hiredis version: hiredis 1.2.0
redis-plus-plus version: 1.3.12

Additional context
Correction proposal:

  • introduce a new property per connection: last_response_received
  • failure detection: whenever a request is sent in a connection, check if (t_now - last_response_received) is under socket_timeout + TOLERATION (some milliseconds).
  • when the failure is detected, be careful with CLUSTER SLOTS, do not target the problematic master instance, select another (I see issues like that)
  • check if mastership was changed, and reconnect to the new master if needed. Abort the ongoing requests towards the redis-plus-plus user, redis-plus-plus user should retransmit the request, maybe with some delay.

As a workaround, we started guard timer and reset the AsyncRedisClient at timeouts to force redis-plus-plus to discover the mastership changes / broken links. But this solution also caused some issues, see:
#577
#578

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant