- 
                Notifications
    You must be signed in to change notification settings 
- Fork 2.6k
Don't deplete all the startup nodes after ConnectionError/TimeoutError #3697
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Don't deplete all the startup nodes after ConnectionError/TimeoutError #3697
Conversation
… or TimeoutError against all nodes, rather keep one around so that retry algorithm has at least one node to work with
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR updates the cluster command execution logic to avoid removing the last remaining startup node on connection or timeout failures, ensuring retries can still proceed.
- Preserve one startup node when all others fail and wrap the original exception in a RedisClusterException
- Only remove failed nodes if more than one startup node remains
- Re-raise the appropriate exception after forcing a cluster layout reinitialization
Comments suppressed due to low confidence (2)
redis/asyncio/cluster.py:824
- [nitpick] The error message could be more descriptive and grammatically clear, e.g., 'Unable to connect to Redis Cluster: connection or timeout errors on all startup nodes'.
                        'Connection or Timeout Errors across all startup nodes'
redis/asyncio/cluster.py:820
- Add a unit test covering the scenario where only one startup node remains to ensure it isn't removed and the correct RedisClusterException is raised with the original cause.
                if len(self.nodes_manager.startup_nodes) == 1:
| ce = RedisClusterException( | ||
| 'Redis Cluster cannot be connected. ' | ||
| 'Connection or Timeout Errors across all startup nodes' | ||
| ) | ||
| ce.__cause__ = e | ||
| e = ce | 
    
      
    
      Copilot
AI
    
    
    
      Jul 14, 2025 
    
  
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Reassigning the caught exception variable e to a new exception can be confusing; consider raising the new RedisClusterException directly or using a separate variable name for clarity.
| ce = RedisClusterException( | |
| 'Redis Cluster cannot be connected. ' | |
| 'Connection or Timeout Errors across all startup nodes' | |
| ) | |
| ce.__cause__ = e | |
| e = ce | |
| raise RedisClusterException( | |
| 'Redis Cluster cannot be connected. ' | |
| 'Connection or Timeout Errors across all startup nodes' | |
| ) from e | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eoghanmurray , Hello!
Thanks for the PR!
We also encountered a similar error when redis removes all startup nodes from the pool and goes into an endless retreat.
Would you like to see the comments from Copilot?
I want to see the merged PR as soon as possible :)
| Hi @eoghanmurray, thank you for your PR! | 
Don't deplete all the startup nodes after ConnectionError/TimeoutError against all nodes, rather keep one around so that retry algorithm has at least one node to work with
Description of change
See bug report #3693