Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry the grpc connection when there's an error #503

Merged
merged 5 commits into from
Feb 23, 2024

Conversation

akondapuram
Copy link
Contributor

@akondapuram akondapuram commented Feb 9, 2024

This PR aims to fix the hot-looping problem described in Issue#502

The issue here was when the xDS-server closed the stream, the xDS-client tried to NACK the previous response and it went berserk in a hot-loop trying to fetch the configuration updates from the closed stream. This happens because the sotw.isConnErr doesn't return true in this case when the server signals the client with the following error message
rpc error: code = Internal desc = stream terminated by RST_STREAM with error code: NO_ERROR.

Here's the sotw.isConnErr() for reference.

So the idea of this PR is to retry the connection whenever there is an error trying to fetch the config, instead of just expecting and handling just a few error codes.

Also, added exponential backoff for retrying the connection attempts.

@akondapuram
Copy link
Contributor Author

@mattklein123 @renuka-fernando Requesting your review on this PR. Thank you.
cc: @ramaraochavali

Signed-off-by: alekhya.kondapuram <[email protected]>
Signed-off-by: alekhya.kondapuram <[email protected]>
@akondapuram akondapuram marked this pull request as ready for review February 16, 2024 02:49
Comment on lines +118 to +119
p.retryGrpcConn()
return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we retry gRPC conn only for connection errors?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we run Xds Server behind Envoy, during pod shutdowns/server enforced max connection age, the client gets RESET frame like "rpc error: code = Internal desc = stream terminated by RST_STREAM with error code: NO_ERROR" which is not treated as connection failure (It is RESET). Envoy -> XDS control plane does not just retry connection failures but retries with a backup on any error for the same reason https://github.com/envoyproxy/envoy/blob/49425f55aa9212a64b3390909160c41dc22ff349/source/extensions/config_subscription/grpc/grpc_stream.h#L50

This PR just mimicks the Envoy behaviour

@akondapuram
Copy link
Contributor Author

Thank you for the approval @renuka-fernando
@mattklein123 Could you please take a look when you get some time? Thank you!

@ramaraochavali
Copy link

@mattklein123 can you please take a look and merge this if this looks good to you?

@mattklein123 mattklein123 self-assigned this Feb 22, 2024
Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also add documentation in the README.

src/provider/xds_grpc_sotw_provider.go Outdated Show resolved Hide resolved
@akondapuram
Copy link
Contributor Author

Thank you for your review @mattklein123. Updated per comments. Please take a second look when you're free. Thanks!

README.md Outdated Show resolved Hide resolved
Signed-off-by: alekhya.kondapuram <[email protected]>
@mattklein123 mattklein123 merged commit 19f2079 into envoyproxy:main Feb 23, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants