-
Notifications
You must be signed in to change notification settings - Fork 808
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add retry manager to reduce RateLimitExceeded errors #2010
Add retry manager to reduce RateLimitExceeded errors #2010
Conversation
/hold Until:
|
Code Coverage Diff
|
0ec72fc
to
b9d7bd2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code largely looks good, currently manually testing changes. Will report back with lgtm if no issues surface from manual tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did not run into any issues manually testing. Will apply lgtm label after Connor's feedback is addressed. I've got one non-blocking idea and question.
Idea: worth considering using a helper function to apply the retryer instead of passing the retryer function inline for each EC2 API call, ie:
func withRetryer(fn func(ctx context.Context, input interface{}, optFns ...func(*ec2.Options)) (output interface{}, err error), retryer *retry.AdaptiveMode) func(ctx context.Context, input interface{}, optFns ...func(*ec2.Options)) (output interface{}, err error) {
return func(ctx context.Context, input interface{}, optFns ...func(*ec2.Options)) (output interface{}, err error) {
optFns = append(optFns, func(o *ec2.Options) {
o.Retryer = retryer
})
return fn(ctx, input, optFns...)
}
}
and then we can use it like this:
response, err := withRetryer(c.ec2.CreateVolume, c.rm.createVolumeRetryer)(ctx, requestInput)
This way, retryer logic is centralized.
Question:
Currently, each EC2 API call has its own retryer field in the retryManager
struct - is there a requirement for doing so, or is that level of granularity necessary? (given a subset of API calls may have similar retry requirements)
We need each Mutating EC2 API call to have their own retryer field because the sdk throttles on a retryer object level, not by API name. While default AWS accounts share request tokens between each EC2 Mutating API, if a customer raises limits for a particular Mutating API (ie CreateVolume) then the token bucket is no longer shared. E.g. If we have them share the same retryer, then if a CreateVolume hits RateLimitExceeded, AttachVolume calls will also be throttled by client-side adaptive retryer, even if the two APIs no longer share the same AWS request token bucket (due to raised limits). Hope that is clearer. |
Thank you for the suggestion. I personally think that a call-site option is clearer because it matches the example approach of overriding AWS SDK service operations and avoids a function pointer. Also we may want to add more call-site options in the future. |
@AndrewSirenko Perfect, that is as clear as can be. Can we add a comment in the |
4f2bef7
to
727c4c4
Compare
b25ad77
to
9e8c545
Compare
9e8c545
to
9514fab
Compare
/retest |
/lgtm |
/unhold |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: torredil The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Is this a bug fix or adding new feature?
Feature
What is this PR about? / Why do we need it?
If the driver calls an EC2 API repeatedly enough within a few seconds to exceed its request token bucket’s refill rate, EC2 will start returning the RequestLimitExceeded client errors to each CSI RPC.
Today, the plugin as a whole is not aware if it is being rate-limited by an EC2 API. This leads to each RPC to continuously retry EC2 API calls, even if they are likely to fail due to concurrent RPCs also failing. This leads to surges of RequestLimitExceeded errors.
By introducing client-side throttling for each EC2 Mutating API action via the AWS SDK's Adaptive Retry Mode, we can prevent barrages of RateLimitExceeded errors by coordinating CSI RPCs to make their EC2 calls at a slower rate until the request-token buckets refill.
What testing is done?
Amount of
RateLimitExceeded
client errors during scalability tests with default account limits.Note: Even with Adaptive Retry we still will see
RateLimitExceeded
errors because default AWS account only restores 5 mutating API tokens per second. With Adaptive Retry we see ~0.6
RateLimitExceeded
errors per second, which seems optimal (too low of a number means extra latency in volume creation time due to client-side throttling).In design document we deemed 1-2 errors per second acceptable so this is a success.