Add retry manager to reduce RateLimitExceeded errors #2010

AndrewSirenko · 2024-04-15T16:56:04Z

Is this a bug fix or adding new feature?
Feature

What is this PR about? / Why do we need it?
If the driver calls an EC2 API repeatedly enough within a few seconds to exceed its request token bucket’s refill rate, EC2 will start returning the RequestLimitExceeded client errors to each CSI RPC.

Today, the plugin as a whole is not aware if it is being rate-limited by an EC2 API. This leads to each RPC to continuously retry EC2 API calls, even if they are likely to fail due to concurrent RPCs also failing. This leads to surges of RequestLimitExceeded errors.

By introducing client-side throttling for each EC2 Mutating API action via the AWS SDK's Adaptive Retry Mode, we can prevent barrages of RateLimitExceeded errors by coordinating CSI RPCs to make their EC2 calls at a slower rate until the request-token buckets refill.

What testing is done?

Amount of RateLimitExceeded client errors during scalability tests with default account limits.

# Pods Deployed	Pre-PR	Adaptive Retry
1000	9.16k (in 12 min)	410 (in 12 min)
2500		1.2k
5000	67.5k (in 62 min)	2.4k (in 55 min)

Note: Even with Adaptive Retry we still will see RateLimitExceeded errors because default AWS account only restores 5 mutating API tokens per second. With Adaptive Retry we see ~0.6 RateLimitExceeded errors per second, which seems optimal (too low of a number means extra latency in volume creation time due to client-side throttling).

In design document we deemed 1-2 errors per second acceptable so this is a success.

AndrewSirenko · 2024-04-15T16:56:55Z

/hold

Until:

5k, 10k scalability tests completed
PR Description updated
1k pod scalability test chart posted

github-actions · 2024-04-15T16:58:11Z

Code Coverage Diff

File	Old Coverage	New Coverage	Delta
github.com/kubernetes-sigs/aws-ebs-csi-driver/pkg/cloud/cloud.go	85.4%	84.1%	-1.3
github.com/kubernetes-sigs/aws-ebs-csi-driver/pkg/cloud/retry_manager.go	Does not exist	100.0%

pkg/cloud/retry_manager.go

pkg/cloud/cloud_test.go

pkg/cloud/retry_manager.go

torredil

Code largely looks good, currently manually testing changes. Will report back with lgtm if no issues surface from manual tests

torredil

Did not run into any issues manually testing. Will apply lgtm label after Connor's feedback is addressed. I've got one non-blocking idea and question.

Idea: worth considering using a helper function to apply the retryer instead of passing the retryer function inline for each EC2 API call, ie:

func withRetryer(fn func(ctx context.Context, input interface{}, optFns ...func(*ec2.Options)) (output interface{}, err error), retryer *retry.AdaptiveMode) func(ctx context.Context, input interface{}, optFns ...func(*ec2.Options)) (output interface{}, err error) {
    return func(ctx context.Context, input interface{}, optFns ...func(*ec2.Options)) (output interface{}, err error) {
        optFns = append(optFns, func(o *ec2.Options) {
            o.Retryer = retryer
        })
        return fn(ctx, input, optFns...)
    }
}

and then we can use it like this:

response, err := withRetryer(c.ec2.CreateVolume, c.rm.createVolumeRetryer)(ctx, requestInput)

This way, retryer logic is centralized.

Question:

Currently, each EC2 API call has its own retryer field in the retryManager struct - is there a requirement for doing so, or is that level of granularity necessary? (given a subset of API calls may have similar retry requirements)

AndrewSirenko · 2024-04-18T15:34:57Z

Currently, each EC2 API call has its own retryer field in the retryManager struct - is there a requirement for doing so, or is that level of granularity necessary? (given a subset of API calls may have similar retry requirements)

We need each Mutating EC2 API call to have their own retryer field because the sdk throttles on a retryer object level, not by API name. While default AWS accounts share request tokens between each EC2 Mutating API, if a customer raises limits for a particular Mutating API (ie CreateVolume) then the token bucket is no longer shared.

E.g. If we have them share the same retryer, then if a CreateVolume hits RateLimitExceeded, AttachVolume calls will also be throttled by client-side adaptive retryer, even if the two APIs no longer share the same AWS request token bucket (due to raised limits).

Hope that is clearer.

AndrewSirenko · 2024-04-18T15:41:37Z

Idea: worth considering using a helper function to apply the retryer instead of passing the retryer function inline for each EC2 API call, ie:

Thank you for the suggestion. I personally think that a call-site option is clearer because it matches the example approach of overriding AWS SDK service operations and avoids a function pointer. Also we may want to add more call-site options in the future.

torredil · 2024-04-18T15:43:15Z

@AndrewSirenko Perfect, that is as clear as can be. Can we add a comment in the retryManager to help future maintainers understand the reasoning behind having separate retryer fields for each mutating API?

pkg/cloud/retry_manager.go

torredil · 2024-04-19T14:02:53Z

/retest

ConnorJC3 · 2024-04-19T16:23:07Z

/lgtm

AndrewSirenko · 2024-04-19T17:37:20Z

/unhold

torredil

/approve

k8s-ci-robot · 2024-04-19T17:45:47Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: torredil

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [torredil]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Apr 15, 2024

k8s-ci-robot requested review from ConnorJC3 and torredil April 15, 2024 16:56

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Apr 15, 2024

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 15, 2024

AndrewSirenko force-pushed the cst-adaptive-retry-manager branch from 0ec72fc to b9d7bd2 Compare April 15, 2024 17:17

ConnorJC3 reviewed Apr 16, 2024

View reviewed changes

pkg/cloud/retry_manager.go Outdated Show resolved Hide resolved

pkg/cloud/cloud_test.go Outdated Show resolved Hide resolved

pkg/cloud/retry_manager.go Outdated Show resolved Hide resolved

pkg/cloud/retry_manager.go Outdated Show resolved Hide resolved

torredil reviewed Apr 16, 2024

View reviewed changes

torredil reviewed Apr 18, 2024

View reviewed changes

AndrewSirenko force-pushed the cst-adaptive-retry-manager branch 2 times, most recently from 4f2bef7 to 727c4c4 Compare April 18, 2024 16:56

rdpsin reviewed Apr 18, 2024

View reviewed changes

pkg/cloud/retry_manager.go Outdated Show resolved Hide resolved

pkg/cloud/retry_manager.go Outdated Show resolved Hide resolved

AndrewSirenko force-pushed the cst-adaptive-retry-manager branch from b25ad77 to 9e8c545 Compare April 18, 2024 22:58

Add retry manager to reduce RateLimitExceeded errors

9514fab

AndrewSirenko force-pushed the cst-adaptive-retry-manager branch from 9e8c545 to 9514fab Compare April 18, 2024 23:01

k8s-ci-robot assigned ConnorJC3 Apr 19, 2024

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 19, 2024

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 19, 2024

torredil approved these changes Apr 19, 2024

View reviewed changes

k8s-ci-robot assigned torredil Apr 19, 2024

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 19, 2024

k8s-ci-robot merged commit 275e68c into kubernetes-sigs:master Apr 19, 2024
19 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add retry manager to reduce RateLimitExceeded errors #2010

Add retry manager to reduce RateLimitExceeded errors #2010

AndrewSirenko commented Apr 15, 2024 •

edited

Loading

AndrewSirenko commented Apr 15, 2024 •

edited

Loading

github-actions bot commented Apr 15, 2024 •

edited

Loading

torredil left a comment

torredil left a comment

AndrewSirenko commented Apr 18, 2024

AndrewSirenko commented Apr 18, 2024 •

edited

Loading

torredil commented Apr 18, 2024

torredil commented Apr 19, 2024

ConnorJC3 commented Apr 19, 2024

AndrewSirenko commented Apr 19, 2024

torredil left a comment

k8s-ci-robot commented Apr 19, 2024

Add retry manager to reduce RateLimitExceeded errors #2010

Add retry manager to reduce RateLimitExceeded errors #2010

Conversation

AndrewSirenko commented Apr 15, 2024 • edited Loading

AndrewSirenko commented Apr 15, 2024 • edited Loading

github-actions bot commented Apr 15, 2024 • edited Loading

Code Coverage Diff

torredil left a comment

Choose a reason for hiding this comment

torredil left a comment

Choose a reason for hiding this comment

AndrewSirenko commented Apr 18, 2024

AndrewSirenko commented Apr 18, 2024 • edited Loading

torredil commented Apr 18, 2024

torredil commented Apr 19, 2024

ConnorJC3 commented Apr 19, 2024

AndrewSirenko commented Apr 19, 2024

torredil left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Apr 19, 2024

AndrewSirenko commented Apr 15, 2024 •

edited

Loading

AndrewSirenko commented Apr 15, 2024 •

edited

Loading

github-actions bot commented Apr 15, 2024 •

edited

Loading

AndrewSirenko commented Apr 18, 2024 •

edited

Loading