Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add RetryStrategy for S3 file system #9736

Closed
wants to merge 4 commits into from

Conversation

yma11
Copy link
Contributor

@yma11 yma11 commented May 7, 2024

This PR add RetryStrategy support for s3 file system, thus it will retry when connection fails. It also upgrade aws sdk to 1.11.321 which supports AdaptiveRetryStrategy which user may choose to use.

For RetryStrategy, 2 configs are added:

hive.s3.max-attempts: Maximum attempts for connections to a single http client.
hive.s3.retry-mode: 'standard', 'adaptive' or legacy, client will be created w/o retrystrategy if it's empty.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 7, 2024
Copy link

netlify bot commented May 7, 2024

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit e927721
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/6666ac033cbfe600090421b5

@yma11
Copy link
Contributor Author

yma11 commented May 7, 2024

@majetideepak Can you help review this PR? cc @FelixYBW

Copy link
Collaborator

@majetideepak majetideepak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yma11 two minor comments. Thanks for this change!

@@ -640,6 +644,39 @@ class S3FileSystem::Impl {
return getDefaultCredentialsProvider();
}

// Return a client RetryStrategy based on the config.
std::shared_ptr<Aws::Client::RetryStrategy> getRetryStrategy() const {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can make getRetryStrategy() return std::optional<std::shared_ptr<Aws::Client::RetryStrategy>> and use the library default.

Copy link
Contributor Author

@yma11 yma11 May 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean keeping the choice that create the client without any RetryStrategy?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated a little. Please take a look.

* - hive.s3.retry-mode
- string
-
- 'standard' or 'adaptive', use DefaultRetryStrategy if it's empty.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add some description here on what standard and adaptive retry strategies are? Thanks!

// Otherwise, use default value 3.
return std::make_shared<Aws::Client::AdaptiveRetryStrategy>();
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add the check of invalid value of retry mode?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

@@ -536,7 +536,16 @@ Each query can override the config by setting corresponding query session proper
- integer
-
- Maximum concurrent TCP connections for a single http client.

* - hive.s3.max-attempts
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By default, aws sdk uses 3, right? Let's mention this here

@majetideepak majetideepak added the ready-to-merge PR that have been reviewed and are ready for merging. PRs with this tag notify the Velox Meta oncall label May 8, 2024
@FelixYBW
Copy link

FelixYBW commented May 8, 2024

@yma11 When the call client_->HeadObject or client_->GetObject fail, can we get the the retry number from outcome? If so, let's print the number in error message. So user can know it does be retried and may increase retry number.

@majetideepak majetideepak removed the ready-to-merge PR that have been reviewed and are ready for merging. PRs with this tag notify the Velox Meta oncall label May 8, 2024
Copy link
Collaborator

@majetideepak majetideepak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yma11 I will approve again once @FelixYBW comment is addressed.

Comment on lines 174 to 178
if (IsRetryableHttpResponseCode(error.GetResponseCode())) { \
auto retryHint = fmt::format( \
"This request has retried {} times, you may can try increasing 'hive.s3.max-attempts'.", \
outcome.GetRetryCount()); \
errMsg.append(retryHint); \
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@FelixYBW Please review this change.

* - hive.s3.max-attempts
- integer
-
- Maximum attempts for connections to a single http client, work together with retry-mode. By default, it's 3 for standard/adaptive mode
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this document describe, default 4 for legacy mode

A default value of 4 for maximum retry attempts,

https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-retries.html

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

Get this number from code, maybe more reliable?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

their cpp_sdk and document have some difference. Let's follow sdk

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe different AWS version has different default value

@@ -120,6 +120,16 @@ std::optional<uint32_t> HiveConfig::s3MaxConnections() const {
config_->get<uint32_t>(kS3MaxConnections));
}

std::optional<uint32_t> HiveConfig::s3MaxAttempts() const {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it is int32_t in AWS config?

@yma11 yma11 force-pushed the s3-retry branch 3 times, most recently from f052a02 to c2f4d18 Compare May 8, 2024 10:39
@yma11
Copy link
Contributor Author

yma11 commented May 8, 2024

@yma11 I will approve again once @FelixYBW comment is addressed.

Thanks @majetideepak, I have updated. Can you take a look again? There is a build failure in CI, which is caused by the old version aws sdk doesn't have GetRetryCount() API and seems version upgrade in this PR can't be reflected immediately.

@majetideepak
Copy link
Collaborator

majetideepak commented May 8, 2024

There is a build failure in CI, which is caused by the old version aws sdk doesn't have GetRetryCount() API

@yma11 Let's update the library first in a separate PR (changes inside setup-adapters.sh). That will give us the newer CI images with the updated library.

@yma11
Copy link
Contributor Author

yma11 commented May 9, 2024

There is a build failure in CI, which is caused by the old version aws sdk doesn't have GetRetryCount() API

@yma11 Let's update the library first in a separate PR (changes inside setup-adapters.sh). That will give us the newer CI images with the updated library.

Thanks for suggestion. I have created 9756 for version upgrade. Will update this PR once it's merged.

@yma11 yma11 force-pushed the s3-retry branch 2 times, most recently from 6360d25 to 00bc6ad Compare May 11, 2024 01:30
@yma11 yma11 requested a review from majetideepak May 13, 2024 01:16
@yma11
Copy link
Contributor Author

yma11 commented May 14, 2024

@majetideepak can you approve this PR again? Thanks.

Copy link
Collaborator

@majetideepak majetideepak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yma11 some minor comments.
Were you able to test this change on your local setup?

if (maxAttempts.has_value()) {
VELOX_USER_CHECK(
(maxAttempts.value() > 0),
"Invalid configuration: specify 'hive.s3.max-attempts' > 0.");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe print the value?
Should 0 be allowed here?
VELOX_USER_CHECK( (maxAttempts.value() >= 0), "Invalid configuration: specified 'hive.s3.max-attempts' value {} is < 0.", maxAttempts.value());

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As checked in CilentConfiguration, it can be 0. Changed it to use VELOX_USER_CHECK_GE. @jinchengchenghh

} else if (retryMode.value() == "adaptive") {
if (maxAttempts.has_value()) {
VELOX_USER_CHECK(
(maxAttempts.value() > 0),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above. Should 0 be allowed and specify the value in the error message.

} else if (retryMode.value() == "legacy") {
if (maxAttempts.has_value()) {
VELOX_USER_CHECK(
(maxAttempts.value() > 0),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above. Should 0 be allowed and specify the value in the error message.

getRequestID(error.GetResponseHeaders())); \
if (IsRetryableHttpResponseCode(error.GetResponseCode())) { \
auto retryHint = fmt::format( \
" This request gets retriable response and has retried {} times, you may increase 'hive.s3.max-attempts'.", \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Request failed after retrying {} times. Try increasing the value of 'hive.s3.max-attempts'.

if (retryMode.has_value()) {
if (retryMode.value() == "standard") {
if (maxAttempts.has_value()) {
VELOX_USER_CHECK(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VELOX_USER_CHECK -> VELOX_USER_CHECK_GT

}
} else if (retryMode.value() == "adaptive") {
if (maxAttempts.has_value()) {
VELOX_USER_CHECK(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

}
} else if (retryMode.value() == "legacy") {
if (maxAttempts.has_value()) {
VELOX_USER_CHECK(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

@yma11
Copy link
Contributor Author

yma11 commented May 14, 2024

@yma11 some minor comments. Were you able to test this change on your local setup?

I don't have local env for this test for now. I will add corresponding configs at Gluten side and then test it in AWS.

@majetideepak
Copy link
Collaborator

majetideepak commented May 14, 2024

I don't have local env for this test for now. I will add corresponding configs at Gluten side and then test it in AWS.

@yma11 thanks! Can you please confirm and update here? We cannot add CI tests for some of these changes, but the expectation is that the author tests the changes on their end.

const {
auto retryMode = hiveConfig_->s3RetryMode();
auto maxAttempts = hiveConfig_->s3MaxAttempts();
if (retryMode.has_value()) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If user doesn't config retryMode but config maxAttempts, does maxAttempts take effect? It should takes effect since we have default retry mode.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. In velox, maxAttempts need work together with retryMode. If retryMode isn't configured, S3 client will be created w/o RetryStrategy. But for Gluten, we've set default value for retryMode. I updated the doc in this PR.

@FelixYBW
Copy link

FelixYBW commented May 25, 2024

@yma11 thanks! Can you please confirm and update here? We cannot add CI tests for some of these changes, but the expectation is that the author tests the changes on their end.

Confirmed it works. Retry number is passed to S3 and the query passed.

@majetideepak majetideepak added the ready-to-merge PR that have been reviewed and are ready for merging. PRs with this tag notify the Velox Meta oncall label May 25, 2024
@yma11
Copy link
Contributor Author

yma11 commented Jun 3, 2024

@majetideepak can you help ping anyone who can merge this PR? Thanks.

@majetideepak
Copy link
Collaborator

@pedroerp can you please help merge this? Thanks!

@majetideepak
Copy link
Collaborator

@yma11 I pinged Pedro again. Can you please rebase with main? Sorry for the delay.

@yma11
Copy link
Contributor Author

yma11 commented Jun 10, 2024

@yma11 I pinged Pedro again. Can you please rebase with main? Sorry for the delay.

Done. Really thanks for driving merge.

@facebook-github-bot
Copy link
Contributor

@kgpai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@kgpai merged this pull request in 8c3630b.

Copy link

Conbench analyzed the 1 benchmark run on commit 8c3630b3.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Merged ready-to-merge PR that have been reviewed and are ready for merging. PRs with this tag notify the Velox Meta oncall
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants