-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add RetryStrategy for S3 file system #9736
Conversation
✅ Deploy Preview for meta-velox canceled.
|
@majetideepak Can you help review this PR? cc @FelixYBW |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yma11 two minor comments. Thanks for this change!
@@ -640,6 +644,39 @@ class S3FileSystem::Impl { | |||
return getDefaultCredentialsProvider(); | |||
} | |||
|
|||
// Return a client RetryStrategy based on the config. | |||
std::shared_ptr<Aws::Client::RetryStrategy> getRetryStrategy() const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can make getRetryStrategy()
return std::optional<std::shared_ptr<Aws::Client::RetryStrategy>>
and use the library default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean keeping the choice that create the client without any RetryStrategy
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated a little. Please take a look.
velox/docs/configs.rst
Outdated
* - hive.s3.retry-mode | ||
- string | ||
- | ||
- 'standard' or 'adaptive', use DefaultRetryStrategy if it's empty. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add some description here on what standard
and adaptive
retry strategies are? Thanks!
// Otherwise, use default value 3. | ||
return std::make_shared<Aws::Client::AdaptiveRetryStrategy>(); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add the check of invalid value of retry mode?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
@@ -536,7 +536,16 @@ Each query can override the config by setting corresponding query session proper | |||
- integer | |||
- | |||
- Maximum concurrent TCP connections for a single http client. | |||
|
|||
* - hive.s3.max-attempts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By default, aws sdk uses 3, right? Let's mention this here
@yma11 When the call client_->HeadObject or client_->GetObject fail, can we get the the retry number from outcome? If so, let's print the number in error message. So user can know it does be retried and may increase retry number. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (IsRetryableHttpResponseCode(error.GetResponseCode())) { \ | ||
auto retryHint = fmt::format( \ | ||
"This request has retried {} times, you may can try increasing 'hive.s3.max-attempts'.", \ | ||
outcome.GetRetryCount()); \ | ||
errMsg.append(retryHint); \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@FelixYBW Please review this change.
* - hive.s3.max-attempts | ||
- integer | ||
- | ||
- Maximum attempts for connections to a single http client, work together with retry-mode. By default, it's 3 for standard/adaptive mode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As this document describe, default 4 for legacy mode
A default value of 4 for maximum retry attempts,
https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-retries.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
their cpp_sdk and document have some difference. Let's follow sdk
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe different AWS version has different default value
velox/connectors/hive/HiveConfig.cpp
Outdated
@@ -120,6 +120,16 @@ std::optional<uint32_t> HiveConfig::s3MaxConnections() const { | |||
config_->get<uint32_t>(kS3MaxConnections)); | |||
} | |||
|
|||
std::optional<uint32_t> HiveConfig::s3MaxAttempts() const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it is int32_t
in AWS config?
f052a02
to
c2f4d18
Compare
Thanks @majetideepak, I have updated. Can you take a look again? There is a build failure in CI, which is caused by the old version aws sdk doesn't have |
@yma11 Let's update the library first in a separate PR (changes inside setup-adapters.sh). That will give us the newer CI images with the updated library. |
Thanks for suggestion. I have created 9756 for version upgrade. Will update this PR once it's merged. |
6360d25
to
00bc6ad
Compare
@majetideepak can you approve this PR again? Thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yma11 some minor comments.
Were you able to test this change on your local setup?
if (maxAttempts.has_value()) { | ||
VELOX_USER_CHECK( | ||
(maxAttempts.value() > 0), | ||
"Invalid configuration: specify 'hive.s3.max-attempts' > 0."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe print the value?
Should 0 be allowed here?
VELOX_USER_CHECK( (maxAttempts.value() >= 0), "Invalid configuration: specified 'hive.s3.max-attempts' value {} is < 0.", maxAttempts.value());
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As checked in CilentConfiguration, it can be 0
. Changed it to use VELOX_USER_CHECK_GE
. @jinchengchenghh
} else if (retryMode.value() == "adaptive") { | ||
if (maxAttempts.has_value()) { | ||
VELOX_USER_CHECK( | ||
(maxAttempts.value() > 0), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above. Should 0 be allowed and specify the value in the error message.
} else if (retryMode.value() == "legacy") { | ||
if (maxAttempts.has_value()) { | ||
VELOX_USER_CHECK( | ||
(maxAttempts.value() > 0), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above. Should 0 be allowed and specify the value in the error message.
getRequestID(error.GetResponseHeaders())); \ | ||
if (IsRetryableHttpResponseCode(error.GetResponseCode())) { \ | ||
auto retryHint = fmt::format( \ | ||
" This request gets retriable response and has retried {} times, you may increase 'hive.s3.max-attempts'.", \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Request failed after retrying {} times. Try increasing the value of 'hive.s3.max-attempts'.
if (retryMode.has_value()) { | ||
if (retryMode.value() == "standard") { | ||
if (maxAttempts.has_value()) { | ||
VELOX_USER_CHECK( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
VELOX_USER_CHECK -> VELOX_USER_CHECK_GT
} | ||
} else if (retryMode.value() == "adaptive") { | ||
if (maxAttempts.has_value()) { | ||
VELOX_USER_CHECK( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
} | ||
} else if (retryMode.value() == "legacy") { | ||
if (maxAttempts.has_value()) { | ||
VELOX_USER_CHECK( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
I don't have local env for this test for now. I will add corresponding configs at Gluten side and then test it in AWS. |
@yma11 thanks! Can you please confirm and update here? We cannot add CI tests for some of these changes, but the expectation is that the author tests the changes on their end. |
const { | ||
auto retryMode = hiveConfig_->s3RetryMode(); | ||
auto maxAttempts = hiveConfig_->s3MaxAttempts(); | ||
if (retryMode.has_value()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If user doesn't config retryMode but config maxAttempts, does maxAttempts take effect? It should takes effect since we have default retry mode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. In velox, maxAttempts
need work together with retryMode
. If retryMode
isn't configured, S3 client will be created w/o RetryStrategy
. But for Gluten, we've set default value for retryMode
. I updated the doc in this PR.
Confirmed it works. Retry number is passed to S3 and the query passed. |
@majetideepak can you help ping anyone who can merge this PR? Thanks. |
@pedroerp can you please help merge this? Thanks! |
@yma11 I pinged Pedro again. Can you please rebase with main? Sorry for the delay. |
Done. Really thanks for driving merge. |
@kgpai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Conbench analyzed the 1 benchmark run on commit There were no benchmark performance regressions. 🎉 The full Conbench report has more details. |
This PR add
RetryStrategy
support for s3 file system, thus it will retry when connection fails. It also upgrade aws sdk to1.11.321
which supportsAdaptiveRetryStrategy
which user may choose to use.For
RetryStrategy
, 2 configs are added:hive.s3.max-attempts
: Maximum attempts for connections to a single http client.hive.s3.retry-mode
: 'standard', 'adaptive' orlegacy
, client will be created w/o retrystrategy if it's empty.