Skip to content

Conversation

@corpoverlords
Copy link
Contributor

@corpoverlords corpoverlords commented Oct 21, 2025

Rationale for this change

This would allow Arrow to coexist with other libraries that also use the AWS SDK. Internally since 2 years ago AWS SDK already has a refcount mechanism for InitAPI and supports re-init after deinit.

  • Introduced a mutex for thread-safe re-initialization of the S3 client after finalization, replacing the previous std::call_once mechanism.
  • Added an Initialize method to reset the finalized state of the S3 client.
  • Updated EnsureInitialized to allow re-initialization while ensuring thread safety.

This change improves the flexibility and safety of the S3 client lifecycle management.

What changes are included in this PR?

S3FS init changes

Are these changes tested?

Tested with our local infra. I can add a unit test in a dedicated cpp file (due to init/deinit usage) if wanted.

Are there any user-facing changes?

No

@github-actions
Copy link

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

@corpoverlords corpoverlords changed the title Allow S3 Filesystem Re-initialization GH-47904: [C++] [Python] Allow S3 Filesystem Re-initialization Oct 21, 2025
@github-actions
Copy link

⚠️ GitHub issue #47904 has been automatically assigned in GitHub to PR creator.

This would allow Arrow to coexist with other libraries that also use the AWS SDK. Internally since 2 years ago AWS SDK already has a refcount mechanism for `InitAPI` and supports re-init after deinit.

- Introduced a mutex for thread-safe re-initialization of the S3 client after finalization, replacing the previous std::call_once mechanism.
- Added an Initialize method to reset the finalized state of the S3 client.
- Updated EnsureInitialized to allow re-initialization while ensuring thread safety.

This change improves the flexibility and safety of the S3 client lifecycle management.
@pitrou
Copy link
Member

pitrou commented Oct 30, 2025

Thanks for doing this @corpoverlords ! It seems that the Python tests need updating now that re-initialization is allowed, can you take a look?

std::lock_guard<std::mutex> lock(init_mutex_);

if (!is_initialized_.load()) {
// Not already initialized, allow re-initialization after finalization
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we perhaps only allow it if the AWS SDK version is recent enough?

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Oct 30, 2025
@corpoverlords
Copy link
Contributor Author

Hi @pitrou Thanks for the review! After internal use for a week now I think there is probably a need for revising the entire AWS SDK initialization mechanism. Since AWS SDK now supports a reference count we can actually let the S3FileSystem object itself hold a reference to global lock object. From https://github.com/aws/aws-sdk-cpp/blob/main/src/aws-cpp-sdk-core/include/aws/core/Aws.h#L293 we should also use a dedicated background thread for AWS SDK init/deinit. Please lmk what do you think :)

@corpoverlords
Copy link
Contributor Author

What I think can be done:

  • AwsInitHolder: global static singleton in Arrow where it holds refcount to Aws SDK within Arrow
  • AwsInitLock: a user-declarable RAII object, this class will be used in things that touch S3, for example, S3FileSystem, as a class member to hold a reference to AwsInitHolder. User can also (inside main for example) to reduce AWS init/deinit overhead.
  • AwsInitHolder will hold reference to a thread handle and signal the thread to deinit AWS SDK when refcount reaches zero.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants