Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use AWS C++ SDK for communicating with S3 #149

Draft
wants to merge 69 commits into
base: master
Choose a base branch
from

Conversation

sjperkins
Copy link
Contributor

@sjperkins sjperkins commented Apr 2, 2024

Based on this discussion:

TODO:

@sjperkins sjperkins marked this pull request as draft April 2, 2024 11:40
Copy link
Contributor Author

@sjperkins sjperkins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've included the tensorflow io deps into bazel. The bulk of the work needs to be done, but it's probably worth reviewing how the dependency representation has been done as I'm not very experienced with bazel. I've tried to replicate the tensorflow io deps with tensorstore's workplace flavour.

FWIW it builds locally and hopefully the dummy s3 dep I've added to the s3_request_builder works on CI.

"aws-cpp-sdk-core/source/*.cpp", # AWS_SOURCE
"aws-cpp-sdk-core/source/external/tinyxml2/*.cpp", # AWS_TINYXML2_SOURCE
"aws-cpp-sdk-core/source/external/cjson/*.cpp", # CJSON_SOURCE
"aws-cpp-sdk-core/source/auth/*.cpp", # AWS_AUTH_SOURCE
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naively, I assume the challenge will be to see how much of the source can be excised, while retaining the auth capalities.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We now have tinyxml2, so it would be good to remove that if we can.

:)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, lets see whats possible here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the tinyxml2 vendored into the AWS c++ sdk has been patched. Quoting from https://github.com/aws/aws-sdk-cpp/blob/d79ece080eba11212b10a857ae551a4b8423fe00/src/aws-cpp-sdk-core/include/aws/core/external/tinyxml2/tinyxml2.h#L23-L28.

  (1) Memory management operations use aws memory management api
  (2) Import-export preprocessor logic tweaked for better integration into core library
  (3) Wrapped everything in Amazon namespace to prevent static linking issues if the user includes a version of this code through another dependency

Would it be reasonable to use the AWS C++ SDK version of tinyxml2, instead of the vanilla version currently used with tensorflow?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naively, I assume the challenge will be to see how much of the source can be excised, while retaining the auth capalities.

Naively. So everything in the core library is tightly coupled. It doesn't seem possible to easily excise portions of the core library out. However, here are the .so binary sizes after stripping:

$ ~/code/tensorstore/bazel-bin/external$ ls -lh com_github_aws_cpp_sdk/libcore.so
-rwxr-xr-x 1 simon simon 3.8M Apr  3 15:42 com_github_aws_cpp_sdk/libcore.so

$ ls -lh com_github_aws_c_common/libaws_c_common.so
-rwxr-xr-x 1 simon simon 193K Apr  3 15:40 com_github_aws_c_common/libaws_c_common.so

$ ls -lh com_github_aws_c_event_stream/libaws_c_event_stream.so
-rwxr-xr-x 1 simon simon 35K Apr  3 15:40 com_github_aws_c_event_stream/libaws_c_event_stream.so

$ ls -lh com_github_aws_checksums/libaws_checksums.so
-rwxr-xr-x 1 simon simon 43K Apr  3 15:40 com_github_aws_checksums/libaws_checksums.so

Lets round that up and say that it costs ~4.1MB total to get the AWS SDK code functionality. Considering that some of the functionality is already replicated already in S3RequestBuilder, depending on core may not be a bad tradeoff in terms of code maintenance.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In most cases outside tests the AWS libraries would be statically linked to tensorstore and the linker may automatically remove most unused code anyway.

Copy link
Contributor Author

@sjperkins sjperkins Apr 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In most cases outside tests the AWS libraries would be statically linked to tensorstore and the linker may automatically remove most unused code anyway.

Yes, the .so sizes would only establish a rough upper bound on included code size. I'm trying to estimate before making significant code changes, but I guess using the code and inspecting the final wheel (for e.g.) is the only way to be sure.

libs3.so is 9.2MB stripped.

…tream BUILD. This aligns the BUILD more closely with tensorflow/io
@sjperkins
Copy link
Contributor Author

sjperkins commented Apr 3, 2024

So a first order strategy here might be:

and in S3RequestBuilder::BuildRequest:

  • Create a dummy Aws::Http::StandardHttpRequest.
    • Aws::String objects are simply typedef std::basic_strings so that shouldn't be an issue.
    • An Adapter will be needed to represent the absl::Cord payload as an Aws::Iostream.
  • Sign the Aws::Http::StandardHttpRequest with an AwsAuthV4Signer using the Aws::Auth::DefaultCredentialProviderChain.
  • Get the Authorization Header off the Aws::Http::StandardHttpRequest and use in tensorstore::http::HttpRequest
  • Resolve S3 Endpoints with Aws::S3EndPoint::S3EndPointProvider (apart from the core sdk dependency this would require a dependency on the s3 sdk).

I guess at some point the S3 Encryption API will become relevant.

Edit: Updated with EndPointProvider dependency.

@laramiel
Copy link
Collaborator

I don't think that we want rules_foreign_cc; internally we will need to build with bazel anyway.

@sjperkins
Copy link
Contributor Author

sjperkins commented Jun 24, 2024

I don't think that we want rules_foreign_cc; internally we will need to build with bazel anyway.

I see. Based on a brief look through aws-c-common HEAD (only), one concern was generating the config.h.in file

In a strict sense, it seems that a pure bazel approach would need to customise the above for combinations of different platforms and architectures, for each dependency, perhaps through the use of write_file and @//platforms as done here: https://github.com/google/tensorstore/pull/149/files#diff-cee7f961ceb075b160fbdd37a847f8f68be0872f932924267589a78d4dbe8c57R175-R203

This seems doable for one repo. One concern is that a trial and error approach to this might introduce subtle bugs into the SDK build (e.g. builds with newer instructions sets, although these could be disabled for the sake of safety).

However, I'm probably overthinking this. Did the configs of the above dependencies require extensive customisation in the bazel build mentioned here?

@laramiel
Copy link
Collaborator

laramiel commented Jun 25, 2024

Yes. There are only, I think, about 2 config.h.in style files in the aws repositories that we need, this one in aws_c_common and another in aws_cpp_sdk/aws_cpp_sdk_core or something.

I think that the starting point for c-common is something like

# Configured for Linux x86_64.
_SUBS = {
    "#cmakedefine AWS_HAVE_GCC_OVERFLOW_MATH_EXTENSIONS": "#define AWS_HAVE_GCC_OVERFLOW_MATH_EXTENSIONS",
    "#cmakedefine AWS_HAVE_GCC_INLINE_ASM": "#define AWS_HAVE_GCC_INLINE_ASM",
    "#cmakedefine AWS_HAVE_MSVC_INTRINSICS_X64": "",
    "#cmakedefine AWS_HAVE_POSIX_LARGE_FILE_SUPPORT": "#define AWS_HAVE_POSIX_LARGE_FILE_SUPPORT",
    "#cmakedefine AWS_HAVE_EXECINFO": "#define AWS_HAVE_EXECINFO",
    "#cmakedefine AWS_HAVE_WINAPI_DESKTOP": "",
    "#cmakedefine AWS_HAVE_LINUX_IF_LINK_H": "#define AWS_HAVE_LINUX_IF_LINK_H",
    "#cmakedefine AWS_HAVE_AVX2_INTRINSICS": "#define AWS_HAVE_AVX2_INTRINSICS",
    "#cmakedefine AWS_HAVE_AVX512_INTRINSICS": "#define AWS_HAVE_AVX512_INTRINSICS",
    "#cmakedefine AWS_HAVE_MM256_EXTRACT_EPI64": "#define AWS_HAVE_MM256_EXTRACT_EPI64",
    "#cmakedefine AWS_HAVE_CLMUL": "#define AWS_HAVE_CLMUL",
    "#cmakedefine AWS_HAVE_ARM32_CRC": "",
    "#cmakedefine AWS_HAVE_ARMv8_1": "",
    "#cmakedefine AWS_ARCH_ARM64": "",
    "#cmakedefine AWS_ARCH_INTEL": "#define AWS_ARCH_INTEL",
    "#cmakedefine AWS_ARCH_INTEL_X64": "#define AWS_ARCH_INTEL_X64",
    "#cmakedefine AWS_USE_CPU_EXTENSIONS": "#define AWS_USE_CPU_EXTENSIONS",
}

But we could probably just write the config file directly, as we do in se_curl

Or we could specialize the individual cmake lines like we do in jpeg

Or maybe we pre-generate files, like we do in cares, although the per-system #cmakedefine exists there as well.

... for most of the builds where I had to generate this, I mostly ran the cmake in windows, linux, and on my arm-mac, and did a 3-way merge to see what was common and what was different. Thinking about it now, I think that there is enough commonality here that we could have a library for it.

@laramiel
Copy link
Collaborator

... it might be nice to contribute aws rules to something like the bazel-central repository: https://github.com/bazelbuild/bazel-central-registry/tree/main/modules

But that happened after we started, so we haven't done that with our other modules, and we use the old-style workspace approach.

@sjperkins
Copy link
Contributor Author

sjperkins commented Jul 3, 2024

The build now includes more recent versions of the AWS C, CRT and C++ SDKs.

FWIW the s3_sdk localstack test cases pass locally on linux, but this PR hasn't been tested on Windows or Mac. Unfortunately, I may not have extra bandwidth (or easy access to hardware) for these environments. In these cases, best-guesses have been applied in the BUILD scripts, but require further scrutiny. Might it be possible to try this out on your end and iron out the differences in review?

There's also probably scope to use the aws-cpp-sdk-s3-crt for improved performance, instead of the original aws-cpp-sdk-s3.

@sjperkins sjperkins force-pushed the depend-on-aws-cpp-sdk-for-auth branch from fd75b35 to 3883f04 Compare July 3, 2024 12:25
@laramiel
Copy link
Collaborator

laramiel commented Jul 9, 2024

I've been getting internal imports of these aws c libraries; I wonder if we should not include the aws_cpp_sdk libraries at all, and just limit the use to the aws_crt_cpp libraries; it feels like the aws-cpp-sdk-s3-crt library belongs in aws_crt_cpp...

Mostly, I think, we want the aws_c_auth capabilities to sign requests.

@sjperkins
Copy link
Contributor Author

sjperkins commented Jul 10, 2024

I wonder if we should not include the aws_cpp_sdk libraries at all, and just limit the use to the aws_crt_cpp libraries; it feels like the aws-cpp-sdk-s3-crt library belongs in aws_crt_cpp...

I think aws-cpp-sdk-s3-crt has a dependency on aws-cpp-sdk-core so it's still seems to be part of the AWS C++ SDK library,

https://github.com/aws/aws-sdk-cpp/blob/77b525c975bb2e9840661d4f5beaf5171c8c3673/generated/src/aws-cpp-sdk-s3-crt/include/aws/s3-crt/S3CrtClient.h#L7-L22

and aws-cpp-sdk-core has a dependency on the aws-cpp-crt

https://github.com/aws/aws-sdk-cpp/blob/77b525c975bb2e9840661d4f5beaf5171c8c3673/src/aws-cpp-sdk-core/include/aws/core/Aws.h#L16-L17

Mostly, I think, we want the aws_c_auth capabilities to sign requests.

Without having looked at the internals of aws_c_auth, this seems plausible.

The benefit of using aws-cpp-sdk-s3-crt would be depending on reference implementations of:

  • retry logic
  • xml response handling
  • endpoint setup
  • maybe other encryption-related functionality like AWS KMS

which may be preferable to maintaining tensorstore implementations of the above.

Which approach do you favour?

@sjperkins
Copy link
Contributor Author

sjperkins commented Jul 10, 2024

I'm travelling in July but I'll try find space to swap the S3CrtClient into the test cases.

@laramiel
Copy link
Collaborator

I'm currently getting these aws libraries up to aws-crt-cpp imported into our internal build system. I think that I'd like to start with the aws-crt-cpp / aws_c_s3 libraries.

The aws-cpp-sdk-s3-crt library seems to be built on the aws-c-s3 + aws-crt-cpp along with autogenerated mechanisms to parse the xml responses (in the model directory). However it's likely that the raw aws-c-s3 libraries will have most of the official retry, etc. done without the xml response parsing, and I'd like to target that as a first attempt.

To that end I may import the third_party bits that you have here. If you had time to separate that into a distinct pull request it would be great.

@sjperkins
Copy link
Contributor Author

I'm currently getting these aws libraries up to aws-crt-cpp imported into our internal build system. I think that I'd like to start with the aws-crt-cpp / aws_c_s3 libraries.

The aws-cpp-sdk-s3-crt library seems to be built on the aws-c-s3 + aws-crt-cpp along with autogenerated mechanisms to parse the xml responses (in the model directory). However it's likely that the raw aws-c-s3 libraries will have most of the official retry, etc. done without the xml response parsing, and I'd like to target that as a first attempt.

To that end I may import the third_party bits that you have here. If you had time to separate that into a distinct pull request it would be great.

Sure. I'll aim to make that available by the end of this week,

@laramiel
Copy link
Collaborator

... I actually plan on getting the third_party working since it needs to work for windows / mac.
So just stay tuned.

Basic changes that I've made:

All the aws libraries now start with aws_... rather than github_com_aws...

The BUILD files are all slightly incorrect. Most of them use something like srcs = glob([ "includes/**/*.h", "sources/**/*.c" ]), however the "public" .h files should be globbed into hdrs = ...

Then there are some other changes to make aws_c_common / aws_c_io work based on the CMakeLists.txt files for those repositories.

@sjperkins
Copy link
Contributor Author

sjperkins commented Jul 11, 2024

... I actually plan on getting the third_party working since it needs to work for windows / mac.
So just stay tuned.

OK, just to overcommunicate: I take this to mean to hold off on the aws third_party PR. Let me know if I'm misinterpreting, otherwise I'll wait on your changes :-)

However it's likely that the raw aws-c-s3 libraries will have most of the official retry, etc. done without the xml response parsing, and I'd like to target that as a first attempt.

I had a closer look at aws-c-s3. It does advertise retry logic and there's also endpoint resolution code. Given the reduced dependency footprint I could see why this would be an appealing direction to take.

Edit: It's also got some xml response parsing: https://github.com/awslabs/aws-c-s3/blob/51e24febe7542466a998a1e8fee7ac3785084f55/source/s3_list_parts.c#L57-L160

@laramiel
Copy link
Collaborator

laramiel commented Jul 11, 2024

Yes; I published my WIP here for now: #177

I think that it's pretty close.

copybara-service bot pushed a commit that referenced this pull request Jul 12, 2024
This was heavily modified from the original GitHub PR #149

PiperOrigin-RevId: 651917637
Change-Id: I484addac73efc8a633683224f378db565ba230c1
@sjperkins
Copy link
Contributor Author

sjperkins commented Jul 14, 2024

Thanks, that build looks more comprehensive. Regarding the aws-c-s3 approach, I could try a PR on this when I return at the start of August. If that timeline is too far out for any deliverables, please don't wait on me :-)

I'll aim to integrate those libraries into this branch at about the same time.

@laramiel
Copy link
Collaborator

I don't have a timeline for updating s3; I work on it in between other things, somewhat like you do.

Anyway, some pointers / ideas.

The aws-c-s3 github repository has an s3 sample which covers most of what we need to do with s3: reading and listing.
https://github.com/awslabs/aws-c-s3/blob/main/samples/s3/

@sjperkins
Copy link
Contributor Author

sjperkins commented Aug 12, 2024

I could try a PR on this when I return at the start of August.

Apologies, I haven't had time to get to this due to time off sick and conferences. Hoping to pick it up next week.

@sjperkins
Copy link
Contributor Author

sjperkins commented Aug 21, 2024

@sjperkins
Copy link
Contributor Author

Apologies, I haven't had time to get to this due to time off sick and conferences. Hoping to pick it up next week.

I need to get some releases out. I'm still interested, but I'm not sure when I'll have space in the near future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants