-
Notifications
You must be signed in to change notification settings - Fork 122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use AWS C++ SDK for communicating with S3 #149
base: master
Are you sure you want to change the base?
Use AWS C++ SDK for communicating with S3 #149
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've included the tensorflow io deps into bazel. The bulk of the work needs to be done, but it's probably worth reviewing how the dependency representation has been done as I'm not very experienced with bazel. I've tried to replicate the tensorflow io deps with tensorstore's workplace flavour.
FWIW it builds locally and hopefully the dummy s3 dep I've added to the s3_request_builder works on CI.
"aws-cpp-sdk-core/source/*.cpp", # AWS_SOURCE | ||
"aws-cpp-sdk-core/source/external/tinyxml2/*.cpp", # AWS_TINYXML2_SOURCE | ||
"aws-cpp-sdk-core/source/external/cjson/*.cpp", # CJSON_SOURCE | ||
"aws-cpp-sdk-core/source/auth/*.cpp", # AWS_AUTH_SOURCE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Naively, I assume the challenge will be to see how much of the source can be excised, while retaining the auth capalities.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We now have tinyxml2, so it would be good to remove that if we can.
:)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, lets see whats possible here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like the tinyxml2
vendored into the AWS c++ sdk has been patched. Quoting from https://github.com/aws/aws-sdk-cpp/blob/d79ece080eba11212b10a857ae551a4b8423fe00/src/aws-cpp-sdk-core/include/aws/core/external/tinyxml2/tinyxml2.h#L23-L28.
(1) Memory management operations use aws memory management api
(2) Import-export preprocessor logic tweaked for better integration into core library
(3) Wrapped everything in Amazon namespace to prevent static linking issues if the user includes a version of this code through another dependency
Would it be reasonable to use the AWS C++ SDK version of tinyxml2, instead of the vanilla version currently used with tensorflow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Naively, I assume the challenge will be to see how much of the source can be excised, while retaining the auth capalities.
Naively. So everything in the core library is tightly coupled. It doesn't seem possible to easily excise portions of the core library out. However, here are the .so
binary sizes after stripping:
$ ~/code/tensorstore/bazel-bin/external$ ls -lh com_github_aws_cpp_sdk/libcore.so
-rwxr-xr-x 1 simon simon 3.8M Apr 3 15:42 com_github_aws_cpp_sdk/libcore.so
$ ls -lh com_github_aws_c_common/libaws_c_common.so
-rwxr-xr-x 1 simon simon 193K Apr 3 15:40 com_github_aws_c_common/libaws_c_common.so
$ ls -lh com_github_aws_c_event_stream/libaws_c_event_stream.so
-rwxr-xr-x 1 simon simon 35K Apr 3 15:40 com_github_aws_c_event_stream/libaws_c_event_stream.so
$ ls -lh com_github_aws_checksums/libaws_checksums.so
-rwxr-xr-x 1 simon simon 43K Apr 3 15:40 com_github_aws_checksums/libaws_checksums.so
Lets round that up and say that it costs ~4.1MB total
to get the AWS SDK code functionality. Considering that some of the functionality is already replicated already in S3RequestBuilder, depending on core
may not be a bad tradeoff in terms of code maintenance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In most cases outside tests the AWS libraries would be statically linked to tensorstore and the linker may automatically remove most unused code anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In most cases outside tests the AWS libraries would be statically linked to tensorstore and the linker may automatically remove most unused code anyway.
Yes, the .so
sizes would only establish a rough upper bound on included code size. I'm trying to estimate before making significant code changes, but I guess using the code and inspecting the final wheel (for e.g.) is the only way to be sure.
libs3.so
is 9.2MB
stripped.
…tream BUILD. This aligns the BUILD more closely with tensorflow/io
So a first order strategy here might be:
and in
I guess at some point the S3 Encryption API will become relevant. Edit: Updated with EndPointProvider dependency. |
I don't think that we want rules_foreign_cc; internally we will need to build with bazel anyway. |
I see. Based on a brief look through In a strict sense, it seems that a pure bazel approach would need to customise the above for combinations of different platforms and architectures, for each dependency, perhaps through the use of This seems doable for one repo. One concern is that a trial and error approach to this might introduce subtle bugs into the SDK build (e.g. builds with newer instructions sets, although these could be disabled for the sake of safety). However, I'm probably overthinking this. Did the configs of the above dependencies require extensive customisation in the bazel build mentioned here? |
Yes. There are only, I think, about 2 config.h.in style files in the aws repositories that we need, this one in aws_c_common and another in aws_cpp_sdk/aws_cpp_sdk_core or something. I think that the starting point for c-common is something like
But we could probably just write the config file directly, as we do in se_curl Or we could specialize the individual cmake lines like we do in jpeg Or maybe we pre-generate files, like we do in cares, although the per-system #cmakedefine exists there as well. ... for most of the builds where I had to generate this, I mostly ran the cmake in windows, linux, and on my arm-mac, and did a 3-way merge to see what was common and what was different. Thinking about it now, I think that there is enough commonality here that we could have a library for it. |
... it might be nice to contribute aws rules to something like the bazel-central repository: https://github.com/bazelbuild/bazel-central-registry/tree/main/modules But that happened after we started, so we haven't done that with our other modules, and we use the old-style workspace approach. |
The build now includes more recent versions of the AWS C, CRT and C++ SDKs. FWIW the There's also probably scope to use the |
fd75b35
to
3883f04
Compare
I've been getting internal imports of these aws c libraries; I wonder if we should not include the aws_cpp_sdk libraries at all, and just limit the use to the aws_crt_cpp libraries; it feels like the aws-cpp-sdk-s3-crt library belongs in aws_crt_cpp... Mostly, I think, we want the aws_c_auth capabilities to sign requests. |
I think and
Without having looked at the internals of The benefit of using
which may be preferable to maintaining tensorstore implementations of the above. Which approach do you favour? |
I'm travelling in July but I'll try find space to swap the S3CrtClient into the test cases. |
I'm currently getting these aws libraries up to aws-crt-cpp imported into our internal build system. I think that I'd like to start with the aws-crt-cpp / aws_c_s3 libraries. The aws-cpp-sdk-s3-crt library seems to be built on the aws-c-s3 + aws-crt-cpp along with autogenerated mechanisms to parse the xml responses (in the model directory). However it's likely that the raw aws-c-s3 libraries will have most of the official retry, etc. done without the xml response parsing, and I'd like to target that as a first attempt. To that end I may import the third_party bits that you have here. If you had time to separate that into a distinct pull request it would be great. |
Sure. I'll aim to make that available by the end of this week, |
... I actually plan on getting the third_party working since it needs to work for windows / mac. Basic changes that I've made: All the aws libraries now start with The BUILD files are all slightly incorrect. Most of them use something like Then there are some other changes to make aws_c_common / aws_c_io work based on the CMakeLists.txt files for those repositories. |
OK, just to overcommunicate: I take this to mean to hold off on the aws third_party PR. Let me know if I'm misinterpreting, otherwise I'll wait on your changes :-)
I had a closer look at Edit: It's also got some xml response parsing: https://github.com/awslabs/aws-c-s3/blob/51e24febe7542466a998a1e8fee7ac3785084f55/source/s3_list_parts.c#L57-L160 |
Yes; I published my WIP here for now: #177 I think that it's pretty close. |
This was heavily modified from the original GitHub PR #149 PiperOrigin-RevId: 651917637 Change-Id: I484addac73efc8a633683224f378db565ba230c1
Thanks, that build looks more comprehensive. Regarding the I'll aim to integrate those libraries into this branch at about the same time. |
I don't have a timeline for updating s3; I work on it in between other things, somewhat like you do. Anyway, some pointers / ideas. The aws-c-s3 github repository has an s3 sample which covers most of what we need to do with s3: reading and listing. |
Apologies, I haven't had time to get to this due to time off sick and conferences. Hoping to pick it up next week. |
S3 now supports conditional writes, although other implementations may not: |
I need to get some releases out. I'm still interested, but I'm not sure when I'll have space in the near future. |
Based on this discussion:
TODO:
[x] Integrate AWS CPP SDK #defines into aconfig.h
file variant. Use AWS C++ SDK for communicating with S3 #149 (comment)