Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disable MultiPart transfer #110

Open
sadortun opened this issue Jul 4, 2019 · 8 comments · May be fixed by #112
Open

Disable MultiPart transfer #110

sadortun opened this issue Jul 4, 2019 · 8 comments · May be fixed by #112

Comments

@sadortun
Copy link

sadortun commented Jul 4, 2019

Hi !

When using AWS_ENDPOINT_URL to connect to Google Storage, the backup fail, beacuse Google Storate does not support Multi-Part transfers.

Would it be possible to add an option to either disable it, or to detect that the endpoint is https://storage.googleapis.com and disable it automactly ?

Thanks,
Samuel

@sadortun
Copy link
Author

sadortun commented Jul 4, 2019

Ahh, while digging in the sources, i found AWS_ENDPOINT_OPT that is not documented. Will try ... !

@sadortun
Copy link
Author

sadortun commented Jul 4, 2019

Unfortunatly it seems that awscli only support multi-part.

Would you be OK changing awscli with s3cmd ?

@deitch
Copy link
Collaborator

deitch commented Jul 4, 2019

Yes, I believe that aws s3 cp ... automatically switches to multipart upload if the file is beyond a certain size.

I find it surprising that Google Cloud Storage doesn't support multipart upload. The docs seem to imply that it does.

If it doesn't, we can switch to the low-level s3api over the aws s3 ... CLI. But can we have a test case showing that it doesn't?

@sadortun
Copy link
Author

sadortun commented Jul 4, 2019

Google calls the S3like access Interoperability, and the docs can be found here

I am still trying to find an appropriate source. The only one found so far is https://stackoverflow.com/questions/27830432/google-cloud-storage-support-of-s3-multipart-upload

@deitch
Copy link
Collaborator

deitch commented Jul 4, 2019

I definitely am fine with modifying it so it can be compatible with both GCP and AWS, but need a proper full understanding of the structure.

I don't mind terribly having an option that forces single upload over multipart, so that the default behaviour is like aws s3 cp ..., which calculates on its own whether to do multipart, but you can override and force it to use a single upload (presuming that the endpoint supports it).

That having been said:

  • we need to get a clear set of tests to show when and where multipart works and where it does not. I think it should work with GCP, but there are conflicting documents (as you pointed out)
  • we need to understand what the default "break-up" point is for aws s3 cp ... so we can mimic that behaviour
  • we need to understand if the default behaviour is parallel multipart? Or sequential? What does the aws s3 cp ... command actually do? I could see theoretical performance benefits for parallel, but survivability (of lost connection) benefits even in sequential.

In other words, trying to mimic the default behaviour as closely as possible. We always can relax those constraints later.

@deitch
Copy link
Collaborator

deitch commented Jul 4, 2019

Hmm, reading the AWS REST API docs here, here and here, I see that:

  • Upload an object:
PUT /ObjectName HTTP/1.1
Host: BucketName.s3.amazonaws.com
Date: date
Authorization: authorization string
  • Initiate a multipart upload
POST /ObjectName?uploads HTTP/1.1
Host: BucketName.s3.amazonaws.com
Date: date
Authorization: authorization string
  • Upload part of a multipart upload
PUT /ObjectName?partNumber=PartNumber&uploadId=UploadId HTTP/1.1
Host: BucketName.s3.amazonaws.com
Date: date
Content-Length: Size
Authorization: authorization string

The GCP REST API is compatible for putting entire objects:

PUT /paris.jpg HTTP/1.1
Host: travel-maps.storage.googleapis.com
Date: Sat, 20 Feb 2010 16:31:08 GMT
Content-Type: image/jpg
Content-MD5: iB94gawbwUSiZy5FuruIOQ==
Content-Length: 554
Authorization: Bearer ya29.AHES6ZRVmB7fkLtd1XTmq6mo0S1wqZZi3-Lh_s-6Uw7p8vtgSwg

As you said, that API (XML), however, doesn't support multipart uploads (although it does support resumable). On the other hand, the json one does.

We have a few options:

  • use the aws s3api interface to control when we do multipart and when not
  • install gsutil and have an option to override (maybe based on URL?)

Those are the two I can think of right now.

@sadortun
Copy link
Author

sadortun commented Jul 4, 2019

@deitch I've done a few tests with s3cmd.

Off

1Gb file multi-part disabled"

@cloudshell:~ (bap-newera)$ ls -lah  file.txt
-rw-r--r-- 1 samuel_denis_dortun samuel_denis_dortun 1G Jul  4 12:46 file.txt
@cloudshell:~ (bap-newera)$ s3cmd put file.txt s3://asdf
upload: 'file.txt' -> 's3://newera_backups_db/file.txt'  [1 of 1]
 737580032 of 737580032   100% in   19s    36.31 MB/s  done

On

1Mb, enable_multipart = true * actually multi-part is not used here

@cloudshell:~ (bap-newera)$ ls -lah  file.txt
-rw-r--r-- 1 samuel_denis_dortun samuel_denis_dortun 1M Jul  4 12:43 file.txt
@cloudshell:~ (bap-newera)$ s3cmd put -vv  file.txt s3://asdf
INFO: Compiling list of local files...
INFO: Running stat() and reading/calculating MD5 values on 1 files, this may take some time...
INFO: Summary: 1 local files to upload
upload: 'file.txt' -> 's3://asdf/file.txt'  [1 of 1]
 1048576 of 1048576   100% in    0s     3.52 MB/s  done

1Gb file multi-part enabled"

@cloudshell:~ (bap-newera)$ s3cmd put file.txt s3://asdf
ERROR: S3 error: 400 (InvalidArgument): Invalid argument.
@cloudshell:~ (bap-newera)$ s3cmd put -vv  file.txt s3://asdf
INFO: Compiling list of local files...
INFO: Running stat() and reading/calculating MD5 values on 1 files, this may take some time...
INFO: Summary: 1 local files to upload
ERROR: S3 error: 400 (InvalidArgument): Invalid argument.

@sadortun
Copy link
Author

sadortun commented Jul 4, 2019

As for a replacement, i would really suggest s3cmd which can do all we need and provide a high level interface.

No need for any special case since we can pass the --disable-multipart flag in AWS_ENDPOINT_OPT

Currently dooing a few tests in a fork

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants