-
Notifications
You must be signed in to change notification settings - Fork 839
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement a lazy multipart writer that does direct PUT for small files #5205
Conversation
I'm a little confused by this, I was perhaps expecting a struct that could be created with a Path and an object store and would implement |
object_store/src/aws/mod.rs
Outdated
@@ -307,6 +307,11 @@ impl MultiPartStore for AmazonS3 { | |||
self.client.put_part(path, id, part_idx, data).await | |||
} | |||
|
|||
async fn put_object(&self, path: &Path, data: Bytes) -> Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the data is already in memory, i.e. Bytes what is the motivation for using multipart?
That doesn't seem correct, because that wouldnt be atomic, i.e you could have half data uploaded. The motivation is imagine u upload very small files half the time and large ones the rest, but you don't know ahead of time. This saves you the needless three calls when ur uploading small files, you just directly PUT the buffered data. |
It would need to buffer in memory until the threshold is reached, at which point it falls back to usinng multipart, otherwise on close it just does a put request with what it has buffered? |
Yes that's what I was thinking |
object_store/src/multipart.rs
Outdated
/// up to the configured maximum concurrency | ||
/// The multipart upload will only be created when more than the 10 MiB is written, | ||
/// otherwise the data will be uploaded directly as PUT. | ||
pub struct LazyWriteMultiPart { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can't this just call ObjectStore::put_multipart to obtain a boxed AsyncWrite if/when it decides to take this path? That would simpler, faster and work for all stores?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yeah that might be a better way, then you wouldn't need the multipartstore.
Marking as draft to signify this is not waiting on review, feel free to unmark when you would like me to take another look |
39adfc0
to
6ddbc82
Compare
Thanks let me know ur comments, I will fix the test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like where this is heading, left some comments
6ddbc82
to
8280a8f
Compare
I made an update according to ur suggestions, let me know comments. |
8280a8f
to
cba3bec
Compare
Signed-off-by: 🐼 Samrose Ahmed 🐼 <[email protected]>
cba3bec
to
372bcbd
Compare
I've run out of time to review this today, trying to avoid spending all my holiday reviewing code, will try to find time to take a look in the coming days. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had a brief look, I think this would benefit from some low-level testing of the interface and state transitions, as it stands I'm not very comfortable the logic is actually correct
} | ||
} | ||
LazyWriteState::AsyncWrite(writer) => { | ||
return Pin::new(writer).poll_write(cx, buf).map_ok(|n| n + wrote) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this is correct, the returned count is the number from buf
that is consumed, as it stands I think this will include any data that was previously buffered as well
Marking as a draft, please feel free to unmark when you would like me to take another look |
I picked this up in #5431 PTAL |
Which issue does this PR close?
I can create an issue for this.
Rationale for this change
For small files, multipart upload is unnecessary, but it can be convenient to use the same interface, ie. same AsyncWrite that can internally either do a direct PUT for small files or use multipart upload.
What changes are included in this PR?
Adds new code to create an async writer that lazily creates the multipart upload after there is enough data.
Are there any user-facing changes?
yes
notes