Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement a lazy multipart writer that does direct PUT for small files #5205

Closed
wants to merge 1 commit into from

Conversation

Samrose-Ahmed
Copy link
Contributor

Which issue does this PR close?

I can create an issue for this.

Rationale for this change

For small files, multipart upload is unnecessary, but it can be convenient to use the same interface, ie. same AsyncWrite that can internally either do a direct PUT for small files or use multipart upload.

What changes are included in this PR?

Adds new code to create an async writer that lazily creates the multipart upload after there is enough data.

Are there any user-facing changes?

yes

notes

  • i couldn't really find way to avoid duplicating the code from the existing WriteMultipart, lmk if there's a better way.
  • I wasn't sure on naming, feel free to suggest alternative

@github-actions github-actions bot added the object-store Object Store Interface label Dec 12, 2023
@tustvold
Copy link
Contributor

I'm a little confused by this, I was perhaps expecting a struct that could be created with a Path and an object store and would implement AsyncWrite buffering the first x bytes to a put but then deferring to a multipart?

@@ -307,6 +307,11 @@ impl MultiPartStore for AmazonS3 {
self.client.put_part(path, id, part_idx, data).await
}

async fn put_object(&self, path: &Path, data: Bytes) -> Result<()> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the data is already in memory, i.e. Bytes what is the motivation for using multipart?

@Samrose-Ahmed
Copy link
Contributor Author

I'm a little confused by this, I was perhaps expecting a struct that could be created with a Path and an object store and would implement AsyncWrite buffering the first x bytes to a put but then deferring to a multipart?

That doesn't seem correct, because that wouldnt be atomic, i.e you could have half data uploaded.

The motivation is imagine u upload very small files half the time and large ones the rest, but you don't know ahead of time. This saves you the needless three calls when ur uploading small files, you just directly PUT the buffered data.

@tustvold
Copy link
Contributor

It would need to buffer in memory until the threshold is reached, at which point it falls back to usinng multipart, otherwise on close it just does a put request with what it has buffered?

@Samrose-Ahmed
Copy link
Contributor Author

Yes that's what I was thinking

/// up to the configured maximum concurrency
/// The multipart upload will only be created when more than the 10 MiB is written,
/// otherwise the data will be uploaded directly as PUT.
pub struct LazyWriteMultiPart {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't this just call ObjectStore::put_multipart to obtain a boxed AsyncWrite if/when it decides to take this path? That would simpler, faster and work for all stores?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah that might be a better way, then you wouldn't need the multipartstore.

@tustvold tustvold marked this pull request as draft December 14, 2023 09:12
@tustvold
Copy link
Contributor

Marking as draft to signify this is not waiting on review, feel free to unmark when you would like me to take another look

@Samrose-Ahmed Samrose-Ahmed force-pushed the lazy_multipart branch 2 times, most recently from 39adfc0 to 6ddbc82 Compare December 14, 2023 09:53
@Samrose-Ahmed Samrose-Ahmed marked this pull request as ready for review December 14, 2023 10:20
@Samrose-Ahmed
Copy link
Contributor Author

Thanks let me know ur comments, I will fix the test

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like where this is heading, left some comments

object_store/src/multipart.rs Outdated Show resolved Hide resolved
object_store/src/multipart.rs Outdated Show resolved Hide resolved
object_store/src/multipart.rs Outdated Show resolved Hide resolved
object_store/src/multipart.rs Outdated Show resolved Hide resolved
object_store/src/multipart.rs Show resolved Hide resolved
@Samrose-Ahmed
Copy link
Contributor Author

I made an update according to ur suggestions, let me know comments.

Signed-off-by: 🐼 Samrose Ahmed 🐼 <[email protected]>
@tustvold
Copy link
Contributor

I've run out of time to review this today, trying to avoid spending all my holiday reviewing code, will try to find time to take a look in the coming days.

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had a brief look, I think this would benefit from some low-level testing of the interface and state transitions, as it stands I'm not very comfortable the logic is actually correct

}
}
LazyWriteState::AsyncWrite(writer) => {
return Pin::new(writer).poll_write(cx, buf).map_ok(|n| n + wrote)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is correct, the returned count is the number from buf that is consumed, as it stands I think this will include any data that was previously buffered as well

@tustvold tustvold marked this pull request as draft January 22, 2024 11:31
@tustvold
Copy link
Contributor

Marking as a draft, please feel free to unmark when you would like me to take another look

@tustvold
Copy link
Contributor

I picked this up in #5431 PTAL

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
object-store Object Store Interface
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants