Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allocate direct buffers for multipart upload #559

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

dopuskh3
Copy link

@dopuskh3 dopuskh3 commented Jun 6, 2024

Allow to allocate multipart upload buffers as direct buffer rather than on the heap.

Reasoning

We try to set a pretty large multipart upload part size on cluster to optimize throughput and reduce S3 requests.

At the same time, we try to keep kafka JVM heap size contained on most kafka installation in order to leave as much memory as possible for the page cache. As a matter of example, we will use 4GB heap size on machines with 64GB available memory.

The consequence of using pretty large multipart upload size on contained JVM heap size is that we can pretty easily run out of heap size if we suddenly have to upload many segments to tiered storage.

The strategy we propose is to allocate multipart buffers in direct memory so that we can more easily configure direct buffer budget.

Usage

We introduced s3.multipart.upload.direct.buffers boolean configuration flag (disabled by default).

@dopuskh3 dopuskh3 requested a review from a team as a code owner June 6, 2024 08:54
@dopuskh3 dopuskh3 changed the title [datadog] Allocate direct buffers for multipart upload Allocate direct buffers for multipart upload Jun 6, 2024
@funky-eyes
Copy link
Contributor

I think this feature is good.

@dopuskh3 dopuskh3 force-pushed the francois.visconte/allocate-direct-buffers branch 2 times, most recently from 8ffac2f to 05ef2ed Compare June 9, 2024 17:30
@dopuskh3
Copy link
Author

dopuskh3 commented Jun 9, 2024

@jeqo fixed tests, PTAL

Allow to allocate multipart upload buffers as direct buffer rather than
on the heap.

We try to set a pretty large multipart upload part size on cluster to
optimize throughput and reduce S3 requests.

At the same time, we try to keep kafka JVM heap size contained on most
kafka installation in order to leave as much memory as possible for the
page cache. As a matter of example, we will use 4GB heap size on
machines with 64GB available memory.

The consequence of using pretty large multipart upload size on contained
JVM heap size is that we can pretty easily run out of heap size if we
suddenly have to upload many segments to tiered storage.

The strategy we propose is to allocate multipart buffer in direct memory
so that we can more easily configure direct buffer budget.
@dopuskh3 dopuskh3 force-pushed the francois.visconte/allocate-direct-buffers branch from 05ef2ed to 9fcdda2 Compare June 10, 2024 08:00
Copy link
Contributor

@jeqo jeqo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @dopuskh3! I agree this is a valuable flag for deployments where JVM memory may not be enough for the part uploads.
I added few minor naming suggestions and some testing coverage; but overall looks good.

@@ -58,6 +58,11 @@ public class S3StorageConfig extends AbstractConfig {
private static final String S3_MULTIPART_UPLOAD_PART_SIZE_DOC = "Size of parts in bytes to use when uploading. "
+ "All parts but the last one will have this size. "
+ "Valid values: between 5MiB and 2GiB";

public static final String S3_MULTIPART_UPLOAD_DIRECT_BUFFERS_CONFIG = "s3.multipart.upload.direct.buffers";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
public static final String S3_MULTIPART_UPLOAD_DIRECT_BUFFERS_CONFIG = "s3.multipart.upload.direct.buffers";
public static final String S3_MULTIPART_UPLOAD_BUFFER_ALLOCATION_DIRECT_CONFIG = "s3.multipart.upload.buffer.allocation.direct";

@@ -261,6 +272,10 @@ public Boolean pathStyleAccessEnabled() {
return getBoolean(S3_PATH_STYLE_ENABLED_CONFIG);
}

public Boolean multipartDirectBuffers() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
public Boolean multipartDirectBuffers() {
public Boolean uploadBufferAllocationDirect() {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could these tests be parametrized to have coverage for the direct allocation?

@@ -44,12 +44,15 @@ public class S3Storage implements StorageBackend {
private String bucketName;
private int partSize;

private boolean multipartDirectBuffers;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
private boolean multipartDirectBuffers;
private boolean bufferAllocationDirect;

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants