Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Figure out larger file support (up to 5GB) #39

Merged
merged 9 commits into from
Apr 6, 2024
Merged

Conversation

ferd
Copy link
Owner

@ferd ferd commented Feb 12, 2024

Right now, big files need to be loaded all in memory to be sent and managed. This is going to be a problem for larger files one might wish to synchronize.

This long-standing PR/branch will aim to add in sub-file transfer messages to break down any sync into manageable bits.


The first step is to rework the file transfer queuing mechanism

The old queuing mechanism mostly worked by putting all the message in the gen_statem's next_event queue. This is not a great mechanism because it it has some limitations:

The stored events are inserted in the queue as the next to process
before any already queued events. The order of these stored events is
preserved, so the first next_event in the containing list becomes the
first to process.

The issue here is that every time we send ourselves a ping message, for example, it will go the back of the whole queue. In some cases I suspect this could lead to issues in flow control (eg. the TLS server waits for a pong to send messages, but the FSM is stuck trying to send more data).

Instead, we start using a queue data structure to more literally hold the messages that are to be processed later.

This may also become useful if we try to support sending large files by breaking them in sub-messages (eg. 15mb per batch) which would just pop them in front of the FSM's queue. Now with a data structure that's more explicit, we'll have the ability to control the scheduling without preventing other messages from being handled in between.

The old queuing mechanism mostly worked by putting all the message in
the `gen_statem`'s `next_event` queue. This is not a great mechanism
because it it has some limitations:

> The stored events are inserted in the queue as the next to process
> before any already queued events. The order of these stored events is
> preserved, so the first next_event in the containing list becomes the
> first to process.

The issue here is that every time we send ourselves a `ping` message,
for example, it will go the back of the whole queue. In some cases I
suspect this could lead to issues in flow control (eg. the TLS server
waits for a `pong` to send messages, but the FSM is stuck trying to send
more data).

Instead, we start using a queue data structure to more literally hold
the messages that are to be processed later.

This may also become useful if we try to support sending large files by
breaking them in sub-messages (eg. 15mb per batch) which would just pop
them in front of the FSM's queue. Now with a data structure that's more
explicit, we'll have the ability to control the scheduling without
preventing other messages from being handled in between.
This avoids messing with regular instances running on docs' sample ports
and accidentally failing tests if the software is running in the
background already.
The data wrapper change introduces a new operation, which means we need
to bump the internal version. At this point in time, the significance of
this is that we eventually need to test the failures of reading the
proper messages, but first let's change the version.

This instantly highlighted shared concerns across _a lot_ of files for
all wrappers, and some inconsistencies, so these get fixed up and
brought under a better centralized umbrella.

Existing tests pass, showing that we're at least consistent within a
single version.
@ferd
Copy link
Owner Author

ferd commented Apr 3, 2024

After some experimentation and checking the docs, there's a bit of bad news for S3 support:

When an object is uploaded as a multipart upload, the ETag for the object is not an MD5 digest of the entire object. Amazon S3 calculates the MD5 digest of each individual part as it is uploaded. The MD5 digests are used to determine the ETag for the final object. Amazon S3 concatenates the bytes for the MD5 digests together and then calculates the MD5 digest of these concatenated values. The final step for creating the ETag is when Amazon S3 adds a dash with the total number of parts to the end.

If you've enabled additional checksum values for your multipart object, Amazon S3 calculates the checksum for each individual part by using the specified checksum algorithm. The checksum for the completed object is calculated in the same way that Amazon S3 calculates the MD5 digest for the multipart upload. You can use this checksum to verify the integrity of the object.

Because of how Amazon S3 calculates the checksum for multipart objects, the checksum value for the object might change if you copy it. If you're using an SDK or the REST API and you call CopyObject, Amazon S3 copies any object up to the size limitations of the CopyObject API operation. Amazon S3 does this copy as a single action, regardless of whether the object was uploaded in a single request or as part of a multipart upload. With a copy command, the checksum of the object is a direct checksum of the full object. If the object was originally uploaded using a multipart upload, then the checksum value changes even though the data has not.

And from the CopyObject page:

You can store individual objects of up to 5 TB in Amazon S3. You create a copy of your object up to 5 GB in size in a single atomic action using this API. However, to copy an object greater than 5 GB, you must use the multipart upload Upload Part - Copy (UploadPartCopy) API.

Essentially, it appears that while I can use multipart for better memory usage in file transfers, the way I use full-file hashes for version tracking implies that I'll need to limit myself to 5GB files on S3 by forcing another copy, and ignoring anything bigger than this. This is a bit annoying, but probably an acceptable trade-off for now.

@ferd ferd changed the title Figure out larger file support Figure out larger file support (up to 5GB) Apr 6, 2024
ferd added 4 commits April 6, 2024 16:28
Wire up multipart both for clients and servers. In writing this patch, I
found out that there was a missing synchronization of conflict files
coming from the client while being unknown to the server.

It seems like tests had hidden that fact based on the ease of setup. The
opposed syncs were tried but not that one.

This commit adds a test that validates both single-transfer and
multipart conflict portions, and makes the code support it.

the code is messy, and is going to need a refactor, which writing this
helped clarify in my mind.
Pick the comon aspects to the client and the server, where files are
read and broken up to be sent, and unite them within a shared structure.

This further splits up the handling of scheduling, serializing, and the
callback management.

It also temporarily undoes tracing support, which will need to be
reintroduced later.
There's some fun stuff to deal with for the multipart checksums. If we
want to use S3 at all, we need to limit all file uploads to 5GB
otherwise we'll break the hashing mechanism we rely on.
@ferd ferd merged commit 5fede9f into main Apr 6, 2024
1 check passed
@ferd ferd deleted the larger-file-support branch April 6, 2024 17:13
@ferd ferd mentioned this pull request Apr 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant