Figure out larger file support (up to 5GB) #39

ferd · 2024-02-12T01:02:54Z

Right now, big files need to be loaded all in memory to be sent and managed. This is going to be a problem for larger files one might wish to synchronize.

This long-standing PR/branch will aim to add in sub-file transfer messages to break down any sync into manageable bits.

The first step is to rework the file transfer queuing mechanism

The old queuing mechanism mostly worked by putting all the message in the gen_statem's next_event queue. This is not a great mechanism because it it has some limitations:

The stored events are inserted in the queue as the next to process
before any already queued events. The order of these stored events is
preserved, so the first next_event in the containing list becomes the
first to process.

The issue here is that every time we send ourselves a ping message, for example, it will go the back of the whole queue. In some cases I suspect this could lead to issues in flow control (eg. the TLS server waits for a pong to send messages, but the FSM is stuck trying to send more data).

Instead, we start using a queue data structure to more literally hold the messages that are to be processed later.

This may also become useful if we try to support sending large files by breaking them in sub-messages (eg. 15mb per batch) which would just pop them in front of the FSM's queue. Now with a data structure that's more explicit, we'll have the ability to control the scheduling without preventing other messages from being handled in between.

The old queuing mechanism mostly worked by putting all the message in the `gen_statem`'s `next_event` queue. This is not a great mechanism because it it has some limitations: > The stored events are inserted in the queue as the next to process > before any already queued events. The order of these stored events is > preserved, so the first next_event in the containing list becomes the > first to process. The issue here is that every time we send ourselves a `ping` message, for example, it will go the back of the whole queue. In some cases I suspect this could lead to issues in flow control (eg. the TLS server waits for a `pong` to send messages, but the FSM is stuck trying to send more data). Instead, we start using a queue data structure to more literally hold the messages that are to be processed later. This may also become useful if we try to support sending large files by breaking them in sub-messages (eg. 15mb per batch) which would just pop them in front of the FSM's queue. Now with a data structure that's more explicit, we'll have the ability to control the scheduling without preventing other messages from being handled in between.

This avoids messing with regular instances running on docs' sample ports and accidentally failing tests if the software is running in the background already.

The data wrapper change introduces a new operation, which means we need to bump the internal version. At this point in time, the significance of this is that we eventually need to test the failures of reading the proper messages, but first let's change the version. This instantly highlighted shared concerns across _a lot_ of files for all wrappers, and some inconsistencies, so these get fixed up and brought under a better centralized umbrella. Existing tests pass, showing that we're at least consistent within a single version.

ferd · 2024-04-03T17:22:20Z

After some experimentation and checking the docs, there's a bit of bad news for S3 support:

When an object is uploaded as a multipart upload, the ETag for the object is not an MD5 digest of the entire object. Amazon S3 calculates the MD5 digest of each individual part as it is uploaded. The MD5 digests are used to determine the ETag for the final object. Amazon S3 concatenates the bytes for the MD5 digests together and then calculates the MD5 digest of these concatenated values. The final step for creating the ETag is when Amazon S3 adds a dash with the total number of parts to the end.

If you've enabled additional checksum values for your multipart object, Amazon S3 calculates the checksum for each individual part by using the specified checksum algorithm. The checksum for the completed object is calculated in the same way that Amazon S3 calculates the MD5 digest for the multipart upload. You can use this checksum to verify the integrity of the object.

Because of how Amazon S3 calculates the checksum for multipart objects, the checksum value for the object might change if you copy it. If you're using an SDK or the REST API and you call CopyObject, Amazon S3 copies any object up to the size limitations of the CopyObject API operation. Amazon S3 does this copy as a single action, regardless of whether the object was uploaded in a single request or as part of a multipart upload. With a copy command, the checksum of the object is a direct checksum of the full object. If the object was originally uploaded using a multipart upload, then the checksum value changes even though the data has not.

And from the CopyObject page:

You can store individual objects of up to 5 TB in Amazon S3. You create a copy of your object up to 5 GB in size in a single atomic action using this API. However, to copy an object greater than 5 GB, you must use the multipart upload Upload Part - Copy (UploadPartCopy) API.

Essentially, it appears that while I can use multipart for better memory usage in file transfers, the way I use full-file hashes for version tracking implies that I'll need to limit myself to 5GB files on S3 by forcing another copy, and ignoring anything bigger than this. This is a bit annoying, but probably an acceptable trade-off for now.

Wire up multipart both for clients and servers. In writing this patch, I found out that there was a missing synchronization of conflict files coming from the client while being unknown to the server. It seems like tests had hidden that fact based on the ease of setup. The opposed syncs were tried but not that one. This commit adds a test that validates both single-transfer and multipart conflict portions, and makes the code support it. the code is messy, and is going to need a refactor, which writing this helped clarify in my mind.

Pick the comon aspects to the client and the server, where files are read and broken up to be sent, and unite them within a shared structure. This further splits up the handling of scheduling, serializing, and the callback management. It also temporarily undoes tracing support, which will need to be reintroduced later.

There's some fun stuff to deal with for the multipart checksums. If we want to use S3 at all, we need to limit all file uploads to 5GB otherwise we'll break the hashing mechanism we rely on.

ferd added 3 commits February 11, 2024 19:59

Change tests' default server port

66ad9d3

This avoids messing with regular instances running on docs' sample ports and accidentally failing tests if the software is running in the background already.

ferd changed the title ~~Figure out larger file support~~ Figure out larger file support (up to 5GB) Apr 6, 2024

ferd added 4 commits April 6, 2024 16:28

update aws lib

8107306

Weed out S3 Multipart uploads

c148d1e

There's some fun stuff to deal with for the multipart checksums. If we want to use S3 at all, we need to limit all file uploads to 5GB otherwise we'll break the hashing mechanism we rely on.

ferd force-pushed the larger-file-support branch from 93b14cd to a1d1290 Compare April 6, 2024 16:34

ferd added 2 commits April 6, 2024 17:06

Fixing tracing for multipart

fb413a9

turn on multipart transfers for >50MiB files

933b521

ferd force-pushed the larger-file-support branch from 3124936 to 933b521 Compare April 6, 2024 17:06

ferd merged commit 5fede9f into main Apr 6, 2024
1 check passed

ferd deleted the larger-file-support branch April 6, 2024 17:13

ferd mentioned this pull request Apr 6, 2024

Handle large files #37

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Figure out larger file support (up to 5GB) #39

Figure out larger file support (up to 5GB) #39

ferd commented Feb 12, 2024

ferd commented Apr 3, 2024 •

edited

Loading

Figure out larger file support (up to 5GB) #39

Figure out larger file support (up to 5GB) #39

Conversation

ferd commented Feb 12, 2024

ferd commented Apr 3, 2024 • edited Loading

ferd commented Apr 3, 2024 •

edited

Loading