-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dilated File Transfer (transfer-v2) #40
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for picking this up again. I would really like to see this finally become reality. I've gone through the whole thing expect for the expansion excercises.
Recall from the Dilation specification that subchannels are _record_ pipes (not simple byte-streams). | ||
That is, a subchannel transmits a series of complete (framed) messages (up to ~4GiB in size). | ||
|
||
For this protocol, each record on the subchannel uses the first byte to indicate the kind of message; the remaining bytes in any message are kind-dependant. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we use a vastly different format here than in the control channel? I am especially confused about having to special case the first byte. We already use msgpack, so can't we just use a kind
field as well?
Using a string here would make expansion and experiments much easier. We could specify to allow the X-
prefix for experiments for example. As long as every msgpack-encoded message has a kind
of type str
we should be good to go.
MessagePack allows to serialize arbitrary schemaless data structures, we don't need to specify all valid keys beforehand.
Example for a file offer:
{
"kind": "FileOffer",
"filename": "name",
"timestamp": 1721088154
"bytes": 123
}
Example of binary data
{
"kind": "Data",
"data": bin
}
where bin
would be a msgpack bin encoded stream.
The overhead for 64k blocks would be ~20 bytes, which is a loss of 0.03% of the available bandwidth. I would say this is negligible.
We could go further and just send binaries msgpack bin encoded (everything else has the map data type). This could IMO be enough.. Then we'd only have the msgpack bin format overhead to deal with, in this case with a byte array of 64k it would be 3 bytes instead of one, with the benefit of a clearer and more extensible data format that makes deserialisation easier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps this needs some more concrete numbers / testing before finalizing.
The reasoning is that you have to pass all the bytes/memory through the msgpack codec then. With the "first byte" thing, you can efficiently (and predictably) chunk your payload data at 65KiB - 1 bytes on both encode and decode.
So when pulling data off, you'll bring "the message" into memory. Now you can simply look at the first byte, then write out the rest. With msgpack you'll end up allocating another 65KiB buffer for every message. (We could perhaps specify FlatBuffers here instead? But that has less widespread support and the zero-copy aspect is overkill for most of the messages ... )
Anyway, that's the thinking. I have not tested this, but I believe it would mean approximately twice as much memory pressure. (Perhaps actual msgpack libraries are more efficient than the api implies?)
I suppose -- similar to your reasoning above about the "mode" -- this could be left to a future extension if it is found to be substantially more efficient to do so. Overall, it seems unlikely that memory-allocator pressure will dominate runtime (although phone users might prefer less overhead). Again, completely untested :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reasoning is that you have to pass all the bytes/memory through the msgpack codec then.
Depends on the codec impl I guess. After reading the first 3 / 5 bytes it is enough to know that the rest will be binary data that can be read into the final buffer.
I don't have a strong opionion here though - if msgpack implementations make this difficult then by all means.
I do think however that we could still then just have one ID for "msgpack encoded informational map", instead of three.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, yeah at least that simplification makes sense (e.g. "just data" versus "msgpack-encoded data" in which case it must have a "kind" field).
I'll at least do some tests and see if there's some obvious, large advantage. This could also be a "future enhancement" / exercise .
dilated-file-transfer.md
Outdated
|
||
Currently, a file-transfer peer that sends zero version data is assumed to be "classic". | ||
A file-transfer peer supporting Dilation and this new protocol sends `"transfer": {...}` as per the "Version Negotiation" section above. | ||
We include a `"version"` key in that dict so that this transfer protocol may be extended in the future (see "Protocol Expansion Exercises" below). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't do this anymore. We only use features instead. However we could always even start doing at a later date in the future, as long as unknown keys get ignored.
files: list[str] # a list containing relative paths for each file | ||
``` | ||
|
||
The filenames in the `"files"` list are unicode relative paths (relative to the `"base"` from the `DirectoryOffer` and NOT including that part. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should specify that using the ..
directory (for example, but not limited to refering to paths above the directory tree) is a protocol error and MUST cause an immediate rejection of the offer by the client libary. Otherwise we allow this sort of behaviour and need to have applications themselves deal with this trickery.
dilated-file-transfer.md
Outdated
Sending a single file like `/home/meejah/Documents/Letter.docx` gets a filename like `Letter.docx` | ||
Sending a whole directory like `/home/meejah/Documents/` would result in a directory-offer with basedir `Documents` and some number of files (possibly with sub-paths). | ||
|
||
Discussion point: should the basedir simply be ".", and names relative to it? (That is, "Documents" is a possibly-sensitive local directory name). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should just be left for client applications to decide.
As mentioned above, we must make sure to prevent illegal filenames that contain ..
and illegal paths that are ..
or that contain any /
path separator.
Co-authored-by: Fina <[email protected]>
Co-authored-by: Fina <[email protected]>
Great, thanks for taking a look! 🥳 FYI, I am (slowly!) making progress on turning the Python "proof-of-concept" code into something a bit closer to production / usable, to prove to myself that the document makes sense and isn't missing details. Also, in case you missed it, I'm working on a tool called |
…rotocols into file-transfer-v2
A version of #23 with master mrged in (i.e. md/rst flip) and additional copy-editing, clarifications and cleanup.