Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dilated File Transfer (transfer-v2) #40

Open
wants to merge 30 commits into
base: main
Choose a base branch
from

Conversation

meejah
Copy link
Member

@meejah meejah commented Jun 4, 2024

A version of #23 with master mrged in (i.e. md/rst flip) and additional copy-editing, clarifications and cleanup.

@meejah meejah changed the title Dilated File Transfer (transfer-vs) Dilated File Transfer (transfer-v2) Jun 4, 2024
Copy link
Contributor

@felinira felinira left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for picking this up again. I would really like to see this finally become reality. I've gone through the whole thing expect for the expansion excercises.

dilated-file-transfer.md Outdated Show resolved Hide resolved
dilated-file-transfer.md Outdated Show resolved Hide resolved
dilated-file-transfer.md Outdated Show resolved Hide resolved
Recall from the Dilation specification that subchannels are _record_ pipes (not simple byte-streams).
That is, a subchannel transmits a series of complete (framed) messages (up to ~4GiB in size).

For this protocol, each record on the subchannel uses the first byte to indicate the kind of message; the remaining bytes in any message are kind-dependant.
Copy link
Contributor

@felinira felinira Jul 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we use a vastly different format here than in the control channel? I am especially confused about having to special case the first byte. We already use msgpack, so can't we just use a kind field as well?

Using a string here would make expansion and experiments much easier. We could specify to allow the X- prefix for experiments for example. As long as every msgpack-encoded message has a kind of type str we should be good to go.

MessagePack allows to serialize arbitrary schemaless data structures, we don't need to specify all valid keys beforehand.

Example for a file offer:

{
  "kind": "FileOffer",
  "filename": "name",
  "timestamp": 1721088154
  "bytes": 123
}

Example of binary data

{
  "kind": "Data",
  "data": bin
}

where bin would be a msgpack bin encoded stream.

The overhead for 64k blocks would be ~20 bytes, which is a loss of 0.03% of the available bandwidth. I would say this is negligible.

We could go further and just send binaries msgpack bin encoded (everything else has the map data type). This could IMO be enough.. Then we'd only have the msgpack bin format overhead to deal with, in this case with a byte array of 64k it would be 3 bytes instead of one, with the benefit of a clearer and more extensible data format that makes deserialisation easier.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this needs some more concrete numbers / testing before finalizing.

The reasoning is that you have to pass all the bytes/memory through the msgpack codec then. With the "first byte" thing, you can efficiently (and predictably) chunk your payload data at 65KiB - 1 bytes on both encode and decode.

So when pulling data off, you'll bring "the message" into memory. Now you can simply look at the first byte, then write out the rest. With msgpack you'll end up allocating another 65KiB buffer for every message. (We could perhaps specify FlatBuffers here instead? But that has less widespread support and the zero-copy aspect is overkill for most of the messages ... )

Anyway, that's the thinking. I have not tested this, but I believe it would mean approximately twice as much memory pressure. (Perhaps actual msgpack libraries are more efficient than the api implies?)

I suppose -- similar to your reasoning above about the "mode" -- this could be left to a future extension if it is found to be substantially more efficient to do so. Overall, it seems unlikely that memory-allocator pressure will dominate runtime (although phone users might prefer less overhead). Again, completely untested :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reasoning is that you have to pass all the bytes/memory through the msgpack codec then.

Depends on the codec impl I guess. After reading the first 3 / 5 bytes it is enough to know that the rest will be binary data that can be read into the final buffer.

I don't have a strong opionion here though - if msgpack implementations make this difficult then by all means.

I do think however that we could still then just have one ID for "msgpack encoded informational map", instead of three.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, yeah at least that simplification makes sense (e.g. "just data" versus "msgpack-encoded data" in which case it must have a "kind" field).

I'll at least do some tests and see if there's some obvious, large advantage. This could also be a "future enhancement" / exercise .

dilated-file-transfer.md Outdated Show resolved Hide resolved

Currently, a file-transfer peer that sends zero version data is assumed to be "classic".
A file-transfer peer supporting Dilation and this new protocol sends `"transfer": {...}` as per the "Version Negotiation" section above.
We include a `"version"` key in that dict so that this transfer protocol may be extended in the future (see "Protocol Expansion Exercises" below).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't do this anymore. We only use features instead. However we could always even start doing at a later date in the future, as long as unknown keys get ignored.

files: list[str] # a list containing relative paths for each file
```

The filenames in the `"files"` list are unicode relative paths (relative to the `"base"` from the `DirectoryOffer` and NOT including that part.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should specify that using the .. directory (for example, but not limited to refering to paths above the directory tree) is a protocol error and MUST cause an immediate rejection of the offer by the client libary. Otherwise we allow this sort of behaviour and need to have applications themselves deal with this trickery.

dilated-file-transfer.md Show resolved Hide resolved
dilated-file-transfer.md Show resolved Hide resolved
Sending a single file like `/home/meejah/Documents/Letter.docx` gets a filename like `Letter.docx`
Sending a whole directory like `/home/meejah/Documents/` would result in a directory-offer with basedir `Documents` and some number of files (possibly with sub-paths).

Discussion point: should the basedir simply be ".", and names relative to it? (That is, "Documents" is a possibly-sensitive local directory name).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should just be left for client applications to decide.

As mentioned above, we must make sure to prevent illegal filenames that contain .. and illegal paths that are .. or that contain any / path separator.

meejah and others added 2 commits July 15, 2024 21:41
Co-authored-by: Fina <[email protected]>
@meejah
Copy link
Member Author

meejah commented Jul 16, 2024

Thanks for picking this up again. I would really like to see this finally become reality. I've gone through the whole thing expect for the expansion excercises.

Great, thanks for taking a look! 🥳

FYI, I am (slowly!) making progress on turning the Python "proof-of-concept" code into something a bit closer to production / usable, to prove to myself that the document makes sense and isn't missing details.

Also, in case you missed it, I'm working on a tool called fowl so that non-Python applications can more easily and directly play with Dilation use-cases (whether that is this, or any other Dilation-based protocol).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants