-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: info-dict extension to the File-Transfer Protocol #39
Comments
I would be thrilled to hear comments or suggestions on this proposal. Thanks for reading! |
This sounds like it has a lot of overlap with "Dilation" (and "Dilated File Transfer" or "transfer v2"). Have you seen those? |
I've read through the dilation spec multiple times. Brief comments:
I had not read the "dilated file transfer" spec before now. Unfortunately it was a bit hidden in a pull request, so I didn't know it existed. I'm reading through it now, so here are a few (less well considered!) thoughts:
I'm glad to see someone else is working on file transfer improvements! I'd be delighted to try to merge my work with this, provided that it can accomplish some of the basic things I'm trying to do here:
If it helps clarify anything, my end goal here is extremely high speed secure transfers for very large datasets (multiple TB at multiple Gbps), for clients that are both behind NAT or firewalls. AFAIK the Rust implementation of Magic Wormhole is the only version that supports hole punching, so it's the biggest target for me. |
So, to explain some context: "Dilation" is definitely a low-ish level protocol change, as you've identified. So only together with "transfer v2" (or "Dilated File Transfer") does it address the sort of "end-user" features you're wanting here (i.e. transferring more than 1 file / directory per wormhole setup). However, putting in a separate "layer" also makes it more widely useful for other wormhole protocols. At a high level Dilation gives you:
So, this (should!) mean that re-tries are redundant: you don't need to re-do a transfer, you just keep the application alive (or, if it correctly serializes state, re-start it) until the transfer concludes. Note that there is a privacy concern with features like "do you already have X?", especially if it operates w/o human involvement: one end could use this to confirm if the peer has particular files. On top of the above is where the Dilated File Transfer protocol runs. So it has all the benefits of Dilation and some additional features beyond the rather simplistic existing file-transfer (multiple offers / answers, bi-directional transfers, etc). The stuff about "viable connections" and so on is because you can have multiple connection hints in play (just as with the existing system) and need a way to decide which one to use now. Also, "which one is viable" may change as your local network conditions change. It may be that you can add more options, too (e.g. if you got a public IP address later). So for example, you may have a "direct" hint, two different transit helpers and a Tor Onion hint in play -- and it's possible that more than one will successfully connect. For Dilated File Transfer itself:
For the "granularity of transfers" part, we did discuss this quite a bit with @warner and others, and the thinking is that we wanted Offers to correspond to "things the user drops", conceptually. That is: a user drops "a Directory" and so that is accepted (or not). A particular client (or user) could make each file an individual Offer if they preferred (all offers can be in parallel) -- so the idea is that for whatever reason "a directory" is a cohesive collection of files. That said, maybe we've gotten this wrong -- I'd love to see some UX research indicating which choice is best. (Also maybe the names are just bad: the important point is that there are two sorts of offers available: "a single file" and "a collection of files"). But, yes, you're correct that the protocol only offers a "y/n" on an entire Offer. I also believe we've got enough versioning information that we can expand the protocol in the future in case more offer-types are desired. For example, perhaps a third type which is "a collection of files, but you might not want them all" (whatever that is named ;) ). Least Authority is currently executing a grant to add Dilation and Dilated File Transfer to the Rust implementation (and work on limitations in the specification, and appropriate changes to the Python reference implementation). If you're interested in collaborating further to ensure the specification either already meets your goals or at least can be extended to do so, I'm happy to schedule a video-call or "meeting" on IRC or similar? I would be very interested to see any testing data you have on transfer-speeds etc. for "parallel" connections. Obviously, the Dilation specification is (currently) just multiplexing on top of a single TCP stream -- although if it's significantly faster to use more connections, I don't see any reason we couldn't introduce a future revision that multiplexed over multiple streams. Definitely in protocols like BitTorrent there are obvious advantages since the endpoints are different network elements (but here, they are not). I see a lot of advantages to keeping concerns separate: keeping Dilation at a different layer lets other higher-level Wormhole protocols take advantage. (For example, if it was made parallel then any protocol including file-transfer could immediately use any speedups). Thanks for the writeup! |
p.s. "resumption" (or I guess "partial file transfer"?) could be a very interesting feature for sure if we can figure out the privacy aspects properly. So I don't want any of the above to imply I'm dismissing that feature, but we left it out of this first revision of Dilated File Transfer because it's complex, hard to get privacy angles correct and may not be needed very often in practice due to the underlying re-try/durability of Dilation itself -- more experience + research needed :) |
The situation you're thinking about is that the sender and receiver have a long-lived connection, and the receiver is set to automatically download everything that the sender offers except if the receiver already has the files in question in the target directory? I guess that's a possible issue, though probably far from a common one. At any rate, I think it would be okay to put this behind a flag like it is in wget ( On the other hand I think benefits go beyond resumption - if I have a ~20 GB directory of log files and I use a long-lived wormhole to sync them to another server, it would be nice not to have to send the full 20 GB every time the transfer happens.
This is the sort of thing that's potentially just as privacy impacting as the "do you already have X" thing. E.g. if I'm using a VPN to hide my IP address from the receiver, and the VPN connection drops, I might expect the connection to fail. But if it is automatically re-established and my routes revert to pushing data back over
Right - I don't see the point of making this a protocol requirement, because there's no way to enforce it. It's best effort anyway, because a (malicious?) program could absolutely make changes to a file without changing its size or modification time in between when the Wormhole client does its initial directory walk and when it actually sends the files. The only way to do this 100% safely is to checksum a file in advance, send the checksum to the receiver in the initial offer, and then you can check it again (and the receiver can check it) while you're streaming it in at some point in the future. (Maybe discussion of the dilated protocol needs to happen elsewhere. This is all nitpicking on my part, which could be a helpful contribution in some contexts but probably not here.)
I can only make a partial case for each of the features I've proposed, but my view was that because they go together so well, to have support for one of them is to get the others for free. (Or at least, easy support for them in the protocol.) Partial transfers, resumption from cold start, checksums (and retransmissions) for individual failed pieces, etc etc, all come together in one nice package if you allow the receiver to specify which files they want from an offer. So as my proposal implements it, the receiver "requests" pieces from an offer (after accepting it), rather than accepting and then receiving it passively.
A very trivial test I performed right now: on my home Internet connection, an Or more to the point, the last time I had a major use case for Wormhole was assisting some scientists trying to transfer a large amount (~1TB) of data between computer nodes across the country. Unfortunately, both nodes were behind firewalls that they had no control over. You could do More generally, the speed of a TCP stream is limited by the amount of data that can be in flight, which is the size of the TCP buffer for the socket divided by the latency (round trip time) of the connection. E.g. the default Linux buffer is something like 3 MB, so over a 100 ms connection (what we were dealing with in the case above), that's a limit of 30 MB/sec. That's why you can often see a speed up proportional to the number of TCP streams. (Some systems, e.g. the BSDs I believe, have difficulty handling high bitrates on a single TCP stream because the processing is single threaded. It's been a while since I looked at this, but back when I used an OPNsense router, I couldn't route more than 300 Mbps in a single stream on low power hardware.) None of this an issue BitTorrent clients, but only because wanting to download at 30 MB/sec or more from a single peer would be extremely rare. So it's okay for BitTorrent to only do 1 connection per peer.
Thanks for the offer. I may end up taking you up on that after I understand the current work being done better. If there's an IRC, Matrix channel, or maybe even a mailing list (!) for wormhole development, I'm happy to idle / subscribe to have an ear on what's happening. (It would be helpful not to miss stuff like dilated file transfers.) My biggest concern at present is that multiple streams for a single file transfer is likely to be hard to add in to the protocol later. :-) Your file transfer protocol needs to say "I don't care which transit pipe I get this data from, as long as I can clearly identify it", and then it doesn't seem that bad (at least with v1 transit, I'm not clear enough on the dilation specifics to comment on that). Most of the work has to be done in the low level transit protocol, and the more complex this is, the harder it is to modify. Is there a timeline for any of the work that is happening? The dilation stuff dates to 2018, and to be honest I assumed it was dead or happening very, very slowly. If we are hoping to do file transfers over the dilated protocol in the Rust client by the end of the year, that changes things significantly. |
We do have an IRC channel, on Libera |
I wouldn't add this to the transfer protocol, I'd add it to Dilation. Then the transfer (or whatever other higher-level) protocol still doesn't have to care about transport details -- it just opens a subchannel (and they happen to multiplex over however many TCP connections you want). This is of course an extremely complex topic. That's part of the point of separating the transport (Dilation) from the protocol (file transfer) here: they each deal with separate concerns. Of course, the Dilation API needs to be rich enough to handle application-protocol concerns. (Currently, for example, you can't obviously express "open a subchannel, but not on the same stream as this other one"). |
Okay, this is more interesting -- in other discussions about use-cases etc we've thought of this as "definitely about file-transfer" and not, e.g., "a synchronization protocol". For example, with UIDs, GIDs, timestamps, symlinks, hardlinks, etc. there are lots of issues here. So, I don't think we see "synchronize these two directories" as a direct use-case for the protocol (and I guess as you point out here, it's not currently going to be very great at that). However, perhaps that could be expressed as another offer-type? Or something? (I would also ask: "why not rsync?" or maybe "can we make rsync work over a wormhole?") I have played around with Dilation for general "forwarding of connections" (as has the Rust implementation), see e.g. https://meejah.ca/blog/fow-wormhole-forward (definitely "proof-of-concept" territory) and further thought of this as a possible "integration point" for experiments: anything that can be expressed as a localhost-listening server or a localhost-connecting client can "do stuff over wormhole" easily in this manner (at least as an experiment). |
Yes, absolutely. The "hint" system does have to be careful here. Currently, that's expressed as "use Tor" (or not) and if you're "using tor" then it only does Tor hints. (I don't think it's impossible to get a "partial file" system working well, but there are some extra things to get right here for sure). |
Yeah, some more eyes on the Dilated File Transfer (and/or Dilation) would be great -- please do feel free to comment on those parts directly too (e.g. on #23 for the Dilated File Transfer stuff). Re-reading that section now (I think you mean around line 226) it does seem fairly prescriptive in implementation details (when the real ask is that the peer should try its best to ensure the "actually sent" files match what was originally offered). Indeed, a hash is probably the only good way to ensure they do in fact match. I believe the thinking there might have been that it's faster to do |
I was thinking along the lines of: "here is |
To the general point about "offer a thing, but only take some of it" -- would that use-case be answered by having a new offer-type that's like the Directory offer, but gives the receiver an opportunity to answer back with more than a "yes/no"? That is they can reply with "yes, but: only A, D and Z..." or similar? And to clarify timeframes: yes, we expect to be doing Dilated File Transfer over Rust before the end of 2023. (There is PoC-level code for this in Python already, so I also expect to have full-featured Python support in that timeframe as well). I do like incremental approaches where possible, hence the "features" flags etc. So, for example, the above suggestion could take that form (similar to compression) and thus be implemented on a longer timeframe. |
@afontenot it would be nice to answer "why this exists" directly upfront. As I understand it, the answer is given later.
I don't think uncompressed transfer is a feature. Ability to negotiate compression protocol maybe a feature, but for one shot transfers KISS is better. "Receiver wanting only some files" looks like a some attempt of bandwidth optimization through increasing server, client and usage complexity. Again, why sender uses one shot transfers to transmit unneeded files, and why client needs to filter them on protocol level? There are more maintainable ways to handle this use case. Resumption of downloads looks like a wrong protocol choice too. There are many file syncing solutions out there. |
The following is a draft of a protocol extension.
Versions:
File-Transfer Protocol (info-dict extension)
The Magic Wormhole File-Transfer Protocol involves two stages. In the first, a Wormhole connection is mediated between the sender and the receiver by a third party Rendezvous Server. The connection is established by a PAKE which results in encrypted communications not readable by a third party, including the server.
In the first stage, at present, the sender provides an offer to the receiver. This offer is currently one of three types:
message
for a text messagefile
for sending a single filedirectory
for sending a directory of files compressed into a single archive fileIf the receiver accepts the offer, the protocol moves into the second stage. The transit protocol involves the transfer and validation of the message, file, or archived directory (as appropriate) over a different connection, which is created using connection hints sent over the Wormhole. Once this transit connection is created, the Wormhole is typically closed.
This extension to the File-Transfer protocol involves two components:
info
key (and associated values) to theoffer
messageinfo
offersThe key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.
The info-dict offer
The
info
dictionary offer is based on the info dictionary from the BitTorrent protocol, and it is intended that info dictionaries from compliant BitTorrent v1 torrents also be valid under this specification, when encoded as JSON.An info-dict offer enables three features that are not available in other modes:
To offer an info-dict for download, the sender SHALL include an
info
key in theoffer
dictionary. Thisinfo
key SHALL include the following top level keys:name
is the suggested file name (if the offer contains a single file) or directory name (if the offer contains multiple files).piece length
is an integer and is the base 2 logarithm of the piece size (as defined below) in bytes.pieces
is a string containing the concatenated hashes of every piece in the offer. The order of the hashes is the order of the files (if the offer contains multiple files), and then in-order through each file from beginning to end. The pieces are the offered file set split into chunks (which can span across files) of size equal to the piece size, which is2 ^ piece length
.The
info
key SHALL also contain either but not both of the following keys:length
is the exact number of bytes in the file, if the offer contains only one file (with the name given by thename
key).files
specifies a list of files provided by the offer, if the offer contains multiple files (with the top level directory given by thename
key). The ordering of the files is significant as it specifies the relationship between the pieces and the files. Each item in the list is a dictionary that SHALL contain the following two keys:length
is the number of bytes in the filepath
is a list of strings providing the path to the file under the top level directory. In this list, the last string is the suggested name of the file. Each string before this final string (if any) is the name of a directory contained by the directory immediately preceding it, and the first directory in the list (if any) is contained by the top level directory.Receivers SHALL either ignore or replace characters in the
path
that are invalid on their operating system or file system, or reject the offer if it contains these characters. In addition, receivers SHALL NOT interpret anypath
component that would cause directory traversal (such as a ".." component on some systems) or placing files outside the top level directory.The
info
key SHOULD also contain the following key:hashtype
specifies the hash that is used for the piece hashes provided in thepieces
string.Receivers compliant with this specification SHALL support the following values for
hashtype
: "sha256", "blake", "blake160", and "sha1". Receivers MAY support other hash types. The "sha256", "blake", and "sha1" values indicate that the corresponding hashes are standard SHA-256, BLAKE2b, and SHA-1 hashes (respectively) with the default digest size. The value "blake160" is the BLAKE2b hash function with a digest size of 20 bytes. This was chosen to correspond to the hash size of the SHA-1 function (which is used by the BitTorrent protocol), while retaining excellent resistance to collision attacks.An info-dict offer with no provided
hashtype
SHALL be interpreted to have a hash type of SHA-1 for historical compatibility. However, senders SHOULD provide ahashtype
value and SHOULD NOT use the SHA-1 hash.Senders and receivers SHOULD NOT limit the piece size beyond the expected limitations of the hardware they run on. It is RECOMMENDED that senders default to a 64 MiB piece size (2^26 bytes). Where the final piece of the last offered file does not coincide with the exact piece size boundary, the hash for the piece SHALL be the hash of the actual data, with no padding.
The receiver SHALL indicate acceptance of an info-dict offer in the same way as for other offers under the File-Transfer Protocol.
Transit protocol extensions for info-dict support
This specification is intentionally opaque about the nature of the transit protocol. The only requirement is that the protocol support both transfer of binary data as well as JSON-encoded control messages, and that both the sender and receiver be able to distinguish the two.
In particular, it is not specified whether the connection happens directly or through a relay server, whether the connection is TCP or UDP, or whether a single stream or multiple simultaneous streams are used.
However, typical connections will be established using connection hints as specified in the File-Transfer Protocol specification, and they are expected to be encrypted using secrets exchanged through the rendezvous connection. See the Wormhole Transit Protocol specification for more information.
Clients compatible with this specification add support for several message types over the transit protocol.
Receiver size hints
Immediately upon establishing a transit connection, a receiver SHOULD send a message containing a
wants
key. If provided, this key MUST contain a value indicating the exact number of bytes from the offer that the receiver expects to request. A client on the sending side SHOULD use this information to provide an accurate indication of progress, if the client provides progress indicators.If for any reason (except for checksum validation errors) the number of bytes the receiver expects to download changes, the receiver SHOULD send an updated
wants
message. These messages MUST contain the total number of bytes the receiver expects from the entire transfer, including from pieces already downloaded. Receivers that send this message MUST NOT double-count bytes from pieces that fail checksum validation or are otherwise downloaded multiple times.Receiver requested pieces
Pieces are sent by the sender only when they are requested by the receiver. Receivers queue up pieces to be sent with a request message. Receivers SHOULD keep enough requests queued up that they are not left waiting for data between downloading pieces. Requests messages SHALL take the following form:
Here, the numbers indicate the (zero-indexed) offset to the pieces provided in the offer. Note that the sender and receiver can determine both the byte offset (in the set of offer files) and the hash offset (in the
pieces
string), because both the piece size and hash digest size are defined in theoffer
.Receivers SHOULD always request the pieces they want in numerical order. Requesting data sequentially through the files allows for more efficient, predictable i/o on many systems.
Upon receiving a request for a piece, the sender SHALL send it through the transit protocol in the appropriate manner for binary data.
Accepting / rejecting / re-sending pieces
The sender SHALL check the hash digest given in the offer for each piece as it comes in, and SHALL reject any piece that does not match, unless strong mitigating circumstances prevail. Examples of such circumstances include that the sender has an incorrect or incomplete copy of the file, and the user / operator of the receiver has actively requested to accept data that fails a checksum error. If such circumstances are expected to occur, receiver software MAY choose to implement support for ignoring checksum failures, with an appropriate warning.
When a piece fails a check, a receiver MAY choose to request the same piece again. Senders are RECOMMENDED to provide a piece again if requested. Either side MAY choose to hang up the connection if a request repeatedly fails.
Acknowledgements
When a piece succeeds, the receiver SHALL send an acknowledgement in the following form:
Note that the receiver MAY send individual acknowledgements for each piece separately, but if multiple pieces enter the finished state before it sends an acknowledgement, it MAY acknowledge both at once as shown above.
Ending the connection
At any point, the receiver MAY hang up the connection with a success indication by sending
FAQ
Why include support for legacy hashes like SHA-1?
The intention is to make adding support for this protocol extension as easy as possible for implementers. A large quantity of software already exists for creating handling BitTorrent format info dictionaries, and using this software is likely to be the quickest way to implement support in many cases. Furthermore, collision resistance is rarely relevant to file transfer cases. Preimage resistance is far more important, and SHA-1 retains this. Other than the
blake160
hash with a non-standard digest length, SHA-1 also has the shortest digest of any hash with REQUIRED support in this specification. Shorter hashes make for more efficientinfo
dictionaries.Is this implementing BitTorrent support for Magic Wormhole?
No. This protocol extension provides a new
offer
format that allows sending a set of files between a single sender and (usually) one receiver, where the metadata provided for the offer is compatible with that used by the BitTorrent info-dict specification, but the protocols are otherwise unrelated.What does this achieve that Magic Wormhole cannot achieve without it? Is using BitTorrent a better choice for this use case?
As mentioned above, this allows sending multiple files without involving the overhead of an archive format, as well as partial downloads and updates to previously shared data. BitTorrent is not a plausible alternative to this use case. In particular, with this extension, Magic Wormhole implements:
a highly secure connection between a sender and receiver. BitTorrent does not support modern, secure forms of encryption between clients
an efficient transport mechanism for one-to-one and one-to-many transfers, thanks the opacity of the file transfer protocol to the underlying transit protocol. BitTorrent is optimized for the many-to-many case, and only creates a single TCP or UDP connection between pairs of peers.
a conversation establishing mechanism for two peers who want to talk to each other, and no one else, via the mailbox protocol. BitTorrent would require two peers to know each others' IP addresses and does not provide any mechanism for authentication.
Why emphasize the opacity of the transit protocol so much?
This feature gives Wormhole clients a lot of flexibility and potential for speed. The author of this extension specification is also working on a transit protocol extension that would allow two clients to keep open multiple transit connections between them and use them simultaneously when exchanging binary data (e.g. the pieces in this specification). Parallel transfers frequently offer an enormous speedup over sequential ones. Hopefully, with both extensions in place, Wormhole clients will be capable of multiple-Gbps transfers on commodity hardware.
The text was updated successfully, but these errors were encountered: