-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Users who join an encrypted room at the same time as a message is sent may receive a UTD (join/send race) #2268
Comments
I think we could address this with something like the following: Step 1: make sure that Alice's client's idea of the room membership matches that of the room DAG at the point of the sent message. We could do this by having the client submit a hash of what it thinks the room membership is, when it sends the message. Alice's server can then cross-reference the hash with the actual room membership once it has decided where in the DAG to hang the new event. If there is a mismatch, the server rejects the send attempt and Alice's client resyncs the membership somehow. (If matrix-org/matrix-spec#1209 (edit: and element-hq/synapse#16940 (comment)) were fixed, that might just be a matter of having Alice's client redo the encryption attempt after a new Step 2: It now crosses over with historical room key sharing (cf element-hq/element-web#26867), but in short: when an event is served to Bob by Bob's server, we could include an indication of whether Bob was a member of the room at the point in the DAG of the event. This would give Bob's client an indication of whether it should expect to be able to decrypt the event. |
Outcome from workshop today
|
In a room with no history visibility, imagine these events happen (because there is a netsplit): graph TD;
Before-->Send;
Send-->After;
Before-->Join;
Join-->After;
Before: User1 and User2 are members Imagine that Join happens well before Send in wall-clock time. Does User3's client receive the event sent in It would be great if we could have a test from @kegsay that demonstrates whether this event arrives at User3's client. |
I encountered this in the wild today. Matthew did not realise I was part of the room and hence did not encrypt his message for me. As a result, all my clients are unable to decrypt this message. See screenshots. In this case:
Total lag: 47s. I don't think the hash solution works here because, from matrix.org's pov:
I am not in the room, hence the hash would always be valid (assuming the lag isn't due to intra-HS components). TODO: check the room to see if the room forked at this time to be sure. I think we can all agree that in this case, it is an expected UTD and should be suppressed by the receiving client. The problem is how to reliably detect this. The lag could have been minutes or hours, causing >1 UTD from Matthew. Sending some extra information in the In the most naive solution, if the client added the entire set of user|device ID pairs to the event, my client could clearly see I'm not part of that list and hence suppress the event. However, large E2EE rooms with lots of members exist, as well as pathological users with many devices. This data could be large. We could use a bloom filter here to provided probablistic suppression, and then sent the bitmask with the event. We can calculate the values here. Assuming we want:
In extremely large rooms we could tweak these values:
The sending logic would be something like:
The receiving logic would be something like:
Critically, these properties means:
|
How about simply hiding all consecutive UTDs you receive after joining a room on the assumption that they're old messages which predate you joining? |
Assuming the hiding is per-sender, rather than just the entire timeline (remember this was just you here having this problem), that won't reliably stop UTDs from being displayed. It definitely helps in this particular case (you joined and get UTDs) but there are related failure modes which this doesn't help with because there isn't a convenient start point (the join event) which you can use as a guide for when to start hiding messages. Notably, when you login on a new device, until that device is synchronised with the sender, you'll get UTDs from the sender. We could generalise this even further and just hide all UTDs in the room until you successfully decrypt a single event from that sender (though really it should be that device), but I worry that would mask a ton of real failure modes. |
We did something similar in the past for historical messages, and it was extremely flaky. The problem with any heuristics around UTDs is that it's very easy to end up catching lots of other failure modes by accident, and hiding half the timeline. |
Clarity around #2268 (comment) there's basically two orthogonal solutions being proposed, both of which help guard against UTDs:
|
Or if it did (eg, because history visibility settings permit it), your server should have given your client a caveat along the lines of "your membership state was actually |
This would be a lovely way to suppress UTDs of messages prior to joining in a more reliable way than entirely client-side imo. My concern with all of this though is it's only a partial solution. You can get the same failure mode when you login with a new device, and then the DAG markup solution falls short. The bloom filter approach would work for both joins and new devices, and handles race conditions through the entire stack (CS and federation). The downside is extra bandwidth consumed, but for very low bandwidth use cases you'd not be using E2EE at all anyway (as ciphertexts cannot be compressed well, and typically you'd rely on network-level encryption if you're that focused on byte saving). |
The other downsides are:
I do think the bloom filter could be useful for some problems where we don't have better ideas (element-hq/synapse#2165, maybe?) but I think there's room for both solutions. |
I agree, but with limited time and developer effort, we surely want the biggest bang for our collective buck. All these solutions need MSCs. The sending hash would require client and server changes. The receiving server markup also requires client and server changes. Even with both of those, they don't help new device logins. The bloom filter just needs client changes, adding a few keys to the
We could probably layer more bloom filters on top to include that information. E.g: {
"sent": 00010101... // the main "did I send this to your user|device key?" bloom filter
"only_verified": 000101... // if you're in this map then this is because we only send to verified devices
"no_otk": 0001... // if you're in this map then this is because there were no otks/fallback key available to use
} This would then form multiple sets:
Because we assume these other filters are only going to contain a few elements if any, they can be very small whlist keep very low false positive rates: 1 in 10 million false positive for 20 entries is 84 bytes. This assumes that if you're in |
matrix-org/matrix-spec-proposals#4115 proposes a protocol change for this part of my idea. |
We have to be careful here: the sender needs to be aware that their message is going to be sent to people who were not in the room when they wrote the message. This may mean that we may need to do something like having the client prompt the sender to let them know about the membership changes and give them a chance to cancel sending. |
This is exacerbated by federation, but the same thing can happen even on a single server. Ultimately, there is always a race between Alice's client calculating/fetching the membership list, and the message being sent.
Now: it's not clear that Bob should be able to decrypt a message that was sent 3 seconds after he joined the room — at a point where Alice didn't even know he was in the room — any more than he should be able to decrypt a message sent before he joined the room. The problem here is that the message is perceived as a product defect (and contributes to our "user saw a UISI" metric). So probably what we want to do is convey to Bob's client that it should not expect to be able to decrypt this message.
The text was updated successfully, but these errors were encountered: