Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify that arbitrary unicode is allowed in user/room IDs and room aliases #1506

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

tulir
Copy link
Member

@tulir tulir commented May 1, 2023

Nobody cares enough to implement matrix-org/matrix-spec-proposals#2828 or other spec changes that would disallow arbitrary unicode in the three identifiers, so here's a spec clarification to document how Matrix currently works.

Preview: https://pr1506--matrix-spec-previews.netlify.app

@tulir tulir requested a review from a team as a code owner May 1, 2023 10:23
Copy link
Member

@anoadragon453 anoadragon453 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This appears like the right thing to do to me, but I'd like a second opinion.

@richvdh
Copy link
Member

richvdh commented May 16, 2023

For the record: whilst I would love to see MSC2828 implemented, we can't actually do that without a room version bump, so, regardless of what happens with MSC2828, we have to document the status quo as it is :(.

content/appendices.md Outdated Show resolved Hide resolved
content/appendices.md Outdated Show resolved Hide resolved
@turt2live
Copy link
Member

@richvdh I've updated this with permission from tulir - please take another look.

@turt2live turt2live requested a review from richvdh August 9, 2023 18:13
content/appendices.md Outdated Show resolved Hide resolved
content/appendices.md Outdated Show resolved Hide resolved
content/appendices.md Outdated Show resolved Hide resolved
content/appendices.md Outdated Show resolved Hide resolved
content/appendices.md Outdated Show resolved Hide resolved
content/appendices.md Outdated Show resolved Hide resolved
@@ -0,0 +1 @@
Clarify that arbitrary unicode is allowed in user/room IDs and room aliases.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strongly disagree on this.

The original vintage 2014 Synapse implementation only allowed non URL quotable characters in user localparts when registering for an account. Unfortunately, federation did not apply the same check, alluded to by other comments here:

Arbitrary unicode ids exist in the wild too, but it has never been possible to create them without modifying the server.

If you are playing silly games (removing validation checks on your server) we should NOT set the precedent that we are just going to accept your games imo, otherwise where do you draw the line? Some comments in this proposal mention about excluding the null byte, but why? Sure, postgres cannot represent it when used with TEXT columns, but hey I am using a modified Synapse which doesn't have this problem, so why are we allowing some silly games but not others?

The result is an inconsistent mess, and it was never designed to be that way. This isn't like other cases where "hey this is what synapse does, in the spec it goes" because in order to get the failure mode of unicode characters you need a malicious and/or buggy actor.

The consequences of making this the rule in the specification, and hence removing these checks in Dendrite, Conduit, et al is an increased risk of homograph attacks. I cannot and will not support increasing the attack surface of Matrix just because a few people back in 2014 removed validation and sent unicode user IDs into a room, which synapse accepted.

It is worth emphasising that room versions ARE NOT a get-out-of-jail-free card here, as user IDs are outside the scope of rooms. For example, the sliding sync proxy recently had an issue with unicode user IDs in device list changes. It's not hard to see how this can also be an issue with the user directory and to-device msgs, both of which sit outside of rooms.

Counter proposal: sorry folks with smiley poos as user localparts, you're going to be broken in the next release of Synapse, and we remove / subsequently ignore events with malformed user IDs aka what we should have done in 2014.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I agree that this only really happens because of people modifying their servers, the result of enforcing those checks in one implementation and not in another can end up being outright disastrous for innocent users.

For example, if Dendrite enforces these checks but Synapse doesn't, then all it takes is for a Synapse user with an emoji localpart to make a power level change or perform any kind of power action to break the room for Dendrite users probably irreversibly.

Hell, they might not even have to perform a power action, because we might drop auth events (such as the user joining to begin with) or refuse to pull in prev events (from normal timeline events), which in turn causes us to run state resolution with a different set of input events (which can result in a different output state set or a complete state reset) or to accidentally propagate broken state to other servers when they ask us for /state_ids or /state.

This situation 100% sucks but Matrix only functions if implementations agree to handle these things in the same way, otherwise different implementations will never arrive at a consistent state. We already see this happen due to other corner cases and it just ends up feeling terrible for all users involved.

As for room versions, it is true that it's not a great solution to the problem due to the fact that it still leaks into device updates or other areas, but at least it's possible in a future room version to make sure that users with invalid localparts can't join the rooms to begin with. That is a huge step towards stamping out invalid localparts.

Servers SHOULD NOT produce user IDs with localparts outside of the following
character set, and SHOULD NOT forward such user IDs to clients when referenced
outside the context of an event. For example, device list updates from "invalid"
user IDs would be dropped by the receiving server.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is worse than just removing events by these users, as things will only half work (E2EE will entirely break for them in weird and wonderful ways).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you set your localpart to a smiley poo, E2EE will break for you. That's fine by me. What I care about is you not being able to DoS a room by setting your localpart to a smiley poo, which is why we have to accept events over federation whose senders are invalid; we don't have to do much else to help such users.

@richvdh
Copy link
Member

richvdh commented Feb 27, 2024

@tulir, @turt2live: would one of you be able to address the questions in my review #1506 (review)?

@turt2live
Copy link
Member

[...], @turt2live: would one of you be able to address the questions in my review #1506 (review)?

Which questions? The link hits a generic review. If there's outstanding comments needed, please ping inline on the threads.

@richvdh
Copy link
Member

richvdh commented Mar 4, 2024

Which questions? The link hits a generic review. If there's outstanding comments needed, please ping inline on the threads.

Looks like @tulir has updated the PR and addressed my questions since I wrote that. Thanks @tulir!

@richvdh richvdh self-requested a review March 4, 2024 11:54
content/appendices.md Outdated Show resolved Hide resolved
@richvdh richvdh self-requested a review January 14, 2025 20:58
Comment on lines +614 to +616
and servers MUST accept user IDs with localparts consisting of any legal
unicode codepoint except for `:` and `NUL` (U+0000), including other control
characters and the empty string. Localparts MUST be valid UTF-8 sequences.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we're talking characters rather than bytes here, so "MUST be valid UTF-8 sequences" seems confusing. I think we want:

Suggested change
and servers MUST accept user IDs with localparts consisting of any legal
unicode codepoint except for `:` and `NUL` (U+0000), including other control
characters and the empty string. Localparts MUST be valid UTF-8 sequences.
and servers MUST accept user IDs with localparts consisting of any legal
non-surrogate Unicode code points except for `:` and `NUL` (U+0000), including other control
characters and the empty string.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a bit of testing on Synapse. As far as I can tell, it rejects any attempt to use the surrogates. If you encode the surrogates as UTF-8 (<ed> <a0> <80>, say, which is the UTF-8 encoding of U+D800), and send that over HTTP, Synapse will reject the request when it tries to decode the UTF-8.

Likewise, it refuses to persist any strings containing the surrogates to the database, or to put them in outgoing HTTP requests/responses, because it fails to encode them as UTF-8.

I'm therefore pretty happy to say that nobody has to deal with the surrogates.

@@ -663,6 +671,11 @@ Room IDs are case-sensitive. They are not meant to be
human-readable. They are intended to be treated as fully opaque strings
by clients.

The localpart of a room ID (`opaque_id` above) may contain any valid
unicode codepoints, including control characters, except `:` and `NUL`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again I think we can and should exclude the surrogates here

Suggested change
unicode codepoints, including control characters, except `:` and `NUL`
non-surrogate Unicode code points, including control characters, except `:` and `NUL`

Comment on lines +692 to +693
The localpart of a room alias may contain any valid unicode codepoints
except `:`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... do we allow NUL in aliases?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants