Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrading large rooms takes 1000s+, times out resulting in confusing client error #17530

Open
ara4n opened this issue Aug 6, 2024 · 1 comment
Labels
A-Performance A-Room-Upgrades good first issue This is a fix that might be an easy place for someone to start for their first contribution O-Uncommon S-Major T-Defect

Comments

@ara4n
Copy link
Member

ara4n commented Aug 6, 2024

Description

Upgrading large rooms takes more than 10s, but after 10s synapse emits some kind of timeout error, leaving the client in a mess.

Steps to reproduce

  • upgrade a big room in EW with lots of bans (e.g. #element-web:matrix.org which had 15K people in it)
  • "failed to fetch" error turns up after 10s leaving room half-upgraded in some horrible limbo: Failed to upgrade room version element-meta#2530

Homeserver

matrix.org

Synapse Version

Whatever was live 2024-07-29T14:23:58.642Z

Installation Method

Other (please mention below)

Database

postgres

Workers

Multiple workers

Platform

debian

Configuration

No response

Relevant log output

Go look at the logs, if they're still there...

Anything else that would be useful to know?

No response

@anoadragon453
Copy link
Member

Here are the logs from Synapse:

3670827:2024-07-29 14:53:13,497 - synapse.http.server - 782 - WARNING - POST-8aadc2a6593fbef6-LHR - Not sending response to request <XForwardedForRequest at 0x7f18a38b3ca0 method='POST' uri='/_matrix/client/v3/rooms/!YTvKGNlinIzlkMTVRl%3Amatrix.org/upgrade' clientproto='HTTP/1.1' site='8080'>, already disconnected.
3670828:2024-07-29 14:53:13,498 - synapse.access.http.8080 - 473 - INFO - POST-8aadc2a6593fbef6-LHR - 77.108.142.78 - 8080 - {@matthew:matrix.org} Processed request: 1854.869sec/-1674.869sec (132.897sec, 15.687sec) (90.976sec/342.120sec/42165) 0B 200! "POST /_matrix/client/v3/rooms/!YTvKGNlinIzlkMTVRl%3Amatrix.org/upgrade HTTP/1.1" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) ElementNightly/2024072901 Chrome/126.0.6478.127 Electron/31.2.1 Safari/537.36" [35554 dbevts]

Parsing the request log, we can see that the request took 1854s to complete. It looks like Synapse did not time out the request - but potentially the reverse proxy? Either way, this is exceedingly long.

It looks like this request took a long time due to copying over user's push rules and account data from this room to the other. The logs are littered with:

...
Waiting for repl stream 'account_data' to reach 3244591758 (account_data-1); currently at: 3244591756
Waiting for repl stream 'push_rules' to reach 236699064 (push_rules-1); currently at: 236699062
...

This appears to happen user-by-user sequentially over replication:

async def copy_user_state_on_room_upgrade(
self, old_room_id: str, new_room_id: str, user_ids: Iterable[str]
) -> None:
"""Copy user-specific information when they join a new room when that new room is the
result of a room upgrade
Args:
old_room_id: The ID of upgraded room
new_room_id: The ID of the new room
user_ids: User IDs to copy state for
"""
logger.debug(
"Copying over room tags and push rules from %s to %s for users %s",
old_room_id,
new_room_id,
user_ids,
)
for user_id in user_ids:
try:
# It is an upgraded room. Copy over old tags
await self.copy_room_tags_and_direct_to_room(
old_room_id, new_room_id, user_id
)
# Copy over push rules
if self._is_push_writer:
await self.store.copy_push_rules_from_room_to_room_for_user(
old_room_id, new_room_id, user_id
)
else:
await self._copy_push_client(
instance_name=self._push_writer,
user_id=user_id,
old_room_id=old_room_id,
new_room_id=new_room_id,
)
except Exception:
logger.exception(
"Error copying tags and/or push rules from rooms %s to %s for user %s. "
"Skipping...",
old_room_id,
new_room_id,
user_id,
)
continue

The main process handles the room upgrade request, while the push-rules-1 and account-data-1 workers handles those changes respectively.

We could shave a lot of time and resources off processing this request by just doing those two operations in larger batches than 1.

@anoadragon453 anoadragon453 added A-Performance A-Room-Upgrades good first issue This is a fix that might be an easy place for someone to start for their first contribution S-Major O-Uncommon T-Defect labels Aug 7, 2024
@anoadragon453 anoadragon453 changed the title Room upgrade timed out and errored after 10s, causing the client grief Upgrading large rooms takes 1000s+, times out resulting in confusing client error Aug 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-Performance A-Room-Upgrades good first issue This is a fix that might be an easy place for someone to start for their first contribution O-Uncommon S-Major T-Defect
Projects
None yet
Development

No branches or pull requests

2 participants