Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow user queries for an Application Service leads to persistent events failing to be sent #17621

Open
anoadragon453 opened this issue Aug 28, 2024 · 0 comments

Comments

@anoadragon453
Copy link
Member

For some context, #17206 was caused by this issue.

When an application service registers to receive persistent events (i.e. messages) from certain users/rooms, Synapse will push those events to the application service. It does this using the ApplicationServiceHandler.

Different types of persistent events are pushed in different code paths. m.room.member events with a membership type of join have their own path, while all other persistent events are pushed through ApplicationServicesHandler._notify_interested_services.

Within this method, Synapse iterates through each event to be sent and calls handle_event with it. During handle_event, Synapse blocks on querying GET /_matrix/app/v1/users/{userId} on the application service:

# Do we know this user exists? If not, poke the user
# query API for all services which match that user regex.
# This needs to block as these user queries need to be
# made BEFORE pushing the event.
await self._check_user_exists(event.sender)
if event.type == EventTypes.Member:
await self._check_user_exists(event.state_key)

Following self.appservice_api.query_user down the stack, the HTTP request has a timeout of 60s.

This means that for a slow connection on a user query, Synapse will wait up to 2 minutes per event(!). It's highly possible that more than 0.5 events/min may be generated, thus causing this AS to fall behind.

Furthermore, we seem to be creating one AS txn per event, whereas we should be batching these up:

https://github.com/element-hq/synapse-private/blob/550d760364bf79147046c7c939af59db06ae17f1/synapse/handlers/appservice.py#L202-L206

I suggest we:

  • extract and de-duplicate the senders from all the events
  • query each user against the AS with a much lower timeout than 60s (5s?)
  • send those user queries in parallel to avoid one blocking all the others.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant