Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Duplicated events are always removed #4706

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

Hywan
Copy link
Member

@Hywan Hywan commented Feb 21, 2025

When a duplicated event is found, sometimes we keep the older one and the new one is dropped, but sometimes the older one is removed and the new one is kept (backwards pagination vs. sync). So we need to be able to remove an event.

However, since #4662 and #4632, this can fail. Why? Because we can find a duplicated event in the store, and that is not loaded in the in-memory LinkedChunk. What happens in this case? We simply give up! See the error! case?

pub fn remove_events_by_id(&mut self, event_ids: Vec<OwnedEventId>) {
for event_id in event_ids {
let Some(event_position) = self.revents().find_map(|(position, event)| {
(event.event_id().as_ref() == Some(&event_id)).then_some(position)
}) else {
error!(?event_id, "Cannot find the event to remove");
continue;
};

Well, this is pretty bad, because it means the duplicated event will not be removed. And it means a new event with the same ID will be inserted. Kaboom 💥.

These patches fix that in simple steps:

  1. EventCacheStore::filter_duplicated_events returns the event positions,
  2. Deduplicator is redesigned to produce a DeduplicatorOutcome which makes the difference between in-memory vs. in-store duplicated events,
  3. Use this information accordingly.

These patches also fix inconsistencies between the BloomFilterDeduplicator (used when there is no store) and StoreDeduplicator (used when there is a store), like: the new events can contain duplicates (new_events = [$ev0, $ev1, $ev0]) in theory (not in practise so far). This check was only done in the bloom filter deduplicator. Now it's done for all of them, directly in Deduplicator. Same for detecting events with no ID, it's done beforehand. See the patches one by one.

Tasks

  • Do everything
  • Test

Copy link

codecov bot commented Feb 21, 2025

Codecov Report

Attention: Patch coverage is 97.11538% with 3 lines in your changes missing coverage. Please review.

Project coverage is 85.91%. Comparing base (f900db4) to head (a7ca6e5).
Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
crates/matrix-sdk/src/event_cache/deduplicator.rs 97.72% 1 Missing ⚠️
crates/matrix-sdk/src/event_cache/room/events.rs 83.33% 1 Missing ⚠️
crates/matrix-sdk/src/event_cache/room/mod.rs 96.15% 1 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main    #4706   +/-   ##
=======================================
  Coverage   85.90%   85.91%           
=======================================
  Files         292      292           
  Lines       33906    33950   +44     
=======================================
+ Hits        29128    29168   +40     
- Misses       4778     4782    +4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@Hywan Hywan force-pushed the fix-sdk-event-cache-remove-events-by-id branch from 32c7aa3 to a7ca6e5 Compare February 24, 2025 09:38
@Hywan Hywan marked this pull request as ready for review February 24, 2025 09:45
@Hywan Hywan requested a review from a team as a code owner February 24, 2025 09:45
@Hywan Hywan requested review from andybalaam and stefanceriu and removed request for a team and andybalaam February 24, 2025 09:45
Copy link
Member

@stefanceriu stefanceriu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments inline but otherwise makes sense and lgtm 👍

Comment on lines 246 to 247
/// Events **must** be sorted by their position (descending, i.e. from
/// newest to oldest).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels a bit footgun-y, can't we sort them here too, similar to what the deduplicator is doing in its sort_events_by_position_descending?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. I've rewritten this part. Deduplicator is no longer responsible to sort events by their position. This is done in a new RoomEventCacheState::remove_events, which is responsible to remove events in the store, and in RoomEvents.

Copy link
Member

@bnjbvr bnjbvr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! just skimmed the high-level changes so I could understand it, and it makes sense to me 👍

Hywan added 11 commits February 24, 2025 15:37
…tion`.

This patch changes `EventCacheStore::filter_duplicated_events` to return
the `Position` of the duplicated event.
This patch redesigns `Deduplicator::filter_duplicate_events`.

First off, `filter_duplicate_events` does remove events with no valid
ID. At the same time, it removes duplicate events within the new events
(`events`). This check was done in the `BloomFilterDeduplicator` but
not in the `StoreDeduplicator`. Now it's done at the front of these
implementations, directly inside `Deduplicator`.

Second, this patch introduces `DeduplicationOutcome` to replace the
return type `(Vec<Event>, Vec<OwnedEventId>)`, especially because
now it would have become `(Vec<Event>, Vec<(OwnedEventId, Position)>,
Vec<(OwnedEventId, Position)>)`. Why?

1. Because the positions of the duplicated events are returned,
2. We differentiate between in-memory vs. in-store duplicated events.

Third, now there are positions associated to duplicated events, events
must be sorted. It's the role of `sort_events_by_position_descending`.

This way, `DeduplicatorOutcome` brings guarantees and less checks are
required.
…ct place.

This patch uses `DeduplicationOutcome` to remove events either in
memory, or in the store, when required. The `remove_events_by_id` method
has been renamed `remove_events_by_position`.
This patch adds a test for `sort_events_by_position_descending`. It
also updates this function so that events are sorted by their chunk
identifier from newest to oldest, it makes no difference but it matches
the order of the position indices too. Everything “dimension” is
descending.
This patch adds a test ensuring that `Deduplicator` is able to find
duplicates in its own inputs.
This patch adds a test ensuring that `Deduplicator` excludes invalid
events, i.e. event with no ID.
Nothing fancy here. Just regular chore tasks.
… store.

This patch tests that `Deduplicator` dispatches duplicated events in the
correct field of `DeduplicationOutcome`.
This patch renames `EmptyChunk` into `EmptyChunkRule`. Name suggested by
@stefanceriu, it makes a lot more sense, thanks!
This patch makes the code more robust around event removals. Sorting
events by their position is no longer done in the `Deduplicator` but in
a new `RoomEventCacheState::remove_events` method, which removes events
in the store and in the `RoomEvents`. This method is responsible to sort
events, this stuff is less fragile like so.
@Hywan Hywan force-pushed the fix-sdk-event-cache-remove-events-by-id branch from a7ca6e5 to c1d8e6a Compare February 24, 2025 14:40
This patch puts the `Ok` outside the `match` for a better ergonomics.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants