Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: de-duplicate events at the database level #88

Merged
merged 20 commits into from
Jan 24, 2024

Conversation

mmalenic
Copy link
Member

@mmalenic mmalenic commented Jan 23, 2024

Closes #72

This was more complex than I expected, especially because there is a lot of caveats in postgres related to upserts and concurrency errors. I think the on conflict solution works well to avoid these.

Ignore the commits related to macros, I was experimenting with writing postgres functions in Rust using plrust and converting regular functions in code into plrust functions using attribute macros. However, I don't think this is useful for this PR.

Changes

  • Introduce sequencer values and version ids into the s3 object table so that duplicate events can be ignored using insert ... on conflict.
    • on conflict avoids any concurrency issues related to manually upserting objects.
    • bucket and key are pushed from object to s3_object. This helps with de-duplication, and also makes more sense if considering an object which is the same but living in two different places.
    • For now, checksum and size are in object, but this could be changed if two objects that are the same have different checksums, for example, different kinds of compression. If this this case, I think it would make sense to have another table one level up which links physical objects with the same checksums/sizes to logical or semantic objects.
  • This de-duplication logic is mirrored in the code within one Lambda call. This logic is the same as described here, where events with identical key, bucket, version_id and sequencer values are considered duplicates.
  • Add various tests to verify de-duplication.

Comment on lines +1 to +31
-- Bulk insert of s3 objects.
insert into s3_object (
s3_object_id,
object_id,
bucket,
key,
-- We default the created date to a value event if this is a deleted event,
-- as we are expecting this to get updated.
created_date,
deleted_date,
last_modified_date,
e_tag,
storage_class,
version_id,
deleted_sequencer
)
values (
unnest($1::uuid[]),
unnest($2::uuid[]),
unnest($3::text[]),
unnest($4::text[]),
unnest($5::timestamptz[]),
unnest($6::timestamptz[]),
unnest($7::timestamptz[]),
unnest($8::text[]),
unnest($9::storage_class[]),
unnest($10::text[]),
unnest($11::text[])
) on conflict on constraint deleted_sequencer_unique do update
set number_duplicate_events = s3_object.number_duplicate_events + 1
returning object_id, number_duplicate_events;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't being used yet, however I think it will be for #73.

@mmalenic mmalenic self-assigned this Jan 23, 2024
@mmalenic mmalenic added filemanager an issue relating to the filemanager bug Something isn't working feature New feature and removed bug Something isn't working labels Jan 23, 2024
@mmalenic mmalenic requested a review from brainstorm January 23, 2024 05:11
@mmalenic mmalenic merged commit a74114c into main Jan 24, 2024
2 checks passed
@mmalenic mmalenic deleted the feat/event-order-macros branch January 24, 2024 05:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature filemanager an issue relating to the filemanager
Projects
None yet
Development

Successfully merging this pull request may close these issues.

filemanager: de-duplicate S3 event messages
2 participants