Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MI events: failing on duplicate data in scrape #5091

Conversation

braykuka
Copy link
Contributor

@braykuka braykuka commented Nov 19, 2024

1261

  • added status value to dedupe_key

@jessemortenson
Copy link
Contributor

Thanks! Can you share your thinking behind adding the status variable to the dedupe key? I'm imagining a scenario:

  1. Yesterday: the scraper ingests Event A which has status tentative
  2. Today: the scraper ingests the same Event A, but status has been changed on the source website to cancelled.
  3. What should happen is that the data pipeline updates the existing Event A to change its status from tentative to cancelled, because it matches on the existing dedupe key

I'm worried that adding status to the dedupe key will instead cause a duplicate event to be created, so that there will be both an Event A - Tentative and an Event A - Cancelled.

@braykuka
Copy link
Contributor Author

image

As you can see, there are same duplicates for 11/7 and 11/14. They are same event name and same event date & time.
For 11/7, the first meeting is tentative and the second meeting is cancelled.
For 11/14, the first meeting is cancelled and the second meeting is tentative.

Could you please let me know which one should be the duplicated event in these cases?
Thanks.

@jessemortenson
Copy link
Contributor

Thank you for the screenshot, I can see how that is confusing!

Looking at the source website, it appears that each of those events has a unique URL:

The content of the pages on those URLs does differ: the agenda seems to be different between the "cancelled" and the "new" events (even though the name, location, time are all the same). So it seems these are actually not duplicate events in the source website (my assumption was wrong).

Those URLs seem to imply each event has a unique meetingID. If that is true, perhaps the meetingID is the best dedupe key for MI events? When we have a "natural" unique identifier that seems to be consistent at a source site, that can be a good candidate for the dedupe key.

@braykuka
Copy link
Contributor Author

@jessemortenson Thank you for the update. I've changed dedupe_key with meetingID. Please review it again.

Copy link
Contributor

@jessemortenson jessemortenson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, merging in

@jessemortenson jessemortenson merged commit 6696b62 into openstates:main Nov 20, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants