Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle duplication of records between auckland_museum and wikimedia #3659

Open
stacimc opened this issue Jan 12, 2024 · 13 comments
Open

Handle duplication of records between auckland_museum and wikimedia #3659

stacimc opened this issue Jan 12, 2024 · 13 comments
Labels
💻 aspect: code Concerns the software code in the repository ✨ goal: improvement Improvement to an existing user-facing feature 🟨 priority: medium Not blocking but should be addressed soon 🧱 stack: catalog Related to the catalog and Airflow DAGs

Comments

@stacimc
Copy link
Contributor

stacimc commented Jan 12, 2024

Problem

As noted by @sarayourfriend in this comment, many records from the Auckland Museum's collection are already in Openverse due to their inclusion in Wikimedia Commons. If we run both DAGs and do nothing to address this, these records will be duplicated in Openverse.

Description

Suggestion taken directly from Sara's comment:

Either we'd need to suppress the entries from Wikimedia Commons, or, (probably my preference) improve our ingestion of Wikimedia Commons to be able to identify sources like this in Wikimedia Commons. Glancing at the Wikimedia Commons provider script, I don't think we currently save the "collections" metadata present in the file summary on the Wikimedia Commons page.

I think this is a big opportunity to expand the list of high quality sources without introducing duplicates, and while cleaning up the Wikimedia Commons data ingestion, cleanup, and overall handling. For this institution in particular, there is a great page describing how the metadata is structured: https://commons.wikimedia.org/wiki/Commons:Batch_uploading/AucklandMuseumCCBY

The same information would also be relevant for the National Gallery of Art (#3167) (see this Wikimedia result, which is in Openverse with similarly poorly handled metadata and is in a NGA collection in Wikimedia Commons's data). I imagine there are a handful of other such institutions that we could add, just by improving the Wikimedia Commons script and our handling of their data.

And actually, when digging through Wikimedia Commons and Wikidata pages researching this comment, I found this amazing spreadsheet that would help us identify these exact kinds of institutions, for Wikimedia Commons, Europeana, Flickr, and even TROVE (#2653): https://docs.google.com/spreadsheets/d/1WPS-KJptUJ-o8SXtg00llcxq0IKJu8eO6Ege_GrLaNc/edit#gid=1216556120

Additional context

The auckland_museum DAG is currently blocked on other issues (see DAG Status page), but this issue should not necessarily prevent us from turning the DAG on.

However, we should not add the provider as a source in the API until this has been resolved.

@stacimc stacimc added 🟨 priority: medium Not blocking but should be addressed soon ✨ goal: improvement Improvement to an existing user-facing feature 💻 aspect: code Concerns the software code in the repository 🧱 stack: catalog Related to the catalog and Airflow DAGs labels Jan 12, 2024
@sarayourfriend
Copy link
Collaborator

Just adding a note that it would be great to do this in a way that lets us also identify other big sources within Wikimedia Commons, like the NGA (#3167) or others in the spreadsheet I linked in the comment Staci quoted: https://docs.google.com/spreadsheets/d/1WPS-KJptUJ-o8SXtg00llcxq0IKJu8eO6Ege_GrLaNc/edit#gid=1216556120

Specifically, getting the "Collection" metadata from Wikimedia Commons (probably into the meta_data blog?) would allow deduplicating Auckland Museum Tamaki Paenga Hira (because we can identify the Wikimedia Commons records to suppress in favour of the first party ones) and also make additional source querying possible in the future.

This would be in contrast to not storing that metadata and just reading it to exclude these records.

@stacimc
Copy link
Contributor Author

stacimc commented Jun 18, 2024

We may also need to identify duplicated records uploaded from Flickr (https://www.flickr.org/introducing-flickypedia/).

@szymon-polaczy
Copy link
Contributor

Hi, I'd like to work on this one

@AetherUnbound
Copy link
Collaborator

I've been poking at this and wanted to share what I could find. Unfortunately, even though "Collection" shows up as a section on the page (example), I cannot seem to find anywhere that it would be in the metadata. Here's the documentation on what's available in the query module for the API, and here's an example of what our standard API query would be. I've looked for where "collection" can be found in the API documentation, and the only page I've found is on the Extension:Collection which doesn't provide any information on actually gathering that information from the API 😞 Unfortunately, I'm not sure that we can get the collection info via the API, though I'll keep looking.

@szymon-polaczy
Copy link
Contributor

Hi @AetherUnbound, I have two ideas one stupid the other a bit less. Both of these are just something that came up when I looked at this but I don't expect you to take them seriously.

Stupid idea, but would it be possible to when ingesting the images keep a hash of their data to then compare to old images? This would mean that we would have to have a hash of all of the ingested images which is probably not possible but it would allow for checking any and all duplicates in the database in the future and generating the hashes could be a job that could run in the background to make sure that slowly the hashes would fill up?

Another less stupid idea would be to maybe use a crawler and get the collection data from the page itself? We could use the descriptionshorturl property to get the url and then collect the collections from the html, this would mean that any html changes will break the ingestion so maybe it would have to be put behind an option for stricter checks or something?

@AetherUnbound
Copy link
Collaborator

Further investigation, it looks like the "Collection" section may use the Institutions template, which is something we can match on (using the templates module). That might help us when an institution is loaded on a page (you can see Institution:National Gallery of Art, Washington DC in the example result), perhaps that's the info we can passively collect and store in the meta_data block 🤔

I was thinking we could have an alert similar to the Flickr subprovider auditor which might try to monitor the list of "Institutions" in Wikimedia. That'd probably be a massive number though, I'm not sure the viability of that approach. At least for the Auckland Museum, there is an existing Institution:Auckland War Memorial Museum template we could use to identify potential candidates and exclude them from Wikimedia results (example). We could even filter to just template namespace 106, which appears to be the Institution:... namespace based on the examples I've supplied! That'd mean we'd have to change our props query to include templates:

And we'd also want to filter down to just that template namespace by adding "tlnamespace": 106 to the get_next_query_params function:

def get_next_query_params(self, prev_query_params: dict | None):

Here's what that response looks like when added: https://commons.wikimedia.org/w/api.php?action=query&prop=info|templates&inprop=displaytitle&tlnamespace=106&titles=File:(Figure_sketches)_PD-1952-2-34.jpg

The assumption would be that a piece of media is from one institution, but this approach on a technical level might allow more than one institution to be present on the page, since we're just using whether or not the institution's template was used. It seems like we might want to store all institution values in a list in the meta_data block in that case.

@stacimc and @sarayourfriend, I'm interested to hear your thoughts on all this!


@szymon-polaczy to respond to your questions (thanks for asking!):

Hashing the image (using either similarity methods like perceptual hashing or more standard cryptographic/file contents hashing) would indeed help us identify duplicates. As you point out though, for our dataset, this would be a massive undertaking and likely a full project in order to execute. Not that that makes it impossible, but we'd have to scope and prioritize it among all our other projects 🙂

On the crawling front - the Wikimedia ingestion process gathers nearly all of the data it needs using the "Allimages" generator. This means that we're parsing through pages of images at a given time and ingesting them all at once, rather than combing through one result at a time (you can read more about this on the DAG documentation we have). This mechanism already pushes up against API/throughput limits (while trying to be respectful of Wikimedia's API request rate), and would be slowed down significantly by querying individual images instead of performing that querying in bulk. That's part of the motivation behind trying to find the right set of properties to use above that can be paired with the allimages generator 😅 We've had to take the per-record-request approach for some other providers (here's an example from Freesound) and it's proved to be quite the headache.

That said, in addition to action=query, the Wikimedia API does have a action=parse operation which will produce the HTML for a given record. If we did want to go the per-record request route, we could do so using that! Here's an example using one of the previous images which actually produces the Collection HTML table row in the result if you search for it on the page.

That to say, I'm frustrated that the information isn't more easily available when Wikimedia is clearly able to find it somewhere for the record so that it actually knows that there are collections associated with an image 😕

@szymon-polaczy
Copy link
Contributor

szymon-polaczy commented Jul 24, 2024

Thank you for the deep explanation.
As I have a moment, I've taken more of a deep look at documentation that you've sent over before and I saw this prop

prop=iwlinks

here's a full url with it

and in the response there was this
{ "prefix": "d", "*": "Q62104323" },
and that's the last part of a link to one of the things in the collection
image

Wikipedia link for the National Gallery of Art didn't get caugth but it was mentioned as the Artist and the Credit in the metadata

So maybe it would be possible to also combine other props and create some mapping so see if an element exists somewhere else? @AetherUnbound

When I find more time I'll look through the last message to see if I can help with anything there but I wanted to send this through as maybe another option.

edit: I quickly read through the explanation for your idea and I think that what I sent over might just be a worse / roundabout version of what you found

@AetherUnbound
Copy link
Collaborator

Heh, thanks Szymon! I was also looking at iwlinks as a possible option, but it seemed inconsistent too (and again, another piece of information we might need to know ahead of time rather than something we could just pull out and use directly, which would be more ideal). We appreciate you looking into this with us!

@sarayourfriend
Copy link
Collaborator

@AetherUnbound using the templates to determine it sounds like the right way to go, based on my very small understanding of MediaWiki's data model.

Would we need to reingest records for this solution?

@stacimc
Copy link
Contributor Author

stacimc commented Aug 6, 2024

Templates seems like the best approach as far as I can see, although I've spent much less time working with this API than you @AetherUnbound :)

Would we need to reingest records for this solution?

We'd need some kind of backfill regardless, because there's no way to update the Wikimedia DAG to delete existing records even once we've identified them as duplicates. The simple way we'd do it with a smaller provider would be to update the DAG to add a new meta_data field for the collection/institution, run reingestion to update all records, and then drop or otherwise preserve the duplicates. Full reingestion for Wikimedia will be a beast, though -- I'm not even sure we can do that reliably given the issues we've had with flakiness in the reingestion DAGs. Were you considering a different approach, @AetherUnbound?

@sarayourfriend
Copy link
Collaborator

update the Wikimedia DAG to delete

I wonder if they should be deleted or if they should be marked as duplicates of a provider-specific DAG's works? It would be good if we could exclude them from search (and I guess other analysis) without chucking the data or skipping them during ingestion.

I wonder if the wikimedia API makes it possible to filter based on the template parameter value. In which case, we could do a targeted backfill of the Auckland Museum works in wikimedia to suppress/exclude them.

If that's possible, there might be some crossover with the targeted reingestion work we've discussed in #4452.

@stacimc
Copy link
Contributor Author

stacimc commented Aug 9, 2024

I wonder if they should be deleted or if they should be marked as duplicates of a provider-specific DAG's works?

Totally agree, although I was intentionally vague when I said "otherwise preserve the duplicates" because we don't have an "official" way of doing that yet as far as I know 😄 FWIW simplest thing to do is probably move them to the deleted_image table in the catalog with duplicate as the deleted_reason, possibly using the delete_records DAG (although it's been a minute since I've looked at that, would need to double check if it makes sense to be used here). This makes it extremely easy to restore them later, or else do something else with them if we come up with a more sophisticated plan for handling duplicates.

@AetherUnbound
Copy link
Collaborator

I have seen this and will respond to this discussion when I have time!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💻 aspect: code Concerns the software code in the repository ✨ goal: improvement Improvement to an existing user-facing feature 🟨 priority: medium Not blocking but should be addressed soon 🧱 stack: catalog Related to the catalog and Airflow DAGs
Projects
Status: 📋 Backlog
Development

No branches or pull requests

4 participants