-
Notifications
You must be signed in to change notification settings - Fork 197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Audit and update provider scripts to collect image dimensions #1628
Comments
I've taken another pass at auditing these. It looks like @obulat may have written the checklist in the PR description and found more information than I was able to -- do you mind taking a look at what I've found here? I was unable to see support for dimensions in any of the listed APIs except for Europeana. Europeana Unfortunately this still isn't a quick win because Europeana is currently blocked on needing a refactor to use the new API endpoint (#109), and also to obtain and API key (#569) Metropolitan NYPL I can’t find documentation for this, but the Smithsonian Walters Art Museum Finnish Museum |
This is a great write-up, @stacimc ⭐
Interesting, I did not know that the Flickr script requests each image separately. When I was writing the |
This is super helpful @stacimc! And +1 @obulat, I had initially thought that individual image queries would be a bad idea but if we have precedent for it 🤷🏼♀️ I think what we should do in that case though is leverage python's async features for issuing and handling multiple requests concurrently, so we don't have to do it all synchronously. In fact, we may want to go that way with the ingestion in general in the long run 🤔 something to ponder! |
Follow up issues created:
Closing this issue as the auditing is complete. |
Current Situation
Only two of our provider scripts that I am aware of collect a height and width for each image:
We would like to use this information on the frontend for calculating our grid layout and improving the loading appearance of search results.
Suggested Improvement
APIs that do not return dimensions (updated based on @stacimc's comment):
europeana
- dimensions are only provided in the single record API result, which means that the ingestion script would need to request each individual item. Europeana does provide a way of downloading all of the data usingftp
: https://pro.europeana.eu/page/harvesting-and-downloads, which is probably what we should use. The archive file is fairly big, though, at more than 40gb.metropolitan_museum_of_art
nypl
smithsonian
walters_art_museum
finnish_museums
Image DAGs that already collect dimensions data:
brooklyn_museum
cleveland_museum_of_art
flickr
museum_victoria
phylopic
raw_pixel
science_museum
smk
stocksnap
wikimedia_commons
wordpress
Then, separately, we'd need to write a script to backfill all existing records.
Finally, we would need a solution to collect the dimensions for images whose provider scripts do not provide the width and height.
The text was updated successfully, but these errors were encountered: