Audit and update provider scripts to collect image dimensions #1628

zackkrida · 2022-03-04T14:49:48Z

Current Situation

Only two of our provider scripts that I am aware of collect a height and width for each image:

We would like to use this information on the frontend for calculating our grid layout and improving the loading appearance of search results.

Suggested Improvement

APIs that do not return dimensions (updated based on @stacimc's comment):

europeana - dimensions are only provided in the single record API result, which means that the ingestion script would need to request each individual item. Europeana does provide a way of downloading all of the data using ftp: https://pro.europeana.eu/page/harvesting-and-downloads, which is probably what we should use. The archive file is fairly big, though, at more than 40gb.
metropolitan_museum_of_art
nypl
smithsonian
walters_art_museum
finnish_museums

Image DAGs that already collect dimensions data:

✅ brooklyn_museum
✅ cleveland_museum_of_art
✅ flickr
✅ museum_victoria
✅ phylopic
✅ raw_pixel
✅ science_museum
✅ smk
✅ stocksnap
✅ wikimedia_commons
✅ wordpress

Then, separately, we'd need to write a script to backfill all existing records.
Finally, we would need a solution to collect the dimensions for images whose provider scripts do not provide the width and height.

The text was updated successfully, but these errors were encountered:

stacimc · 2022-07-21T02:02:41Z

I've taken another pass at auditing these. It looks like @obulat may have written the checklist in the PR description and found more information than I was able to -- do you mind taking a look at what I've found here? I was unable to see support for dimensions in any of the listed APIs except for Europeana.

Europeana
As pointed out, dimensions are not provided in the Search results but we can get them on the single record page for each result. I think this is probably fine because there's plenty of precedent for doing so in other provider scripts, like flickr, Metropolitan, NYPL, etc.

Unfortunately this still isn't a quick win because Europeana is currently blocked on needing a refactor to use the new API endpoint (#109), and also to obtain and API key (#569)

Metropolitan
The API does include dimensions, but they’re dimensions for the physical artwork being photographed, eg “46 5/8 x 18 3/4 in. (118.4 x 47.6 cm)”. https://metmuseum.github.io/#object

NYPL
The API does not appear to include dimensions http://api.repo.nypl.org/#items-item-details

I can’t find documentation for this, but the imageLinks ‘description’ field does contain some of the information, but in not easily parseable text (eg “Cropped .jpeg (1600 pixels on the long side)”)

Smithsonian
As far as I can see the API does not return dimensions data: https://edan.si.edu/openaccess/docs/more.html

Walters Art Museum
Like Metropolitan, this API also contains dimensions related to the physical object being photographed: https://github.com/WaltersArtMuseum/walters-api/blob/master/objects/objects-id.md

Finnish Museum
Again, as far as I can tell the API does not appear to have dimensions https://api.finna.fi/swagger-ui/?url=%2Fapi%2Fv1%3Fswagger#/Search/get_search

obulat · 2022-07-21T14:08:34Z

This is a great write-up, @stacimc ⭐

Europeana: As pointed out, dimensions are not provided in the Search results but we can get them on the single record page for each result. I think this is probably fine because there's plenty of precedent for doing so in other provider scripts, like flickr, Metropolitan, NYPL, etc.

Interesting, I did not know that the Flickr script requests each image separately. When I was writing the filesize/filetype PRs, I remember deciding against separate requests for each image to get those pieces of data as that would significantly slow down ingestion. However, we might decide that image dimensions are important enough to actually do it...

AetherUnbound · 2022-07-21T18:34:04Z

This is super helpful @stacimc! And +1 @obulat, I had initially thought that individual image queries would be a bad idea but if we have precedent for it 🤷🏼‍♀️ I think what we should do in that case though is leverage python's async features for issuing and handling multiple requests concurrently, so we don't have to do it all synchronously. In fact, we may want to go that way with the ingestion in general in the long run 🤔 something to ponder!

stacimc · 2022-08-02T23:02:43Z

Follow up issues created:

Closing this issue as the auditing is complete.

zackkrida added 🚦 status: awaiting triage Has not been triaged & therefore, not ready for work ✨ goal: improvement Improvement to an existing user-facing feature labels Mar 4, 2022

obulat added 🟧 priority: high Stalls work on the project or its dependents and removed 🚦 status: awaiting triage Has not been triaged & therefore, not ready for work labels Mar 18, 2022

krysal added the 💻 aspect: code Concerns the software code in the repository label May 10, 2022

krysal mentioned this issue Apr 17, 2023

Document the recommended image size to choose from providers #1551

Open

1 task

obulat added the data normalization label May 24, 2022

obulat mentioned this issue Apr 17, 2023

Add width and height to all images in the catalog database #1559

Closed

4 tasks

stacimc self-assigned this Jun 21, 2022

krysal added 🟨 priority: medium Not blocking but should be addressed soon and removed 🟧 priority: high Stalls work on the project or its dependents labels Jun 21, 2022

stacimc mentioned this issue Apr 17, 2023

Detect image dimensions for provider APIs that do not support them #1486

Open

1 task

stacimc closed this as completed Aug 2, 2022

AetherUnbound mentioned this issue Apr 17, 2023

Don't move issues back & forth in project, reduce API calls for project automation #765

Merged

7 tasks

obulat transferred this issue from WordPress/openverse-catalog Apr 17, 2023

dhruvkb added this to the Data normalization milestone Dec 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Audit and update provider scripts to collect image dimensions #1628

Audit and update provider scripts to collect image dimensions #1628

zackkrida commented Mar 4, 2022 •

edited by obulat

Loading

stacimc commented Jul 21, 2022

obulat commented Jul 21, 2022

AetherUnbound commented Jul 21, 2022

stacimc commented Aug 2, 2022

Audit and update provider scripts to collect image dimensions #1628

Audit and update provider scripts to collect image dimensions #1628

Comments

zackkrida commented Mar 4, 2022 • edited by obulat Loading

Current Situation

Suggested Improvement

stacimc commented Jul 21, 2022

obulat commented Jul 21, 2022

AetherUnbound commented Jul 21, 2022

stacimc commented Aug 2, 2022

zackkrida commented Mar 4, 2022 •

edited by obulat

Loading