Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Audit and update provider scripts to collect image dimensions #1628

Closed
6 tasks
zackkrida opened this issue Mar 4, 2022 · 4 comments
Closed
6 tasks

Audit and update provider scripts to collect image dimensions #1628

zackkrida opened this issue Mar 4, 2022 · 4 comments
Assignees
Labels
💻 aspect: code Concerns the software code in the repository ✨ goal: improvement Improvement to an existing user-facing feature 🟨 priority: medium Not blocking but should be addressed soon

Comments

@zackkrida
Copy link
Member

zackkrida commented Mar 4, 2022

Current Situation

Only two of our provider scripts that I am aware of collect a height and width for each image:

We would like to use this information on the frontend for calculating our grid layout and improving the loading appearance of search results.

Suggested Improvement

APIs that do not return dimensions (updated based on @stacimc's comment):

  • europeana - dimensions are only provided in the single record API result, which means that the ingestion script would need to request each individual item. Europeana does provide a way of downloading all of the data using ftp: https://pro.europeana.eu/page/harvesting-and-downloads, which is probably what we should use. The archive file is fairly big, though, at more than 40gb.
  • metropolitan_museum_of_art
  • nypl
  • smithsonian
  • walters_art_museum
  • finnish_museums

Image DAGs that already collect dimensions data:

  • brooklyn_museum
  • cleveland_museum_of_art
  • flickr
  • museum_victoria
  • phylopic
  • raw_pixel
  • science_museum
  • smk
  • stocksnap
  • wikimedia_commons
  • wordpress

Then, separately, we'd need to write a script to backfill all existing records.
Finally, we would need a solution to collect the dimensions for images whose provider scripts do not provide the width and height.

@zackkrida zackkrida added 🚦 status: awaiting triage Has not been triaged & therefore, not ready for work ✨ goal: improvement Improvement to an existing user-facing feature labels Mar 4, 2022
@obulat obulat added 🟧 priority: high Stalls work on the project or its dependents and removed 🚦 status: awaiting triage Has not been triaged & therefore, not ready for work labels Mar 18, 2022
@krysal krysal added the 💻 aspect: code Concerns the software code in the repository label May 10, 2022
@stacimc stacimc self-assigned this Jun 21, 2022
@krysal krysal added 🟨 priority: medium Not blocking but should be addressed soon and removed 🟧 priority: high Stalls work on the project or its dependents labels Jun 21, 2022
@stacimc
Copy link
Contributor

stacimc commented Jul 21, 2022

I've taken another pass at auditing these. It looks like @obulat may have written the checklist in the PR description and found more information than I was able to -- do you mind taking a look at what I've found here? I was unable to see support for dimensions in any of the listed APIs except for Europeana.

Europeana
As pointed out, dimensions are not provided in the Search results but we can get them on the single record page for each result. I think this is probably fine because there's plenty of precedent for doing so in other provider scripts, like flickr, Metropolitan, NYPL, etc.

Unfortunately this still isn't a quick win because Europeana is currently blocked on needing a refactor to use the new API endpoint (#109), and also to obtain and API key (#569)

Metropolitan
The API does include dimensions, but they’re dimensions for the physical artwork being photographed, eg “46 5/8 x 18 3/4 in. (118.4 x 47.6 cm)”. https://metmuseum.github.io/#object

NYPL
The API does not appear to include dimensions http://api.repo.nypl.org/#items-item-details

I can’t find documentation for this, but the imageLinks ‘description’ field does contain some of the information, but in not easily parseable text (eg “Cropped .jpeg (1600 pixels on the long side)”)

Smithsonian
As far as I can see the API does not return dimensions data: https://edan.si.edu/openaccess/docs/more.html

Walters Art Museum
Like Metropolitan, this API also contains dimensions related to the physical object being photographed: https://github.com/WaltersArtMuseum/walters-api/blob/master/objects/objects-id.md

Finnish Museum
Again, as far as I can tell the API does not appear to have dimensions https://api.finna.fi/swagger-ui/?url=%2Fapi%2Fv1%3Fswagger#/Search/get_search

@obulat
Copy link
Contributor

obulat commented Jul 21, 2022

This is a great write-up, @stacimc

Europeana: As pointed out, dimensions are not provided in the Search results but we can get them on the single record page for each result. I think this is probably fine because there's plenty of precedent for doing so in other provider scripts, like flickr, Metropolitan, NYPL, etc.

Interesting, I did not know that the Flickr script requests each image separately. When I was writing the filesize/filetype PRs, I remember deciding against separate requests for each image to get those pieces of data as that would significantly slow down ingestion. However, we might decide that image dimensions are important enough to actually do it...

@AetherUnbound
Copy link
Collaborator

This is super helpful @stacimc! And +1 @obulat, I had initially thought that individual image queries would be a bad idea but if we have precedent for it 🤷🏼‍♀️ I think what we should do in that case though is leverage python's async features for issuing and handling multiple requests concurrently, so we don't have to do it all synchronously. In fact, we may want to go that way with the ingestion in general in the long run 🤔 something to ponder!

@stacimc
Copy link
Contributor

stacimc commented Aug 2, 2022

Follow up issues created:

Closing this issue as the auditing is complete.

@stacimc stacimc closed this as completed Aug 2, 2022
@obulat obulat transferred this issue from WordPress/openverse-catalog Apr 17, 2023
@dhruvkb dhruvkb added this to the Data normalization milestone Dec 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💻 aspect: code Concerns the software code in the repository ✨ goal: improvement Improvement to an existing user-facing feature 🟨 priority: medium Not blocking but should be addressed soon
Projects
Archived in project
Development

No branches or pull requests

6 participants