Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retrieve Auckland Museum Image Data #3258

Merged
merged 31 commits into from
Jan 12, 2024

Conversation

ngken0995
Copy link
Collaborator

Fixes

Fixes #1771 by @obulat

Description

Adds the script to get all the media from aucklandmuseum.com. Currently, there are 10000 images because the search result max amount is 10000 images.

To collect the filesize, this script makes head requests for individual media items, which make the script slower than expected. But it should be okay considering the number of available images.
Image dimensions are not available, so they will need to be collected separately in the future.

Testing Instructions

run just catalog/test -k auckland_museum

Checklist

  • My pull request has a descriptive title (not a vague title likeUpdate index.md).
  • My pull request targets the default branch of the repository (main) or a parent feature branch.
  • My commit messages follow best practices.
  • My code follows the established code style of the repository.
  • I added or updated tests for the changes I made (if applicable).
  • I added or updated documentation (if applicable).
  • I tried running the project locally and verified that there are no visible errors.
  • I ran the DAG documentation generator (if applicable).

Developer Certificate of Origin

Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

@ngken0995 ngken0995 requested a review from a team as a code owner October 25, 2023 20:23
@ngken0995 ngken0995 requested review from krysal and stacimc October 25, 2023 20:23
@openverse-bot openverse-bot added 🟩 priority: low Low priority and doesn't need to be rushed 🧹 status: ticket work required Needs more details before it can be worked on 🧱 stack: catalog Related to the catalog and Airflow DAGs 🚦 status: awaiting triage Has not been triaged & therefore, not ready for work labels Oct 25, 2023
@obulat obulat added 🌟 goal: addition Addition of new feature 💻 aspect: code Concerns the software code in the repository and removed 🧹 status: ticket work required Needs more details before it can be worked on 🚦 status: awaiting triage Has not been triaged & therefore, not ready for work labels Oct 26, 2023
Copy link
Contributor

@obulat obulat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome work, @ngken0995 !

To add the DAG that would run your script at specific intervals, you'll also need to add it to catalog/dags/providers/provider_workflows.py. Then, if you run just up, go to 0.0.0.0:9090 (log in with airflow/airflow), you can see all of the current DAGs, and you can start the Auckland museum DAG by clicking on the green triangle. Then, you can view the logs from the dag, and can "mark" the "pull_data" step "successful" after several minutes. The DAG will then go on to save the data to a Postgres database on your local machine!

get_record_data must return all of the required pieces of data, otherwise it must return None.
You can see the pieces that are required, and the other pieces you can return in the add_item method of the ImageStore:

Currently, when I ran the DAG, I got an error saying TypeError: ImageStore.add_item() missing 2 required positional arguments: 'foreign_landing_url' and 'foreign_identifier'. So, you need to add foreign_landing_url (the page for the media item on the provider website) and the foreign_identifier. For the following item (the first ingested item in the script), I think the foreign_identifier would be 7109b24c87dbc582327584848d3ee481b2bf5c6e.

{'creator': 'Auckland War Memorial Museum',
 'filesize': 167,
 'license_info': LicenseInfo(license='by', version='4.0', url='https://creativecommons.org/licenses/by/4.0/', raw_url='https://creativecommons.org/licenses/by/4.0/'),
 'meta_data': {'department': 'ephemera',
               'geopos': '',
               'type': 'ecrm:E84_Information_Carrier'},
 'thumbnail_url': 'http://api.aucklandmuseum.com/id/media/p/7109b24c87dbc582327584848d3ee481b2bf5c6e?rendering=thumbnail.jpg',
 'title': 'New Zealand Contemporary Furniture',
 'url': 'http://api.aucklandmuseum.com/id/media/p/7109b24c87dbc582327584848d3ee481b2bf5c6e'}

@ngken0995
Copy link
Collaborator Author

ngken0995 commented Oct 26, 2023

@obulat I was able to find the foreign_landing_url and foreign_identifier by cleaning up the _id url. I have add AucklandMuseumDataIngester to provider_workflow and a successful run from DAG. How do I know if DAG ingest the correct data? Can you double check if it ran correctly?

Copy link
Collaborator

@stacimc stacimc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic @ngken0995, I'm excited to see you jumping in to adding a new provider! I suspect you might be having trouble running the DAG locally because of the start_date being in the future (see my comment below). Once you update the start_date, you'll be able to run it locally and actually get data.

Once you've ingested some data locally, you can see what it looks like by querying your local catalog database. Run just catalog/pgcli to open pgcli in your terminal, and then you can run sql queries (ie select * from image where provider='aucklandmuseum' limit 10;).

I was not initially able to ingest any data, because the DAG fails on every image when trying to fetch the file size with a 301. I commented this part out for the sake of testing the rest of the code.

On a higher-level note: the 100,000 number comes from the rate limit and max response size, right? Meaning there's actually more data than we can fetch in a day? My concern is that when the DAG is run a second time, it'll start processing from the beginning all over again. As it currently stands, I don't think we'll ever be able to ingest more than those first 100k rows.

If the API supports date range queries, we could consider making this a dated DAG instead. Otherwise, we might need to be a bit creative. The absolute simplest, somewhat silly solution I can think of is to give it a huge timeout and greatly extend the delay between requests such that the DAG runs over a week or so, only fetching 100k a day in order to respect the rate limit. I'm not sure what their total dataset size is, so not sure if that's feasible!

# copyright:CC state Creative Commons Attribution 4.0
return {
"q": "_exists_:primaryRepresentation+copyright:CC",
"size": "100",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is the default batch_limit from the parent class, we can use self.batch_limit here (and in the increment in the else statement).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The max amount of data retrieved from the api is 10,000. Look at hits -> total -> value(api) Size is the amount of data to present in a get request. From is the index of the total value. We can keep incrementing From till it reach 10,000 and get_should_continue function should know when it reach the limit.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant that by default self.batch_limit is 100, and that you can just say:

    "size": self.batch_limit,

Rather than hard-coding it separately here.

Copy link
Collaborator Author

@ngken0995 ngken0995 Oct 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will update "size" with batch_limit. Please take a look at the comment below about batch_limit default value.

catalog/dags/providers/provider_workflows.py Outdated Show resolved Hide resolved

url = information.get("primaryRepresentation")

thumbnail_url = f"{url}?rendering=thumbnail.jpg"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thumbnail their API provides is tiny, fixed width of 70px. @obulat would know best -- should we use this or just default to None here and use our own thumbnail service? They also have a slightly bigger preview rendering with a fixed width of 100px.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have previously discussed the thumbnail sizes, and decided against using thumbnails smaller than 600px: #675 (comment)

@ngken0995
Copy link
Collaborator Author

I was not initially able to ingest any data, because the DAG fails on every image when trying to fetch the file size with a 301. I commented this part out for the sake of testing the rest of the code.

@stacimc I tried to add the url with ?rendering=original.jpg and some of them gave me message | "Internal server error" response page. What is the best way to handle every image when trying to fetch the file size with a 301?

@krysal krysal removed their request for review October 28, 2023 11:13
@stacimc
Copy link
Collaborator

stacimc commented Oct 30, 2023

What is the best way to handle every image when trying to fetch the file size with a 301?

From a quick look, it does look like the API returns image urls with http but redirects to https in browser. I tried updating the urls to https in get_filesize and this results in 403s on the HEAD request. I didn't have much time to look into this, but if it's a problem on their side we may have to just skip getting file size information for now.

On a higher-level note: the 100,000 number comes from the rate limit and max response size, right? Meaning there's actually more data than we can fetch in a day? My concern is that when the DAG is run a second time, it'll start processing from the beginning all over again. As it currently stands, I don't think we'll ever be able to ingest more than those first 100k rows.

Reiterating this point from earlier: it occurred to me as I was looking at this again, is there a reason the batch_limit here needs to be the default 100? The API limits to 1,000 requests per day, but now that I look I don't see anywhere in the documentation saying that we have to limit it to 100 records per request. Can you increase that limit in order to process more data?

@ngken0995
Copy link
Collaborator Author

ngken0995 commented Oct 30, 2023

On a higher-level note: the 100,000 number comes from the rate limit and max response size, right? Meaning there's actually more data than we can fetch in a day? My concern is that when the DAG is run a second time, it'll start processing from the beginning all over again. As it currently stands, I don't think we'll ever be able to ingest more than those first 100k rows.

Reiterating this point from earlier: it occurred to me as I was looking at this again, is there a reason the batch_limit here needs to be the default 100? The API limits to 1,000 requests per day, but now that I look I don't see anywhere in the documentation saying that we have to limit it to 100 records per request. Can you increase that limit in order to process more data?

My apologizes, I should have stated why the size set as a default for 100. I didn't know what the correct batch_limit should be and used 100 as a placeholder. I believe it can set to 2,000. How should I determine what is a good batch_limit?

Copy link
Contributor

@obulat obulat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for such a great contribution, @ngken0995 ! I ran the DAG locally, and it works well.

I am concerned with the quality of data we collect, though. I got around 60 items locally, and a very large proportion of them either show "Server error" or "Online image not available" image for the main image file.

I think we should check the main url before saving the item to the catalog for this provider, otherwise we risk downloading a lot of dead links here. What do you think, @openverse-catalog?

Here's a sample of urls I got locally:

+---------------------------------------------------+---------------------------------------------------------------------------------------------------------+------------------------------------+
| url                                               | foreign_landing_url                                                                                     | title                              |
|---------------------------------------------------+---------------------------------------------------------------------------------------------------------+------------------------------------|
| https://api.aucklandmuseum.com/id/media/v/2882    | https://www.aucklandmuseum.com/collections-research/collections/record/am_humanhistory-object-18430     | jar                                |
| https://api.aucklandmuseum.com/id/media/v/117250  | https://www.aucklandmuseum.com/collections-research/collections/record/am_humanhistory-object-15507     | glass, wine                        |
| https://api.aucklandmuseum.com/id/media/v/3191    | https://www.aucklandmuseum.com/collections-research/collections/record/am_humanhistory-object-9923      | jar, lidded                        |
| https://api.aucklandmuseum.com/id/media/v/861840  | https://www.aucklandmuseum.com/collections-research/collections/record/am_humanhistory-object-22679     | briefs, pair                       |
| https://api.aucklandmuseum.com/id/media/v/370276  | https://www.aucklandmuseum.com/collections-research/collections/record/am_humanhistory-object-608181    | cartridges                         |
| https://api.aucklandmuseum.com/id/media/v/325015  | https://www.aucklandmuseum.com/collections-research/collections/record/am_humanhistory-object-4375      | teabowl                            |
| https://api.aucklandmuseum.com/id/media/v/528116  | https://www.aucklandmuseum.com/collections-research/collections/record/am_humanhistory-object-1151      | bowl, lidded                       |
| https://api.aucklandmuseum.com/id/media/v/34541   | https://www.aucklandmuseum.com/collections-research/collections/record/am_naturalsciences-object-368805 | Carex resectans Cheeseman          |
| https://api.aucklandmuseum.com/id/media/v/828322  | https://www.aucklandmuseum.com/collections-research/collections/record/am_humanhistory-object-90135     | skirt, wool                        |
| https://api.aucklandmuseum.com/id/media/v/229298  | https://www.aucklandmuseum.com/collections-research/collections/record/am_humanhistory-object-61319     | tablecloth, signature              |
| https://api.aucklandmuseum.com/id/media/v/75802   | https://www.aucklandmuseum.com/collections-research/collections/record/am_humanhistory-object-14260     | cup                                |
| https://api.aucklandmuseum.com/id/media/v/117280  | https://www.aucklandmuseum.com/collections-research/collections/record/am_humanhistory-object-12592     | goblet                             |
+---------------------------------------------------+---------------------------------------------------------------------------------------------------------+------------------------------------+

@sarayourfriend
Copy link
Collaborator

@obulat which ones did you see that errored? I chose some random ones from the list you shared and they all worked for me. I'm curious to see whether they're present in the wikimedia commons dataset (perhaps they've already sorted through those ones in the Wikimedia dump).

@obulat
Copy link
Contributor

obulat commented Jan 3, 2024

@obulat which ones did you see that errored? I chose some random ones from the list you shared and they all worked for me. I'm curious to see whether they're present in the wikimedia commons dataset (perhaps they've already sorted through those ones in the Wikimedia dump).

I think you can get to the main images for the next 2 sections from their landing URLs, but it's not easy to derive their URLs.

"Online image not available" placeholder

https://api.aucklandmuseum.com/id/media/v/3191, https://api.aucklandmuseum.com/id/media/v/2882,

Internal server error

https://api.aucklandmuseum.com/id/media/v/861840
https://api.aucklandmuseum.com/id/media/v/528116
https://api.aucklandmuseum.com/id/media/v/828322
https://api.aucklandmuseum.com/id/media/v/229298

No image on the landing page

https://www.aucklandmuseum.com/collections-research/collections/record/am_humanhistory-object-9923

Placeholder image on the landing page

https://www.aucklandmuseum.com/collections-research/collections/record/am_naturalsciences-object-368805

Maybe there's a geographical access problem?

It's a good idea to check for these items in Wikimedia.

@ngken0995
Copy link
Collaborator Author

I have the same errors and I'm based in the US.

@sarayourfriend
Copy link
Collaborator

sarayourfriend commented Jan 3, 2024

From what my spouse has told me, the museum is quick to remove public access things for cultural sensitivity reasons, in favour of having controlled access with a culturally relevant approach. I wouldn't be surprised if there are public records for those items where the images aren't available.

If we want to check, I'm pretty sure that the museum would respond to an email from us, and if not, I can ask my spouse to get us in touch with someone, they still have connections with folks there. At the very least we could clarify which of these intentionally do not have access so that we 100% do not index them (knowing that they won't ever be available) and which are technical (potentially temporary) access issues, if such a distinction even exists.

@stacimc
Copy link
Collaborator

stacimc commented Jan 4, 2024

Very interesting, I'm glad you spotted this @obulat!

Given the complexity of the data quality questions and the fact that we already have outstanding work required for de-duplicating these results with Wikimedia, I think it would be reasonable to open a separate issue and PR for addressing this.

In the meantime, I think this PR could be merged as-is, although the data quality should be addressed before we actually turn the DAG on in production and begin ingesting. This is in line with how we've managed the addition of some other providers that ended up being very complex (eg iNaturalist). We should add a row to the DAG Status page explaining why the DAG is not yet enabled, and prioritize that work separately.

@stacimc stacimc requested a review from obulat January 9, 2024 17:24
@openverse-bot
Copy link
Collaborator

Based on the low urgency of this PR, the following reviewers are being gently reminded to review this PR:

@obulat
@sarayourfriend
This reminder is being automatically generated due to the urgency configuration.

Excluding weekend1 days, this PR was ready for review 14 day(s) ago. PRs labelled with low urgency are expected to be reviewed within 5 weekday(s)2.

@ngken0995, if this PR is not ready for a review, please draft it to prevent reviewers from getting further unnecessary pings.

Footnotes

  1. Specifically, Saturday and Sunday.

  2. For the purpose of these reminders we treat Monday - Friday as weekdays. Please note that the operation that generates these reminders runs at midnight UTC on Monday - Friday. This means that depending on your timezone, you may be pinged outside of the expected range.

@obulat
Copy link
Contributor

obulat commented Jan 12, 2024

In the meantime, I think this PR could be merged as-is, although the data quality should be addressed before we actually turn the DAG on in production and begin ingesting. This is in line with how we've managed the addition of some other providers that ended up being very complex (eg iNaturalist). We should add a row to the DAG Status page explaining why the DAG is not yet enabled, and prioritize that work separately.

I like this solution, @stacimc, nice to have such a workaround.

Copy link
Contributor

@obulat obulat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for adding a new DAG, @ngken0995, and all of your patience during the review! It's interesting that this DAG adds a new POST request to the ingester.

@ngken0995 ngken0995 merged commit ab09a8a into WordPress:main Jan 12, 2024
39 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💻 aspect: code Concerns the software code in the repository 🌟 goal: addition Addition of new feature 🟩 priority: low Low priority and doesn't need to be rushed 🧱 stack: catalog Related to the catalog and Airflow DAGs
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Auckland Museum
5 participants