Retrieve Auckland Museum Image Data #3258

ngken0995 · 2023-10-25T20:23:43Z

Fixes

Description

Adds the script to get all the media from aucklandmuseum.com. Currently, there are 10000 images because the search result max amount is 10000 images.

To collect the filesize, this script makes head requests for individual media items, which make the script slower than expected. But it should be okay considering the number of available images.
Image dimensions are not available, so they will need to be collected separately in the future.

Testing Instructions

run just catalog/test -k auckland_museum

Checklist

My pull request has a descriptive title (not a vague title likeUpdate index.md).
My pull request targets the default branch of the repository (main) or a parent feature branch.
My commit messages follow best practices.
My code follows the established code style of the repository.
I added or updated tests for the changes I made (if applicable).
I added or updated documentation (if applicable).
I tried running the project locally and verified that there are no visible errors.
I ran the DAG documentation generator (if applicable).

Developer Certificate of Origin

Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

obulat

This is awesome work, @ngken0995 !

To add the DAG that would run your script at specific intervals, you'll also need to add it to catalog/dags/providers/provider_workflows.py. Then, if you run just up, go to 0.0.0.0:9090 (log in with airflow/airflow), you can see all of the current DAGs, and you can start the Auckland museum DAG by clicking on the green triangle. Then, you can view the logs from the dag, and can "mark" the "pull_data" step "successful" after several minutes. The DAG will then go on to save the data to a Postgres database on your local machine!

get_record_data must return all of the required pieces of data, otherwise it must return None.
You can see the pieces that are required, and the other pieces you can return in the add_item method of the ImageStore:

openverse/catalog/dags/common/storage/image.py

Line 44 in 23d4536

def add_item(

Currently, when I ran the DAG, I got an error saying TypeError: ImageStore.add_item() missing 2 required positional arguments: 'foreign_landing_url' and 'foreign_identifier'. So, you need to add foreign_landing_url (the page for the media item on the provider website) and the foreign_identifier. For the following item (the first ingested item in the script), I think the foreign_identifier would be 7109b24c87dbc582327584848d3ee481b2bf5c6e.

{'creator': 'Auckland War Memorial Museum',
 'filesize': 167,
 'license_info': LicenseInfo(license='by', version='4.0', url='https://creativecommons.org/licenses/by/4.0/', raw_url='https://creativecommons.org/licenses/by/4.0/'),
 'meta_data': {'department': 'ephemera',
               'geopos': '',
               'type': 'ecrm:E84_Information_Carrier'},
 'thumbnail_url': 'http://api.aucklandmuseum.com/id/media/p/7109b24c87dbc582327584848d3ee481b2bf5c6e?rendering=thumbnail.jpg',
 'title': 'New Zealand Contemporary Furniture',
 'url': 'http://api.aucklandmuseum.com/id/media/p/7109b24c87dbc582327584848d3ee481b2bf5c6e'}

ngken0995 · 2023-10-26T16:08:39Z

@obulat I was able to find the foreign_landing_url and foreign_identifier by cleaning up the _id url. I have add AucklandMuseumDataIngester to provider_workflow and a successful run from DAG. How do I know if DAG ingest the correct data? Can you double check if it ran correctly?

stacimc

Fantastic @ngken0995, I'm excited to see you jumping in to adding a new provider! I suspect you might be having trouble running the DAG locally because of the start_date being in the future (see my comment below). Once you update the start_date, you'll be able to run it locally and actually get data.

Once you've ingested some data locally, you can see what it looks like by querying your local catalog database. Run just catalog/pgcli to open pgcli in your terminal, and then you can run sql queries (ie select * from image where provider='aucklandmuseum' limit 10;).

I was not initially able to ingest any data, because the DAG fails on every image when trying to fetch the file size with a 301. I commented this part out for the sake of testing the rest of the code.

On a higher-level note: the 100,000 number comes from the rate limit and max response size, right? Meaning there's actually more data than we can fetch in a day? My concern is that when the DAG is run a second time, it'll start processing from the beginning all over again. As it currently stands, I don't think we'll ever be able to ingest more than those first 100k rows.

If the API supports date range queries, we could consider making this a dated DAG instead. Otherwise, we might need to be a bit creative. The absolute simplest, somewhat silly solution I can think of is to give it a huge timeout and greatly extend the delay between requests such that the DAG runs over a week or so, only fetching 100k a day in order to respect the rate limit. I'm not sure what their total dataset size is, so not sure if that's feasible!

catalog/dags/providers/provider_api_scripts/auckland_museum.py

stacimc · 2023-10-26T18:57:23Z

catalog/dags/providers/provider_api_scripts/auckland_museum.py

+            # copyright:CC state Creative Commons Attribution 4.0
+            return {
+                "q": "_exists_:primaryRepresentation+copyright:CC",
+                "size": "100",


Since this is the default batch_limit from the parent class, we can use self.batch_limit here (and in the increment in the else statement).

The max amount of data retrieved from the api is 10,000. Look at hits -> total -> value(api) Size is the amount of data to present in a get request. From is the index of the total value. We can keep incrementing From till it reach 10,000 and get_should_continue function should know when it reach the limit.

I meant that by default self.batch_limit is 100, and that you can just say:

"size": self.batch_limit,

Rather than hard-coding it separately here.

I will update "size" with batch_limit. Please take a look at the comment below about batch_limit default value.

catalog/dags/providers/provider_api_scripts/auckland_museum.py

catalog/dags/providers/provider_workflows.py

catalog/tests/dags/providers/provider_api_scripts/test_auckland_museum.py

catalog/dags/providers/provider_api_scripts/auckland_museum.py

stacimc · 2023-10-26T20:06:52Z

catalog/dags/providers/provider_api_scripts/auckland_museum.py

+
+        url = information.get("primaryRepresentation")
+
+        thumbnail_url = f"{url}?rendering=thumbnail.jpg"


The thumbnail their API provides is tiny, fixed width of 70px. @obulat would know best -- should we use this or just default to None here and use our own thumbnail service? They also have a slightly bigger preview rendering with a fixed width of 100px.

We have previously discussed the thumbnail sizes, and decided against using thumbnails smaller than 600px: #675 (comment)

ngken0995 · 2023-10-27T13:26:05Z

I was not initially able to ingest any data, because the DAG fails on every image when trying to fetch the file size with a 301. I commented this part out for the sake of testing the rest of the code.

@stacimc I tried to add the url with ?rendering=original.jpg and some of them gave me message | "Internal server error" response page. What is the best way to handle every image when trying to fetch the file size with a 301?

stacimc · 2023-10-30T20:31:35Z

What is the best way to handle every image when trying to fetch the file size with a 301?

From a quick look, it does look like the API returns image urls with http but redirects to https in browser. I tried updating the urls to https in get_filesize and this results in 403s on the HEAD request. I didn't have much time to look into this, but if it's a problem on their side we may have to just skip getting file size information for now.

On a higher-level note: the 100,000 number comes from the rate limit and max response size, right? Meaning there's actually more data than we can fetch in a day? My concern is that when the DAG is run a second time, it'll start processing from the beginning all over again. As it currently stands, I don't think we'll ever be able to ingest more than those first 100k rows.

Reiterating this point from earlier: it occurred to me as I was looking at this again, is there a reason the batch_limit here needs to be the default 100? The API limits to 1,000 requests per day, but now that I look I don't see anywhere in the documentation saying that we have to limit it to 100 records per request. Can you increase that limit in order to process more data?

ngken0995 · 2023-10-30T20:56:14Z

On a higher-level note: the 100,000 number comes from the rate limit and max response size, right? Meaning there's actually more data than we can fetch in a day? My concern is that when the DAG is run a second time, it'll start processing from the beginning all over again. As it currently stands, I don't think we'll ever be able to ingest more than those first 100k rows.

Reiterating this point from earlier: it occurred to me as I was looking at this again, is there a reason the batch_limit here needs to be the default 100? The API limits to 1,000 requests per day, but now that I look I don't see anywhere in the documentation saying that we have to limit it to 100 records per request. Can you increase that limit in order to process more data?

My apologizes, I should have stated why the size set as a default for 100. I didn't know what the correct batch_limit should be and used 100 as a placeholder. I believe it can set to 2,000. How should I determine what is a good batch_limit?

catalog/dags/providers/provider_api_scripts/auckland_museum.py

obulat

Thank you for such a great contribution, @ngken0995 ! I ran the DAG locally, and it works well.

I am concerned with the quality of data we collect, though. I got around 60 items locally, and a very large proportion of them either show "Server error" or "Online image not available" image for the main image file.

I think we should check the main url before saving the item to the catalog for this provider, otherwise we risk downloading a lot of dead links here. What do you think, @openverse-catalog?

Here's a sample of urls I got locally:

+---------------------------------------------------+---------------------------------------------------------------------------------------------------------+------------------------------------+
| url                                               | foreign_landing_url                                                                                     | title                              |
|---------------------------------------------------+---------------------------------------------------------------------------------------------------------+------------------------------------|
| https://api.aucklandmuseum.com/id/media/v/2882    | https://www.aucklandmuseum.com/collections-research/collections/record/am_humanhistory-object-18430     | jar                                |
| https://api.aucklandmuseum.com/id/media/v/117250  | https://www.aucklandmuseum.com/collections-research/collections/record/am_humanhistory-object-15507     | glass, wine                        |
| https://api.aucklandmuseum.com/id/media/v/3191    | https://www.aucklandmuseum.com/collections-research/collections/record/am_humanhistory-object-9923      | jar, lidded                        |
| https://api.aucklandmuseum.com/id/media/v/861840  | https://www.aucklandmuseum.com/collections-research/collections/record/am_humanhistory-object-22679     | briefs, pair                       |
| https://api.aucklandmuseum.com/id/media/v/370276  | https://www.aucklandmuseum.com/collections-research/collections/record/am_humanhistory-object-608181    | cartridges                         |
| https://api.aucklandmuseum.com/id/media/v/325015  | https://www.aucklandmuseum.com/collections-research/collections/record/am_humanhistory-object-4375      | teabowl                            |
| https://api.aucklandmuseum.com/id/media/v/528116  | https://www.aucklandmuseum.com/collections-research/collections/record/am_humanhistory-object-1151      | bowl, lidded                       |
| https://api.aucklandmuseum.com/id/media/v/34541   | https://www.aucklandmuseum.com/collections-research/collections/record/am_naturalsciences-object-368805 | Carex resectans Cheeseman          |
| https://api.aucklandmuseum.com/id/media/v/828322  | https://www.aucklandmuseum.com/collections-research/collections/record/am_humanhistory-object-90135     | skirt, wool                        |
| https://api.aucklandmuseum.com/id/media/v/229298  | https://www.aucklandmuseum.com/collections-research/collections/record/am_humanhistory-object-61319     | tablecloth, signature              |
| https://api.aucklandmuseum.com/id/media/v/75802   | https://www.aucklandmuseum.com/collections-research/collections/record/am_humanhistory-object-14260     | cup                                |
| https://api.aucklandmuseum.com/id/media/v/117280  | https://www.aucklandmuseum.com/collections-research/collections/record/am_humanhistory-object-12592     | goblet                             |
+---------------------------------------------------+---------------------------------------------------------------------------------------------------------+------------------------------------+

Co-authored-by: Olga Bulat <[email protected]>

sarayourfriend · 2024-01-03T00:32:04Z

@obulat which ones did you see that errored? I chose some random ones from the list you shared and they all worked for me. I'm curious to see whether they're present in the wikimedia commons dataset (perhaps they've already sorted through those ones in the Wikimedia dump).

obulat · 2024-01-03T12:09:20Z

@obulat which ones did you see that errored? I chose some random ones from the list you shared and they all worked for me. I'm curious to see whether they're present in the wikimedia commons dataset (perhaps they've already sorted through those ones in the Wikimedia dump).

I think you can get to the main images for the next 2 sections from their landing URLs, but it's not easy to derive their URLs.

Maybe there's a geographical access problem?

It's a good idea to check for these items in Wikimedia.

ngken0995 · 2024-01-03T14:54:09Z

I have the same errors and I'm based in the US.

sarayourfriend · 2024-01-03T22:29:08Z

From what my spouse has told me, the museum is quick to remove public access things for cultural sensitivity reasons, in favour of having controlled access with a culturally relevant approach. I wouldn't be surprised if there are public records for those items where the images aren't available.

If we want to check, I'm pretty sure that the museum would respond to an email from us, and if not, I can ask my spouse to get us in touch with someone, they still have connections with folks there. At the very least we could clarify which of these intentionally do not have access so that we 100% do not index them (knowing that they won't ever be available) and which are technical (potentially temporary) access issues, if such a distinction even exists.

stacimc · 2024-01-04T23:12:21Z

Very interesting, I'm glad you spotted this @obulat!

Given the complexity of the data quality questions and the fact that we already have outstanding work required for de-duplicating these results with Wikimedia, I think it would be reasonable to open a separate issue and PR for addressing this.

In the meantime, I think this PR could be merged as-is, although the data quality should be addressed before we actually turn the DAG on in production and begin ingesting. This is in line with how we've managed the addition of some other providers that ended up being very complex (eg iNaturalist). We should add a row to the DAG Status page explaining why the DAG is not yet enabled, and prioritize that work separately.

openverse-bot · 2024-01-12T00:00:21Z

Based on the low urgency of this PR, the following reviewers are being gently reminded to review this PR:

@obulat
@sarayourfriend
This reminder is being automatically generated due to the urgency configuration.

Excluding weekend¹ days, this PR was ready for review 14 day(s) ago. PRs labelled with low urgency are expected to be reviewed within 5 weekday(s)².

@ngken0995, if this PR is not ready for a review, please draft it to prevent reviewers from getting further unnecessary pings.

Specifically, Saturday and Sunday. ↩
For the purpose of these reminders we treat Monday - Friday as weekdays. Please note that the operation that generates these reminders runs at midnight UTC on Monday - Friday. This means that depending on your timezone, you may be pinged outside of the expected range. ↩

obulat · 2024-01-12T03:27:15Z

In the meantime, I think this PR could be merged as-is, although the data quality should be addressed before we actually turn the DAG on in production and begin ingesting. This is in line with how we've managed the addition of some other providers that ended up being very complex (eg iNaturalist). We should add a row to the DAG Status page explaining why the DAG is not yet enabled, and prioritize that work separately.

I like this solution, @stacimc, nice to have such a workaround.

obulat

Thank you for adding a new DAG, @ngken0995, and all of your patience during the review! It's interesting that this DAG adds a new POST request to the ingester.

ngken0995 added 8 commits October 23, 2023 17:04

create auckland museum and implement get_next_query_params

d253f47

implement get_should_continue and get_batch_data

7f46257

implement _get_meta_data and _get_file_info

24a9d31

add test sample

7fd1303

add new single_item

b6b5447

add test_get_record_data

4566bda

implement test_get_record_data

3682cbe

if else condition for creator exist

072978d

ngken0995 requested a review from a team as a code owner October 25, 2023 20:23

ngken0995 requested review from krysal and stacimc October 25, 2023 20:23

obulat requested changes Oct 26, 2023

View reviewed changes

ngken0995 added 2 commits October 26, 2023 11:48

add foreign_identifier and foregin_landing_url

65c6913

add AucklandMuseumDataIngester to provider_workflow

43fa686

stacimc requested changes Oct 26, 2023

View reviewed changes

ngken0995 added 2 commits October 27, 2023 09:03

variables have default values

cceab62

increment batch_start with batch_limit

1fe2ce0

krysal removed their request for review October 28, 2023 11:13

ngken0995 and others added 4 commits December 22, 2023 15:36

remove if/else condition for date

ce12051

Merge branch 'main' into 1771-auckland-museum

8282360

lint format

fdce81b

add date to test

bf5e2bb