-
Notifications
You must be signed in to change notification settings - Fork 54
Conversation
Thanks for jumping on this @krysal!! We actually got a response back from Mike Keesey in the upstream ticket I filed, looks like we might have some new endpoints available: keesey/phylopic#5 (comment) I'll be responding to their comment today 😄 |
I read it! While that gets shipped, this update will enable us to reactivate PhyloPic after a run :) |
openverse_catalog/dags/providers/provider_api_scripts/phylopic.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the quick fix of a ❗ critical issue, @krysal!
I will approve after the creator_url
and last page are fixed.
if response_json and response_json.get("success") is True: | ||
return response_json.get("result") | ||
def get_should_continue(self, response_json): | ||
return self.current_page < self.total_pages |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return self.current_page < self.total_pages | |
return self.current_page <= self.total_pages |
This DAG ran really well until the last page, when the last requested page returned 404:
[2023-03-27, 15:45:24 +03] {requester.py:78} WARNING - Unable to request URL: https://api.phylopic.org/images?build=196&page=146&embed_items=true Status code: 404
[2023-03-27, 15:45:24 +03] {requester.py:148} WARNING - Bad response_json: None
[2023-03-27, 15:45:24 +03] {requester.py:149} WARNING - Retrying:
_get_response_json(
https://api.phylopic.org/images,
{'build': 196, 'page': 146, 'embed_items': 'true'},
retries=-1)
[2023-03-27, 15:45:24 +03] {requester.py:134} ERROR - No retries remaining. Failure.
[2023-03-27, 15:45:24 +03] {media.py:230} INFO - Writing 74 lines from buffer to disk.
[2023-03-27, 15:45:24 +03] {provider_data_ingester.py:501} INFO - Committed 6974 records
[2023-03-27, 15:45:24 +03] {taskinstance.py:1768} ERROR - Task failed with exception
providers.provider_api_scripts.provider_data_ingester.IngestionError: Retries exceeded
query_params: {"build": 196, "page": 146, "embed_items": "true"}
The initial_request.json
(tests/dags/providers/provider_api_scripts/resources/phylopic/initial_request.json) seems to suggest that we should use <=
or even ==
here, because even thought the totalPages is 145, the lastPage is 144.
"lastPage": {
"href": "/images?build=194&page=144"
},
"totalPages": 145
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha. I noticed it's trying to request the page 146, which is actually the page 147 because it starts at 0
, from a total of 146, so this requires changing the initial page to 1
then we pass current_page - 1
as the query parameter and the condition <
is maintained, so we don't try to get a page that doesn't exists and avoid the DAG fails.
openverse_catalog/dags/providers/provider_api_scripts/phylopic.py
Outdated
Show resolved
Hide resolved
"title": title, | ||
"meta_data": meta_data, | ||
# TODO: Evaluate whether to include upstream thumbnails. | ||
# Sizes available: 192x192, 128x128, 64x64. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried the other images we get in the sample_record.json
: while the SVGs are all crisp on any resolution, the raster image of an acceptable size (https://images.phylopic.org/images/5b1e88b5-159d-495d-b8cb-04f9e28d2f02/raster/512x480.png) looks much worse.
I wonder if we should handle SVGs differently: they can be basically any size, their width and height don't mean much beyond the aspect ratio (which of course is very important for the frontend rendering), and their thumbnails will probably be larger in size and worse in quality than the original image. Not for this PR specifically, of course.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the raster images are an option but look really blurry, not sure if it's worth including those for thumbnails.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 🚀
@krysal it seems the old parameters have been re-added to the new API 😮 keesey/phylopic#5. How would you like to proceed? I'm happy to review this if you'd like to get it in and follow up with making this a dated DAG once more, or we could pivot to just updating the dated portions of the existing DAG. |
@AetherUnbound It's great the filters are available already. I'd like to merge this so we can run the DAG to fix all the PhyloPic data and after that we can improve making the DAG dated again along with the other TODO notes I left. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again for jumping on this so fast! Code-wise, I just have a few non-blocking comments.
On the data front though, it looks like this will give us an entirely new set of Phylopic data (rather than updating existing values). I was able to test this by:
- Download production phylopic data:
\copy (SELECT * FROM image WHERE provider='phylopic') to '/tmp/phylopic_v1_prod.csv' DELIMITER ',' CSV HEADER
. This returned 4,156 rows - Start the catalog stack (I had to rebuild the postgres image with a newer version of pgcli, PR coming out for that in a little bit)
- Copy the file into the postgres image:
docker cp /tmp/phylopic_v1_prod.csv openverse-catalog-postgres-1:/tmp/phylopic_v1_prod.csv
- Load the file into the table using
j db-shell
:\copy image from '/tmp/phylopic_v1_prod.csv' WITH (FORMAT CSV, HEADER)
- Run the full DAG, which processed 6,974 rows.
- Ran a count using
j db-shell
:
openledger> select count(*) from image;
+-------+
| count |
|-------|
| 10521 |
+-------+
SELECT 1
Time: 0.040s
All that to say that we will be losing the old data, which I think is fine because it's not working anyway 😅 We could explicitly delete the existing Phylopic data from the catalog prior to running this DAG though, that might be the best bet to ensure we don't have records hanging around that can't be accessed.
"""Get the required `build` param from the API and set the total pages.""" | ||
resp = self.get_response_json(query_params={}) | ||
if not resp: | ||
raise Exception("No response from Phylopic API.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this might be a more specific exception?
raise Exception("No response from Phylopic API.") | |
raise ValueError("No response from Phylopic API.") |
Good find! Unfortunately, it seems some olders rows have -- Sample `foreign_identifier` from PhyloPic image
http://phylopic.org/assets/images/submissions/68e12ccc-dbab-4bd9-98c8-6bf33a85b080.1024.png I think that will require an extra DAG to run a one time pass through PhyloPic to fix those values. What do you think? I can work on that, so we don't start PhyloPic DAG until that is done. |
Honestly, it's only 4k records 😅 I think it would be okay to move them all, especially since they were changed on the upstream site (and since this precludes our ability to have Phylopic displaying on the API I think). But I'm fine either way! |
Fixes
Fixes WordPress/openverse#902 🚨
Fixes WordPress/openverse#1288
Description
This PR updates PhyloPic to use v2 of the API, and with this, it's now converted to a non-dated DAG since v2 doesn't support a filter by modified date yet. The good news is that getting the fields is much easier and should run faster since we can get more items per request.