-
Notifications
You must be signed in to change notification settings - Fork 54
Refactor Rawpixel to use ProviderDataIngester
#795
Conversation
00d03ca
to
98c41a6
Compare
I added thumbnail capture to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at this now! Seems that this refactor came at just the right time :) I haven't run the DAG yet but I wanted to comment that I strongly believe we should not include the thumbnails in the meta_data
when there is a column for it. I see that the old script wasn't using thumbnails so can we exclude it until we reach an agreement?
Will be reviewing the rest of the script in a moment...
openverse_catalog/dags/providers/provider_api_scripts/rawpixel.py
Outdated
Show resolved
Hide resolved
openverse_catalog/dags/providers/provider_api_scripts/rawpixel.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is amazing and essentially adding a new provider script 😄 I ran it locally and got 67,200 records in about 11 minutes of testing 🥳 A lot of very cool images as well!
The generated TSV looks good, although it's a shame that creator
seems to be populated very sparsely. The only other thing I noticed was that a fair number of titles seem to be cut off. For example, this Rawpixel image has the title Paul Klee's Rich Port (a
. I'm curious if they're coming back cut off in the API response?
@@ -10,7 +10,9 @@ INSERT INTO public.image_popularity_metrics ( | |||
) VALUES | |||
('flickr', 'views', 0.85), | |||
('wikimedia', 'global_usage_count', 0.85), | |||
('stocksnap', 'downloads_raw', 0.85); | |||
('stocksnap', 'downloads_raw', 0.85), | |||
('rawpixel', 'download_count', 0.85) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool! This has me curious if there are any other easy wins for popularity metrics with our other providers, created #815
openverse_catalog/dags/providers/provider_api_scripts/rawpixel.py
Outdated
Show resolved
Hide resolved
Desktop wallpaper summer beach landscape, | Free Photo - rawpixel | ||
Branch with a sunflower (1714–1760) | Free Photo Illustration - rawpixel | ||
Free public domain CC0 photo. | Free Photo - rawpixel | ||
Flower background. Free public domain | Free Photo - rawpixel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of the text cleaning you've added here is fantastic 🥇
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks 😅 It was a ton of trial/error & tinkering!
Yes, unfortunately they're coming back from the API like that 😞 I think they must have some algorithm that takes the description and chops it off after a certain number of words, which is why we're seeing titles like that. Without trying to do a bunch of extra data cleaning, I'm not sure what'd be the best approach. We could potentially use the description as the title 🤔 they're usually not too long and match the title, what do you think? |
Okay, after a bit more investigation here's some data on title vs description. Note that this is after the cleaning that we do to strip away some of the dead text. Script used: from providers.provider_api_scripts.rawpixel import RawpixelDataIngester
rp = RawpixelDataIngester()
qp = rp.get_next_query_params({})
batch, _ = rp.get_batch(qp)
for r in batch:
print(f'{r["id"]}:\n\ttitle={RawpixelDataIngester._get_title(r["metadata"])}\n\tdescr={RawpixelDataIngester._clean_text(r["metadata"]["description_text"])}') Results
There's a lot of duplication there, but using the |
Co-authored-by: Krystle Salazar <[email protected]>
This reverts commit f2b6a3a.
71c8322
to
b6538c8
Compare
@@ -87,7 +87,7 @@ def get_record_data(self, data): | |||
return None | |||
license_url = data.get("license_url") | |||
license_info = get_license_info(license_url=license_url) | |||
if license_info == LicenseInfo(None, None, None, None): | |||
if license_info == NO_LICENSE_FOUND: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great fix! It was really cumbersome before.
openverse_catalog/dags/providers/provider_api_scripts/rawpixel.py
Outdated
Show resolved
Hide resolved
Testing the dismiss review feature, will receive more feedback later today
ProviderDataIngester
if not foreign_url: | ||
@staticmethod | ||
def _get_category(metadata: dict) -> str | None: | ||
keywords = set(metadata.get("popular_keywords", [])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice to have more categorized media!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice that we have more metadata for this provider than we usually do!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great :) Above and beyond on the title/description work, I really like how you've handled it. Super excited for this one!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome cleaning work, looks great! I just left some minor non-blocking comments.
# Unescape HTMl sequences | ||
text = html.unescape(text) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be quite useful for other providers as well! (I think Jamendo is another one returning scaped text?)
openverse_catalog/dags/providers/provider_api_scripts/rawpixel.py
Outdated
Show resolved
Hide resolved
from common.licenses import get_license_info | ||
from airflow.models import Variable | ||
from common import constants | ||
from common.licenses import NO_LICENSE_FOUND, get_license_info |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pycharm complains here:
Cannot find reference 'NO_LICENSE_FOUND' in
'__init__.py'
Interesting that it still works 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How odd.... because it's definitely there 😅
Fixes
Fixes WordPress/openverse#1515 by @stacimc, fixes WordPress/openverse#1419 by @AetherUnbound, fixes WordPress/openverse#1689 by @zackkrida
Description
This PR refactors the Rawpixel ingester to use the
ProviderDataIngester
class. This required a bit more work than I was anticipating, mostly because Rawpixel's API surface has changed so much (for the better!) since this script was written. We now need to use an HMAC signature based query mechanism in order to pass 100,000 records, which is the public limit.I've also been in communication with Rawpixel directly regarding some of these fields (namely
image_url
).I took my best attempt at heuristics for the
category
field, and worked to scrub some unnecessary text from the title & description which might hurt our search relevancy down the road. If either of these seem inappropriate, please let me know!We're also missing filesize, regrettably. It looks to be a field present in their frontend (see: https://www.rawpixel.com/image/6439018/vector-plant-flower-vintage) but not always (see: https://www.rawpixel.com/image/6516667/image-aesthetic-vintage-public-domain). It looks like we could extract it from the download options description if it's available, let me know if you think that's the best route @WordPress/openverse-catalog.
Testing Instructions
just test
Checklist
Update index.md
).main
) or a parent feature branch.Developer Certificate of Origin
Developer Certificate of Origin