Skip to content
This repository has been archived by the owner on Aug 4, 2023. It is now read-only.

Refactor Rawpixel to use ProviderDataIngester #795

Merged
merged 29 commits into from
Oct 27, 2022

Conversation

AetherUnbound
Copy link
Contributor

@AetherUnbound AetherUnbound commented Oct 14, 2022

Fixes

Fixes WordPress/openverse#1515 by @stacimc, fixes WordPress/openverse#1419 by @AetherUnbound, fixes WordPress/openverse#1689 by @zackkrida

Description

This PR refactors the Rawpixel ingester to use the ProviderDataIngester class. This required a bit more work than I was anticipating, mostly because Rawpixel's API surface has changed so much (for the better!) since this script was written. We now need to use an HMAC signature based query mechanism in order to pass 100,000 records, which is the public limit.

I've also been in communication with Rawpixel directly regarding some of these fields (namely image_url).

I took my best attempt at heuristics for the category field, and worked to scrub some unnecessary text from the title & description which might hurt our search relevancy down the road. If either of these seem inappropriate, please let me know!

We're also missing filesize, regrettably. It looks to be a field present in their frontend (see: https://www.rawpixel.com/image/6439018/vector-plant-flower-vintage) but not always (see: https://www.rawpixel.com/image/6516667/image-aesthetic-vintage-public-domain). It looks like we could extract it from the download options description if it's available, let me know if you think that's the best route @WordPress/openverse-catalog.

NOTE: We can technically receive popularity metrics for Rawpixel! I have added a metrics field in the table definition, we will need to add this record to the prod table manually. I haven't seen any values that aren't 0 yet, but it'll be useful to have this as part of the ingestion nonetheless if data becomes available there.

Testing Instructions

  1. just test
  2. You may be able to run this without an API key, at least for testing. If it's needed, there's one present in the maintainer secret store. Run this locally and check the results in the database!

Checklist

  • My pull request has a descriptive title (not a vague title like Update index.md).
  • My pull request targets the default branch of the repository (main) or a parent feature branch.
  • My commit messages follow best practices.
  • My code follows the established code style of the repository.
  • I added or updated tests for the changes I made (if applicable).
  • I added or updated documentation (if applicable).
  • I tried running the project locally and verified that there are no visible errors.

Developer Certificate of Origin

Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

@AetherUnbound AetherUnbound requested a review from a team as a code owner October 14, 2022 00:09
@openverse-bot openverse-bot added ✨ goal: improvement Improvement to an existing user-facing feature 💻 aspect: code Concerns the software code in the repository 🟨 priority: medium Not blocking but should be addressed soon labels Oct 14, 2022
@AetherUnbound AetherUnbound force-pushed the feature/rawpixel-refactor#592 branch from 00d03ca to 98c41a6 Compare October 21, 2022 19:02
@AetherUnbound
Copy link
Contributor Author

I added thumbnail capture to meta_data["thumbnail_url"] 💯

krysal
krysal previously requested changes Oct 21, 2022
Copy link
Member

@krysal krysal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at this now! Seems that this refactor came at just the right time :) I haven't run the DAG yet but I wanted to comment that I strongly believe we should not include the thumbnails in the meta_data when there is a column for it. I see that the old script wasn't using thumbnails so can we exclude it until we reach an agreement?

Will be reviewing the rest of the script in a moment...

env.template Show resolved Hide resolved
Copy link
Contributor

@stacimc stacimc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is amazing and essentially adding a new provider script 😄 I ran it locally and got 67,200 records in about 11 minutes of testing 🥳 A lot of very cool images as well!

The generated TSV looks good, although it's a shame that creator seems to be populated very sparsely. The only other thing I noticed was that a fair number of titles seem to be cut off. For example, this Rawpixel image has the title Paul Klee's Rich Port (a. I'm curious if they're coming back cut off in the API response?

@@ -10,7 +10,9 @@ INSERT INTO public.image_popularity_metrics (
) VALUES
('flickr', 'views', 0.85),
('wikimedia', 'global_usage_count', 0.85),
('stocksnap', 'downloads_raw', 0.85);
('stocksnap', 'downloads_raw', 0.85),
('rawpixel', 'download_count', 0.85)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool! This has me curious if there are any other easy wins for popularity metrics with our other providers, created #815

Desktop wallpaper summer beach landscape, | Free Photo - rawpixel
Branch with a sunflower (1714–1760) | Free Photo Illustration - rawpixel
Free public domain CC0 photo. | Free Photo - rawpixel
Flower background. Free public domain | Free Photo - rawpixel
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of the text cleaning you've added here is fantastic 🥇

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks 😅 It was a ton of trial/error & tinkering!

@AetherUnbound
Copy link
Contributor Author

The only other thing I noticed was that a fair number of titles seem to be cut off. For example, this Rawpixel image has the title Paul Klee's Rich Port (a. I'm curious if they're coming back cut off in the API response?

Yes, unfortunately they're coming back from the API like that 😞 I think they must have some algorithm that takes the description and chops it off after a certain number of words, which is why we're seeing titles like that. Without trying to do a bunch of extra data cleaning, I'm not sure what'd be the best approach. We could potentially use the description as the title 🤔 they're usually not too long and match the title, what do you think?

@AetherUnbound
Copy link
Contributor Author

Okay, after a bit more investigation here's some data on title vs description. Note that this is after the cleaning that we do to strip away some of the dead text.

Script used:

from providers.provider_api_scripts.rawpixel import RawpixelDataIngester
rp = RawpixelDataIngester()
qp = rp.get_next_query_params({})
batch, _ = rp.get_batch(qp)
for r in batch:
    print(f'{r["id"]}:\n\ttitle={RawpixelDataIngester._get_title(r["metadata"])}\n\tdescr={RawpixelDataIngester._clean_text(r["metadata"]["description_text"])}')
Results
4032668:
	title=Bull elk searches for food
	descr=Bull elk searches for food beneath the snow. Frank. Original public domain image from Flickr
3864377:
	title=Desktop wallpaper summer beach landscape
	descr=Desktop wallpaper summer beach landscape, background HD images. Original public domain image from Wikimedia Commons
6329121:
	title=Japanese autumn tree color drawing
	descr=Japanese autumn tree color drawing.
6755932:
	title=Smoking skeleton png sticker, vintage
	descr=Smoking skeleton png sticker, vintage illustration, transparent background.
6535407:
	title=Sugar skull png sticker, Day
	descr=Sugar skull png sticker, Day of the dead illustration on transparent background.
3864432:
	title=HD wallpaper winter landscape, nature
	descr=HD wallpaper winter landscape, nature image background. Original public domain image from Flickr
6318158:
	title=Butterfly & moth painting
	descr=Butterfly & moth painting.  Digitally enhanced from our own 1842 edition of Le Jardin Des Plantes by Paul Gervais
7615733:
	title=Open hand, palm reading
	descr=Open hand, palm reading. Original from the Library of Congress.
5909875:
	title=Free abstract watercolor background image
	descr=Free abstract watercolor background image
4036226:
	title=A Young Girl Defending Herself
	descr=A Young Girl Defending Herself against Eros (1825-1905) illustration in high resolution by William-Adolphe Bouguereau. Original from Getty Museum.
5995578:
	title=Japanese butterfly. Digitally enhanced from our
	descr=Japanese butterfly. Digitally enhanced from our own original 1904 edition of Kamisaka Sekka's Cho senshu (One Thousand Butterflies)
4036186:
	title=A Hare in the Forest
	descr=A Hare in the Forest (1585) painting in high resolution by Hans Hoffmann. Original from Getty Museum.
6118621:
	title=Blue flower pattern, Examples of Chinese
	descr=Blue flower pattern, Examples of Chinese Ornament selected from objects in the South Kensington Museum and other collections by Owen Jones. Digitally enhanced plate from our own original 1867 edition of the book
6043535:
	title=Beautiful butterfly on white background
	descr=Beautiful butterfly on white background.
4026340:
	title=Mural by Beastman, Spotlight Sydenham
	descr=Mural by Beastman, Spotlight Sydenham - Sydenham, Christchurch, Canterbury, 23 December 2013
4032812:
	title=Juniper illustration in high resolution
	descr=Juniper illustration in high resolution by Georg Dionysius Ehret (1708-1770). Original from Getty Museum.
3896191:
	title=Claude Monet's The Magpie (1868–1869)
	descr=Claude Monet's The Magpie (1868–1869) famous painting. Original from Wikimedia Commons.
4036224:
	title=A Sheet of Studies with French
	descr=A Sheet of Studies with French Roses and an Oxeye Daisy (1570) illustration in high resolution by Jacques Le Moyne de Morgues. Original from Getty Museum.
3864388:
	title=Galaxy wallpaper desktop background, HD
	descr=Galaxy wallpaper desktop background, HD aesthetic night sky image. Original public domain image from Wikimedia Commons
6031570:
	title=Flower background
	descr=Flower background.
7514064:
	title=“Cosmic Cliffs” in the Carina
	descr=“Cosmic Cliffs” in the Carina Nebula from NASA’s James Webb Space Telescope (NIRCam Image)
4036206:
	title=Studies of Peonies (1472-1473) painting
	descr=Studies of Peonies (1472-1473) painting in high resolution by Martin Schongauer. Original from Getty Museum.
3819803:
	title=Zebra (1763) painting in high
	descr=Zebra (1763) painting in high resolution by George Stubbs. Original from The Yale University Art Gallery.
3864317:
	title=Flower wallpaper desktop, aesthetic HD
	descr=Flower wallpaper desktop, aesthetic HD background nature image. Original public domain image from Wikimedia Commons
2780735:
	title=Bouquet of Flowers on a Ledge
	descr=Bouquet of Flowers on a Ledge (1619) in high resolution by Ambrosius Bosschaert. Original from the Los Angeles County Museum of Art.
5920316:
	title=Free bold abstract painting background
	descr=Free bold abstract painting background
3578359:
	title=A civilian actor dressed in moulage
	descr=A civilian actor dressed in moulage to simulate an injury stands by to be placed at an accident site during a full scale exercise involving over 600 Army and Air National Guardsmen from New York, New Jersey, and West Virginia at Joint Base McGuire-Dix-Lakehurst, N.J., April 17, 2015
2742839:
	title=Menselijk oog met een afwijking
	descr=Menselijk oog met een afwijking (1836–1912) print in high resolution by Isaac Weissenbruch. Original from The Rijksmuseum.
3864331:
	title=Desktop wallpaper Koi fish, aesthetic
	descr=Desktop wallpaper Koi fish, aesthetic HD image background. Original public domain image from Wikimedia Commons
6283170:
	title=Colorful abstract cat, animal hand
	descr=Colorful abstract cat, animal hand drawn illustration, transparent background.
6175816:
	title=Colorful floral pattern, Examples of Chinese
	descr=Colorful floral pattern, Examples of Chinese Ornament selected from objects in the South Kensington Museum and other collections by Owen Jones. Digitally enhanced plate from our own original 1867 edition of the book
3864303:
	title=Desktop wallpaper, beach aesthetic HD
	descr=Desktop wallpaper, beach aesthetic HD nature image background. Original public domain image from Wikimedia Commons
4032753:
	title=Bison group in the road
	descr=Bison group in the road near Frying Pan Spring. Original public domain image from Flickr
3864631:
	title=Vincent van Gogh's Starry Night
	descr=Vincent van Gogh's Starry Night Over the Rhone (1888) famous landscape painting. Original from Wikimedia Commons.
3970503:
	title=Ryū shōten (1897) print in high
	descr=Ryū shōten (1897) print in high resolution by Ogata Gekko
6261216:
	title=Vintage praying skeleton hand drawn
	descr=Vintage praying skeleton hand drawn illustration.
5978857:
	title=Butterfly pochoir pattern in Art
	descr=Butterfly pochoir pattern in Art Nouveau oriental style. Original from our own 1914 edition of Samarkande: 20 Compositions en couleurs dans le Style oriental" (Samarkand: 20 Color Compositions in the Oriental Style) by E. A. Séguy
6516667:
	title=Butterfly fairy clipart, mythical creature
	descr=Butterfly fairy clipart, mythical creature vintage illustration.
6182487:
	title=Flower pattern, Examples of Chinese
	descr=Flower pattern, Examples of Chinese Ornament selected from objects in the South Kensington Museum and other collections by Owen Jones. Digitally enhanced plate from our own original 1867 edition of the book
3970688:
	title=Crimson topaz hummingbird, Cyclamen, Red
	descr=Crimson topaz hummingbird, Cyclamen, Red Postman and shells from the Natural History Cabinet of Anna Blackburne (1768) painting in high resolution by James Bolton. Original from The Yale University Art Gallery.
6031223:
	title=Floral glasses
	descr=Floral glasses.
6029772:
	title=Red poppy field
	descr=Red poppy field.
3864349:
	title=Abstract wallpaper desktop, beautiful fluid
	descr=Abstract wallpaper desktop, beautiful fluid art design background. Original public domain image from Wikimedia Commons
3846747:
	title=Leonardo da Vinci's (1503–1506) Portrait
	descr=Leonardo da Vinci's (1503–1506) Portrait of Mona Lisa del Giocondo famous painting. Original from Wikimedia Commons.
3868942:
	title=Vincent van Gogh's Almond blossom
	descr=Vincent van Gogh's Almond blossom (1890) famous painting. Original from Wikimedia Commons.
3309051:
	title=Pigeons in white and blue
	descr=Pigeons in white and blue (1928) pattern in high resolution. Original from the Rijksmuseum.
7559009:
	title=Png trick or treat sticker
	descr=Png trick or treat sticker illustration, transparent background.
6195677:
	title=Wild animal, safari lithograph. Digitally
	descr=Wild animal, safari lithograph. Digitally enhanced from our own 1900 edition of The Great and Small Game of India, Burma, & Tibet by Richard Lydekker
3970645:
	title=Caesalpinoid legume, Blackburn's Earth Boring
	descr=Caesalpinoid legume, Blackburn's Earth Boring Beetle, Seven-Spotted Ladybird Beetle, Purple Emperor and shells from the Natural History Cabinet of Anna Blackburne (1768) painting in high resolution by James Bolton. Original from The Yale University Art Gallery.
3864278:
	title=Whale wallpaper desktop, animal illustration
	descr=Whale wallpaper desktop, animal illustration HD background. Original public domain image from Wikimedia Commons
5997199:
	title=Japanese butterfly. Digitally enhanced from our
	descr=Japanese butterfly. Digitally enhanced from our own original 1904 edition of Kamisaka Sekka's Cho senshu (One Thousand Butterflies)
3590250:
	title=Paul Klee's Rich Port (a
	descr=Paul Klee's Rich Port (a travel picture, 1938) painting in high resolution. Original from the Kunstmuseum Basel Museum.
2698177:
	title=Alphonse Maria Mucha's Zodiaque or
	descr=Alphonse Maria Mucha's Zodiaque or La Plume (ca. 1896–1897) by. Famous Art Nouveau artwork, original from The Art Institute of Chicago.
3065118:
	title=Katsuyama Neighborhood (ca.1929–1932) print in high
	descr=Katsuyama Neighborhood (ca.1929–1932) print in high resolution by Hiroaki Takahashi. Original from The Los Angeles County Museum of Art.
6284511:
	title=Monoline couple, hand drawn illustration
	descr=Monoline couple, hand drawn illustration.
4042370:
	title=Howl-O-Ween Spooktacular pet costume contest
	descr=Howl-O-Ween Spooktacular pet costume contest. Original public domain image from Flickr
3820038:
	title=Human Skeleton, Lateral View (Close
	descr=Human Skeleton, Lateral View (Close to the Final Study for Table III But Differs in Detail), (1795–1806) drawing in high resolution by George Stubbs. Original from The Yale University Art Gallery.
6178724:
	title=Colorful flower pattern, Examples of Chinese
	descr=Colorful flower pattern, Examples of Chinese Ornament selected from objects in the South Kensington Museum and other collections by Owen Jones. Digitally enhanced plate from our own original 1867 edition of the book
3984066:
	title=Grant Wood's American Gothic (1930)
	descr=Grant Wood's American Gothic (1930) famous painting. Original from Wikimedia Commons.
3545034:
	title=The Boulevard Montmartre on a Winter
	descr=The Boulevard Montmartre on a Winter Morning (1897) by Camille Pissarro. Original from The MET museum.
6839601:
	title=Cat gang  png sticker
	descr=Cat gang  png sticker, black and white illustration, transparent background.
3570933:
	title=Two cats, blue and yellow
	descr=Two cats, blue and yellow (1912) painting in high resolution by Franz Marc. Original from the Kunstmuseum Basel Museum.
3338300:
	title=Cute gray cat
	descr=Cute gray cat. Original public domain image from Wikimedia Commons
7615878:
	title=Rae's St. Louis mammoth chart
	descr=Rae's St. Louis mammoth chart (1849). Original from the Library of Congress.
3856482:
	title=Henri Rousseau's Virgin Forest with Sunset
	descr=Henri Rousseau's Virgin Forest with Sunset (1910) famous painting. Original from the Kunstmuseum Basel Museum.
7665935:
	title=Modern recut copy of The
	descr=Modern recut copy of The Great Wave off Kanagawa (神奈川沖波裏), from 36 Views of Mount Fuji, Color woodcut. Although it is often used in tsunami literature, there is no reason to suspect that Hokusai intended it to be interpreted in that way. The waves in this work are sometimes mistakenly referred to as tsunami (津波), but they are more accurately called okinami (沖波), great off-shore waves
6111838:
	title=Stormtroopers and Vendetta character mural
	descr=Stormtroopers and Vendetta character mural art. Location unknown - 04/21/2017
7665426:
	title=Vincent van Gogh - Head
	descr=Vincent van Gogh - Head of a skeleton with a burning cigarette - Google Art Project
3895804:
	title=Claude Monet's Impression, Sunrise (1872)
	descr=Claude Monet's Impression, Sunrise (1872) famous painting. Original from Wikimedia Commons.
3894606:
	title=Piet Mondrian's Composition with Red
	descr=Piet Mondrian's Composition with Red, Yellow, Blue, and Black (1921) famous painting. Original from Wikimedia Commons.
3338349:
	title=Lotus summer
	descr=Lotus summer. Original public domain image from Wikimedia Commons
5949467:
	title=None
	descr=
3864329:
	title=Desktop wallpaper, beautiful canyon travel
	descr=Desktop wallpaper, beautiful canyon travel destination image background. Original public domain image from Wikimedia Commons
4036140:
	title=Study of Clouds with a Sunset
	descr=Study of Clouds with a Sunset near Rome (1786-1801) painting in high resolution by Simon Alexandre Clément Denis. Original from Getty Museum.
3896220:
	title=Claude Monet's Water Lilies and
	descr=Claude Monet's Water Lilies and Japanese Bridge (1899) famous painting. Original from Wikimedia Commons.
3970654:
	title=European robin and wild strawberry
	descr=European robin and wild strawberry from the Natural History Cabinet of Anna Blackburne (1768) painting in high resolution by James Bolton. Original from The Yale University Art Gallery.
2940042:
	title=White Rabbit with Herald's Costume
	descr=White Rabbit with Herald's Costume Design (1915) for Alice in Wonderland in high resolution by William Penhallow Henderson. Original from The Smithsonian.
3152237:
	title=Trogan variegatus (1804–1908) print in high
	descr=Trogan variegatus (1804–1908) print in high resolution by John Gould and William Matthew Hart. Original from The National Gallery of Art.
3864780:
	title=Vincent van Gogh's Café Terrace
	descr=Vincent van Gogh's Café Terrace at Night (1888) famous painting. Original from Wikimedia Commons.
4087712:
	title=Vintage geometric floral motifs, variations
	descr=Vintage geometric floral motifs, variations 8 from our own Variations Quatre-Vingt-Six Motifs Décoratifs En Vingt Planches (1928) by Édouard Bénédictus.
3309056:
	title=White, gray, pink and red
	descr=White, gray, pink and red flowers (1929) pattern in high resolution by Charles Goy. Original from the Rijksmuseum.
3590279:
	title=In the realm of air
	descr=In the realm of air (1917) painting in high resolution by Paul Klee. Original from the Kunstmuseum Basel Museum.
2552950:
	title=Twin Stars (1851–1896) by Luis
	descr=Twin Stars (1851–1896) by Luis Falero. Original from The MET Museum.
3590290:
	title=Branch with a sunflower (1714–1760)
	descr=Branch with a sunflower (1714–1760) painting in high resolution by Michiel van Huysum. Original from The Rijksmuseum.
4036227:
	title=Cinnamomum illustration in high resolution
	descr=Cinnamomum illustration in high resolution by Georg Dionysius Ehret (1708-1770). Original from Getty Museum.
3844930:
	title=Johannes Vermeer’s Girl with a Pearl
	descr=Johannes Vermeer’s Girl with a Pearl Earring (ca. 1665) famous painting. Original from the Mauritshuis Museum.
3896957:
	title=Claude Monet's Lady in the garden
	descr=Claude Monet's Lady in the garden (1867) famous painting. Original from Wikimedia Commons.
3123663:
	title=N.V. The Scene. Dir. Willem
	descr=N.V. The Scene. Dir. Willem Royaards. Lucifer mourning game of Vondel. Music by Hubert Cuyper. Design, decor, costumes by R.N. Roland Holst. (1910) print in high resolution by Richard Roland Holst. Original from the Rijksmuseum.
6111583:
	title=Audrey Hepburn in Breakfast at Tiffany's
	descr=Audrey Hepburn in Breakfast at Tiffany's. Unknown location, unknown date
3848165:
	title=Leonardo da Vinci's The Last
	descr=Leonardo da Vinci's The Last Supper (1495-1498) famous painting. Original from Wikimedia Commons.
6289494:
	title=Tiger clipart, wild animal illustration
	descr=Tiger clipart, wild animal illustration.
3338121:
	title=A classy dog in a punk
	descr=A classy dog in a punk hat. Original public domain image from Wikimedia Commons
5923637:
	title=Colorful papers
	descr=Colorful papers
3864892:
	title=Vincent van Gogh's Olive Trees
	descr=Vincent van Gogh's Olive Trees with the Alpilles in the Background (1889) famous landscape painting. Original from Wikimedia Commons.
3864381:
	title=Water wallpaper desktop, wave aesthetic
	descr=Water wallpaper desktop, wave aesthetic HD nature photo background. Original public domain image from Wikimedia Commons
3970498:
	title=Picture of the Great Japanese
	descr=Picture of the Great Japanese Victory at a Navy Battle (1894) print in high resolution by Ogata Gekko
2968518:
	title=Kleine Welten I (Small Worlds
	descr=Kleine Welten I (Small Worlds I) (1922) print in high resolution by Wassily Kandinsky. Original from The MET Museum.
3283890:
	title=An overhead shot of a grouping
	descr=An overhead shot of a grouping of red and white flowers. Original public domain image from Wikimedia Commons
3896105:
	title=Claude Monet's The Cliffs at Étretat
	descr=Claude Monet's The Cliffs at Étretat (1885) famous painting. Original from the Sterling and Francine Clark Art Institute.
5924641:
	title=Free herd of horses digital
	descr=Free herd of horses digital art image, public domain animal CC0 photo

There's a lot of duplication there, but using the title that gets returned to us at least prevents us from having the "Original public domain image from " in the title. What do you think @stacimc @krysal?

@AetherUnbound AetherUnbound force-pushed the feature/rawpixel-refactor#592 branch from 71c8322 to b6538c8 Compare October 25, 2022 15:15
@@ -87,7 +87,7 @@ def get_record_data(self, data):
return None
license_url = data.get("license_url")
license_info = get_license_info(license_url=license_url)
if license_info == LicenseInfo(None, None, None, None):
if license_info == NO_LICENSE_FOUND:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great fix! It was really cumbersome before.

@AetherUnbound AetherUnbound dismissed krysal’s stale review October 25, 2022 15:34

Testing the dismiss review feature, will receive more feedback later today

@AetherUnbound AetherUnbound changed the title Refactor Rawpixel to use ProviderDataIngester Refactor Rawpixel to use ProviderDataIngester Oct 25, 2022
if not foreign_url:
@staticmethod
def _get_category(metadata: dict) -> str | None:
keywords = set(metadata.get("popular_keywords", []))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice to have more categorized media!

Copy link
Contributor

@obulat obulat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice that we have more metadata for this provider than we usually do!

Copy link
Contributor

@stacimc stacimc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great :) Above and beyond on the title/description work, I really like how you've handled it. Super excited for this one!

Copy link
Member

@krysal krysal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome cleaning work, looks great! I just left some minor non-blocking comments.

Comment on lines +168 to +169
# Unescape HTMl sequences
text = html.unescape(text)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be quite useful for other providers as well! (I think Jamendo is another one returning scaped text?)

from common.licenses import get_license_info
from airflow.models import Variable
from common import constants
from common.licenses import NO_LICENSE_FOUND, get_license_info
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pycharm complains here:

Cannot find reference 'NO_LICENSE_FOUND' in '__init__.py'

Interesting that it still works 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
💻 aspect: code Concerns the software code in the repository ✨ goal: improvement Improvement to an existing user-facing feature 🟨 priority: medium Not blocking but should be addressed soon
Projects
None yet
Development

Successfully merging this pull request may close these issues.

RawPixel Improvements Refactor rawpixel to use ProviderDataIngester RawPixel does not process any data
5 participants