Remove tag filtering steps during ingestion #4542
Labels
💻 aspect: code
Concerns the software code in the repository
🗄️ aspect: data
Concerns the data in our catalog and/or databases
✨ goal: improvement
Improvement to an existing user-facing feature
🟨 priority: medium
Not blocking but should be addressed soon
🧱 stack: catalog
Related to the catalog and Airflow DAGs
⛔ status: blocked
Blocked & therefore, not ready for work
Description
Blocked by #4541 and #3925 more broadly.
Per #4465 and the clarity gleaned in other places within the project, we are moving towards the catalog database serving as a "data warehouse". Operationally, this means we intend to store as much information as we can in it and filter out low quality or inaccurate data during the data refresh process. As such, once the ingestion server is removed and we have data filtering in place, we should remove the steps that occur during provider ingestion which would remove denylisted tags.
Specifically, we can remove the step here that would remove tags (the
_tag_denylisted
function):openverse/catalog/dags/common/storage/media.py
Lines 296 to 300 in 47fe5df
The filtering of denylisted tags will happen entirely in the data refresh process instead.
Additional context
See the discussion that prompted this in #4464.
The text was updated successfully, but these errors were encountered: