Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove and de-duplicate tags with leading/trailing whitespace #4199

Closed
AetherUnbound opened this issue Apr 24, 2024 · 6 comments · Fixed by #4429
Closed

Remove and de-duplicate tags with leading/trailing whitespace #4199

AetherUnbound opened this issue Apr 24, 2024 · 6 comments · Fixed by #4429
Assignees
Labels
🗄️ aspect: data Concerns the data in our catalog and/or databases 🛠 goal: fix Bug fix 🟨 priority: medium Not blocking but should be addressed soon 🧱 stack: catalog Related to the catalog and Airflow DAGs

Comments

@AetherUnbound
Copy link
Collaborator

Description

We have some records in our data where there are duplicate tags, only the duplicate tag has leading or trailing whitespace. Here's an example: https://api.openverse.engineering/v1/images/2d454032-0cc1-48a5-8f40-e9235f1a4f12/

"tags": [
        {
            "name": "abigfave",
            "accuracy": null
        },
        {
            "name": " abigfave",
            "accuracy": null
        },
        {
            "name": "artisanal",
            "accuracy": null
        },
        {
            "name": " artisanal",
            "accuracy": null
        },
        {
            "name": "color",
            "accuracy": null
        },
        {
            "name": " color",
            "accuracy": null
        },
        ...
    ],

This might need to be tackled in two steps, or a least an operation which covers both cases:

  1. Remove leading/trailing whitespace from existing tags
  2. Deduplicate tags which might previously have been separate ones

We will also want to check, similar to #1566, that any new tags added always have extra whitespace stripped.

Additional context

Related to #430

@AetherUnbound AetherUnbound added 🗄️ aspect: data Concerns the data in our catalog and/or databases 🛠 goal: fix Bug fix 🟨 priority: medium Not blocking but should be addressed soon 🧱 stack: catalog Related to the catalog and Airflow DAGs labels Apr 24, 2024
@krysal krysal added this to the Data normalization milestone Apr 24, 2024
@sarayourfriend sarayourfriend self-assigned this May 30, 2024
@sarayourfriend
Copy link
Collaborator

sarayourfriend commented May 30, 2024

@WordPress/openverse-catalog I'd like to take a shot at this issue. Am I correct to assume this should use the batched update DAG? And if so, I think I'd like to try it in two steps, as suggested, basically doing something like this:

  1. Batch update to strip leading/trailing whitespace
  2. Batch update using something like select distinct for each set of tags?

Is such a thing possible with the batched update DAG? Are there any potentially helpful examples of how we've used that recently I could work off of?

@krysal
Copy link
Member

krysal commented May 30, 2024

@sarayourfriend You're correct. It's possible to do it with the batched_update DAG. The deletion of duplicates was resolved in #1566 (comment) and Postgres has string functions for trimming.

@AetherUnbound
Copy link
Collaborator Author

If possible, it'd be best to combine both of those steps into a single batched update, that way we don't have to do two passes on the data! Might make for a tricky query, but then we only have to run it once 😄

@sarayourfriend
Copy link
Collaborator

I guess select distinct trimming_function(tag.name), or something along those lines would work?

Thanks for the input, y'all.

@krysal
Copy link
Member

krysal commented Jun 14, 2024

Reopening for the pending execution of the trim_and_deduplicate_tags DAG.

@krysal krysal reopened this Jun 14, 2024
@krysal
Copy link
Member

krysal commented Jun 28, 2024

Solved in #4557.

@krysal krysal closed this as completed Jun 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🗄️ aspect: data Concerns the data in our catalog and/or databases 🛠 goal: fix Bug fix 🟨 priority: medium Not blocking but should be addressed soon 🧱 stack: catalog Related to the catalog and Airflow DAGs
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants