Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some images have duplicate incorrectly decoded unicode tags #1303

Open
obulat opened this issue Jan 9, 2023 · 4 comments
Open

Some images have duplicate incorrectly decoded unicode tags #1303

obulat opened this issue Jan 9, 2023 · 4 comments
Labels
💻 aspect: code Concerns the software code in the repository 🛠 goal: fix Bug fix 🟨 priority: medium Not blocking but should be addressed soon 🧱 stack: catalog Related to the catalog and Airflow DAGs

Comments

@obulat
Copy link
Contributor

obulat commented Jan 9, 2023

Description

Some media with non-ascii characters in tags that were ingested a long time ago has duplicate tags: one with a correct utf-8 letter and one with an incorrectly escaped sequence.

Reproduction

  1. Go to https://api.openverse.engineering/v1/images/ab5150da-5d83-47ec-ad66-bf08dcfef78f/ (or https://search-production.openverse.engineering/image/ab5150da-5d83-47ec-ad66-bf08dcfef78f for the frontend view)
  2. Look at the tags list.
  3. See error: There are many unreadable tags, many of which are duplicated. For example, "arapça" with a ç with a cedilla, and arapu00e7a, where that character was replaced with an incorrectly escaped ç as u00e7 (this is the unicode code point for this letter, without the \ control character)
  4. This is the way they are saved in the catalog:
    {"name": "arapça", "provider": "flickr"}, {"name": "arapu00e7a", "provider": "flickr"},

Screenshots

Tags displayed on the frontend:
Screenshot 2023-01-09 at 12 13 56 PM

Additional context

I think we also had the same problem for other details such as title and description, but most of them were fixed when re-ingested. When we upsert the tags, we add all the tags that are different from the ones already saved. And since the new tag appears different than the mangled one, both were saved.

This item has a non-mangled title and mangled and non-mangled tags, which suggests that the titles were fixed, and the tags were simply added to:
https://api.openverse.engineering/v1/images/829eb0a7-3ce8-44ca-8194-4a78757a88aa/

There is also an error of over-correction of the unicode decoding error. Instead of removing the backslash before u, the backslash is escaped by another backslash, so arapu00e7a becomes arap\\u00e7a.
On the frontend, we compensate for this problem for title, creator and tag name in decode-string: https://github.com/WordPress/openverse-frontend/blob/26fb744449cbe4c25b895c75fad57ab2646b1737/src/utils/decode-data.ts

@obulat obulat added 🟨 priority: medium Not blocking but should be addressed soon 🛠 goal: fix Bug fix 💻 aspect: code Concerns the software code in the repository labels Jan 9, 2023
@obulat obulat added the 🧱 stack: catalog Related to the catalog and Airflow DAGs label Feb 23, 2023
@obulat obulat transferred this issue from WordPress/openverse-catalog Apr 17, 2023
@AetherUnbound
Copy link
Collaborator

We may be able to do some analysis on the tags to determine the provider or range of dates this is limited to!

@sarayourfriend
Copy link
Collaborator

This will be solved by #4452 and there are no ways to address this that do not run into the exact same problems of "inferring" unescaped unicode as exist for that issue. I do not believe there is anything unique to do for this issue aside from just continuing the work on #4452.

@sarayourfriend
Copy link
Collaborator

@AetherUnbound and @obulat if y'all agree with that (please let me know if not), feel free to close this as won't do; I didn't want to close it without your explicit input in case I have missed something that makes this unrelated to the issue I linked.

@AetherUnbound
Copy link
Collaborator

I have seen this and will respond to this discussion when I have time!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💻 aspect: code Concerns the software code in the repository 🛠 goal: fix Bug fix 🟨 priority: medium Not blocking but should be addressed soon 🧱 stack: catalog Related to the catalog and Airflow DAGs
Projects
Status: 📅 To Do
Development

No branches or pull requests

3 participants