Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementation plan: Catalog Data Cleaning #3848

Merged
merged 4 commits into from
Mar 12, 2024
Merged

Implementation plan: Catalog Data Cleaning #3848

merged 4 commits into from
Mar 12, 2024

Conversation

krysal
Copy link
Member

@krysal krysal commented Feb 29, 2024

Fixes

Related to #430 by @obulat

Description

Since the project proposal is clear with what we want to achieve in this case, and in a past attempt, there were more doubts with the technical details than with the proposal itself, here I directly add an implementation plan with the most straightforward path identified.

This proposal also represents work that @obulat pushed last year; it's partly possible due to her effort and the @stacimc's precedent in the batched_update DAG. Kudos to both!


This discussion follows the Openverse decision-making process. Information about this process can be found on the Openverse documentation site.

Requested reviewers or participants will be following this process. If you are being asked to give input on a specific detail, you do not need to familiarize yourself with and follow the process.

Current round

This discussion is currently in the Decision round.

The deadline for review of this round is March 12, 2024.

Developer Certificate of Origin

Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

@krysal krysal added 🟨 priority: medium Not blocking but should be addressed soon 📄 aspect: text Concerns the textual material in the repository 🧰 goal: internal improvement Improvement that benefits maintainers, not users 🧱 stack: ingestion server Related to the ingestion/data refresh server 🧱 stack: catalog Related to the catalog and Airflow DAGs skip-changelog 🧭 project: implementation plan An implementation plan for a project labels Feb 29, 2024
Copy link

Full-stack documentation: https://docs.openverse.org/_preview/3848

Please note that GitHub pages takes a little time to deploy newly pushed code, if the links above don't work or you see old versions, wait 5 minutes and try again.

You can check the GitHub pages deployment action list to see the current status of the deployments.

New files ➕:

@krysal krysal marked this pull request as ready for review February 29, 2024 21:47
@krysal krysal requested a review from a team as a code owner February 29, 2024 21:47
Copy link
Collaborator

@AetherUnbound AetherUnbound left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for drafting this @krysal! I have a number of editorial suggestions and questions as part of the clarification round.

Comment on lines 135 to 152
If confirmed the time is reduced to zero, optionally the cleaning steps can be
removed, or leave them in case we want to perform a similar cleaning effort
later.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can maybe leave a cleaning "shim" in for the future, but the current cleaning logic that gets called should be removed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How will look a cleaning shim in your opinion?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking we could perhaps leave this section in, even if clean_image_data is a noop:

if downstream_table == "image":
# Step 5: Clean the data
log.info("Cleaning data...")
clean_image_data(downstream_table)
log.info("Cleaning completed!")
slack.status(
downstream_table, "Data cleaning complete | _Next: Elasticsearch reindex_"
)

Now that I say it though, it seems silly 😅 we probably don't need to keep that either!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kind of think we do need to have this shim, because I'm pretty sure we will find something to clean, at least in the short run. Would be nice to have some criteria to determine when to remove the shim. Maybe, when the second stage of the data normalization project is done?

With this shim in place, the functions that clean tags and URLs in the cleanup.py would be removed, and this dictionary would have empty values:

_cleanup_config = {
"tables": {
"image": {
"sources": {
# Applies to all sources.
"*": {
"fields": {
"tags": CleanupFunctions.cleanup_tags,
"url": CleanupFunctions.cleanup_url,
"creator_url": CleanupFunctions.cleanup_url,
"foreign_landing_url": CleanupFunctions.cleanup_url,
}
}
}
}
}
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was debating between the two options, too.

Would be nice to have some criteria to determine when to remove the shim. Maybe, when the second stage of the data normalization project is done?

That's a great idea! I agree with it 💯

Copy link
Contributor

@obulat obulat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great plan, @krysal !

Could we also add some notes on how we keep track of the cleaned up data, and how we report on it? This will probably be useful for the person who runs the DAGs, and maybe the MSR runner?

@krysal krysal marked this pull request as draft March 4, 2024 15:20
@krysal krysal force-pushed the rfc/data-cleaning branch 3 times, most recently from 7c0d57b to 3d70e64 Compare March 5, 2024 19:28
@krysal
Copy link
Member Author

krysal commented Mar 5, 2024

Could we also add some notes on how we keep track of the cleaned up data, and how we report on it? This will probably be useful for the person who runs the DAGs, and maybe the MSR runner?

@obulat Good note! I see you're already thinking about re-using the DAG outside of cleaning tasks 😄 I added that step to the DAG specification. Thanks for bringing it up and for the quick review!

@krysal krysal marked this pull request as ready for review March 5, 2024 22:09
@krysal
Copy link
Member Author

krysal commented Mar 5, 2024

@AetherUnbound @obulat I made the corrections from your excellent suggestions. This is ready for another review 📑

@krysal krysal mentioned this pull request Mar 6, 2024
Copy link
Collaborator

@AetherUnbound AetherUnbound left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple more comments, no further input after this from me on the clarification round! 🙂

Copy link
Contributor

@obulat obulat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can go on to the next stage.

Comment on lines 135 to 152
If confirmed the time is reduced to zero, optionally the cleaning steps can be
removed, or leave them in case we want to perform a similar cleaning effort
later.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kind of think we do need to have this shim, because I'm pretty sure we will find something to clean, at least in the short run. Would be nice to have some criteria to determine when to remove the shim. Maybe, when the second stage of the data normalization project is done?

With this shim in place, the functions that clean tags and URLs in the cleanup.py would be removed, and this dictionary would have empty values:

_cleanup_config = {
"tables": {
"image": {
"sources": {
# Applies to all sources.
"*": {
"fields": {
"tags": CleanupFunctions.cleanup_tags,
"url": CleanupFunctions.cleanup_url,
"creator_url": CleanupFunctions.cleanup_url,
"foreign_landing_url": CleanupFunctions.cleanup_url,
}
}
}
}
}
}

@krysal krysal force-pushed the rfc/data-cleaning branch 5 times, most recently from 0532dae to d17e708 Compare March 7, 2024 20:34
Co-authored-by: Madison Swain-Bowden <[email protected]>
Co-authored-by: Olga Bulat <[email protected]>

Fix and add links

Add suggested extra issue and adjust the Expected Outcomes

Include smart_open in the Tools & packages section

Apply editorial suggestions
@krysal
Copy link
Member Author

krysal commented Mar 7, 2024

Document updated. I moved the discussion to the Decision round 🥁

Copy link
Collaborator

@AetherUnbound AetherUnbound left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No blocking objections, the plan looks good! 😄

@openverse-bot
Copy link
Collaborator

Based on the medium urgency of this PR, the following reviewers are being gently reminded to review this PR:

@obulat
This reminder is being automatically generated due to the urgency configuration.

Excluding weekend1 days, this PR was ready for review 4 day(s) ago. PRs labelled with medium urgency are expected to be reviewed within 4 weekday(s)2.

@krysal, if this PR is not ready for a review, please draft it to prevent reviewers from getting further unnecessary pings.

Footnotes

  1. Specifically, Saturday and Sunday.

  2. For the purpose of these reminders we treat Monday - Friday as weekdays. Please note that the operation that generates these reminders runs at midnight UTC on Monday - Friday. This means that depending on your timezone, you may be pinged outside of the expected range.

Copy link
Contributor

@obulat obulat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excited to get this project started!

Co-authored-by: Olga Bulat <[email protected]>
Co-authored-by: Madison Swain-Bowden <[email protected]>
@krysal
Copy link
Member Author

krysal commented Mar 12, 2024

Thank you folks! ✨

@krysal krysal merged commit d7d4208 into main Mar 12, 2024
37 checks passed
@krysal krysal deleted the rfc/data-cleaning branch March 12, 2024 16:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
📄 aspect: text Concerns the textual material in the repository 🧰 goal: internal improvement Improvement that benefits maintainers, not users 🟨 priority: medium Not blocking but should be addressed soon 🧭 project: implementation plan An implementation plan for a project skip-changelog 🧱 stack: catalog Related to the catalog and Airflow DAGs 🧱 stack: ingestion server Related to the ingestion/data refresh server
Projects
Status: Accepted
Archived in project
Development

Successfully merging this pull request may close these issues.

4 participants