Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data normalization #430

Closed
obulat opened this issue Feb 18, 2023 · 22 comments
Closed

Data normalization #430

obulat opened this issue Feb 18, 2023 · 22 comments
Assignees
Labels
🧭 project: thread An issue used to track a project and its progress

Comments

@obulat
Copy link
Contributor

obulat commented Feb 18, 2023

Start Date Project Lead Actual Ship Date
2023-09-01 @krysal TBD

Description

This project aims to save the cleaned data of the Data Refresh process and remove those steps from said process to save time.

Documents

Milestone / Issues

Prior Art


Future work - Phase Two

Prerequisites

@obulat obulat added the 🧭 project: thread An issue used to track a project and its progress label Feb 18, 2023
@zackkrida

This comment was marked as outdated.

@krysal
Copy link
Member

krysal commented Mar 6, 2024

The implementation plan is up for discussion at #3848. Writing it helped me ensure where we were starting from and define a scope for the project while indicating what could be done in a second phase, as suggested in the initial post. I hope others find it helpful too.

After its approval, the milestone should be complemented with a some issues:

  • Modify Ingestion Server to upload TSV files to AWS S3 and save fixed tags
  • Check cleanup steps times of the Ingestion Server after running the batched update from files DAG.

@krysal
Copy link
Member

krysal commented Apr 3, 2024

Since the last update, the IP has been approved, and work has started on fixing duplicated tags. This has been a bit delayed, given solution proposal differences, but once the modification to the catalog is solved (#3926), we can delete current duplicates in upstream DB (#1566) and continue with the rest of the milestone (#23).

@openverse-bot
Copy link
Collaborator

Hi @krysal, this project has not received an update comment in 14 days. Please leave an update comment as soon as you can. See the documentation on project updates for more information.

@krysal
Copy link
Member

krysal commented Apr 24, 2024

Done

In progress

Added

@openverse-bot
Copy link
Collaborator

Hi @krysal, this project has not received an update comment in 14 days. Please leave an update comment as soon as you can. See the documentation on project updates for more information.

@krysal
Copy link
Member

krysal commented May 13, 2024

Done

In progress

Previous merged PRs should solve #3912. I'm waiting for a run of the image data refresh to confirm we save and have the files, which is currently stopped/blocked on #4315 but that should be resolved between today and tomorrow. So I'm hoping the process is resumed soon and we can have the files this week.

To do

In the meantime, I can work on the next step:

@krysal
Copy link
Member

krysal commented May 24, 2024

An image data refresh in production couldn't finish with the changes from #4163, so we added more logging #4358, rolled back prod ingestion server, and decided to perform the cleanup process on the dev environment. An attempt with a data refresh limit resurrected an old problem (#736, #4381), which we already have a fix for, #4382. After merging #4382 on Monday, we must deploy the dev ingestion server and trigger the image data refresh to continue debugging.

The add_license_url DAG presented some issues of time outs as well and was refactored. The PR is pending revision:

@openverse-bot
Copy link
Collaborator

Hi @krysal, this project has not received an update comment in 14 days. Please leave an update comment as soon as you can. See the documentation on project updates for more information.

@krysal
Copy link
Member

krysal commented Jun 14, 2024

Done

In progress

To do

@sarayourfriend
Copy link
Collaborator

I did some manual tests and managed to load a table from the S3 files directly in staging, finding the extension for importing data into an RDS for PostgreSQL extremely useful.

For what it's worth @krysal, you can definitely test that locally, rather than needing to use a live environment. We use the extension already for iNaturalist, so there are examples in the codebase of how to do it (including with support for local files for testing and development). Check this one out, for example:

SELECT aws_s3.table_import_from_s3('inaturalist.observations',

@krysal
Copy link
Member

krysal commented Jun 14, 2024

@sarayourfriend I did not think of iNaturalist as a reference here, and the relationship had not been mentioned until now. That's good to know! I thought of testing in the staging DB first because, from the documentation, I understood the extension is specifically for an Amazon RDS Postgres instance, so it's excellent information to know it works for local Postgres instance. Thank you!

@krysal
Copy link
Member

krysal commented Jun 28, 2024

Done

In progress

@krysal
Copy link
Member

krysal commented Jul 12, 2024

This week maintainers were off from Openverse work so the tasks will be resumed next week.

@krysal
Copy link
Member

krysal commented Jul 26, 2024

The catalog_cleaner DAG ran for the programmed fields successfully, so it habilitates #1411 and #700 for next week after the data refresh if the process doesn't produce more files with changes :)

Besides that, what remains to do is #4452.

@openverse-bot
Copy link
Collaborator

Hi @krysal, this project has not received an update comment in 14 days. Please leave an update comment as soon as you can. See the documentation on project updates for more information.

@krysal
Copy link
Member

krysal commented Aug 16, 2024

Done

To Do

@openverse-bot
Copy link
Collaborator

Hi @krysal, this project has not received an update comment in 14 days. Please leave an update comment as soon as you can. See the documentation on project updates for more information.

@zackkrida
Copy link
Member

@WordPress/openverse-maintainers last week @krysal and I discussed the idea of sunsetting this project, with #4452 extracted out as a standalone issue to be worked on later this year.

In hindsight, this project was defined with two goals that were a bit less clear than we initially thought:

  • The catalog database (upstream) contains the cleaned data outputs of the
    current Ingestion Server's cleaning steps
  • The image Data Refresh process is simplified by reducing significantly
    cleaning times.

The first goal, in particular, is very open to interpretation and changes over time. Our data will never be perfect; does that mean we need to incorporate every new cleanup action into the scope of this work? That seems untenable.

The goal to remove the cleanup step from the data refresh has been met; I propose we close this project and move on.

If anyone objects: please share. Otherwise, I'll ask @krysal to move the project to success and close this issue next week.

@sarayourfriend
Copy link
Collaborator

I agree. The first goal actually is clear (in my reading), in that it specifies the "outputs of the current Ingestion Server's cleaning steps". I think, rather, we've let the scope get away from that boundary of the ingestion server cleaning steps, into a total "data cleaning" project.

@AetherUnbound
Copy link
Collaborator

Definitely okay closing this out based on that - the rest of the data cleaning issues that come up we can prioritize alongside other work!

@zackkrida
Copy link
Member

This project has been closed and moved to success.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🧭 project: thread An issue used to track a project and its progress
Projects
Status: ✅ Success
Development

No branches or pull requests

6 participants