Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Save cleaned data of Ingestion Server to AWS S3 #4163

Merged
merged 10 commits into from
May 9, 2024
Merged

Conversation

krysal
Copy link
Member

@krysal krysal commented Apr 19, 2024

Fixes

Part of #3912 by @krysal

Description

This PR changes the Ingestion Server's cleanup step to save the cleaned data to an S3 bucket once they reach a certain number of rows. I chose 10M given that from the previously saved files, the one with most lines has +9M, so we know around that quantity is manageable in memory and produce fewer file parts.

wc -l creator_url.tsv && wc -l foreign_landing_url.tsv &&  wc -l url.tsv 
 9035462 creator_url.tsv
 8963194 foreign_landing_url.tsv
   57390 url.tsv

S3 does not support appending to files; you can only replace them, so the file for each column to modify is split into chunks and uploaded as they are generated.

This is the first part to upload the files, additionally, the server instance will need to have permission over the bucket, as @sarayourfriend pointed out.

Testing Instructions

Using Airflow UI

To test this locally with MinIO, create the openverse-catalog bucket if it's not automatically created at start. Go to http://localhost:5011/ (username: test_user & password: test_secret).

just api/up && just catalog/up

Then make some rows of the image table in the catalog dirty by removing the protocol in one or several of the URLs (url, creator_url, or foreign_landing_url) or modify the tags for low accuracy. Run the data refresh from the Airflow UI, wait for it to finish and check the bucket in MinIO.

http://localhost:5011/browser/openverse-catalog/

The data refresh should continue even if the upload to S3 fails for whatever reason. Try shooting down the S3 container and clearing the ingest_upstream step in the DAG to confirms it continues despite the failure in the upload.

Set the CLEANUP_BUFFER_SIZE environment variable to a low number, like 5 (having more rows to clean), to test uploading several files for the same field.

Using init script

Alternatively, follow instructions provided by @AetherUnbound in this comment.

Checklist

  • My pull request has a descriptive title (not a vague title likeUpdate index.md).
  • My pull request targets the default branch of the repository (main) or a parent feature branch.
  • My commit messages follow best practices.
  • My code follows the established code style of the repository.
  • I added or updated tests for the changes I made (if applicable).
  • I added or updated documentation (if applicable).
  • I tried running the project locally and verified that there are no visible errors.
  • I ran the DAG documentation generator (if applicable).

Developer Certificate of Origin

Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

@krysal krysal requested review from a team as code owners April 19, 2024 01:46
@krysal krysal requested review from fcoveram, AetherUnbound and obulat and removed request for fcoveram April 19, 2024 01:46
@github-actions github-actions bot added 🧱 stack: api Related to the Django API 🧱 stack: catalog Related to the catalog and Airflow DAGs 🧱 stack: ingestion server Related to the ingestion/data refresh server labels Apr 19, 2024
@openverse-bot openverse-bot added 🟧 priority: high Stalls work on the project or its dependents ✨ goal: improvement Improvement to an existing user-facing feature 💻 aspect: code Concerns the software code in the repository labels Apr 19, 2024
Copy link
Contributor

@obulat obulat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran the data refresh locally, and saw the saved data in the minio bucket. I'll look through the code again later, but wanted to add a comment here now.

When running locally, I got the bucket does not exist error. Do you think it's better to reuse an existing bucket, or to add the new one to the minio docker template:
BUCKETS_TO_CREATE=openverse-storage,openverse-airflow-logs

ingestion_server/env.template Outdated Show resolved Hide resolved
@AetherUnbound
Copy link
Collaborator

When running locally, I got the bucket does not exist error. Do you think it's better to reuse an existing bucket, or to add the new one to the minio docker template:
BUCKETS_TO_CREATE=openverse-storage,openverse-airflow-logs

I checked in S3 and we've used s3://${OPENVERSE_BUCKET}/shared/data-refresh-cleaned-data/ for previous versions of these files, so I think that would be okay to use!

upload the files to an S3 bucket once they reach a certain size (chose 10 MB kind of arbitrarily, to make them manageable)

Having checked S3, it looks like the creator_url.tsv and foreign_landing_url.tsv (the largest ones) are 700-800MB (IIRC tags.tsv was too large to even save and upload). We have 16GB of storage available on the primary ingestion server and 8GB available on each indexer worker. Given that a 10MB limit would produce 80 files for creator_url.tsv for example, I think setting the limit to something like 1GB might make more sense!

Copy link
Collaborator

@AetherUnbound AetherUnbound left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm excited about this! There are a few things that will need to be adjusted, in addition to my comments above about file size before upload.

Additionally, the testing instructions appear to no longer be accurate, since that initialization based on the environment appears to be a part of the code now.

Lastly, I adjusted the default bucket name and removed some HTTP schemes from the sample data as I noted below. I was able to get the upload to work, but the CSV which was uploaded had multiple copies of each identifier (the screenshot below has the rows sorted just to show the duplications, they were not in that order in the originally produced file). It appears that some of the logic is causing duplicate file writing.
image

ingestion_server/ingestion_server/cleanup.py Outdated Show resolved Hide resolved
ingestion_server/ingestion_server/cleanup.py Outdated Show resolved Hide resolved
ingestion_server/ingestion_server/cleanup.py Outdated Show resolved Hide resolved
ingestion_server/ingestion_server/cleanup.py Outdated Show resolved Hide resolved
ingestion_server/ingestion_server/cleanup.py Outdated Show resolved Hide resolved
@@ -364,7 +413,7 @@ def clean_image_data(table):
log.info(f"Starting {len(jobs)} cleaning jobs")

for result in pool.starmap(_clean_data_worker, jobs):
batch_cleaned_counts = save_cleaned_data(result)
batch_cleaned_counts = data_uploader.save(result)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noting this because I had to talk myself through it - I was worried that the data uploader step was happening in each process as part of the multiprocessing, but it looks like each result that comes out of pool.starmap is processed serially, so we don't need to worry about multiple processes stepping on each other.

ingestion_server/env.template Outdated Show resolved Hide resolved
ingestion_server/ingestion_server/cleanup.py Outdated Show resolved Hide resolved
@krysal krysal marked this pull request as draft April 20, 2024 16:23
@krysal krysal force-pushed the ing_server_save_to_s3 branch 3 times, most recently from 7e6c8e1 to a9d9894 Compare April 21, 2024 03:28
@krysal
Copy link
Member Author

krysal commented Apr 21, 2024

@obulat @AetherUnbound Thanks for your valuable suggestions. I was too quick to mark this as ready before but I'm still glad I did!

@AetherUnbound Regarding the file size, I set it to 850 MB, so only one file is created for each field but tags. 1 GB is on the big side for downloading without a high-speed connection. About the duplicated rows in the file, I'm not sure what could have happened there; I don't see those results in my tests 🤔 Could you try again? I applied the rest of changes suggested.

@krysal krysal marked this pull request as ready for review April 21, 2024 03:57
Copy link
Contributor

@obulat obulat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works great now, @krysal !

I left several comments for improvements (non-blocking).

ingestion_server/ingestion_server/cleanup.py Outdated Show resolved Hide resolved
ingestion_server/ingestion_server/cleanup.py Outdated Show resolved Hide resolved
ingestion_server/ingestion_server/cleanup.py Show resolved Hide resolved
ingestion_server/ingestion_server/cleanup.py Outdated Show resolved Hide resolved
@krysal krysal force-pushed the ing_server_save_to_s3 branch 2 times, most recently from 92b4eec to 98d3334 Compare April 22, 2024 21:04
Copy link
Collaborator

@AetherUnbound AetherUnbound left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unfortunately still producing duplicate rows for me locally 😕 here's everything I did to get this working:

  1. Add OPENVERSE_BUCKET=openverse-storage because my existing minio environment file didn't have the new bucket, and the new bucket defaults to openverse-catalog
  2. Remove https:// from the foreign_landing_url for the first 10 records of sample_image.csv
  3. Run just down -v
  4. Run just c to start the catalog (and get S3 running)
  5. Run just api/init (this performs the image data refresh for us)
  6. Download the produced TSV from localhost:5011

Although the logs say 10 records were written for foreign_landing_url, the CSV I download has 40 records (4 full copies of each 10 rows).

@krysal
Copy link
Member Author

krysal commented May 3, 2024

@sarayourfriend That is very helpful, thank you!

If there is a new bucket, please also add the new bucket in Terraform and ensure we control and own the bucket. There's a trend of scanning GitHub for bucket name references and hijacking ones that aren't reserved.

I'm using the existing bucket so there shouldn't be surprises with it :) Good to know that either way!


@AetherUnbound I resorted to simplify the file management given the previous approach was producing strange results in integration tests. This is ready for a re-review now.

@krysal krysal marked this pull request as ready for review May 3, 2024 02:12
@krysal krysal requested a review from AetherUnbound May 3, 2024 02:12
Copy link
Collaborator

@AetherUnbound AetherUnbound left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, I was able to test it with the ingestion altering method I had used in the past. The output had no duplicate rows this time! I have one non-blocking suggestion.

ingestion_server/ingestion_server/cleanup.py Show resolved Hide resolved
@krysal krysal removed 🧱 stack: api Related to the Django API 🧱 stack: catalog Related to the catalog and Airflow DAGs labels May 8, 2024
@krysal
Copy link
Member Author

krysal commented May 8, 2024

It turns out the quotes are needed for tags, or the file creation will raise errors. I undid said change, and locally, the tests passed, so I need to investigate why they aren't working in CI.

@krysal
Copy link
Member Author

krysal commented May 9, 2024

Okay, so I just moved the instructions regarding whether the file upload is skipped or not, so there is not much change, which is interesting in relation to tests... Anyway, I'm merging this now!

@krysal krysal merged commit 43d4d09 into main May 9, 2024
44 checks passed
@krysal krysal deleted the ing_server_save_to_s3 branch May 9, 2024 19:51
krysal added a commit that referenced this pull request Jun 5, 2024
krysal added a commit that referenced this pull request Jun 7, 2024
* Revert "Save cleaned data of Ingestion Server to AWS S3 (#4163)"

This reverts commit 43d4d09.

* Update default environment in test to omit warning

* Use perf_counter and limit decimals to three digits in time
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💻 aspect: code Concerns the software code in the repository ✨ goal: improvement Improvement to an existing user-facing feature 🟧 priority: high Stalls work on the project or its dependents 🧱 stack: ingestion server Related to the ingestion/data refresh server
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

5 participants