Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove single quotes in values of Ingestion Server's TSV files #4471

Merged
merged 4 commits into from
Jun 26, 2024

Conversation

krysal
Copy link
Member

@krysal krysal commented Jun 11, 2024

Fixes

Related to #3912 by @krysal

Description

This PR fixes several issues on the Ingestion Server and prepares it to upload the files.

  • Saves files of cleaned data to a specific temporary directory locally before being send to S3 (this was an @AetherUnbound's previous suggestion)
  • Recreates the temp directory before cleaning to avoid mixing values between different data refresh processes
  • Removes single quotes on cleaned values. Otherwise, the file has URLs surrounded by ', which isn't necessary and makes it quite complicated to save the data in a database table later
  • Adds stocksnap to TLS_CACHE manually (for some reason, it was being registered as TLS not supported, which is not true)

Testing Instructions

  1. Spin up the services
just a && just c
  1. Make some rows of the image table in the catalog "dirty" by removing the protocol in one or more of the URLs fields (url, creator_url, or foreign_landing_url)

  2. Run an image data refresh

# Optionally, delete the image index before to avoid the error of "index already exists"
just docker/es/delete-index image-init

just ingestion_server/ingest-upstream "image" "init"
  1. Check the files are in the container, and check for their content:
just exec ingestion_server bash
ls /tmp/cleaned_data/
cat /tmp/cleaned_data/url.tsv

and that the content of each file is the identifier and values without quotes, eg.:

0041b4ec-f55e-4cd0-a491-84f41be72232	https://cdn.stocksnap.io/img-thumbs/960w/YUJXHHKSMI.jpg
00252271-e1dc-4faf-b840-7fcc386f2529	https://cdn.stocksnap.io/img-thumbs/960w/7HHEXNL6AQ.jpg

While in our bucket, you can find the manually uploaded files containing rows like the following:

4e406052-5d90-4c30-87b6-6f1b6ba533e7	'http://musee-mccord.qc.ca/ObjView/MP-1986.85.4.jpg'
3eba813d-6e24-4aa3-9fdb-9242d4bdfc4e	'https://flora-on.pt/Carlina-vulgaris_ori_v6Wn.jpg'

Checklist

  • My pull request has a descriptive title (not a vague title likeUpdate index.md).
  • My pull request targets the default branch of the repository (main) or a parent feature branch.
  • My commit messages follow best practices.
  • My code follows the established code style of the repository.
  • I added or updated tests for the changes I made (if applicable).
  • I added or updated documentation (if applicable).
  • I tried running the project locally and verified that there are no visible errors.
  • [N/A] I ran the DAG documentation generator (just catalog/generate-docs for catalog PRs) or the media properties generator (just catalog/generate-docs media-props for the catalog or just api/generate-docs for the API) where applicable.

Developer Certificate of Origin

Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

@openverse-bot openverse-bot added 🟧 priority: high Stalls work on the project or its dependents ✨ goal: improvement Improvement to an existing user-facing feature 💻 aspect: code Concerns the software code in the repository 🧱 stack: ingestion server Related to the ingestion/data refresh server labels Jun 11, 2024
@krysal krysal changed the title Upload Upload Ingestion Server's TSV files to AWS S3 and fix several issues Jun 11, 2024
@krysal krysal force-pushed the feat/ing_server_s3_upload branch 3 times, most recently from d0caa71 to 6e52ba2 Compare June 13, 2024 16:28
@krysal krysal marked this pull request as ready for review June 13, 2024 16:37
@krysal krysal requested a review from a team as a code owner June 13, 2024 16:37
@krysal krysal requested review from dhruvkb and stacimc June 13, 2024 16:37
@krysal krysal mentioned this pull request Jun 14, 2024
Copy link
Contributor

@stacimc stacimc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm able to see the rows I updated getting cleaned, and all the expected logs in the ingestion server locally -- but I don't see the actual file in MinIO 🤔 No errors of any kind. Any ideas what could be going wrong?

continue
update_field_expressions.append(f"{field} = '{clean_value}'")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this moved? I think this was easier to parse when the comment was before the continue, as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The URL fields had single quotes added twice. As you can see above in the file (lines 117 and 119), I removed the quotes from the cleaned value so they won't appear in the file either (single quotes as the quoting character are problematic for later loading in the DB).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also confused about this 😕

Where were the duplicate quotes coming from, exactly? And why does the clean_value for tags not need to be quoted but everything else does?

Maybe avoiding the continue would help the clarity of this block:

for field, clean_value in cleaned_data.items():
    if field != "tags":
        # Save cleaned values for later
        # (except for tags, which take up too much space)
        cleaned_values[field].append((identifier, clean_value))

    update_field_expressions.append(f"{field} = {clean_value}")

Never mind that I don't understand where the difference is coming from for the format of the string added to update_field_expressions, this version is a lot easier to understand, to me.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where were the duplicate quotes coming from, exactly?

The cleaned values have quotes added in the return of the cleaning function:

if tls_supported:
return f"'https://{url}'"
else:
return f"'http://{url}'"

And why does the clean_value for tags not need to be quoted but everything else does?

The psycopg2.extras.Json function for tags already adapts a Python object to :sql:json data type, so they don't need extra quotes wrapping.

The difference in the string added to update_field_expressions is the quotes. I assume this was previously done that way to avoid this confusion, but it's necessary to have the cleaned values without these quotes in the file so they don't interfere with the copy upload to a DB table later.

An alternative would be to perform more string manipulation to remove the quotes before saving the rows in files, but this seemed unnecessary and more prone to error to me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation! That's fine with me. I don't think it's worth re-working this too much since these steps are going to be removed afterward anyway, but it would be good to expand that comment and move it before the continue to make this more clear, since multiple people found it confusing!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Staci, but I also don't want to block on it because of this code being deleted soon anyway 🤷

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the comment. If it's still not clear, it's open to suggestions. I didn't think it would bring so much confusion 😅

ingestion_server/ingestion_server/cleanup.py Outdated Show resolved Hide resolved
@krysal
Copy link
Member Author

krysal commented Jun 14, 2024

@stacimc That sounds like, potentially, you don't have the openverse-catalog bucket that was recently changed as the default for the catalog. That is not good because it should have raised an error and been logged. I realized my way of checking for the bucket's existence wasn't working. Now I have updated and confirmed it checks for the bucket.

You can manually create the bucket through MinIO, or running a just recreate should work. Could you try again? If the bucket exists and you see no files, please share what is in the logs to get any hints.

@krysal krysal requested a review from stacimc June 14, 2024 21:52
@krysal krysal force-pushed the feat/ing_server_s3_upload branch from 9fae52d to f22023c Compare June 14, 2024 21:59
Copy link
Collaborator

@sarayourfriend sarayourfriend left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requesting changes to simplify the get_s3_resource function (or even just remove it), clarify further the changes regarding update_field_expressions, and clarify the loop in _upload_to_s3 (specifically the case where the file does not exist).

continue
update_field_expressions.append(f"{field} = '{clean_value}'")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also confused about this 😕

Where were the duplicate quotes coming from, exactly? And why does the clean_value for tags not need to be quoted but everything else does?

Maybe avoiding the continue would help the clarity of this block:

for field, clean_value in cleaned_data.items():
    if field != "tags":
        # Save cleaned values for later
        # (except for tags, which take up too much space)
        cleaned_values[field].append((identifier, clean_value))

    update_field_expressions.append(f"{field} = {clean_value}")

Never mind that I don't understand where the difference is coming from for the format of the string added to update_field_expressions, this version is a lot easier to understand, to me.

ingestion_server/ingestion_server/cleanup.py Show resolved Hide resolved
Comment on lines 306 to 309
"""
Locally, it connects to a MinIO instance through its endpoint and test credentials.
On live environments, the connection is allowed via IAM roles.
"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion to clarify this comment. I wasn't sure what "it" referred to (might have both been MinIO and boto3?), and the reason for the difference between local and live wasn't clear (like, why MinIO matters).

Suggested change
"""
Locally, it connects to a MinIO instance through its endpoint and test credentials.
On live environments, the connection is allowed via IAM roles.
"""
"""
Retrieve a correctly configured boto3 S3 resource.
Locally, we use MinIO to emulate S3, so we must specify the endpoint and credentials.
On live environments, S3 itself is used, and authentication works via the instance profile.
"""

Although, overall, I do think this would be clearer if we didn't use default values here, and instead required setting them, and allowed them to be None when not configured in the environment. Then the code wouldn't need to worry about the implementation detail at all. I'll share a suggestion for that code in my next comment.

Copy link
Member Author

@krysal krysal Jun 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use Minio to emulate and test some AWS operations. In this case, it's the replacement for S3. Otherwise, we would need to use some live resources, which would complicate things a lot and likely incur costs.

ingestion_server/ingestion_server/cleanup.py Outdated Show resolved Hide resolved
Comment on lines 338 to 339
if not file_path.exists():
continue
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In what case is this file missing not an error condition (in other words, why can we just continue here rather than raise an exception)? The intention for this loop would be a lot safer and more explicit if instead of looping through fields, we only looped through the specific ones we want to upload.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once we start fixing the rows, we will reach the point where there is nothing to clean, so there is no file to upload in that case, and the tags field won't have a file here either.

@krysal krysal marked this pull request as draft June 17, 2024 20:55
@krysal krysal force-pushed the feat/ing_server_s3_upload branch from f22023c to b4172e4 Compare June 20, 2024 19:50
@krysal krysal changed the title Upload Ingestion Server's TSV files to AWS S3 and fix several issues Remove single quotes in values of Ingestion Server's TSV files Jun 20, 2024
@krysal
Copy link
Member Author

krysal commented Jun 20, 2024

@stacimc @sarayourfriend I split the PR into two parts since many things were happening here and it's easier to explain by simplifying it. Here, I left the fix for the quotes in the TSV files, updated the testing instructions, and opened #4529 for the upload to S3. I hope I have answered all your questions. Let me know if anything needs further clarification.

@krysal krysal marked this pull request as ready for review June 20, 2024 21:35
Copy link
Contributor

@stacimc stacimc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@krysal I'm not sure what's happening, but I'm getting really strange behavior when testing this. Locally in my catalog I updated three different records, removing the protocol from the url on one, the creator_url on the second, and the foreign_landing_url on the third. When I ran the data refresh only the foreign_landing_url actually got cleaned, and I still am not seeing the file in minio even though I see no errors in my ingestion server logs. Edit: I see that you moved the s3 uploading out of this PR so that part makes sense.

I didn't have much time to look further into this, but is anyone else able to replicate my results?

continue
update_field_expressions.append(f"{field} = '{clean_value}'")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation! That's fine with me. I don't think it's worth re-working this too much since these steps are going to be removed afterward anyway, but it would be good to expand that comment and move it before the continue to make this more clear, since multiple people found it confusing!

@openverse-bot
Copy link
Collaborator

Based on the high urgency of this PR, the following reviewers are being gently reminded to review this PR:

@dhruvkb
@sarayourfriend
This reminder is being automatically generated due to the urgency configuration.

Excluding weekend1 days, this PR was ready for review 3 day(s) ago. PRs labelled with high urgency are expected to be reviewed within 2 weekday(s)2.

@krysal, if this PR is not ready for a review, please draft it to prevent reviewers from getting further unnecessary pings.

Footnotes

  1. Specifically, Saturday and Sunday.

  2. For the purpose of these reminders we treat Monday - Friday as weekdays. Please note that the operation that generates these reminders runs at midnight UTC on Monday - Friday. This means that depending on your timezone, you may be pinged outside of the expected range.

Copy link
Collaborator

@sarayourfriend sarayourfriend left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't able to replicate Staci's results. I modified an image in the catalog to remove the scheme from all three URLs, and all three have separate files correctly adding the scheme.

I think there are ways to make the intention of the code clearer, but as I said, I am unwilling to block the PR given the code will be gone soon anyway. That said, I would be more firm in requesting clarification comments or reorganisation of that loop if this code was long term, so if similar code will be introduced elsewhere, I would block on its clarification. Just wanted to clarify it's a matter of the circumstance, not that I'm otherwise ignoring the issue of the clarity of that loop in this review.

continue
update_field_expressions.append(f"{field} = '{clean_value}'")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Staci, but I also don't want to block on it because of this code being deleted soon anyway 🤷

@krysal krysal force-pushed the feat/ing_server_s3_upload branch from b4172e4 to 08ab020 Compare June 26, 2024 16:57
@krysal
Copy link
Member Author

krysal commented Jun 26, 2024

@stacimc Did you try doing a full just recreate before all the steps? ... @dhruvkb are you able to test and review the PR?


@sarayourfriend unrelated but I was about to suggest using ./ov just recreate but it didn't work for me. I got the following:

➜ ./ov just recreate
Unable to find image 'openverse-dev_env:latest' locally
docker: Error response from daemon: pull access denied for openverse-dev_env, repository does not exist or may require 'docker login'.
See 'docker run --help'.

The plain just recreate worked.

@krysal krysal requested a review from stacimc June 26, 2024 17:11
Copy link
Member

@dhruvkb dhruvkb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code changes make sense based on the PR description. I followed the testing instructions and verified the correct output in /tmp.

@sarayourfriend
Copy link
Collaborator

@krysal that's because of @dhruvkb's PR to #4526.

@stacimc
Copy link
Contributor

stacimc commented Jun 26, 2024

@krysal I did recreate, but if three other people are unable to reproduce I'm willing to believe something was going awry in my environment. No objections from me 👍

@krysal krysal dismissed stacimc’s stale review June 26, 2024 22:42

It's working for most reviewers.

@krysal krysal merged commit 3cfdb8f into main Jun 26, 2024
48 checks passed
@krysal krysal deleted the feat/ing_server_s3_upload branch June 26, 2024 22:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💻 aspect: code Concerns the software code in the repository ✨ goal: improvement Improvement to an existing user-facing feature 🟧 priority: high Stalls work on the project or its dependents 🧱 stack: ingestion server Related to the ingestion/data refresh server
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

5 participants