Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explicitly include Filter Data step in ingestion server removal IP #4524

Merged
merged 2 commits into from
Jun 21, 2024

Conversation

AetherUnbound
Copy link
Collaborator

Fixes

Fixes #4456 by @AetherUnbound

Description

This PR modifies the ingestion server removal IP to explicitly include a plan for handling the filtering of data which occurs during the current "cleanup" step and will not be handled by the data normalization project.

While the change to the plan is minimal, I added a significant footnote just to provide context, describe what's needed, and then provide suggestions for the future. The suggestions made are not necessary at this moment, in my opinion. We can start with just slapping what used to be the "cleanup" step of the ingestion server into the new data refresh DAG and calling it good for now. I do think, though, that we could make an issue for optimizing this down the line and doing it in a more "Airflow-y" manner. It'll require more work to do that and it's worth doing eventually, but not needed now!

Testing Instructions

View the rendered document and make sure the description looks okay!

Checklist

  • My pull request has a descriptive title (not a vague title likeUpdate index.md).
  • My pull request targets the default branch of the repository (main) or a parent feature branch.
  • My commit messages follow best practices.
  • My code follows the established code style of the repository.
  • I added or updated tests for the changes I made (if applicable).
  • I added or updated documentation (if applicable).
  • I tried running the project locally and verified that there are no visible errors.
  • I ran the DAG documentation generator (./ov just catalog/generate-docs for catalog
    PRs) or the media properties generator (./ov just catalog/generate-docs media-props
    for the catalog or ./ov just api/generate-docs for the API) where applicable.

Developer Certificate of Origin

Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

@AetherUnbound AetherUnbound requested a review from a team as a code owner June 19, 2024 22:32
@openverse-bot openverse-bot added 🧱 stack: documentation Related to Sphinx documentation 🧭 project: implementation plan An implementation plan for a project 🟨 priority: medium Not blocking but should be addressed soon ✨ goal: improvement Improvement to an existing user-facing feature 📄 aspect: text Concerns the textual material in the repository 🧱 stack: ingestion server Related to the ingestion/data refresh server 🧱 stack: catalog Related to the catalog and Airflow DAGs labels Jun 19, 2024
Copy link

Full-stack documentation: https://docs.openverse.org/_preview/4524

Please note that GitHub pages takes a little time to deploy newly pushed code, if the links above don't work or you see old versions, wait 5 minutes and try again.

You can check the GitHub pages deployment action list to see the current status of the deployments.

Changed files 🔄:

Copy link
Member

@krysal krysal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I agree that the step being a copy of the current cleanup step is fine. Thanks for writing this addendum!

Copy link
Contributor

@stacimc stacimc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM -- I was envisioning something much more involved (along the lines of the future optimizations you've documented here), but I'm sold on your plan. The simple approach is so much simpler to implement and there's really no effort wasted if we end up wanting to optimize down the road. Let's not overcomplicate things too early!

takes about 8 hours for all cleanup steps, but that includes the URL
cleaning which is certainly more time intensive than the tag filtering since
it makes outbound requests. Running the tag filtering on Airflow should not
impact any of the other running tasks or saturate the instance.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but that includes the URL cleaning which is certainly more time intensive than the tag filtering since it makes outbound requests.

Great point I totally hadn't thought of!

@AetherUnbound AetherUnbound merged commit 2befb89 into main Jun 21, 2024
42 checks passed
@AetherUnbound AetherUnbound deleted the docs/filtering-tags-ip-mod branch June 21, 2024 00:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
📄 aspect: text Concerns the textual material in the repository ✨ goal: improvement Improvement to an existing user-facing feature 🟨 priority: medium Not blocking but should be addressed soon 🧭 project: implementation plan An implementation plan for a project 🧱 stack: catalog Related to the catalog and Airflow DAGs 🧱 stack: documentation Related to Sphinx documentation 🧱 stack: ingestion server Related to the ingestion/data refresh server
Projects
Status: Accepted
Archived in project
Development

Successfully merging this pull request may close these issues.

Update ingestion server removal IP to include plan for filtering tags
4 participants