Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update init scripts to use new data refresh #4962

Merged
merged 11 commits into from
Oct 8, 2024

Conversation

stacimc
Copy link
Contributor

@stacimc stacimc commented Sep 20, 2024

Fixes

Fixes #4153 by @stacimc

Description

Updates the load_sample_data script to load sample data by actually triggering the new data refreshes, removing this dependency on the ingestion server and also helping to test the new DAGs! If AIRFLOW_CONN_SENSITIVE_TERMS is not configured, it will set this up automatically by pulling the value from the env.template to prevent setup from breaking for folks who have not yet added this env variable (see discussion).

It also fixes a few dependency issues that were noticed as part of this work, explained in commit messages.

Note that this change also updates CI, so the data refreshes are run during the setup for the api tests. This should go a long way to integration testing the data refreshes; if they break, or don't populate the sample data correctly, we should see test failures in CI.

The downside is that the overhead of Airflow, while it should be negligible in an actual production data refresh compared to the amount of time a refresh takes, does slow down the init scripts and CI by extension. I've seen the api tests in CI take 9-10 minutes, up from 5-7. To compensate for this I've taken @sarayourfriend's suggestion to add an option to allow the data refreshes to run concurrently, which is enabled in the init scripts.

Testing Instructions

First, remove AIRFLOW_CONN_SENSITIVE_TERMS from your .env to verify that the init scripts will still work without it.

Run ov j recreate. Once the catalog stack is up, you can open http://localhost:9090/dags/staging_audio_data_refresh/grid and watch as the DAG is actually triggered locally. Verify that the DAG (and the related image one) both pass, and that the recreate works. Check out the image data refresh at the same time to see them running concurrently.

Then try ov j api/init and verify this works successfully as well.

CI should also pass, of course.

Checklist

  • My pull request has a descriptive title (not a vague title likeUpdate index.md).
  • My pull request targets the default branch of the repository (main) or a parent feature branch.
  • My commit messages follow best practices.
  • My code follows the established code style of the repository.
  • I added or updated tests for the changes I made (if applicable).
  • I added or updated documentation (if applicable).
  • I tried running the project locally and verified that there are no visible errors.
  • I ran the DAG documentation generator (ov just catalog/generate-docs for catalog
    PRs) or the media properties generator (ov just catalog/generate-docs media-props
    for the catalog or ov just api/generate-docs for the API) where applicable.

Developer Certificate of Origin

Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

@stacimc stacimc added 🟨 priority: medium Not blocking but should be addressed soon 🌟 goal: addition Addition of new feature 💻 aspect: code Concerns the software code in the repository 🧱 stack: catalog Related to the catalog and Airflow DAGs labels Sep 20, 2024
@stacimc stacimc self-assigned this Sep 20, 2024
@openverse-bot openverse-bot added the 🧱 stack: api Related to the Django API label Sep 20, 2024
@stacimc stacimc force-pushed the update/init-scripts-to-use-new-data-refresh branch 2 times, most recently from 4adb242 to 737099c Compare September 25, 2024 17:45
Copy link

github-actions bot commented Sep 25, 2024

Latest k6 run output1

$     ✓ status was 200

     checks.........................: 100.00% ✓ 8000      ✗ 0   
     data_received..................: 1.8 GB  8.1 MB/s
     data_sent......................: 1.0 MB  4.7 kB/s
     http_req_blocked...............: avg=16.2µs   min=1.65µs  med=4.31µs   max=11.36ms p(90)=5.61µs   p(95)=6.08µs  
     http_req_connecting............: avg=10.29µs  min=0s      med=0s       max=11.28ms p(90)=0s       p(95)=0s      
     http_req_duration..............: avg=584.12ms min=44.01ms med=524.38ms max=2.47s   p(90)=1.09s    p(95)=1.21s   
       { expected_response:true }...: avg=584.12ms min=44.01ms med=524.38ms max=2.47s   p(90)=1.09s    p(95)=1.21s   
   ✓ http_req_failed................: 0.00%   ✓ 0         ✗ 8000
     http_req_receiving.............: avg=177.02µs min=40.86µs med=124.53µs max=31.14ms p(90)=200.48µs p(95)=249.57µs
     http_req_sending...............: avg=23.26µs  min=6.49µs  med=20.14µs  max=2.66ms  p(90)=26.82µs  p(95)=33.59µs 
     http_req_tls_handshaking.......: avg=0s       min=0s      med=0s       max=0s      p(90)=0s       p(95)=0s      
     http_req_waiting...............: avg=583.92ms min=43.82ms med=524.15ms max=2.47s   p(90)=1.09s    p(95)=1.21s   
     http_reqs......................: 8000    35.528913/s
     iteration_duration.............: avg=3.9s     min=1.32s   med=3.31s    max=11.45s  p(90)=8.45s    p(95)=8.97s   
     iterations.....................: 1200    5.329337/s
     vus............................: 4       min=4       max=30
     vus_max........................: 30      min=30      max=30

Footnotes

  1. This comment will automatically update with new output each time k6 runs for this PR

Copy link

Full-stack documentation: https://docs.openverse.org/_preview/4962

Please note that GitHub pages takes a little time to deploy newly pushed code, if the links above don't work or you see old versions, wait 5 minutes and try again.

You can check the GitHub pages deployment action list to see the current status of the deployments.

Changed files 🔄:

@stacimc stacimc force-pushed the update/init-scripts-to-use-new-data-refresh branch from 676e7f7 to adc26dd Compare September 25, 2024 22:15
@openverse-bot openverse-bot added 🧱 stack: documentation Related to Sphinx documentation 🧱 stack: mgmt Related to repo management and automations labels Sep 25, 2024
@stacimc stacimc marked this pull request as ready for review September 25, 2024 23:44
@stacimc stacimc requested review from a team as code owners September 25, 2024 23:45
@stacimc stacimc requested review from AetherUnbound, sarayourfriend and krysal and removed request for a team September 25, 2024 23:45
Copy link
Collaborator

@sarayourfriend sarayourfriend left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM and I was able to get everything working as expected locally. I am requesting changes for a very small, but I think significant issue with the development experience of this change.

The data refresh DAGs will fail if the sensitive terms Airflow connection is not configured. A default configuration is in catalog/env.template, but nothing in this change indicates a hard requirement to get new .env values (whether manually or by recreating your .env altogether). When the wait-for-index check times out in load_sample_data, even if you check Airflow, all you will see is that the connection is missing.

I was able to figure it out and remembered how the sensitive terms connection works in the new data refresh, but that was with my context and experience on the project. Newer contributors will run into this, and will not have any direct indication of how to resolve the problem. In fact, they may not even know to check the DAG status.

I believe there is a relatively simple fix for this: in the load sample data script, grep catalog/.env for the sensitive terms connection envvar. If it is not present, log a message and either:

  1. Automatically append the one from env.template to .env. In a local environment, for a script used only in CI (where this check will never be relevant) or locally, I feel this is fine.
  2. Just log the message but with instructions for adding the new connection manually, and then exit 1.

Both of these could eventually be removed, say in 2 or 3 months. There is no hard "backwards compatibility" requirement of our development environment (it would be untenable), but transitions should ideally to be smooth and transparent at the time.

The diff below solves this problem for me locally. I tested by removing the connection variable from my .env and re-running the init script:

diff --git a/load_sample_data.sh b/load_sample_data.sh
index 46df60c8f..702bba335 100755
--- a/load_sample_data.sh
+++ b/load_sample_data.sh
@@ -6,6 +6,18 @@ CACHE_SERVICE_NAME="${CACHE_SERVICE_NAME:-cache}"
 UPSTREAM_DB_SERVICE_NAME="${UPSTREAM_DB_SERVICE_NAME:-upstream_db}"
 DB_SERVICE_NAME="${DB_SERVICE_NAME:-db}"
 
+# `true` resolves to an empty string, and prevents the script from failing
+# due to `set -e` and grep's non-zero status code if the pattern isn't found
+has_sensitive_terms_airflow_conn=$(grep "AIRFLOW_CONN_SENSITIVE_TERMS" catalog/.env || true)
+
+if [[ ! "$has_sensitive_terms_airflow_conn" ]]; then
+  echo "Adding new Airflow connection environment variable required for sample data loading"
+  grep "AIRFLOW_CONN_SENSITIVE_TERMS" catalog/env.template >> catalog/.env
+
+  echo "Restarting Airflow to populate the new connection variable"
+  just dc restart webserver scheduler triggerer
+fi
+
 while getopts 'c' OPTION; do
   case "$OPTION" in
   c)

Otherwise, this PR LGTM. I have some questions about the DevEx, but nothing that is blocking or can't be iterated on. Those are:

  1. Loading sample data takes noticeably longer than it did with the ingestion server. Is the DAG concurrency check necessary for local usage? Could we run audio and image data refresh concurrently when loading sample data? Given the regularity of recreating sample data locally (at least when working on some tickets), if there's anything we can do to cut down the amount of time this takes, it would be a great benefit.
  2. We trigger the DAGs in order of audio then image. Is it reliable that audio will always happen first? If not, if image could go first somehow, then the load sample data script would sit there waiting for audio much longer. I don't think it would reach the 5-minute time out, but, speaking personally, I would almost certainly kill the process if it was taking longer than 2 minutes.
  • If it's even possible for the DAG order to be inconsistent, running the DAGs concurrently obviates any concern about it.
  1. There are four rather visible warnings emitted by Airflow CLI: "/home/airflow/.local/lib/python3.12/site-packages/airflow/cli/commands/dag_command.py:48 UserWarning: Could not import graphviz. Rendering graph to the graphical format will not be possible.". Can we suppress these? However unlikely it may seem, warnings could be taken as the cause if something else goes wrong in the load sample data script, and could confuse someone unfamiliar with how this all works.

To clarify again, none of those 3 are blockers, but I would create issues for any you agree are valid improvements to make after the fact.

The only blocker is making sure the sensitive terms connection doesn't cause unexpected and niche failures during this transitionary period.

@stacimc stacimc marked this pull request as draft September 26, 2024 23:14
@stacimc
Copy link
Contributor Author

stacimc commented Sep 27, 2024

@sarayourfriend Great call about the sensitive terms Airflow connection. I added your suggestion in d9b2a70; I agree, we should handle this proactively since it could break all the init scripts for any dev who is unaware of the need to add the new connection, and this is only a temporary measure.

As to running the DAGs concurrently during the sample data script -- I see no reason why not! There were a number of naming collisions (connection ids, FDW, etc) that we didn't need to worry about when the data refreshes were guaranteed to be isolated, but that can be handled. I think it's worth getting this working in this PR because it should help with cutting down that time in the recreate/init scripts and CI. I've added a param to the DAG to allow data refreshes to run concurrently, and enabled that during the load_sample_data script :)

@stacimc stacimc marked this pull request as ready for review September 27, 2024 02:03
Copy link
Collaborator

@sarayourfriend sarayourfriend left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing!

The TaskGroups that created new indices were returning the name of the index from the taskgroup, in order to allow accessing the index names later in other parts of the DAG. Unfortunately this was breaking the expected dependency list in the DAG; only the returned task was being set upstream to downstream tasks, rather than the last task in the TaskGroup.

TL;DR it was possible for, for example, creation of the filtered index to fail, but then table/index promotion would proceed anyway!

This commit fixes that by pulling the tasks to generate the index names out of the TaskGroups so they can be used directly at any point in the DAG.
This is necessary for the load sample data script, so we can load sample data by running the staging DAGs. If they were on any kind of automated schedule, we would have to contend with scheduled runs in addition to our manual ones.
@stacimc stacimc force-pushed the update/init-scripts-to-use-new-data-refresh branch from 547aa42 to 276bd49 Compare October 7, 2024 22:30
@stacimc stacimc merged commit 424702c into main Oct 8, 2024
53 checks passed
@stacimc stacimc deleted the update/init-scripts-to-use-new-data-refresh branch October 8, 2024 17:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💻 aspect: code Concerns the software code in the repository 🌟 goal: addition Addition of new feature 🟨 priority: medium Not blocking but should be addressed soon 🧱 stack: api Related to the Django API 🧱 stack: catalog Related to the catalog and Airflow DAGs 🧱 stack: documentation Related to Sphinx documentation 🧱 stack: mgmt Related to repo management and automations
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Update the load_sample_data scripts to remove ingestion server usage
3 participants