Update init scripts to use new data refresh #4962

stacimc · 2024-09-20T01:18:11Z

Fixes

Description

Updates the load_sample_data script to load sample data by actually triggering the new data refreshes, removing this dependency on the ingestion server and also helping to test the new DAGs! If AIRFLOW_CONN_SENSITIVE_TERMS is not configured, it will set this up automatically by pulling the value from the env.template to prevent setup from breaking for folks who have not yet added this env variable (see discussion).

It also fixes a few dependency issues that were noticed as part of this work, explained in commit messages.

Note that this change also updates CI, so the data refreshes are run during the setup for the api tests. This should go a long way to integration testing the data refreshes; if they break, or don't populate the sample data correctly, we should see test failures in CI.

The downside is that the overhead of Airflow, while it should be negligible in an actual production data refresh compared to the amount of time a refresh takes, does slow down the init scripts and CI by extension. I've seen the api tests in CI take 9-10 minutes, up from 5-7. To compensate for this I've taken @sarayourfriend's suggestion to add an option to allow the data refreshes to run concurrently, which is enabled in the init scripts.

Testing Instructions

First, remove AIRFLOW_CONN_SENSITIVE_TERMS from your .env to verify that the init scripts will still work without it.

Run ov j recreate. Once the catalog stack is up, you can open http://localhost:9090/dags/staging_audio_data_refresh/grid and watch as the DAG is actually triggered locally. Verify that the DAG (and the related image one) both pass, and that the recreate works. Check out the image data refresh at the same time to see them running concurrently.

Then try ov j api/init and verify this works successfully as well.

CI should also pass, of course.

Checklist

My pull request has a descriptive title (not a vague title likeUpdate index.md).
My pull request targets the default branch of the repository (main) or a parent feature branch.
My commit messages follow best practices.
My code follows the established code style of the repository.
I added or updated tests for the changes I made (if applicable).
I added or updated documentation (if applicable).
I tried running the project locally and verified that there are no visible errors.
I ran the DAG documentation generator (ov just catalog/generate-docs for catalog
PRs) or the media properties generator (ov just catalog/generate-docs media-props
for the catalog or ov just api/generate-docs for the API) where applicable.

Developer Certificate of Origin

Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

github-actions · 2024-09-25T18:11:08Z

Latest k6 run output¹

$     ✓ status was 200

     checks.........................: 100.00% ✓ 8000      ✗ 0   
     data_received..................: 1.8 GB  8.1 MB/s
     data_sent......................: 1.0 MB  4.7 kB/s
     http_req_blocked...............: avg=16.2µs   min=1.65µs  med=4.31µs   max=11.36ms p(90)=5.61µs   p(95)=6.08µs  
     http_req_connecting............: avg=10.29µs  min=0s      med=0s       max=11.28ms p(90)=0s       p(95)=0s      
     http_req_duration..............: avg=584.12ms min=44.01ms med=524.38ms max=2.47s   p(90)=1.09s    p(95)=1.21s   
       { expected_response:true }...: avg=584.12ms min=44.01ms med=524.38ms max=2.47s   p(90)=1.09s    p(95)=1.21s   
   ✓ http_req_failed................: 0.00%   ✓ 0         ✗ 8000
     http_req_receiving.............: avg=177.02µs min=40.86µs med=124.53µs max=31.14ms p(90)=200.48µs p(95)=249.57µs
     http_req_sending...............: avg=23.26µs  min=6.49µs  med=20.14µs  max=2.66ms  p(90)=26.82µs  p(95)=33.59µs 
     http_req_tls_handshaking.......: avg=0s       min=0s      med=0s       max=0s      p(90)=0s       p(95)=0s      
     http_req_waiting...............: avg=583.92ms min=43.82ms med=524.15ms max=2.47s   p(90)=1.09s    p(95)=1.21s   
     http_reqs......................: 8000    35.528913/s
     iteration_duration.............: avg=3.9s     min=1.32s   med=3.31s    max=11.45s  p(90)=8.45s    p(95)=8.97s   
     iterations.....................: 1200    5.329337/s
     vus............................: 4       min=4       max=30
     vus_max........................: 30      min=30      max=30

This comment will automatically update with new output each time k6 runs for this PR ↩

github-actions · 2024-09-25T18:19:01Z

Full-stack documentation: https://docs.openverse.org/_preview/4962

Please note that GitHub pages takes a little time to deploy newly pushed code, if the links above don't work or you see old versions, wait 5 minutes and try again.

You can check the GitHub pages deployment action list to see the current status of the deployments.

Changed files 🔄:

https://docs.openverse.org/_preview/4962/catalog/reference/DAGs.html

sarayourfriend

This LGTM and I was able to get everything working as expected locally. I am requesting changes for a very small, but I think significant issue with the development experience of this change.

The data refresh DAGs will fail if the sensitive terms Airflow connection is not configured. A default configuration is in catalog/env.template, but nothing in this change indicates a hard requirement to get new .env values (whether manually or by recreating your .env altogether). When the wait-for-index check times out in load_sample_data, even if you check Airflow, all you will see is that the connection is missing.

I was able to figure it out and remembered how the sensitive terms connection works in the new data refresh, but that was with my context and experience on the project. Newer contributors will run into this, and will not have any direct indication of how to resolve the problem. In fact, they may not even know to check the DAG status.

I believe there is a relatively simple fix for this: in the load sample data script, grep catalog/.env for the sensitive terms connection envvar. If it is not present, log a message and either:

Automatically append the one from env.template to .env. In a local environment, for a script used only in CI (where this check will never be relevant) or locally, I feel this is fine.
Just log the message but with instructions for adding the new connection manually, and then exit 1.

Both of these could eventually be removed, say in 2 or 3 months. There is no hard "backwards compatibility" requirement of our development environment (it would be untenable), but transitions should ideally to be smooth and transparent at the time.

The diff below solves this problem for me locally. I tested by removing the connection variable from my .env and re-running the init script:

diff --git a/load_sample_data.sh b/load_sample_data.sh
index 46df60c8f..702bba335 100755
--- a/load_sample_data.sh
+++ b/load_sample_data.sh
@@ -6,6 +6,18 @@ CACHE_SERVICE_NAME="${CACHE_SERVICE_NAME:-cache}"
 UPSTREAM_DB_SERVICE_NAME="${UPSTREAM_DB_SERVICE_NAME:-upstream_db}"
 DB_SERVICE_NAME="${DB_SERVICE_NAME:-db}"
 
+# `true` resolves to an empty string, and prevents the script from failing
+# due to `set -e` and grep's non-zero status code if the pattern isn't found
+has_sensitive_terms_airflow_conn=$(grep "AIRFLOW_CONN_SENSITIVE_TERMS" catalog/.env || true)
+
+if [[ ! "$has_sensitive_terms_airflow_conn" ]]; then
+  echo "Adding new Airflow connection environment variable required for sample data loading"
+  grep "AIRFLOW_CONN_SENSITIVE_TERMS" catalog/env.template >> catalog/.env
+
+  echo "Restarting Airflow to populate the new connection variable"
+  just dc restart webserver scheduler triggerer
+fi
+
 while getopts 'c' OPTION; do
   case "$OPTION" in
   c)

Otherwise, this PR LGTM. I have some questions about the DevEx, but nothing that is blocking or can't be iterated on. Those are:

Loading sample data takes noticeably longer than it did with the ingestion server. Is the DAG concurrency check necessary for local usage? Could we run audio and image data refresh concurrently when loading sample data? Given the regularity of recreating sample data locally (at least when working on some tickets), if there's anything we can do to cut down the amount of time this takes, it would be a great benefit.
We trigger the DAGs in order of audio then image. Is it reliable that audio will always happen first? If not, if image could go first somehow, then the load sample data script would sit there waiting for audio much longer. I don't think it would reach the 5-minute time out, but, speaking personally, I would almost certainly kill the process if it was taking longer than 2 minutes.

If it's even possible for the DAG order to be inconsistent, running the DAGs concurrently obviates any concern about it.

There are four rather visible warnings emitted by Airflow CLI: "/home/airflow/.local/lib/python3.12/site-packages/airflow/cli/commands/dag_command.py:48 UserWarning: Could not import graphviz. Rendering graph to the graphical format will not be possible.". Can we suppress these? However unlikely it may seem, warnings could be taken as the cause if something else goes wrong in the load sample data script, and could confuse someone unfamiliar with how this all works.

To clarify again, none of those 3 are blockers, but I would create issues for any you agree are valid improvements to make after the fact.

The only blocker is making sure the sensitive terms connection doesn't cause unexpected and niche failures during this transitionary period.

stacimc · 2024-09-27T02:02:48Z

@sarayourfriend Great call about the sensitive terms Airflow connection. I added your suggestion in d9b2a70; I agree, we should handle this proactively since it could break all the init scripts for any dev who is unaware of the need to add the new connection, and this is only a temporary measure.

As to running the DAGs concurrently during the sample data script -- I see no reason why not! There were a number of naming collisions (connection ids, FDW, etc) that we didn't need to worry about when the data refreshes were guaranteed to be isolated, but that can be handled. I think it's worth getting this working in this PR because it should help with cutting down that time in the recreate/init scripts and CI. I've added a param to the DAG to allow data refreshes to run concurrently, and enabled that during the load_sample_data script :)

sarayourfriend

Amazing!

The TaskGroups that created new indices were returning the name of the index from the taskgroup, in order to allow accessing the index names later in other parts of the DAG. Unfortunately this was breaking the expected dependency list in the DAG; only the returned task was being set upstream to downstream tasks, rather than the last task in the TaskGroup. TL;DR it was possible for, for example, creation of the filtered index to fail, but then table/index promotion would proceed anyway! This commit fixes that by pulling the tasks to generate the index names out of the TaskGroups so they can be used directly at any point in the DAG.

This is necessary for the load sample data script, so we can load sample data by running the staging DAGs. If they were on any kind of automated schedule, we would have to contend with scheduled runs in addition to our manual ones.

…@sarayourfriend

…@sarayourfriend implementation

This allows the init scripts to run faster as the DAGs can run concurrently when there is no risk of negative impact to ES CPU.

…shes run concurrently

stacimc added 🟨 priority: medium Not blocking but should be addressed soon 🌟 goal: addition Addition of new feature 💻 aspect: code Concerns the software code in the repository 🧱 stack: catalog Related to the catalog and Airflow DAGs labels Sep 20, 2024

stacimc self-assigned this Sep 20, 2024

openverse-bot added the 🧱 stack: api Related to the Django API label Sep 20, 2024

stacimc force-pushed the update/init-scripts-to-use-new-data-refresh branch 2 times, most recently from 4adb242 to 737099c Compare September 25, 2024 17:45

stacimc force-pushed the update/init-scripts-to-use-new-data-refresh branch from 676e7f7 to adc26dd Compare September 25, 2024 22:15

openverse-bot added 🧱 stack: documentation Related to Sphinx documentation 🧱 stack: mgmt Related to repo management and automations labels Sep 25, 2024

stacimc marked this pull request as ready for review September 25, 2024 23:44

stacimc requested review from a team as code owners September 25, 2024 23:45

stacimc requested review from AetherUnbound, sarayourfriend and krysal and removed request for a team September 25, 2024 23:45

sarayourfriend requested changes Sep 26, 2024

View reviewed changes

stacimc marked this pull request as draft September 26, 2024 23:14

stacimc marked this pull request as ready for review September 27, 2024 02:03

stacimc requested a review from sarayourfriend September 27, 2024 02:03

sarayourfriend approved these changes Sep 27, 2024

View reviewed changes

stacimc added 2 commits October 7, 2024 15:23

Set schedule to None for staging data refreshes

77ec325

This is necessary for the load sample data script, so we can load sample data by running the staging DAGs. If they were on any kind of automated schedule, we would have to contend with scheduled runs in addition to our manual ones.

stacimc added 9 commits October 7, 2024 15:23

Add just recipe to run single airflow cli command

1088b92

Start using the new DAGs in load_sample_data

557a9f0

Update DAG docs

af942c3

Add catalog dependency to api tests in ci

2e194b7

Start catalog before initializing api

e0fc5a1

Remove accidental copy/paste

6fb822e

Populate AIRFLOW_CONN_SENSITIVE_TERMS in init scripts if missing using …

becd9e6

…@sarayourfriend implementation

Add option to allow concurrent refreshes, enable during load_sample_data

d069d02

This allows the init scripts to run faster as the DAGs can run concurrently when there is no risk of negative impact to ES CPU.

Avoid naming collisions in FDW, schema, connection id when data refre…

276bd49

…shes run concurrently

stacimc force-pushed the update/init-scripts-to-use-new-data-refresh branch from 547aa42 to 276bd49 Compare October 7, 2024 22:30

stacimc merged commit 424702c into main Oct 8, 2024
53 checks passed

stacimc deleted the update/init-scripts-to-use-new-data-refresh branch October 8, 2024 17:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update init scripts to use new data refresh #4962

Update init scripts to use new data refresh #4962

stacimc commented Sep 20, 2024 •

edited

Loading

github-actions bot commented Sep 25, 2024 •

edited

Loading

github-actions bot commented Sep 25, 2024

sarayourfriend left a comment

stacimc commented Sep 27, 2024

sarayourfriend left a comment

Update init scripts to use new data refresh #4962

Update init scripts to use new data refresh #4962

Conversation

stacimc commented Sep 20, 2024 • edited Loading

Fixes

Description

Testing Instructions

Checklist

Developer Certificate of Origin

github-actions bot commented Sep 25, 2024 • edited Loading

Latest k6 run output1

Footnotes

github-actions bot commented Sep 25, 2024

sarayourfriend left a comment

Choose a reason for hiding this comment

stacimc commented Sep 27, 2024

sarayourfriend left a comment

Choose a reason for hiding this comment

stacimc commented Sep 20, 2024 •

edited

Loading

github-actions bot commented Sep 25, 2024 •

edited

Loading

Latest k6 run output¹