Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Google Batch for full ETL runs #3211

Merged
merged 6 commits into from
Jan 29, 2024
Merged

Use Google Batch for full ETL runs #3211

merged 6 commits into from
Jan 29, 2024

Conversation

jdangerx
Copy link
Member

@jdangerx jdangerx commented Jan 3, 2024

Overview

Closes #3208 .

This lets us kick off nightly builds via Google Batch.

Opens up:

  • prompter/searchable/filterable logging
  • scaling up/down the instances easily
  • kicking off multiple builds at once without stepping on each other's toes
  • trying to use spot instances - we'd probably want to set this job up in a more "restartable" way, so that we don't end up having to re-run the whole dang thing when our VM gets interrupted

What did you change?

  • add script to create batch configuration, based on the same CLI args we were using before for VM configuration
  • run Postgres within the container, so we can stop running the Cloud SQL instance
  • slightly nicer Slack message outputs
  • exit 1 on failure so that the batch job dashboard correctly identifies a failure

Testing

How did you make sure this worked? How can a reviewer verify this?

Ran lots of nightly builds using Batch :) You can kick off your own via workflow_dispatch.

To-do list

@jdangerx jdangerx linked an issue Jan 3, 2024 that may be closed by this pull request
@jdangerx jdangerx force-pushed the try-google-batch branch 3 times, most recently from 2b6ab6c to da11dc9 Compare January 9, 2024 18:40
@zaneselvans zaneselvans added cloud Stuff that has to do with adapting PUDL to work in cloud computing context. nightly-builds Anything having to do with nightly builds or continuous deployment. labels Jan 9, 2024
@jdangerx jdangerx force-pushed the try-google-batch branch 2 times, most recently from 9670eac to 0399687 Compare January 10, 2024 17:42
@zaneselvans zaneselvans added this to the v2024.01 milestone Jan 12, 2024
@jdangerx jdangerx force-pushed the try-google-batch branch 3 times, most recently from c23a4ac to ee423f1 Compare January 16, 2024 20:54
@jdangerx jdangerx force-pushed the try-google-batch branch 2 times, most recently from 7196e2a to 7c26216 Compare January 23, 2024 16:26
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@jdangerx jdangerx force-pushed the try-google-batch branch 6 times, most recently from 96e4479 to e842d7b Compare January 26, 2024 20:58
Comment on lines -149 to -152
--container-env DAGSTER_PG_USERNAME="postgres" \
--container-env DAGSTER_PG_PASSWORD="$DAGSTER_PG_PASSWORD" \
--container-env DAGSTER_PG_HOST="104.154.182.24" \
--container-env DAGSTER_PG_DB="dagster-storage" \
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These guys got moved into docker/dagster.yaml because:

  • I wanted all the dagster postgres configs in one place
  • dagster postgres port configuration has to be in dagster.yaml, otherwise we run into "port is a string, not an int" issues

.github/workflows/build-deploy-pudl.yml Show resolved Hide resolved

send_slack_msg "$message"
upload_file_to_slack "$LOGFILE" "pudl_etl logs for $BUILD_ID:"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to keep this here? It's kind of nice to be able to click and download logs right from the failure notification. But we can already just click that Google Batch Console link...

Copy link
Member

@zaneselvans zaneselvans Jan 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I just don't know how to use it correctly but I have found the logs on the console annoying to find anything in and they only show a few lines no matter how bit the window is. And I can't just click on the link because then it opens in the wrong browser window, or under the wrong Google ID, so I have to copy link, open a tab in the right window in the right user container, paste the URL and then get an interface I don't like anyway.

The thing I'd like to have is direct links / URLs for both the output directory and the logfile, with the $BUILD_ID visible rather than hidden in a link:

  • Outputs: gs://builds.catalyst.coop/${BUILD_ID}/
  • Logfile: gs://builds.catalyst.coop/${BUILD_ID}/${BUILD_ID}-pudl-etl.log

so I can pull the logfile down with gsutil cp and open it in my editor / grep through it, and know without having to hover over it what build it's from. I find the "download file" dialog in Slack a bit involved, but it's probably more accessible to the less cloudy folks.

Another potential improvement, especially with our newfound ability to run an arbitrary number of builds all at the same time, would be to make sure that every slack message includes the $BUILD_ID so that we know which build the messages pertain to, since if there's more than one going at a time they get interleaved with each other and that can be very confusing. Or if there were some way to group all of the messages that pertain to a given build in a single thread that would be even better.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess what I'm saying with the links to logs & outputs is different people will prefer different access methods so maybe we should just provide them all.


# NOTE (daz): the best documentation of the actual data structure I've found is at
# https://cloud.google.com/python/docs/reference/batch/latest/google.cloud.batch_v1.types.Job
config = {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I briefly thought about doing some dataclass thing here instead of a dictionary, but that seemed super verbose so I abandoned that idea.

@jdangerx jdangerx changed the title WIP: Try google batch Use Google Batch for full ETL runs Jan 26, 2024
Copy link
Member

@zaneselvans zaneselvans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to the extent that I understand what it's doing :)

I have some desires around what shows up in the Slack notifications so we can easily tell which build a message is coming from and directly access the outputs.

devtools/generate_batch_config.py Show resolved Hide resolved

send_slack_msg "$message"
upload_file_to_slack "$LOGFILE" "pudl_etl logs for $BUILD_ID:"
Copy link
Member

@zaneselvans zaneselvans Jan 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I just don't know how to use it correctly but I have found the logs on the console annoying to find anything in and they only show a few lines no matter how bit the window is. And I can't just click on the link because then it opens in the wrong browser window, or under the wrong Google ID, so I have to copy link, open a tab in the right window in the right user container, paste the URL and then get an interface I don't like anyway.

The thing I'd like to have is direct links / URLs for both the output directory and the logfile, with the $BUILD_ID visible rather than hidden in a link:

  • Outputs: gs://builds.catalyst.coop/${BUILD_ID}/
  • Logfile: gs://builds.catalyst.coop/${BUILD_ID}/${BUILD_ID}-pudl-etl.log

so I can pull the logfile down with gsutil cp and open it in my editor / grep through it, and know without having to hover over it what build it's from. I find the "download file" dialog in Slack a bit involved, but it's probably more accessible to the less cloudy folks.

Another potential improvement, especially with our newfound ability to run an arbitrary number of builds all at the same time, would be to make sure that every slack message includes the $BUILD_ID so that we know which build the messages pertain to, since if there's more than one going at a time they get interleaved with each other and that can be very confusing. Or if there were some way to group all of the messages that pertain to a given build in a single thread that would be even better.

@jdangerx jdangerx marked this pull request as ready for review January 29, 2024 16:00
@jdangerx
Copy link
Member Author

I think I know how to get the Slack notifications working even better but I figured that was out of scope for this PR. Just decided to update the wording & make sure that you can (a) click the links and go straight to the correct console page (b) copy the links directly into gsutil or gcloud storage commands.

@zaneselvans
Copy link
Member

I note that since we we distribute the log file associated with the nightly builds, it's also possible to provide a clickable HTTP download link, though that would only work if the build has gotten as far as distributing the outputs. In which case I guess we'll usually not need to look at the logs.

https://s3.us-west-2.amazonaws.com/pudl.catalyst.coop/nightly/2024-01-27-0602-52a2741ff-main-pudl-etl.log

Another nice bonus of giving the logfile a readable and unique name is we can now at a glance see what build the nightlies came from.

@jdangerx
Copy link
Member Author

Yeah, I didn't realize what I didn't have until I had it, but being able to see both the date & the git hash is 🎆

@jdangerx jdangerx enabled auto-merge (squash) January 29, 2024 17:23
@jdangerx jdangerx merged commit c48766d into main Jan 29, 2024
17 checks passed
@jdangerx jdangerx deleted the try-google-batch branch January 29, 2024 17:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cloud Stuff that has to do with adapting PUDL to work in cloud computing context. nightly-builds Anything having to do with nightly builds or continuous deployment.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Port nightly build process to Google Batch
2 participants