-
-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use Google Batch for full ETL runs #3211
Conversation
2b6ab6c
to
da11dc9
Compare
9670eac
to
0399687
Compare
c23a4ac
to
ee423f1
Compare
7196e2a
to
7c26216
Compare
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
96e4479
to
e842d7b
Compare
--container-env DAGSTER_PG_USERNAME="postgres" \ | ||
--container-env DAGSTER_PG_PASSWORD="$DAGSTER_PG_PASSWORD" \ | ||
--container-env DAGSTER_PG_HOST="104.154.182.24" \ | ||
--container-env DAGSTER_PG_DB="dagster-storage" \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These guys got moved into docker/dagster.yaml
because:
- I wanted all the dagster postgres configs in one place
- dagster postgres port configuration has to be in dagster.yaml, otherwise we run into "port is a string, not an int" issues
docker/gcp_pudl_etl.sh
Outdated
|
||
send_slack_msg "$message" | ||
upload_file_to_slack "$LOGFILE" "pudl_etl logs for $BUILD_ID:" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to keep this here? It's kind of nice to be able to click and download logs right from the failure notification. But we can already just click that Google Batch Console link...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I just don't know how to use it correctly but I have found the logs on the console annoying to find anything in and they only show a few lines no matter how bit the window is. And I can't just click on the link because then it opens in the wrong browser window, or under the wrong Google ID, so I have to copy link, open a tab in the right window in the right user container, paste the URL and then get an interface I don't like anyway.
The thing I'd like to have is direct links / URLs for both the output directory and the logfile, with the $BUILD_ID
visible rather than hidden in a link:
- Outputs:
gs://builds.catalyst.coop/${BUILD_ID}/
- Logfile:
gs://builds.catalyst.coop/${BUILD_ID}/${BUILD_ID}-pudl-etl.log
so I can pull the logfile down with gsutil cp
and open it in my editor / grep through it, and know without having to hover over it what build it's from. I find the "download file" dialog in Slack a bit involved, but it's probably more accessible to the less cloudy folks.
Another potential improvement, especially with our newfound ability to run an arbitrary number of builds all at the same time, would be to make sure that every slack message includes the $BUILD_ID
so that we know which build the messages pertain to, since if there's more than one going at a time they get interleaved with each other and that can be very confusing. Or if there were some way to group all of the messages that pertain to a given build in a single thread that would be even better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess what I'm saying with the links to logs & outputs is different people will prefer different access methods so maybe we should just provide them all.
|
||
# NOTE (daz): the best documentation of the actual data structure I've found is at | ||
# https://cloud.google.com/python/docs/reference/batch/latest/google.cloud.batch_v1.types.Job | ||
config = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I briefly thought about doing some dataclass thing here instead of a dictionary, but that seemed super verbose so I abandoned that idea.
e842d7b
to
2c8bb16
Compare
2c8bb16
to
39569c2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great to the extent that I understand what it's doing :)
I have some desires around what shows up in the Slack notifications so we can easily tell which build a message is coming from and directly access the outputs.
docker/gcp_pudl_etl.sh
Outdated
|
||
send_slack_msg "$message" | ||
upload_file_to_slack "$LOGFILE" "pudl_etl logs for $BUILD_ID:" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I just don't know how to use it correctly but I have found the logs on the console annoying to find anything in and they only show a few lines no matter how bit the window is. And I can't just click on the link because then it opens in the wrong browser window, or under the wrong Google ID, so I have to copy link, open a tab in the right window in the right user container, paste the URL and then get an interface I don't like anyway.
The thing I'd like to have is direct links / URLs for both the output directory and the logfile, with the $BUILD_ID
visible rather than hidden in a link:
- Outputs:
gs://builds.catalyst.coop/${BUILD_ID}/
- Logfile:
gs://builds.catalyst.coop/${BUILD_ID}/${BUILD_ID}-pudl-etl.log
so I can pull the logfile down with gsutil cp
and open it in my editor / grep through it, and know without having to hover over it what build it's from. I find the "download file" dialog in Slack a bit involved, but it's probably more accessible to the less cloudy folks.
Another potential improvement, especially with our newfound ability to run an arbitrary number of builds all at the same time, would be to make sure that every slack message includes the $BUILD_ID
so that we know which build the messages pertain to, since if there's more than one going at a time they get interleaved with each other and that can be very confusing. Or if there were some way to group all of the messages that pertain to a given build in a single thread that would be even better.
39569c2
to
13416cf
Compare
I think I know how to get the Slack notifications working even better but I figured that was out of scope for this PR. Just decided to update the wording & make sure that you can (a) click the links and go straight to the correct console page (b) copy the links directly into |
I note that since we we distribute the log file associated with the nightly builds, it's also possible to provide a clickable HTTP download link, though that would only work if the build has gotten as far as distributing the outputs. In which case I guess we'll usually not need to look at the logs. Another nice bonus of giving the logfile a readable and unique name is we can now at a glance see what build the nightlies came from. |
Yeah, I didn't realize what I didn't have until I had it, but being able to see both the date & the git hash is 🎆 |
Overview
Closes #3208 .
This lets us kick off nightly builds via Google Batch.
Opens up:
What did you change?
Testing
How did you make sure this worked? How can a reviewer verify this?
Ran lots of nightly builds using Batch :) You can kick off your own via
workflow_dispatch
.To-do list