Automate the data release process #2756

bendnorman · 2023-07-26T21:03:38Z

Once #1973 and the implementation of #2517 are complete, we can move to data-only releases. There is some automation we'd like to create to make this process as smooth as possible.

The plan right now is to do a semi-manual data release based on the v2023.12.01 tag and take notes on the process, so we can do an automatic release within the next 2 weeks containing all the post-rename tables. It's just a draft at the moment, but the v2023.12.01 release will be available at:

On Zenodo: https://zenodo.org/records/10275052
With DOI: 10.5281/zenodo.10275052

Resolved Questions

The auto-generated release notes on the GitHub release got truncated. Regenerating them manually did not truncate. I have modified our release configuration to exclude all of the bot PRs and hopefully we will never wait a year between releases, so this shouldn't be a problem anymore.

Open Questions

Do we want an "oops" wait time or required reviews on the release? I've already restricted it to only being possible to deploy from main and only if the tag starts with v20*.
Do we or don't we want to push releases to PyPI? It's an obvious place to look for things, but we're publishing an application not a library. If we do publish release software to PyPI should we also continue updating on conda-forge? I'm inclined to find a low overhead but not guaranteed to work / be reproducible way to do this just so the package can ge installed without git reference gymnastics.
If we continue doing software release, note that PyPI removes leading zeroes in the calendar version, turning v2023.12.01 into v2023.12.1 However, GitHub and the outputs on GCS & S3 don't do this, and the tag we applied on GitHub is v2023.12.01 so this could result in some confusion / annoyance. Do we want to use the no-leading-zeroes version of CalVer?

For Future Consideration

Ideally we would probably have some more cautious flow-control in the release process. Like as it is once you push the tag, all of these things kick off, but there should really be dependencies. Right now if the build fails for some reason, then the data won't be released, but the software would still be pushed out.
How can we make the builds more reproducible? And also faster? Really, EVERY versioned release should be based on a recent successful nightly build. That means to do a release we don't actually need to do a build - we just need to know what past nightly-YYYY-MM-DD commit to tag with vYYYY.MM.DD, and then we should be able to look up their nightly build outputs and distribute them without needing to do a build at all. Doing a release would only take a few minutes then, and could hopefully be done on a GitHub runner. The biggest single file we need to distribute right now is CEMS, at 6GB. So if we're downloading from S3 and re-uploading one file at a time to Zenodo, that shouldn't be a problem. With the whole release being under 10GB we could probably download it all and re-upload it, but that probably won't be true forever. Or we could just do releases on a bigger runner with more disk. They should only take as long to run as it takes to download and re-upload the data. Or maybe in the New Zenodo API there's some way to copy directly from cloud-to-cloud without downloading the files locally at all? A boy can dream.
If we tag the Docker containers that are used to run the nightly builds intelligently, we could also pull and archive the exact container that was used for a given release.
We could get rid of the need for the giant AWS CLI install in the Docker container if we instead wrote a small Python script that uses fsspec to transfer the local build outputs up to both S3 and GCS. Seems like something to integrate into the pythonization of the build / deploy script.

The text was updated successfully, but these errors were encountered:

bendnorman · 2023-07-26T21:14:04Z

We currently have space limitations with the s3 bucket (100 GB), so we need to decide how often we will do data releases and how long we will retain old releases. I asked the AWS folks if there is a ceiling to how much additional storage space we can request. They didn't tell me the limit, but they said they can increase our storage to 1 TB given the size and frequency of our data releases.

Our data releases are currently about 20 GB though it's likely they will get bigger! We could cut down on the size by:

converting tables created by simple joins to SQL views
select the partitioned or monolithic cems parquet files

If we assume our releases grow to 40 GB (more output tables and new datasets!) we could have 25 data releases available at once. This equates to roughly 2 data releases a month retained for a full year. This sounds reasonable to me. We'll have to make it clear that releases expire on s3 after a year. If users need to depend on old versions, they can always pull the data from Zenodo or the GCS bucket using requester pays.

zaneselvans · 2023-08-24T19:00:40Z

I think if we get to putting out monthly (or even quarterly) data releases we will be in great shape! So it seems like we should easily be able to store at least a couple of years worth of releases in free S3 buckets, with Zenodo as the free, public, citation-friendly, cold-storage for anything older.

To start testing a script for pushing releases to Zenodo, we might initially use their Sandbox server, and rather than only trying to push on a tagged release, push on any nightly build that succeeds off of dev -- that way we'd have lots of opportunities for refinement. I guess the script itself should probably just take inputs of:

a cloud storage location as the source (a prefix containing a "directory" of outputs)
necessary credentials for the cloud storage source and Zenodo destination
a Zenodo concept DOI indicating what archive should be updated with a new version of the new outputs
some deposition-level metadata that should be associated with the new Zenodo archive. Ideally I think this would be assembled programmatically based on the metadata that we're generating to annotate the data release itself, but maybe that's for later.

We don't have much in the way of software distribution infrastructure right now. We'll keep tagging the commits associated with persistent data releases, and those tags will keep getting archvied on Zenodo automatically. Do we want to keep pushing catalystcoop.pudl to PyPI / conda-forge if we really consider it an application? I guess maybe not.

I don't thing generating datapackage.json for pudl.sqlite and all of our other outputs is a blocking concern for going to data releases, but we should definitely do it at some point, and IIRC we will want to update to v5 of the Frictionless Framework to make use of its direct SQLite metadata annotations.

jdangerx · 2024-01-24T16:10:37Z

Minimal changes required:

make pypi-release depend on pypi-test-release
don't auto-publish to production zenodo, but do try to upload a draft to production on a v202x build.
create release runbook
- close out release the night before
- tell people you're planning on releasing
- push tag, make sure it points at nightly-20whatever
- verify: RTD, distribution buckets, Zenodo

Nice-to-have:

on push to v202x, don't trigger a whole ETL/test/build process, just go find the outputs from the corresponding nightly build in builds.catalyst.coop
create github issue to review the new prod deposition

zaneselvans · 2024-01-31T05:16:55Z

I've pulled the nice-to-haves into #3326 so we can close this issue.

github-project-automation bot added this to Catalyst Megaproject Jul 26, 2023

github-project-automation bot moved this to New in Catalyst Megaproject Jul 26, 2023

bendnorman moved this from New to Icebox in Catalyst Megaproject Jul 26, 2023

e-belfer moved this from Icebox to Backlog in Catalyst Megaproject Aug 24, 2023

bendnorman mentioned this issue Nov 20, 2023

Minimal rename cleanup for Winter #3061

Closed

zaneselvans self-assigned this Dec 4, 2023

zaneselvans moved this from Backlog to In progress in Catalyst Megaproject Dec 5, 2023

zaneselvans linked a pull request Dec 5, 2023 that will close this issue

Add GHA release workflow #3122

Closed

8 tasks

zaneselvans changed the title ~~Set up automation to handle data only releases~~ Automate the data release process Dec 5, 2023

zaneselvans mentioned this issue Dec 5, 2023

Add GHA workflow for release-on-tag #3124

Merged

8 tasks

zaneselvans removed a link to a pull request Dec 6, 2023

Add GHA release workflow #3122

Closed

8 tasks

zaneselvans linked a pull request Dec 6, 2023 that will close this issue

Add GHA workflow for release-on-tag #3124

Merged

8 tasks

zaneselvans linked a pull request Dec 14, 2023 that will close this issue

Attempt to create sandbox data release in nightly builds. #3158

Merged

zaneselvans closed this as completed in #3158 Dec 14, 2023

github-project-automation bot moved this from In progress to Done in Catalyst Megaproject Dec 14, 2023

zaneselvans reopened this Dec 14, 2023

zaneselvans moved this from Done to In progress in Catalyst Megaproject Dec 14, 2023

zaneselvans added this to the v2024.01 milestone Jan 12, 2024

zaneselvans moved this from In progress to Backlog in Catalyst Megaproject Jan 24, 2024

zaneselvans removed this from the v2024.01 milestone Jan 25, 2024

jdangerx moved this from Backlog to In progress in Catalyst Megaproject Jan 29, 2024

zaneselvans mentioned this issue Jan 31, 2024

PUDL release automation improvements #3326

Open

zaneselvans closed this as completed Jan 31, 2024

github-project-automation bot moved this from In progress to Done in Catalyst Megaproject Jan 31, 2024

zaneselvans added this to the v2024.01 milestone Jan 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automate the data release process #2756

Automate the data release process #2756

bendnorman commented Jul 26, 2023 •

edited by zaneselvans

Loading

Tasks

bendnorman commented Jul 26, 2023 •

edited

Loading

zaneselvans commented Aug 24, 2023

jdangerx commented Jan 24, 2024

zaneselvans commented Jan 31, 2024

Automate the data release process #2756

Automate the data release process #2756

Comments

bendnorman commented Jul 26, 2023 • edited by zaneselvans Loading

Resolved Questions

Open Questions

For Future Consideration

Tasks

bendnorman commented Jul 26, 2023 • edited Loading

zaneselvans commented Aug 24, 2023

jdangerx commented Jan 24, 2024

zaneselvans commented Jan 31, 2024

bendnorman commented Jul 26, 2023 •

edited by zaneselvans

Loading

bendnorman commented Jul 26, 2023 •

edited

Loading