Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automate the data release process #2756

Closed
13 tasks done
Tracked by #3061
bendnorman opened this issue Jul 26, 2023 · 4 comments · Fixed by #3124 or #3158
Closed
13 tasks done
Tracked by #3061

Automate the data release process #2756

bendnorman opened this issue Jul 26, 2023 · 4 comments · Fixed by #3124 or #3158
Assignees
Labels
datapkg Frictionless data package input, output, metadata, manipulation metadata Anything having to do with the content, formatting, or storage of metadata. Mostly datapackages. release Tasks directly related to data and software releases. zenodo Issues having to do with Zenodo data archiving and retrieval.
Milestone

Comments

@bendnorman
Copy link
Member

bendnorman commented Jul 26, 2023

Once #1973 and the implementation of #2517 are complete, we can move to data-only releases. There is some automation we'd like to create to make this process as smooth as possible.

The plan right now is to do a semi-manual data release based on the v2023.12.01 tag and take notes on the process, so we can do an automatic release within the next 2 weeks containing all the post-rename tables. It's just a draft at the moment, but the v2023.12.01 release will be available at:

Resolved Questions

  • The auto-generated release notes on the GitHub release got truncated. Regenerating them manually did not truncate. I have modified our release configuration to exclude all of the bot PRs and hopefully we will never wait a year between releases, so this shouldn't be a problem anymore.

Open Questions

  • Do we want an "oops" wait time or required reviews on the release? I've already restricted it to only being possible to deploy from main and only if the tag starts with v20*.
  • Do we or don't we want to push releases to PyPI? It's an obvious place to look for things, but we're publishing an application not a library. If we do publish release software to PyPI should we also continue updating on conda-forge? I'm inclined to find a low overhead but not guaranteed to work / be reproducible way to do this just so the package can ge installed without git reference gymnastics.
  • If we continue doing software release, note that PyPI removes leading zeroes in the calendar version, turning v2023.12.01 into v2023.12.1 However, GitHub and the outputs on GCS & S3 don't do this, and the tag we applied on GitHub is v2023.12.01 so this could result in some confusion / annoyance. Do we want to use the no-leading-zeroes version of CalVer?

For Future Consideration

  • Ideally we would probably have some more cautious flow-control in the release process. Like as it is once you push the tag, all of these things kick off, but there should really be dependencies. Right now if the build fails for some reason, then the data won't be released, but the software would still be pushed out.
  • How can we make the builds more reproducible? And also faster? Really, EVERY versioned release should be based on a recent successful nightly build. That means to do a release we don't actually need to do a build - we just need to know what past nightly-YYYY-MM-DD commit to tag with vYYYY.MM.DD, and then we should be able to look up their nightly build outputs and distribute them without needing to do a build at all. Doing a release would only take a few minutes then, and could hopefully be done on a GitHub runner. The biggest single file we need to distribute right now is CEMS, at 6GB. So if we're downloading from S3 and re-uploading one file at a time to Zenodo, that shouldn't be a problem. With the whole release being under 10GB we could probably download it all and re-upload it, but that probably won't be true forever. Or we could just do releases on a bigger runner with more disk. They should only take as long to run as it takes to download and re-upload the data. Or maybe in the New Zenodo API there's some way to copy directly from cloud-to-cloud without downloading the files locally at all? A boy can dream.
  • If we tag the Docker containers that are used to run the nightly builds intelligently, we could also pull and archive the exact container that was used for a given release.
  • We could get rid of the need for the giant AWS CLI install in the Docker container if we instead wrote a small Python script that uses fsspec to transfer the local build outputs up to both S3 and GCS. Seems like something to integrate into the pythonization of the build / deploy script.

Tasks

  1. release zenodo
    zaneselvans
  2. nightly-builds
    zaneselvans
  3. 9 of 9
    nightly-builds
    jdangerx zaneselvans
  4. 0 of 4
    release zenodo
    jdangerx
@bendnorman
Copy link
Member Author

bendnorman commented Jul 26, 2023

We currently have space limitations with the s3 bucket (100 GB), so we need to decide how often we will do data releases and how long we will retain old releases. I asked the AWS folks if there is a ceiling to how much additional storage space we can request. They didn't tell me the limit, but they said they can increase our storage to 1 TB given the size and frequency of our data releases.

Our data releases are currently about 20 GB though it's likely they will get bigger! We could cut down on the size by:

  • converting tables created by simple joins to SQL views
  • select the partitioned or monolithic cems parquet files

If we assume our releases grow to 40 GB (more output tables and new datasets!) we could have 25 data releases available at once. This equates to roughly 2 data releases a month retained for a full year. This sounds reasonable to me. We'll have to make it clear that releases expire on s3 after a year. If users need to depend on old versions, they can always pull the data from Zenodo or the GCS bucket using requester pays.

@e-belfer e-belfer moved this from Icebox to Backlog in Catalyst Megaproject Aug 24, 2023
@zaneselvans zaneselvans added members Catalyst governance & organizational decisionmaking issues. datapkg Frictionless data package input, output, metadata, manipulation zenodo Issues having to do with Zenodo data archiving and retrieval. release Tasks directly related to data and software releases. metadata Anything having to do with the content, formatting, or storage of metadata. Mostly datapackages. and removed members Catalyst governance & organizational decisionmaking issues. labels Aug 24, 2023
@zaneselvans
Copy link
Member

I think if we get to putting out monthly (or even quarterly) data releases we will be in great shape! So it seems like we should easily be able to store at least a couple of years worth of releases in free S3 buckets, with Zenodo as the free, public, citation-friendly, cold-storage for anything older.

To start testing a script for pushing releases to Zenodo, we might initially use their Sandbox server, and rather than only trying to push on a tagged release, push on any nightly build that succeeds off of dev -- that way we'd have lots of opportunities for refinement. I guess the script itself should probably just take inputs of:

  • a cloud storage location as the source (a prefix containing a "directory" of outputs)
  • necessary credentials for the cloud storage source and Zenodo destination
  • a Zenodo concept DOI indicating what archive should be updated with a new version of the new outputs
  • some deposition-level metadata that should be associated with the new Zenodo archive. Ideally I think this would be assembled programmatically based on the metadata that we're generating to annotate the data release itself, but maybe that's for later.

We don't have much in the way of software distribution infrastructure right now. We'll keep tagging the commits associated with persistent data releases, and those tags will keep getting archvied on Zenodo automatically. Do we want to keep pushing catalystcoop.pudl to PyPI / conda-forge if we really consider it an application? I guess maybe not.

I don't thing generating datapackage.json for pudl.sqlite and all of our other outputs is a blocking concern for going to data releases, but we should definitely do it at some point, and IIRC we will want to update to v5 of the Frictionless Framework to make use of its direct SQLite metadata annotations.

@zaneselvans zaneselvans self-assigned this Dec 4, 2023
@zaneselvans zaneselvans moved this from Backlog to In progress in Catalyst Megaproject Dec 5, 2023
@zaneselvans zaneselvans linked a pull request Dec 5, 2023 that will close this issue
8 tasks
@zaneselvans zaneselvans changed the title Set up automation to handle data only releases Automate the data release process Dec 5, 2023
@zaneselvans zaneselvans removed a link to a pull request Dec 6, 2023
8 tasks
@zaneselvans zaneselvans linked a pull request Dec 6, 2023 that will close this issue
8 tasks
@github-project-automation github-project-automation bot moved this from In progress to Done in Catalyst Megaproject Dec 14, 2023
@zaneselvans zaneselvans reopened this Dec 14, 2023
@zaneselvans zaneselvans moved this from Done to In progress in Catalyst Megaproject Dec 14, 2023
@zaneselvans zaneselvans added this to the v2024.01 milestone Jan 12, 2024
@jdangerx
Copy link
Member

Minimal changes required:

  • make pypi-release depend on pypi-test-release
  • don't auto-publish to production zenodo, but do try to upload a draft to production on a v202x build.
  • create release runbook
    • close out release the night before
    • tell people you're planning on releasing
    • push tag, make sure it points at nightly-20whatever
    • verify: RTD, distribution buckets, Zenodo

Nice-to-have:

  • on push to v202x, don't trigger a whole ETL/test/build process, just go find the outputs from the corresponding nightly build in builds.catalyst.coop
  • create github issue to review the new prod deposition

@zaneselvans zaneselvans moved this from In progress to Backlog in Catalyst Megaproject Jan 24, 2024
@zaneselvans zaneselvans removed this from the v2024.01 milestone Jan 25, 2024
@jdangerx jdangerx moved this from Backlog to In progress in Catalyst Megaproject Jan 29, 2024
@zaneselvans
Copy link
Member

I've pulled the nice-to-haves into #3326 so we can close this issue.

@github-project-automation github-project-automation bot moved this from In progress to Done in Catalyst Megaproject Jan 31, 2024
@zaneselvans zaneselvans added this to the v2024.01 milestone Jan 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datapkg Frictionless data package input, output, metadata, manipulation metadata Anything having to do with the content, formatting, or storage of metadata. Mostly datapackages. release Tasks directly related to data and software releases. zenodo Issues having to do with Zenodo data archiving and retrieval.
Projects
Archived in project
3 participants