-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Build & publish images through GHA #105
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One question that isn't really contained in these code changes - the current docker build commands we're using to make the images use --squash
, is there any way to mimic that through the docker/build-push-action
step?
schedule: | ||
- cron: "0 1 * * *" # Daily “At 01:00” UTC | ||
workflow_dispatch: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if there's a preference on how something like this would get triggered on a schedule? Ideally would like to have the status checks broadcast on Slack to improve visibility, so happy to move in whatever direction makes that easiest with our current devops stack
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks ok to me. cc @raydouglass for additional review
FWIW this post might be relevant for your slack notification needs: https://github.blog/changelog/2022-12-06-github-actions-workflow-notifications-in-slack-and-microsoft-teams/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this! I left some comments for your consideration. I think there are some opportunities to significantly simplify this. Please let me know if you'd like to pair on that.
return_to_outputs: false | ||
file: .github/workflows/build.yml | ||
export_to_envs: false | ||
return_to_outputs: true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was the only use of ci/axis.dask.yaml
.
git grep axis
That file should be deleted as part of this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably should've provided more clarity around this in the PR description, but this file is actually getting pointed to by some of our Jenkins infra, which I'm assuming would start failing if we deleted this file:
My thinking was that we would:
- merge this in, ensure that the GHA image publishing is working as expected alongside the Jenkins job
- make the necessary internal changes to drop the scheduled Jenkins runs, which should allow us to delete the above linked file
- drop the contents of
ci/**
in this repo as a follow up once the Jenkins runs have been turned off
Does that seem like a reasonable approach? Happy to do things a different way, and/or add more writing here to clarify the other ops work that should be considered a follow up to this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ooooo thank you for that explanation! Yes I totally support that plan, I'd forgotten about the Jenkins half of this.
.github/workflows/update-gpuci.yml
Outdated
|
||
- name: Get old RAPIDS / UCX-Py versions | ||
env: | ||
OLD_RAPIDS_VER: ${{ steps.parse_yaml.outputs.jobs_docker_strategy_matrix_rapids_0 }} | ||
RAPIDS_VER: ${{ steps.parse_yaml.outputs.jobs_docker_strategy_matrix_rapids_1 }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm uncomfortable with implicitly depending on the ordering of the matrix.rapids
entry here. I could see someone changing ["24.10", "24.12"]
to ["25.02", "24.12"]
in a PR and not realizing that that was breaking some of the assumptions here.
I also think it's wasteful (and a source of unnecessary risk of failed builds due to network issues) to have all these calls to https://version.gpuci.io/rapids
. The response from that endpoint should change only when the RAPIDS version changes. And if you're only tracking 2 RAPIDS versions, I think it'd be simpler to just hard-code the corresponding ucx-py
version.
Would you consider something like the following over in build.yml
? At the top level of the workflow:
env:
OLD_RAPIDS_VER: "24.10"
OLD_UCX_PY_VER: "0.40"
RAPIDS_VER: "24.12"
UCX_PY_VER: "0.41"
jobs:
docker:
runs-on: ubuntu-latest
strategy:
fail-fast: false
matrix:
cuda: ["11.8.0"]
python: ["3.10", "3.11", "3.12"]
linux: ["ubuntu20.04"]
rapids: [${{ env.OLD_RAPIDS_VER }}, ${{ env.NEW_RAPIDS_VER }}]
And then passing those values as outputs through to whatever needs it in this update-gpuci.yml
? That'd remove that source of network calls, and remove this implicit reliance on the ordering of the matrix entries for correctness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that'd also allow you to get rid of all this other machinery:
dask-build-environment/.github/workflows/update-gpuci.yml
Lines 38 to 44 in 8977069
- name: Get latest UCX-Py nightly version | |
id: ucx_py_latest | |
uses: jacobtomlinson/[email protected] | |
with: | |
org: "rapidsai-nightly" | |
package: "ucx-py" | |
version_system: "CalVer" |
There's exactly 1 ucx-py
version per RAPIDS release and these builds seem to assume they'll only run for exactly 2 RAPIDS versions... seems like a great opportunity to replace some complicated and custom machinery with a little bit of hard-coding that'll be updated infrequently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I definitely think the ordering concern alone is sufficient to hardcode these values somewhere other than the build matrix itself, shouldn't be too difficult to refactor the builds to rely on an internal version mapping instead of https://version.gpuci.io/rapids
I think that'd also allow you to get rid of all this other machinery
Were you thinking of replacing this step with a request to https://version.gpuci.io/rapids
to fetch the new UCX-Py version instead? That could work, IIRC I opted to do it this way because cuDF and UCX-Py nightlies became available at different times in the RAPIDS release process, and so this ensured that both were available before opening a PR that would inevitably fail at the conda solve step (though this may be less of a problem now with PR checks running to make PRs like that visibly "blocked" from merging)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another minor annoyance here I didn't foresee - it looks like the env
context is not available within the matrix
block, so the example you shared above triggers a syntax error:
https://github.com/rapidsai/dask-build-environment/actions/runs/11114625627/workflow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was able to get around this by using some hardcoded representative values in the build matrix, that we then later manually map to the hardcoded RAPIDS/ucx-py versions in 0af7f24
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
were you thinking of replacing this step
I looked back through this workflow more closely to be sure I understood what it was doing. I didn't understand that it's job was to trigger new PRs here, so some of my earlier comments about it might not make sense.
Am I right that this update-gpuci
workflow's role is "open a PR here in dask-build-environment
whenever a new RAPIDS release or ucx-py
release is detected?".
dask-build-environment/.github/workflows/update-gpuci.yml
Lines 86 to 98 in 8977069
- name: Create pull request | |
if: ${{ env.RAPIDS_VER != env.NEW_RAPIDS_VER && env.UCX_PY_VER != env.NEW_UCX_PY_VER }} | |
uses: peter-evans/create-pull-request@v6 | |
with: | |
token: ${{ secrets.GITHUB_TOKEN }} | |
commit-message: "Update gpuCI `RAPIDS_VER` to `${{ env.NEW_RAPIDS_VER }}`, `UCX_PY_VER` to `${{ env.NEW_UCX_PY_VER }}`" | |
title: "Update gpuCI `RAPIDS_VER` to `${{ env.NEW_RAPIDS_VER }}`" | |
author: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> | |
branch: "upgrade-gpuci-rapids" | |
body: | | |
New cuDF and ucx-py nightly versions have been detected. | |
Updated `dask.yaml` and `run.sh` to use `RAPIDS_VER=${{ env.NEW_RAPIDS_VER }}` and `UCX_PY_VER=${{ env.NEW_UCX_PY_VER }}`. |
Like #104?
If so... I'd just remove this entire update-gpuci.yml
workflow. Maybe things were different back when it was introduced in #10, but today the amount of manual work is so small here and happens so infrequently, I just don't think all this complexity in GitHub Actions is worth it.
There's a new RAPIDS release once every 2 months, and every RAPIDS release corresponds to exactly one ucx-py
release.
Specifically, I'm proposing:
- remove
update-gpuci.yml
workflow completely - ask RAPIDS ops if updating this repo can be a part of the RAPIDS release checklist
- (optional): automate those updates via a
ci/release/update-version.sh
, like most other RAPIDS repos do, e.g. https://github.com/rapidsai/docker/blob/branch-24.10/ci/release/update-version.sh
- (optional): automate those updates via a
If you want to keep this automated, then ignore my other comments about update-gpuci
and leave the network calls to get those versions as-is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like the env context is not available within the matrix block
blegh ok, sorry for misleading you there. I like the solution you came up with!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I right that this update-gpuci workflow's role is "open a PR here in dask-build-environment whenever a new RAPIDS release or ucx-py release is detected?".
Pretty much! Only thing I'd add is that it's specifically when a new cuDF release and ucx-py release are detected, not if it's just or the other.
ask RAPIDS ops if updating this repo can be a part of the RAPIDS release checklist
Definitely a strong +1 on adding something like this to a checklist to increase the visibility of the ops work around Dask GPU CI, worth noting that there are similar workflows in the respective Dask repos to automate this version bump too:
- https://github.com/dask/dask/blob/main/.github/workflows/update-gpuci.yml
- https://github.com/dask/distributed/blob/main/.github/workflows/update-gpuci.yaml
Which reflect other steps that could probably be done manually as part of a checklist - maybe we should file an issue here to track this switch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure! Up to you and the other Dask folks how you want to handle it. Also fine to tell me "nah we'll keep this as-is", I don't feel that strongly about it. Just right now, in this moment, this feels like a lot of complexity to automate something that changes very infrequently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that I understand it more deeply, I really think that the update-gpuci.yml
workflow should just be removed completely, in favor of manual action once every 2 months (each time there's a new RAPIDS release).
But that conversation doesn't need to block this PR. Changes look good to me, thanks for working through some of my suggestions.
echo "Invalid RAPIDS version '${{ matrix.rapids }}'" | ||
exit 1 | ||
;; | ||
esac |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice way to solve this! I love this.
Would've thought the approval from @jameslamb would be sufficient to unblock merging, but it looks he isn't a member of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving the general direction of this PR. Deferring to @jameslamb 's technical approval.
Working on refactoring the existing infra in this repo to build and publish through a GHA run, which theoretically should give us a lot more flexibility in automating / publishing status checks of builds.