Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build & publish images through GHA #105

Merged
merged 27 commits into from
Oct 4, 2024

Conversation

charlesbluca
Copy link
Member

Working on refactoring the existing infra in this repo to build and publish through a GHA run, which theoretically should give us a lot more flexibility in automating / publishing status checks of builds.

@charlesbluca charlesbluca marked this pull request as ready for review September 27, 2024 18:19
@charlesbluca charlesbluca requested a review from a team as a code owner September 27, 2024 18:19
Copy link
Member Author

@charlesbluca charlesbluca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question that isn't really contained in these code changes - the current docker build commands we're using to make the images use --squash, is there any way to mimic that through the docker/build-push-action step?

Comment on lines +8 to +10
schedule:
- cron: "0 1 * * *" # Daily “At 01:00” UTC
workflow_dispatch:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if there's a preference on how something like this would get triggered on a schedule? Ideally would like to have the status checks broadcast on Slack to improve visibility, so happy to move in whatever direction makes that easiest with our current devops stack

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks ok to me. cc @raydouglass for additional review

FWIW this post might be relevant for your slack notification needs: https://github.blog/changelog/2022-12-06-github-actions-workflow-notifications-in-slack-and-microsoft-teams/

.github/workflows/build.yml Outdated Show resolved Hide resolved
dask/Dockerfile Outdated Show resolved Hide resolved
Copy link
Member

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this! I left some comments for your consideration. I think there are some opportunities to significantly simplify this. Please let me know if you'd like to pair on that.

.github/workflows/build.yml Outdated Show resolved Hide resolved
.github/workflows/build.yml Outdated Show resolved Hide resolved
.github/workflows/build.yml Outdated Show resolved Hide resolved
return_to_outputs: false
file: .github/workflows/build.yml
export_to_envs: false
return_to_outputs: true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the only use of ci/axis.dask.yaml.

git grep axis

That file should be deleted as part of this PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably should've provided more clarity around this in the PR description, but this file is actually getting pointed to by some of our Jenkins infra, which I'm assuming would start failing if we deleted this file:

https://github.com/rapidsai/gpuci-scripts/blob/aef3a3321f0393b2473214235b0c7b6e3950b1db/seed-job-dsl/jobs-other/dask/dask-build-environment/branch/dask_build_env_main.groovy#L1

My thinking was that we would:

  • merge this in, ensure that the GHA image publishing is working as expected alongside the Jenkins job
  • make the necessary internal changes to drop the scheduled Jenkins runs, which should allow us to delete the above linked file
  • drop the contents of ci/** in this repo as a follow up once the Jenkins runs have been turned off

Does that seem like a reasonable approach? Happy to do things a different way, and/or add more writing here to clarify the other ops work that should be considered a follow up to this PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ooooo thank you for that explanation! Yes I totally support that plan, I'd forgotten about the Jenkins half of this.


- name: Get old RAPIDS / UCX-Py versions
env:
OLD_RAPIDS_VER: ${{ steps.parse_yaml.outputs.jobs_docker_strategy_matrix_rapids_0 }}
RAPIDS_VER: ${{ steps.parse_yaml.outputs.jobs_docker_strategy_matrix_rapids_1 }}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm uncomfortable with implicitly depending on the ordering of the matrix.rapids entry here. I could see someone changing ["24.10", "24.12"] to ["25.02", "24.12"] in a PR and not realizing that that was breaking some of the assumptions here.

I also think it's wasteful (and a source of unnecessary risk of failed builds due to network issues) to have all these calls to https://version.gpuci.io/rapids. The response from that endpoint should change only when the RAPIDS version changes. And if you're only tracking 2 RAPIDS versions, I think it'd be simpler to just hard-code the corresponding ucx-py version.

Would you consider something like the following over in build.yml? At the top level of the workflow:

env:
  OLD_RAPIDS_VER: "24.10"
  OLD_UCX_PY_VER: "0.40"
  RAPIDS_VER: "24.12"
  UCX_PY_VER: "0.41"

jobs:
  docker:
    runs-on: ubuntu-latest

    strategy:
      fail-fast: false
      matrix:
        cuda: ["11.8.0"]
        python: ["3.10", "3.11", "3.12"]
        linux: ["ubuntu20.04"]
        rapids: [${{ env.OLD_RAPIDS_VER }}, ${{ env.NEW_RAPIDS_VER }}]

And then passing those values as outputs through to whatever needs it in this update-gpuci.yml? That'd remove that source of network calls, and remove this implicit reliance on the ordering of the matrix entries for correctness.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that'd also allow you to get rid of all this other machinery:

- name: Get latest UCX-Py nightly version
id: ucx_py_latest
uses: jacobtomlinson/[email protected]
with:
org: "rapidsai-nightly"
package: "ucx-py"
version_system: "CalVer"

There's exactly 1 ucx-py version per RAPIDS release and these builds seem to assume they'll only run for exactly 2 RAPIDS versions... seems like a great opportunity to replace some complicated and custom machinery with a little bit of hard-coding that'll be updated infrequently.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I definitely think the ordering concern alone is sufficient to hardcode these values somewhere other than the build matrix itself, shouldn't be too difficult to refactor the builds to rely on an internal version mapping instead of https://version.gpuci.io/rapids

I think that'd also allow you to get rid of all this other machinery

Were you thinking of replacing this step with a request to https://version.gpuci.io/rapids to fetch the new UCX-Py version instead? That could work, IIRC I opted to do it this way because cuDF and UCX-Py nightlies became available at different times in the RAPIDS release process, and so this ensured that both were available before opening a PR that would inevitably fail at the conda solve step (though this may be less of a problem now with PR checks running to make PRs like that visibly "blocked" from merging)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another minor annoyance here I didn't foresee - it looks like the env context is not available within the matrix block, so the example you shared above triggers a syntax error:

https://github.com/rapidsai/dask-build-environment/actions/runs/11114625627/workflow

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was able to get around this by using some hardcoded representative values in the build matrix, that we then later manually map to the hardcoded RAPIDS/ucx-py versions in 0af7f24

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

were you thinking of replacing this step

I looked back through this workflow more closely to be sure I understood what it was doing. I didn't understand that it's job was to trigger new PRs here, so some of my earlier comments about it might not make sense.

Am I right that this update-gpuci workflow's role is "open a PR here in dask-build-environment whenever a new RAPIDS release or ucx-py release is detected?".

- name: Create pull request
if: ${{ env.RAPIDS_VER != env.NEW_RAPIDS_VER && env.UCX_PY_VER != env.NEW_UCX_PY_VER }}
uses: peter-evans/create-pull-request@v6
with:
token: ${{ secrets.GITHUB_TOKEN }}
commit-message: "Update gpuCI `RAPIDS_VER` to `${{ env.NEW_RAPIDS_VER }}`, `UCX_PY_VER` to `${{ env.NEW_UCX_PY_VER }}`"
title: "Update gpuCI `RAPIDS_VER` to `${{ env.NEW_RAPIDS_VER }}`"
author: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
branch: "upgrade-gpuci-rapids"
body: |
New cuDF and ucx-py nightly versions have been detected.
Updated `dask.yaml` and `run.sh` to use `RAPIDS_VER=${{ env.NEW_RAPIDS_VER }}` and `UCX_PY_VER=${{ env.NEW_UCX_PY_VER }}`.

Like #104?

If so... I'd just remove this entire update-gpuci.yml workflow. Maybe things were different back when it was introduced in #10, but today the amount of manual work is so small here and happens so infrequently, I just don't think all this complexity in GitHub Actions is worth it.

There's a new RAPIDS release once every 2 months, and every RAPIDS release corresponds to exactly one ucx-py release.

Specifically, I'm proposing:

If you want to keep this automated, then ignore my other comments about update-gpuci and leave the network calls to get those versions as-is.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like the env context is not available within the matrix block

blegh ok, sorry for misleading you there. I like the solution you came up with!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I right that this update-gpuci workflow's role is "open a PR here in dask-build-environment whenever a new RAPIDS release or ucx-py release is detected?".

Pretty much! Only thing I'd add is that it's specifically when a new cuDF release and ucx-py release are detected, not if it's just or the other.

ask RAPIDS ops if updating this repo can be a part of the RAPIDS release checklist

Definitely a strong +1 on adding something like this to a checklist to increase the visibility of the ops work around Dask GPU CI, worth noting that there are similar workflows in the respective Dask repos to automate this version bump too:

Which reflect other steps that could probably be done manually as part of a checklist - maybe we should file an issue here to track this switch?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! Up to you and the other Dask folks how you want to handle it. Also fine to tell me "nah we'll keep this as-is", I don't feel that strongly about it. Just right now, in this moment, this feels like a lot of complexity to automate something that changes very infrequently.

.github/workflows/update-gpuci.yml Show resolved Hide resolved
Copy link
Member

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that I understand it more deeply, I really think that the update-gpuci.yml workflow should just be removed completely, in favor of manual action once every 2 months (each time there's a new RAPIDS release).

But that conversation doesn't need to block this PR. Changes look good to me, thanks for working through some of my suggestions.

echo "Invalid RAPIDS version '${{ matrix.rapids }}'"
exit 1
;;
esac
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice way to solve this! I love this.

@charlesbluca
Copy link
Member Author

Would've thought the approval from @jameslamb would be sufficient to unblock merging, but it looks he isn't a member of the rapidsai/dask-build-environment-write team - bumped https://github.com/rapidsai/ops/issues/3481 to see if we want to add more people to this list

Copy link
Member

@rjzamora rjzamora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving the general direction of this PR. Deferring to @jameslamb 's technical approval.

@charlesbluca charlesbluca merged commit b128381 into rapidsai:main Oct 4, 2024
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants