Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build & publish images through GHA #105

Merged
merged 27 commits into from
Oct 4, 2024
Merged
Show file tree
Hide file tree
Changes from 17 commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
27c529f
First pass at build/publishes through GHA
charlesbluca Sep 27, 2024
f7c702a
Merge RUN commands to reduce storage footprint of build
charlesbluca Sep 27, 2024
6469620
Modify trigger conditions
charlesbluca Sep 27, 2024
23e9fd9
Run builds on a schedule
charlesbluca Sep 27, 2024
200bfc1
Update GPU CI updating workflow to parse information from/update GHA …
charlesbluca Sep 27, 2024
d1141c1
Use heredocs to clean up multi-line RUN
charlesbluca Sep 30, 2024
2534e61
Fix typo in heredocs
charlesbluca Sep 30, 2024
4a0f0f3
Fix remaining typo in dask/distributed heredocs
charlesbluca Sep 30, 2024
1e7f552
Update dockerhub credentials
charlesbluca Sep 30, 2024
6376f91
Fix ucx-py step to write to GITHUB_OUTPUT
charlesbluca Sep 30, 2024
37a765f
Update image tags
charlesbluca Sep 30, 2024
909c17a
Update image publishing conditions
charlesbluca Sep 30, 2024
1c7f528
Explicit bool comparison instead of truthy value
charlesbluca Sep 30, 2024
fe60490
Test condition parsing
charlesbluca Sep 30, 2024
b9493d5
Use string 'true' instead of boolean
charlesbluca Sep 30, 2024
1b43c7f
Revert publishing condition
charlesbluca Sep 30, 2024
127da2a
Fix matrix exclude rule
charlesbluca Sep 30, 2024
5a31da0
Hardcode old and current RAPIDS/UCX-Py versions
charlesbluca Sep 30, 2024
bbf65f4
Update checkout step version
charlesbluca Sep 30, 2024
8cbe7da
Move build platform to GHA matrix
charlesbluca Sep 30, 2024
3a044dd
Clarify image tag in final report step
charlesbluca Sep 30, 2024
fec0a76
Ensure GPU CI updating workflow only touches .github/workflows/build.yml
charlesbluca Sep 30, 2024
f9dc7b4
Use fromJSON to construct RAPIDS matrix array
charlesbluca Sep 30, 2024
0af7f24
Use hard-coded RAPIDS versions in build matrix
charlesbluca Oct 2, 2024
ec564f3
Update PR message
charlesbluca Oct 2, 2024
f5e395a
Typo in updating workflow
charlesbluca Oct 2, 2024
2067ca9
Rename hardcoded matrix values
charlesbluca Oct 2, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
95 changes: 95 additions & 0 deletions .github/workflows/build.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
name: Docker build

on:
push:
branches:
- main
pull_request:
schedule:
- cron: "0 1 * * *" # Daily “At 01:00” UTC
workflow_dispatch:
Comment on lines +8 to +10
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if there's a preference on how something like this would get triggered on a schedule? Ideally would like to have the status checks broadcast on Slack to improve visibility, so happy to move in whatever direction makes that easiest with our current devops stack

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks ok to me. cc @raydouglass for additional review

FWIW this post might be relevant for your slack notification needs: https://github.blog/changelog/2022-12-06-github-actions-workflow-notifications-in-slack-and-microsoft-teams/


env:
push: ${{ github.repository == 'rapidsai/dask-build-environment' && contains(fromJSON('["push", "schedule", "workflow_dispatch"]'), github.event_name) }}

jobs:
docker:
runs-on: ubuntu-latest

strategy:
fail-fast: false
matrix:
cuda: ["11.8.0"]
python: ["3.10", "3.11", "3.12"]
linux: ["ubuntu20.04"]
rapids: ["24.10", "24.12"]
image:
- tag: "rapidsai/dask"
context: "./dask"
- tag: "rapidsai/distributed"
context: "./distributed"
- tag: "rapidsai/dask_image"
context: "./dask_image"
# dask-image gpuCI isn't dependent on RAPIDS
exclude:
- image:
tag: "rapidsai/dask_image"
rapids: "24.10"

steps:
- name: Checkout
uses: actions/checkout@v3
charlesbluca marked this conversation as resolved.
Show resolved Hide resolved

- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3

- name: Login to DockerHub
uses: docker/login-action@v3
if: ${{ env.push == 'true' }}
with:
username: ${{ secrets.GPUCIBOT_DOCKERHUB_USER }}
password: ${{ secrets.GPUCIBOT_DOCKERHUB_TOKEN }}

- name: Compute ucx-py version
id: ucx_py
env:
rapids: ${{ matrix.rapids }}
run: |
echo "version=$(curl -sL https://version.gpuci.io/rapids/${rapids})" >> $GITHUB_OUTPUT

- name: Generate tag
id: tag
env:
cuda: ${{ matrix.cuda }}
python: ${{ matrix.python }}
linux: ${{ matrix.linux }}
rapids: ${{ matrix.rapids }}
image: ${{ matrix.image.tag }}
run: |
case ${image} in
"rapidsai/dask_image") # doesn't depend on RAPIDS / ucx-py for gpuCI
tag="${image}:cuda${cuda}-devel-${linux}-py${python}"
;;
*)
tag="${image}:${rapids}-cuda${cuda}-devel-${linux}-py${python}"
;;
esac

echo "tag=${tag}" >> $GITHUB_OUTPUT

- name: Build and push
uses: docker/build-push-action@v6
with:
context: ${{ matrix.image.context }}
push: ${{ env.push == 'true' }}
platforms: linux/amd64
charlesbluca marked this conversation as resolved.
Show resolved Hide resolved
tags: ${{ steps.tag.outputs.tag }}
build-args: |
RAPIDS_VER=${{ matrix.rapids }}
UCX_PY_VER=${{ steps.ucx_py.outputs.version }}
CUDA_VER=${{ matrix.cuda }}
LINUX_VER=${{ matrix.linux }}
PYTHON_VER=${{ matrix.python }}

- name: Report
run: echo Built ${{ steps.tag.outputs.tag }}
charlesbluca marked this conversation as resolved.
Show resolved Hide resolved
24 changes: 12 additions & 12 deletions .github/workflows/update-gpuci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -13,19 +13,23 @@ jobs:
steps:
- uses: actions/checkout@v4

- name: Parse current axis YAML
- name: Fetch current RAPIDS versions from build matrix
id: parse_yaml
uses: the-coding-turtle/[email protected]
with:
file: ci/axis/dask.yaml
export_to_envs: true
return_to_outputs: false
file: .github/workflows/build.yml
export_to_envs: false
return_to_outputs: true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the only use of ci/axis.dask.yaml.

git grep axis

That file should be deleted as part of this PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably should've provided more clarity around this in the PR description, but this file is actually getting pointed to by some of our Jenkins infra, which I'm assuming would start failing if we deleted this file:

https://github.com/rapidsai/gpuci-scripts/blob/aef3a3321f0393b2473214235b0c7b6e3950b1db/seed-job-dsl/jobs-other/dask/dask-build-environment/branch/dask_build_env_main.groovy#L1

My thinking was that we would:

  • merge this in, ensure that the GHA image publishing is working as expected alongside the Jenkins job
  • make the necessary internal changes to drop the scheduled Jenkins runs, which should allow us to delete the above linked file
  • drop the contents of ci/** in this repo as a follow up once the Jenkins runs have been turned off

Does that seem like a reasonable approach? Happy to do things a different way, and/or add more writing here to clarify the other ops work that should be considered a follow up to this PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ooooo thank you for that explanation! Yes I totally support that plan, I'd forgotten about the Jenkins half of this.


- name: Get old RAPIDS / UCX-Py versions
env:
OLD_RAPIDS_VER: ${{ steps.parse_yaml.outputs.jobs_docker_strategy_matrix_rapids_0 }}
RAPIDS_VER: ${{ steps.parse_yaml.outputs.jobs_docker_strategy_matrix_rapids_1 }}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm uncomfortable with implicitly depending on the ordering of the matrix.rapids entry here. I could see someone changing ["24.10", "24.12"] to ["25.02", "24.12"] in a PR and not realizing that that was breaking some of the assumptions here.

I also think it's wasteful (and a source of unnecessary risk of failed builds due to network issues) to have all these calls to https://version.gpuci.io/rapids. The response from that endpoint should change only when the RAPIDS version changes. And if you're only tracking 2 RAPIDS versions, I think it'd be simpler to just hard-code the corresponding ucx-py version.

Would you consider something like the following over in build.yml? At the top level of the workflow:

env:
  OLD_RAPIDS_VER: "24.10"
  OLD_UCX_PY_VER: "0.40"
  RAPIDS_VER: "24.12"
  UCX_PY_VER: "0.41"

jobs:
  docker:
    runs-on: ubuntu-latest

    strategy:
      fail-fast: false
      matrix:
        cuda: ["11.8.0"]
        python: ["3.10", "3.11", "3.12"]
        linux: ["ubuntu20.04"]
        rapids: [${{ env.OLD_RAPIDS_VER }}, ${{ env.NEW_RAPIDS_VER }}]

And then passing those values as outputs through to whatever needs it in this update-gpuci.yml? That'd remove that source of network calls, and remove this implicit reliance on the ordering of the matrix entries for correctness.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that'd also allow you to get rid of all this other machinery:

- name: Get latest UCX-Py nightly version
id: ucx_py_latest
uses: jacobtomlinson/[email protected]
with:
org: "rapidsai-nightly"
package: "ucx-py"
version_system: "CalVer"

There's exactly 1 ucx-py version per RAPIDS release and these builds seem to assume they'll only run for exactly 2 RAPIDS versions... seems like a great opportunity to replace some complicated and custom machinery with a little bit of hard-coding that'll be updated infrequently.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I definitely think the ordering concern alone is sufficient to hardcode these values somewhere other than the build matrix itself, shouldn't be too difficult to refactor the builds to rely on an internal version mapping instead of https://version.gpuci.io/rapids

I think that'd also allow you to get rid of all this other machinery

Were you thinking of replacing this step with a request to https://version.gpuci.io/rapids to fetch the new UCX-Py version instead? That could work, IIRC I opted to do it this way because cuDF and UCX-Py nightlies became available at different times in the RAPIDS release process, and so this ensured that both were available before opening a PR that would inevitably fail at the conda solve step (though this may be less of a problem now with PR checks running to make PRs like that visibly "blocked" from merging)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another minor annoyance here I didn't foresee - it looks like the env context is not available within the matrix block, so the example you shared above triggers a syntax error:

https://github.com/rapidsai/dask-build-environment/actions/runs/11114625627/workflow

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was able to get around this by using some hardcoded representative values in the build matrix, that we then later manually map to the hardcoded RAPIDS/ucx-py versions in 0af7f24

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

were you thinking of replacing this step

I looked back through this workflow more closely to be sure I understood what it was doing. I didn't understand that it's job was to trigger new PRs here, so some of my earlier comments about it might not make sense.

Am I right that this update-gpuci workflow's role is "open a PR here in dask-build-environment whenever a new RAPIDS release or ucx-py release is detected?".

- name: Create pull request
if: ${{ env.RAPIDS_VER != env.NEW_RAPIDS_VER && env.UCX_PY_VER != env.NEW_UCX_PY_VER }}
uses: peter-evans/create-pull-request@v6
with:
token: ${{ secrets.GITHUB_TOKEN }}
commit-message: "Update gpuCI `RAPIDS_VER` to `${{ env.NEW_RAPIDS_VER }}`, `UCX_PY_VER` to `${{ env.NEW_UCX_PY_VER }}`"
title: "Update gpuCI `RAPIDS_VER` to `${{ env.NEW_RAPIDS_VER }}`"
author: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
branch: "upgrade-gpuci-rapids"
body: |
New cuDF and ucx-py nightly versions have been detected.
Updated `dask.yaml` and `run.sh` to use `RAPIDS_VER=${{ env.NEW_RAPIDS_VER }}` and `UCX_PY_VER=${{ env.NEW_UCX_PY_VER }}`.

Like #104?

If so... I'd just remove this entire update-gpuci.yml workflow. Maybe things were different back when it was introduced in #10, but today the amount of manual work is so small here and happens so infrequently, I just don't think all this complexity in GitHub Actions is worth it.

There's a new RAPIDS release once every 2 months, and every RAPIDS release corresponds to exactly one ucx-py release.

Specifically, I'm proposing:

If you want to keep this automated, then ignore my other comments about update-gpuci and leave the network calls to get those versions as-is.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like the env context is not available within the matrix block

blegh ok, sorry for misleading you there. I like the solution you came up with!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I right that this update-gpuci workflow's role is "open a PR here in dask-build-environment whenever a new RAPIDS release or ucx-py release is detected?".

Pretty much! Only thing I'd add is that it's specifically when a new cuDF release and ucx-py release are detected, not if it's just or the other.

ask RAPIDS ops if updating this repo can be a part of the RAPIDS release checklist

Definitely a strong +1 on adding something like this to a checklist to increase the visibility of the ops work around Dask GPU CI, worth noting that there are similar workflows in the respective Dask repos to automate this version bump too:

Which reflect other steps that could probably be done manually as part of a checklist - maybe we should file an issue here to track this switch?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! Up to you and the other Dask folks how you want to handle it. Also fine to tell me "nah we'll keep this as-is", I don't feel that strongly about it. Just right now, in this moment, this feels like a lot of complexity to automate something that changes very infrequently.

run: |
echo OLD_RAPIDS_VER=$RAPIDS_VER_0 >> $GITHUB_ENV
echo OLD_UCX_PY_VER=$(curl -sL https://version.gpuci.io/rapids/$RAPIDS_VER_0) >> $GITHUB_ENV
echo RAPIDS_VER=$RAPIDS_VER_1 >> $GITHUB_ENV
echo UCX_PY_VER=$(curl -sL https://version.gpuci.io/rapids/$RAPIDS_VER_1) >> $GITHUB_ENV
echo OLD_RAPIDS_VER=$OLD_RAPIDS_VER >> $GITHUB_ENV
echo OLD_UCX_PY_VER=$(curl -sL https://version.gpuci.io/rapids/$OLD_RAPIDS_VER) >> $GITHUB_ENV
echo RAPIDS_VER=$RAPIDS_VER >> $GITHUB_ENV
echo UCX_PY_VER=$(curl -sL https://version.gpuci.io/rapids/$RAPIDS_VER) >> $GITHUB_ENV

- name: Get latest cuDF nightly version
id: cudf_latest
Expand Down Expand Up @@ -54,31 +58,27 @@ jobs:
- name: Update new RAPIDS versions
uses: jacobtomlinson/gha-find-replace@v2
with:
include: 'ci/**'
charlesbluca marked this conversation as resolved.
Show resolved Hide resolved
find: "${{ env.RAPIDS_VER }}"
replace: "${{ env.NEW_RAPIDS_VER }}"
regex: false

- name: Update old RAPIDS versions
uses: jacobtomlinson/gha-find-replace@v3
with:
include: 'ci/**'
find: "${{ env.OLD_RAPIDS_VER }}"
replace: "${{ env.RAPIDS_VER }}"
regex: false

- name: Update new UCX-Py versions
uses: jacobtomlinson/gha-find-replace@v3
with:
include: 'ci/**'
find: "${{ env.UCX_PY_VER }}"
replace: "${{ env.NEW_UCX_PY_VER }}"
regex: false

- name: Update old UCX-Py versions
uses: jacobtomlinson/gha-find-replace@v3
with:
include: 'ci/**'
find: "${{ env.OLD_UCX_PY_VER }}"
replace: "${{ env.UCX_PY_VER }}"
regex: false
Expand Down
37 changes: 14 additions & 23 deletions dask/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -11,38 +11,29 @@ ARG UCX_PY_VER=unset

COPY environment.yml /rapids.yml
ADD https://raw.githubusercontent.com/dask/dask/main/continuous_integration/environment-$PYTHON_VER.yaml /dask.yml

ADD https://github.com/rapidsai/gha-tools/releases/latest/download/tools.tar.gz /tools.tar.gz
RUN tar -xzvf /tools.tar.gz -C /usr/local/bin --strip-components=1 \
&& rm /tools.tar.gz

RUN conda config --set ssl_verify false

RUN rapids-mamba-retry install conda-merge git

RUN cat /rapids.yml \
RUN <<EOT
tar -xzvf /tools.tar.gz -C /usr/local/bin --strip-components=1
rm /tools.tar.gz
conda config --set ssl_verify false
rapids-mamba-retry install conda-merge git
cat /rapids.yml \
| sed -r "s/CUDA_VER/$(echo $CUDA_VER | cut -d. -f1,2)/g" \
| sed -r "s/RAPIDS_VER/${RAPIDS_VER}/g" \
| sed -r "s/UCX_PY_VER/${UCX_PY_VER}/g" \
> /rapids_pinned.yml

# unpin problematic CI dependencies
RUN cat /dask.yml \
cat /dask.yml \
| sed -r "s/pyarrow=/pyarrow>=/g" \
| sed -r "s/pandas=/pandas>=/g" \
| sed -r "s/numpy=/numpy>=/g" \
> /dask_unpinned.yml

RUN conda-merge /rapids_pinned.yml /dask_unpinned.yml > /dask.yml

RUN rapids-mamba-retry env create -n dask --file /dask.yml

# Clean up pkgs to reduce image size and chmod for all users
RUN chmod -R ugo+rw /opt/conda \
&& conda clean -tipy \
&& chmod -R ugo+rw /opt/conda

# need a user with access to conda
RUN useradd -r -g conda -u 10000 dask
conda-merge /rapids_pinned.yml /dask_unpinned.yml > /dask.yml
rapids-mamba-retry env create -n dask --file /dask.yml
chmod -R ugo+rw /opt/conda
conda clean -tipy
chmod -R ugo+rw /opt/conda
useradd -r -g conda -u 10000 dask
EOT

CMD [ "/bin/bash" ]
33 changes: 13 additions & 20 deletions dask_image/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -9,29 +9,22 @@ ARG PYTHON_VER=unset

COPY environment.yml /rapids.yml
ADD https://raw.githubusercontent.com/dask/dask-image/main/continuous_integration/environment-$PYTHON_VER.yml /dask.yml

ADD https://github.com/rapidsai/gha-tools/releases/latest/download/tools.tar.gz /tools.tar.gz
RUN tar -xzvf /tools.tar.gz -C /usr/local/bin --strip-components=1 \
&& rm /tools.tar.gz

RUN conda config --set ssl_verify false

RUN rapids-mamba-retry install conda-merge git

RUN cat /rapids.yml \
RUN <<EOT
tar -xzvf /tools.tar.gz -C /usr/local/bin --strip-components=1
rm /tools.tar.gz
conda config --set ssl_verify false
rapids-mamba-retry install conda-merge git
cat /rapids.yml \
| sed -r "s/CUDA_VER/$(echo $CUDA_VER | cut -d. -f1,2)/g" \
> /rapids_pinned.yml

RUN conda-merge /rapids_pinned.yml /dask.yml > /dask_image.yml

RUN rapids-mamba-retry env create -n dask_image --file /dask_image.yml

# Clean up pkgs to reduce image size and chmod for all users
RUN chmod -R ugo+rw /opt/conda \
&& conda clean -tipy \
&& chmod -R ugo+rw /opt/conda

# need a user with access to conda
RUN useradd -r -g conda -u 10000 dask
conda-merge /rapids_pinned.yml /dask.yml > /dask_image.yml
rapids-mamba-retry env create -n dask_image --file /dask_image.yml
chmod -R ugo+rw /opt/conda
conda clean -tipy
chmod -R ugo+rw /opt/conda
useradd -r -g conda -u 10000 dask
EOT

CMD [ "/bin/bash" ]
37 changes: 14 additions & 23 deletions distributed/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -11,36 +11,27 @@ ARG UCX_PY_VER=unset

COPY environment.yml /rapids.yml
ADD https://raw.githubusercontent.com/dask/distributed/main/continuous_integration/environment-$PYTHON_VER.yaml /dask.yml

ADD https://github.com/rapidsai/gha-tools/releases/latest/download/tools.tar.gz /tools.tar.gz
RUN tar -xzvf /tools.tar.gz -C /usr/local/bin --strip-components=1 \
&& rm /tools.tar.gz

RUN conda config --set ssl_verify false

RUN rapids-mamba-retry install conda-merge git

RUN cat /rapids.yml \
RUN <<EOT
tar -xzvf /tools.tar.gz -C /usr/local/bin --strip-components=1
rm /tools.tar.gz
conda config --set ssl_verify false
rapids-mamba-retry install conda-merge git
cat /rapids.yml \
| sed -r "s/CUDA_VER/$(echo $CUDA_VER | cut -d. -f1,2)/g" \
| sed -r "s/RAPIDS_VER/${RAPIDS_VER}/g" \
| sed -r "s/UCX_PY_VER/${UCX_PY_VER}/g" \
> /rapids_pinned.yml

# unpin problematic CI dependencies
RUN cat /dask.yml \
cat /dask.yml \
| sed -r "s/pyarrow=/pyarrow>=/g" \
> /dask_unpinned.yml

RUN conda-merge /rapids_pinned.yml /dask_unpinned.yml > /distributed.yml

RUN rapids-mamba-retry env create -n dask --file /distributed.yml

# Clean up pkgs to reduce image size and chmod for all users
RUN chmod -R ugo+rw /opt/conda \
&& conda clean -tipy \
&& chmod -R ugo+rw /opt/conda

# need a user with access to conda
RUN useradd -r -g conda -u 10000 dask
conda-merge /rapids_pinned.yml /dask_unpinned.yml > /distributed.yml
rapids-mamba-retry env create -n dask --file /distributed.yml
chmod -R ugo+rw /opt/conda
conda clean -tipy
chmod -R ugo+rw /opt/conda
useradd -r -g conda -u 10000 dask
EOT

CMD [ "/bin/bash" ]
Loading