Migrating Dask GPU CI from Jenkins to GitHub Actions #348

charlesbluca · 2023-10-05T14:42:05Z

Currently, several of the Dask projects run a subset of their tests on GPUs, through a Jenkins-based infrastructure put together by the ops team at RAPIDS (relevant [archived] docs).

Currently, the RAPIDS projects on GitHub are undergoing a migration from this Jenkins setup to GitHub Actions (docs outlining this setup) for a variety of reasons, most relevant to Dask developers being increased visibility/control over the files controlling the GPU build matrix - in short, public GHA workflow files over private Groovy files only accessible by members of RAPIDS. Once this migration is complete, RAPIDS intends to decommission all its Jenkins-based testing infrastructure.

In making this migration, there would be several changes that would need to happen both at the Dask org and developer level to get things working smoothly, so I’m hoping to use this issue as a place for discussion around those potential changes and coordination of the efforts to complete the migration.

Requirements

In short, to get things running we would need to:

Have admins of the dask and dask-contrib organizations install the GitHub applications used to manage GPU CI:
- Install the nvidia-runners GitHub application for each org we intend to migrate, and enable it on all repos we intend to run GPU tests on; this will handle dispatching of self-hosted GPU runners
- Install the copy-pr-bot GitHub application for each org we intend to migrate, and enable it on all repos we intend to run GPU tests on; this will automate the process of copying/deleting PR source code to/from the upstream repo
Add a configuration file for copy-pr-bot to each repo we intend to run GPU tests on; this outlines the trusted NVIDIA users and admins for that repo:

# .github/copy-pr-bot.yaml

enabled: true
additional_trustees:
  - nvidia_employee_gh_username  # non NVIDIA employees on this list are ignored
additional_vetters:
  - nvidia_employee_gh_username  # non NVIDIA employees on this list are ignored

Once this is done, we should be able to run jobs on RAPIDS self-hosted runners by adding a workflow file looking something like this:

name: Test Self Hosted Runners

on:
  push:
    branches:
      - "pull-request/[0-9]+"

jobs:
  build:
    runs-on: "linux-amd64-gpu-t4-latest-1"
    container:
      # must run in a container
      image: nvidia/cuda:11.8.0-base-ubuntu22.04
      env:
        # must set this container env variable
        NVIDIA_VISIBLE_DEVICES: ${{ env.NVIDIA_VISIBLE_DEVICES }}
    steps:
      - name: hello
        run: |
          echo "hello"
          nvidia-smi

And from there we can work on migrating the existing setup files in each repo over to GHA workflows.

Differences between Jenkins & GHA

Here’s a quick outline of some of the key differences between the Jenkins and GHA setups that will be relevant to developers:

Instead of running tests through an external CI platform that reports back to GitHub, tests would be run on self-hosted runners, allowing some degree of customization to the CPU/GPU instances being used as outlined here
Pull requests approved to run on GPU would automatically be pushed to a branch on the upstream repo of the form pull-request/<PR_NUMBER> as part of the process of triggering a self-hosted runner to run the tests; this branch would be automatically deleted once the PR is closed or merged
- These pull-request/* branches can be ignored in local git operations by running the following:

git config \
  --global \
  --add "remote.upstream.fetch" \ 
  '^refs/heads/pull-request/*'

The once internal concepts of the GPU CI “allowlist” and “adminlist” would now be handled on a per-repo basis through the copy-pr-bot concepts of trusted users and vetters, respectively
Commits without GPG/SSH signatures will be considered untrusted and PRs from trusted users containing them will require vetters to leave an /ok to test comment for approval
The bot command rerun tests would no longer exist, being replaced with the standard method of rerunning GHA workflow runs

Migrating GPU image builds to GHA

An additional piece of Dask’s GPU CI infrastructure that runs on Jenkins and would need to be migrated is GHA is the automated builds of the GPU containers used for each repo; some benefits I think would come with this migration are:

Running these builds through GHA would greatly increase the flexibility of where the setup files can be hosted, so that they could be moved from the relatively unknown dask-build-environment repo to a place more visible to Dask maintainers
Running the scheduled builds through GHA allows those watching the repos hosting the builds to receive notifications when they fail, making it a lot more obvious when things break

The main blockers here would be establishing where we would like these workflow files to exist - my first thoughts would be either dask-docker, a standalone repo, or in the respective repos the images are intended for.

Thoughts?

I’m interested in soliciting opinions from Dask maintainers on what parts of this migration seem appealing/unappealing. My first impressions of some specific pain points are:

This setup would require new branches to get pushed to the upstream repo frequently, which would contribute some noise to local git operations if the branches aren’t manually ignored
Non-Dask org members or Dask org members without GPG/SSH signed commits would need to have their PRs approved on a per-commit basis by vetters on the relevant repo; depending on who this is, this could put increased pressure on a small group to be constantly watching new PRs and consequently disincentivize running GPU tests on all PRs

But it would be great to hear from others on how we can make this process relatively smooth.

The text was updated successfully, but these errors were encountered:

GenevieveBuckley · 2023-10-16T02:54:13Z

I'm generally positive about this.

If the RAPIDS group say this is better for a number of reasons, that's good enough for me
I never really liked the Jenkins interface anyway. We would sometimes get GPU CI failures and I'd click over to the workflow results on Jenkins and struggle to find useful information about why the test actually failed (maybe the info is there, just hidden in a weird place? I don't know, usually at that point I'd ask John to figure it out)

The major disadvantage here is:

Non-Dask org members or Dask org members without GPG/SSH signed commits would need to have their PRs approved on a per-commit basis

This is deeply, deeply annoying. I'd be ok if it was once per PR, but every commit is not practical. And yes, it would disincentivize me running GPU tests. I don't think it's an edge case either - I know dask-image has very low activity, but I'd guess almost all of the PRs it does get are from non-dask org members.

This setup would require new branches to get pushed to the upstream repo frequently, which would contribute some noise to local git operations if the branches aren’t manually ignored

Usually I use the github CLI to checkout specific pull request branches for review (gh pr checkout $PR_NUMBER). So I don't think I'd care about the extra pull request branches being created, especially if they do eventually get deleted automatically.

But it would be great to hear from others on how we can make this process relatively smooth.

To make the process smooth, I think it would be helpful to split the info you have above into (1) what needs to happen at the Dask org level, and by who, (2) what needs to happen for each individual repository by the maintainers, and (3) what can only be done by NVIDIA employees. It's a good overview, but I'm a little confused about what I would need to do and when, since it seems some steps depend on others too.

jacobtomlinson · 2023-10-20T09:43:48Z

This is deeply, deeply annoying. I'd be ok if it was once per PR, but every commit is not practical. And yes, it would disincentivize me running GPU tests. I don't think it's an edge case either - I know dask-image has very low activity, but I'd guess almost all of the PRs it does get are from non-dask org members.

I agree this is frustrating, but it's actually a pretty low bar. Signing your commits with your SSH keys is easy to set up. Your commits get that little verified tag too.

GenevieveBuckley · 2023-10-22T23:47:19Z

I agree this is frustrating, but it's actually a pretty low bar. Signing your commits with your SSH keys is easy to set up. Your commits get that little verified tag too.

Maybe I misunderstood. Charles says this above:

Non-Dask org members or Dask org members without GPG/SSH signed commits would need to have their PRs approved on a per-commit basis

So I thought that meant:

Non-Dask org members - always need to have their PR commits approved
Dask org members without SSH signed commits - always need to have their PR commits approved
Dask org members with SSH signed commits - do not need to have anything approved, everything is good here.

Is that incorrect?

It's category 1 I'm thinking about. But perhaps newcomers might not actually feel frustrated they have to wait on us, and maybe it's totally fine for the GPU CI to run and pass only once at the end of the PR process.

charlesbluca · 2023-10-23T21:25:15Z

Maybe I misunderstood.

Nope, your understanding of the system is correct

This is deeply, deeply annoying. I'd be ok if it was once per PR, but every commit is not practical. And yes, it would disincentivize me running GPU tests. I don't think it's an edge case either - I know dask-image has very low activity, but I'd guess almost all of the PRs it does get are from non-dask org members.

I (along with other Dask-RAPIDS developers) share this sentiment, though my concern lies less with confusion/frustration from potential newcomers (who I assume would largely ignore the automated messages) and more with the likelihood of GPU-breaking code going in due to GPU tests not getting run as often. For now, I've generally accepted these limitations as a necessity to assuage security concerns related to the migration with hopes that we can loosen these requirements later on, but ultimately can't make any solid promises there 😕

To make the process smooth, I think it would be helpful to split the info you have above into (1) what needs to happen at the Dask org level, and by who, (2) what needs to happen for each individual repository by the maintainers, and (3) what can only be done by NVIDIA employees

Good point, will give that section a second pass to make things clearer

jacobtomlinson · 2023-10-31T16:34:27Z

I think for a first-time contributor it wouldn't be surprising to see that some CI approval is needed. I guess it may be more frustrating for a return contributor who is not a Dask-org member.

Maybe we should be more liberal and add all contributors to the org when their PR is merged? I expect we could automate that with a GitHub Action. We give out membership pretty openly anyway.

jakirkham mentioned this issue Aug 30, 2024

gpuCI broken dask/dask#11312

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Migrating Dask GPU CI from Jenkins to GitHub Actions #348

Migrating Dask GPU CI from Jenkins to GitHub Actions #348

charlesbluca commented Oct 5, 2023 •

edited

Loading

GenevieveBuckley commented Oct 16, 2023

jacobtomlinson commented Oct 20, 2023

GenevieveBuckley commented Oct 22, 2023

charlesbluca commented Oct 23, 2023

jacobtomlinson commented Oct 31, 2023

Migrating Dask GPU CI from Jenkins to GitHub Actions #348

Migrating Dask GPU CI from Jenkins to GitHub Actions #348

Comments

charlesbluca commented Oct 5, 2023 • edited Loading

Requirements

Differences between Jenkins & GHA

Migrating GPU image builds to GHA

Thoughts?

GenevieveBuckley commented Oct 16, 2023

jacobtomlinson commented Oct 20, 2023

GenevieveBuckley commented Oct 22, 2023

charlesbluca commented Oct 23, 2023

jacobtomlinson commented Oct 31, 2023

charlesbluca commented Oct 5, 2023 •

edited

Loading