-
-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Migrating Dask GPU CI from Jenkins to GitHub Actions #348
Comments
I'm generally positive about this.
The major disadvantage here is:
This is deeply, deeply annoying. I'd be ok if it was once per PR, but every commit is not practical. And yes, it would disincentivize me running GPU tests. I don't think it's an edge case either - I know dask-image has very low activity, but I'd guess almost all of the PRs it does get are from non-dask org members.
Usually I use the github CLI to checkout specific pull request branches for review (
To make the process smooth, I think it would be helpful to split the info you have above into (1) what needs to happen at the Dask org level, and by who, (2) what needs to happen for each individual repository by the maintainers, and (3) what can only be done by NVIDIA employees. It's a good overview, but I'm a little confused about what I would need to do and when, since it seems some steps depend on others too. |
I agree this is frustrating, but it's actually a pretty low bar. Signing your commits with your SSH keys is easy to set up. Your commits get that little |
Maybe I misunderstood. Charles says this above:
So I thought that meant:
Is that incorrect? It's category 1 I'm thinking about. But perhaps newcomers might not actually feel frustrated they have to wait on us, and maybe it's totally fine for the GPU CI to run and pass only once at the end of the PR process. |
Nope, your understanding of the system is correct
I (along with other Dask-RAPIDS developers) share this sentiment, though my concern lies less with confusion/frustration from potential newcomers (who I assume would largely ignore the automated messages) and more with the likelihood of GPU-breaking code going in due to GPU tests not getting run as often. For now, I've generally accepted these limitations as a necessity to assuage security concerns related to the migration with hopes that we can loosen these requirements later on, but ultimately can't make any solid promises there 😕
Good point, will give that section a second pass to make things clearer |
I think for a first-time contributor it wouldn't be surprising to see that some CI approval is needed. I guess it may be more frustrating for a return contributor who is not a Dask-org member. Maybe we should be more liberal and add all contributors to the org when their PR is merged? I expect we could automate that with a GitHub Action. We give out membership pretty openly anyway. |
Currently, several of the Dask projects run a subset of their tests on GPUs, through a Jenkins-based infrastructure put together by the ops team at RAPIDS (relevant [archived] docs).
Currently, the RAPIDS projects on GitHub are undergoing a migration from this Jenkins setup to GitHub Actions (docs outlining this setup) for a variety of reasons, most relevant to Dask developers being increased visibility/control over the files controlling the GPU build matrix - in short, public GHA workflow files over private Groovy files only accessible by members of RAPIDS. Once this migration is complete, RAPIDS intends to decommission all its Jenkins-based testing infrastructure.
In making this migration, there would be several changes that would need to happen both at the Dask org and developer level to get things working smoothly, so I’m hoping to use this issue as a place for discussion around those potential changes and coordination of the efforts to complete the migration.
Requirements
In short, to get things running we would need to:
dask
anddask-contrib
organizations install the GitHub applications used to manage GPU CI:nvidia-runners
GitHub application for each org we intend to migrate, and enable it on all repos we intend to run GPU tests on; this will handle dispatching of self-hosted GPU runnerscopy-pr-bot
GitHub application for each org we intend to migrate, and enable it on all repos we intend to run GPU tests on; this will automate the process of copying/deleting PR source code to/from the upstream repocopy-pr-bot
to each repo we intend to run GPU tests on; this outlines the trusted NVIDIA users and admins for that repo:Once this is done, we should be able to run jobs on RAPIDS self-hosted runners by adding a workflow file looking something like this:
And from there we can work on migrating the existing setup files in each repo over to GHA workflows.
Differences between Jenkins & GHA
Here’s a quick outline of some of the key differences between the Jenkins and GHA setups that will be relevant to developers:
pull-request/<PR_NUMBER>
as part of the process of triggering a self-hosted runner to run the tests; this branch would be automatically deleted once the PR is closed or mergedpull-request/*
branches can be ignored in local git operations by running the following:/ok to test
comment for approvalrerun tests
would no longer exist, being replaced with the standard method of rerunning GHA workflow runsMigrating GPU image builds to GHA
An additional piece of Dask’s GPU CI infrastructure that runs on Jenkins and would need to be migrated is GHA is the automated builds of the GPU containers used for each repo; some benefits I think would come with this migration are:
The main blockers here would be establishing where we would like these workflow files to exist - my first thoughts would be either dask-docker, a standalone repo, or in the respective repos the images are intended for.
Thoughts?
I’m interested in soliciting opinions from Dask maintainers on what parts of this migration seem appealing/unappealing. My first impressions of some specific pain points are:
But it would be great to hear from others on how we can make this process relatively smooth.
The text was updated successfully, but these errors were encountered: