Skip to content
This repository has been archived by the owner on Oct 9, 2023. It is now read-only.

Use dask cuda worker when GPU resource > 0 #339

Closed

Conversation

ljstrnadiii
Copy link

TL;DR

If a gpu has been specified in the dask plugin resource, assume we want to use dask cuda worker i.e. that dask-cuda is present on the specified image.

Type

  • Feature
  • Plugin

Are all requirements met?

  • Code completed
  • Smoke tested
  • Unit tests added
  • Code documentation added
  • Any pending items have an associated Issue

Complete description

Currently we build an arg to submit to the container and inject values such as threads and memory from the limits specified in the resources object in the dask plugin object. Dask cuda gives us many built in gpu tools and all that is required is to ensure dask cuda is installed and then to start out workers with dask-cuda-worker. So, in order to accomodate, we simply check for gpu limits in the dask plugin resources and overwrite the first arg to dask-cuda-worker.

@welcome
Copy link

welcome bot commented Apr 3, 2023

Thank you for opening this pull request! 🙌

These tips will help get your PR across the finish line:

  • Most of the repos have a PR template; if not, fill it out to the best of your knowledge.
  • Sign off your commits (Reference: DCO Guide).

@ljstrnadiii ljstrnadiii force-pushed the dask_add_cuda_support branch from 288f088 to 6a82451 Compare April 3, 2023 03:45
@ljstrnadiii
Copy link
Author

@bstadlbauer any thoughts would be greatly appreciated 🙏

@codecov
Copy link

codecov bot commented Apr 3, 2023

Codecov Report

Merging #339 (d2f99c2) into master (18a594e) will increase coverage by 1.33%.
The diff coverage is 75.00%.

❗ Current head d2f99c2 differs from pull request most recent head 6a82451. Consider uploading reports for the commit 6a82451 to get more accurate results

@@            Coverage Diff             @@
##           master     #339      +/-   ##
==========================================
+ Coverage   62.65%   63.99%   +1.33%     
==========================================
  Files         146      146              
  Lines       12220     9910    -2310     
==========================================
- Hits         7657     6342    -1315     
+ Misses       3981     2986     -995     
  Partials      582      582              
Flag Coverage Δ
unittests ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
go/tasks/plugins/array/k8s/subtask_exec_context.go 80.80% <66.66%> (+0.97%) ⬆️
go/tasks/plugins/k8s/dask/dask.go 86.72% <100.00%> (+2.80%) ⬆️

... and 128 files with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@pingsutw
Copy link
Member

pingsutw commented Apr 10, 2023

@ljstrnadiii Is it ready for review?

@ljstrnadiii ljstrnadiii marked this pull request as ready for review April 10, 2023 14:51
Copy link
Member

@bstadlbauer bstadlbauer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm generally in favor, but I am somewhat cautious about making this the default.

How about adding a configuration option to be able to specify the (GPU) worker command? You can seen an example of how to do configuration in the spark plugin

Comment on lines +190 to +194
// If limits includes gpu, assume dask cuda worker cli startup
// https://docs.rapids.ai/api/dask-cuda/nightly/quickstart.html#dask-cuda-worker
if !limits.Name(flytek8s.ResourceNvidiaGPU, "0").IsZero() {
workerArgs[0] = "dask-cuda-worker"
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After having taken a closer look, I am not so sure whether this is a good default behavior. A few questions I've got:

  • Is this the default behavior that every user wants? E.g. from the docs I saw that it states:

It only helps with deployment and management of Dask workers in multi-GPU systems.

and a lot of k8s deployments will have single GPU workers.

  • Should this command also be applied on GPU requests, not just limits? 🤔
  • Is this the desired behavior in case the default resources (e.g. the task resources) also have GPUs?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! I'll make changes to instead add the option to specify a gpu worker command.

Copy link
Author

@ljstrnadiii ljstrnadiii May 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, after thinking about this for a little and getting somewhat more familiar with the code base(s), tell me if this makes sense:

  • go to flyteidl plugins dask.proto and add a field "command".
  • update flytekit plugins python code for models and task here and here to include the "command" option.
  • lastly, update dask.go to get default command if not specified.

question: This seems like three separate prs. Is this order correct? update proto, update models, update dask.go? Also, "command" could be "startup_command". I don't care, but any suggestions are welcome.

@bstadlbauer if this sounds good to you I will try to get this in over the next week.

@eapolinario
Copy link
Contributor

Hi, we are moving all Flyte development to a monorepo. In order to help the transition period, we're moving open PRs to monorepo automatically and your PR was moved to flyteorg/flyte#4129. Notice that if there are any conflicts in the resulting PR they most likely happen due to the change in the import path of the flyte components.

@eapolinario eapolinario closed this Oct 3, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants