-
Notifications
You must be signed in to change notification settings - Fork 53
Use dask cuda worker when GPU resource > 0 #339
Conversation
Thank you for opening this pull request! 🙌 These tips will help get your PR across the finish line:
|
Signed-off-by: ljstrnadiii <[email protected]>
Signed-off-by: ljstrnadiii <[email protected]>
288f088
to
6a82451
Compare
@bstadlbauer any thoughts would be greatly appreciated 🙏 |
Codecov Report
@@ Coverage Diff @@
## master #339 +/- ##
==========================================
+ Coverage 62.65% 63.99% +1.33%
==========================================
Files 146 146
Lines 12220 9910 -2310
==========================================
- Hits 7657 6342 -1315
+ Misses 3981 2986 -995
Partials 582 582
Flags with carried forward coverage won't be shown. Click here to find out more.
... and 128 files with indirect coverage changes Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
@ljstrnadiii Is it ready for review? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm generally in favor, but I am somewhat cautious about making this the default.
How about adding a configuration option to be able to specify the (GPU) worker command? You can seen an example of how to do configuration in the spark plugin
// If limits includes gpu, assume dask cuda worker cli startup | ||
// https://docs.rapids.ai/api/dask-cuda/nightly/quickstart.html#dask-cuda-worker | ||
if !limits.Name(flytek8s.ResourceNvidiaGPU, "0").IsZero() { | ||
workerArgs[0] = "dask-cuda-worker" | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After having taken a closer look, I am not so sure whether this is a good default behavior. A few questions I've got:
- Is this the default behavior that every user wants? E.g. from the docs I saw that it states:
It only helps with deployment and management of Dask workers in multi-GPU systems.
and a lot of k8s deployments will have single GPU workers.
- Should this command also be applied on GPU requests, not just limits? 🤔
- Is this the desired behavior in case the default resources (e.g. the
task
resources) also have GPUs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! I'll make changes to instead add the option to specify a gpu worker command.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, after thinking about this for a little and getting somewhat more familiar with the code base(s), tell me if this makes sense:
- go to flyteidl plugins dask.proto and add a field "command".
- update flytekit plugins python code for models and task here and here to include the "command" option.
- lastly, update dask.go to get default command if not specified.
question: This seems like three separate prs. Is this order correct? update proto, update models, update dask.go? Also, "command" could be "startup_command". I don't care, but any suggestions are welcome.
@bstadlbauer if this sounds good to you I will try to get this in over the next week.
Hi, we are moving all Flyte development to a monorepo. In order to help the transition period, we're moving open PRs to monorepo automatically and your PR was moved to flyteorg/flyte#4129. Notice that if there are any conflicts in the resulting PR they most likely happen due to the change in the import path of the flyte components. |
TL;DR
If a gpu has been specified in the dask plugin resource, assume we want to use
dask cuda worker
i.e. that dask-cuda is present on the specified image.Type
Are all requirements met?
Complete description
Currently we build an arg to submit to the container and inject values such as threads and memory from the limits specified in the resources object in the dask plugin object. Dask cuda gives us many built in gpu tools and all that is required is to ensure dask cuda is installed and then to start out workers with
dask-cuda-worker
. So, in order to accomodate, we simply check for gpu limits in the dask plugin resources and overwrite the first arg todask-cuda-worker
.