-
Notifications
You must be signed in to change notification settings - Fork 671
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Extended Resources] GPU Accelerators #4172
Conversation
Signed-off-by: Jeev B <[email protected]>
Signed-off-by: Jeev B <[email protected]>
Signed-off-by: Jeev B <[email protected]>
a126a85
to
6b62d65
Compare
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## master #4172 +/- ##
==========================================
+ Coverage 58.95% 59.97% +1.01%
==========================================
Files 621 570 -51
Lines 52932 41201 -11731
==========================================
- Hits 31206 24710 -6496
+ Misses 19229 14095 -5134
+ Partials 2497 2396 -101
Flags with carried forward coverage won't be shown. Click here to find out more.
☔ View full report in Codecov by Sentry. |
Signed-off-by: Jeev B <[email protected]>
6b62d65
to
57b5626
Compare
Signed-off-by: Jeev B <[email protected]>
Signed-off-by: Jeev B <[email protected]>
Signed-off-by: Jeev B <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! thank you for putting this together. MonoRepo is already paying dividends :-)
} else { | ||
partitionSizeTol = &v1.Toleration{ | ||
Key: config.GetK8sPluginConfig().GpuPartitionSizeNodeLabel, | ||
Value: GpuPartitionSizeNotSet, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that a k8s const value to mean something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it’s an arbitrary default value that we set for the toleration value added to tasks that explicitly require unpartitioned GPUs. See:
const GpuPartitionSizeNotSet = "NotSet" |
The toleration is useful if running multiple GPU node pools and an unpartitioned A100 node group is protected. See scenario in description. The toleration can also be overridden via plugin config.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think this can be a surprising behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly. We can either:
- Document this behavior. An extraneous toleration should generally be a no-op, unless it somehow conflicts with a real, but unrelated, taint.
- Drop the "default" toleration and require that one be specified explicitly by config, if unpartitioned GPU node groups are protected by a taint.
In retrospect, I like (2).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@EngHabu: Does (2) seem reasonable to you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Went ahead and made the change! :)
Signed-off-by: Jeev B <[email protected]>
func ApplyInterruptibleNodeAffinity(interruptible bool, podSpec *v1.PodSpec) { | ||
func ApplyGPUNodeSelectors(podSpec *v1.PodSpec, gpuAccelerator *core.GPUAccelerator) { | ||
// Short circuit if pod spec does not contain any containers that use GPUs | ||
gpuResourceName := config.GetK8sPluginConfig().GpuResourceName |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could swear I left a comment about this but don't see it... can we avoid calling config.GetK8sPluginConfig()
...
Can we call it once in the "root function" and just pass it down to all of these... it makes these functions more "obviously" testable...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following this pattern here:
gpuResourceName := config.GetK8sPluginConfig().GpuResourceName |
and setting up per-test config, per convention, as such:
assert.NoError(t, config.SetK8sPluginConfig(&config.K8sPluginConfig{ |
It would be a non-trivial refactor to pass config down from root via args in all these places. Happy to do it, but should that precede this PR?
69cbd9c
to
f958e9c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can always refactor to improve testability in a separate PR.
This PR introduces the concept of
ExtendedResources
that is meant to encapsulate all specialized resources that are not already captured by container resources (v1.ResourceRequirements
), and adds GPU accelerator as one such resource. We plan on leveragingExtendedResources
for other specialized resources such as shared memory (/dev/shm
) in the future.Implementation
In this proposed implementation
ExtendedResources
is added as a field withinTaskTemplate
for task-level configuration, andTaskNodeOverrides
for node-level overrides. We considered addingExtendedResources
directly to the IDLResources
object, but found that this is highly specific to container tasks, and not propagated correctly for pod templates and Pod tasks.TaskTemplate
seemed like the most reasonable candidate outside ofResources
.Generally, targeting tasks to specific GPU accelerators is a matter of setting the appropriate node affinities, and where relevant, tolerations. Mixed-partition multi-instance GPUs (MIGs) are an exception and have been excluded from the scope of this PR. We may add support for these in a future PR.
When building a resource for a task requesting a specific GPU accelerator, FlytePropeller now injects:
gpu-device-node-label
variable in the FlytePropeller k8s plugin configuration. A request for a T4 GPU might add the following to the resulting pod spec:gpu-partition-size-node-label
variable in the FlytePropeller k8s plugin configuration, and the node selector requirement and toleration to add for an unpartitioned MIG-capable GPU can be configured viagpu-unpartitioned-node-selector-requirement
andgpu-unpartitioned-toleration
respectively. A request for a partitioned or unpartitioned A100 GPU might add the following to the resulting pod spec:Usage
ExtendedResources
are applied to resources generated by all K8s plugins, with the exception of Spark, for now. We believe that this change gives operators good control over targeting workloads to specific GPU-accelerated nodes just by tuning the labels and taints attached to those nodes. Consider a deployment with 3 different GPU-accelerated node groups: T4, 2g.10gb A100, unpartitioned A100. Operators may want to make the T4 node group the default, the 2g.10gb partitioned A100 node group the default A100 node group, and require that the unpartitioned A100 node group be explicitly requested for special workloads. This can be achieved as follows:T4 node group
2g.10gb A100 node group
Unpartitioned A100 node group
Request a GPU with no preference for device (Scheduled on T4 node group)
with pod spec:
Explicitly request a T4 GPU
with pod spec:
Request an A100 GPU with no preference for partition size (Scheduled on default, partitioned 2g.10gb A100 node group)
with pod spec:
Explicitly request a partitioned A100 GPU
with pod spec:
Explicitly request an unpartitioned A100 GPU
with pod spec:
Check all the applicable boxes