[Extended Resources] GPU Accelerators #4172

jeevb · 2023-10-05T03:45:19Z

This PR introduces the concept of ExtendedResources that is meant to encapsulate all specialized resources that are not already captured by container resources (v1.ResourceRequirements), and adds GPU accelerator as one such resource. We plan on leveraging ExtendedResources for other specialized resources such as shared memory (/dev/shm) in the future.

Implementation

In this proposed implementation ExtendedResources is added as a field within TaskTemplate for task-level configuration, and TaskNodeOverrides for node-level overrides. We considered adding ExtendedResources directly to the IDL Resources object, but found that this is highly specific to container tasks, and not propagated correctly for pod templates and Pod tasks. TaskTemplate seemed like the most reasonable candidate outside of Resources.

Generally, targeting tasks to specific GPU accelerators is a matter of setting the appropriate node affinities, and where relevant, tolerations. Mixed-partition multi-instance GPUs (MIGs) are an exception and have been excluded from the scope of this PR. We may add support for these in a future PR.

When building a resource for a task requesting a specific GPU accelerator, FlytePropeller now injects:

Paired node selector requirement and toleration for the requested GPU device. The key is configurable via the gpu-device-node-label variable in the FlytePropeller k8s plugin configuration. A request for a T4 GPU might add the following to the resulting pod spec:

spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: cloud.google.com/gke-accelerator
            operator: In
            values:
            - nvidia-tesla-t4
...
  tolerations:
  - effect: NoSchedule
    key: cloud.google.com/gke-accelerator
    operator: Equal
    value: nvidia-tesla-t4

Paired node selector requirement and toleration for the requested GPU partition size (or unpartitioned). The key is configurable via the gpu-partition-size-node-label variable in the FlytePropeller k8s plugin configuration, and the node selector requirement and toleration to add for an unpartitioned MIG-capable GPU can be configured via gpu-unpartitioned-node-selector-requirement and gpu-unpartitioned-toleration respectively. A request for a partitioned or unpartitioned A100 GPU might add the following to the resulting pod spec:

 # 2g.10gb partition size
 spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: cloud.google.com/gke-accelerator
            operator: In
            values:
            - nvidia-tesla-a100
          - key: cloud.google.com/gke-gpu-partition-size
            operator: In
            values:
            - 2g.10gb
...
  tolerations:
  - effect: NoSchedule
    key: cloud.google.com/gke-accelerator
    operator: Equal
    value: nvidia-tesla-a100
  - effect: NoSchedule
    key: cloud.google.com/gke-gpu-partition-size
    operator: Equal
    value: 2g.10gb
 
 # Unparititioned
 spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: cloud.google.com/gke-accelerator
            operator: In
            values:
            - nvidia-tesla-a100
          - key: cloud.google.com/gke-gpu-partition-size
            operator: DoesNotExist
...
  tolerations:
  - effect: NoSchedule
    key: cloud.google.com/gke-accelerator
    operator: Equal
    value: nvidia-tesla-a100
  - effect: NoSchedule
    key: cloud.google.com/gke-gpu-partition-size
    operator: Equal
    value: DoesNotExist

Usage

ExtendedResources are applied to resources generated by all K8s plugins, with the exception of Spark, for now. We believe that this change gives operators good control over targeting workloads to specific GPU-accelerated nodes just by tuning the labels and taints attached to those nodes. Consider a deployment with 3 different GPU-accelerated node groups: T4, 2g.10gb A100, unpartitioned A100. Operators may want to make the T4 node group the default, the 2g.10gb partitioned A100 node group the default A100 node group, and require that the unpartitioned A100 node group be explicitly requested for special workloads. This can be achieved as follows:

Create node groups with the following labels and taints:

T4 node group

  labels:
    cloud.google.com/gke-accelerator: nvidia-tesla-t4  # auto
  taints:
    - key: nvidia.com/gpu  # auto
      value: present
      effect: "NO_SCHEDULE"

2g.10gb A100 node group

  labels:
    cloud.google.com/gke-accelerator: nvidia-tesla-a100  # auto
    cloud.google.com/gke-gpu-partition-size: 2g.10gb  # auto
  taints:
    - key: nvidia.com/gpu  # auto
      value: present
      effect: "NO_SCHEDULE"
    - key: cloud.google.com/gke-accelerator
      value: nvidia-tesla-a100
      effect: "NO_SCHEDULE"

Unpartitioned A100 node group

  labels:
    cloud.google.com/gke-accelerator: nvidia-tesla-a100  # auto
  taints:
    - key: nvidia.com/gpu  # auto
      value: present
      effect: "NO_SCHEDULE"
    - key: cloud.google.com/gke-accelerator
      value: nvidia-tesla-a100
      effect: "NO_SCHEDULE"
    - key: cloud.google.com/gke-gpu-partition-size
      value: DoesNotExist
      effect: "NO_SCHEDULE"

Add the following values to the FlytePropeller k8s plugin configuration:

plugins:
  k8s:
    gpu-device-node-label: cloud.google.com/gke-accelerator
    gpu-partition-size-node-label: cloud.google.com/gke-gpu-partition-size
    gpu-unpartitioned-toleration:
      effect: NoSchedule
      key: cloud.google.com/gke-gpu-partition-size
      operator: Equal
      value: DoesNotExist

Users may now do one of the following:

Request a GPU with no preference for device (Scheduled on T4 node group)

@task(limits=Resources(gpu="1"))
def my_task() -> None:
    ...

with pod spec:

apiVersion: v1
kind: Pod
spec:
  containers:
  - resources:
      limits:
        nvidia.com/gpu: 1
  tolerations:
  - key: nvidia.com/gpu  # auto
    operator: Equal
    value: present
    effect: NoSchedule

Explicitly request a T4 GPU

@task(
    limits=Resources(gpu="1"),
    accelerator=T4,
)
def my_task() -> None:
    ...

with pod spec:

apiVersion: v1
kind: Pod
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: cloud.google.com/gke-accelerator
            operator: In
            values:
            - nvidia-tesla-t4
  containers:
  - resources:
      limits:
        nvidia.com/gpu: 1
  tolerations:
  - key: nvidia.com/gpu  # auto
    operator: Equal
    value: present
    effect: NoSchedule
  - key: cloud.google.com/gke-accelerator
    operator: Equal
    value: nvidia-tesla-t4
    effect: NoSchedule

Request an A100 GPU with no preference for partition size (Scheduled on default, partitioned 2g.10gb A100 node group)

@task(
    limits=Resources(gpu="1"),
    accelerator=A100,
)
def my_task() -> None:
    ...

with pod spec:

apiVersion: v1
kind: Pod
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: cloud.google.com/gke-accelerator
            operator: In
            values:
            - nvidia-tesla-a100
  containers:
  - resources:
      limits:
        nvidia.com/gpu: 1
  tolerations:
  - key: nvidia.com/gpu  # auto
    operator: Equal
    value: present
    effect: NoSchedule
  - key: cloud.google.com/gke-accelerator
    operator: Equal
    value: nvidia-tesla-a100
    effect: NoSchedule

Explicitly request a partitioned A100 GPU

@task(
    limits=Resources(gpu="1"),
    accelerator=A100.partition_2g_10gb,
)
def my_task() -> None:
    ...

with pod spec:

apiVersion: v1
kind: Pod
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: cloud.google.com/gke-accelerator
            operator: In
            values:
            - nvidia-tesla-a100
          - key: cloud.google.com/gke-gpu-partition-size
            operator: In
            values:
            - 2g.10gb
  containers:
  - resources:
      limits:
        nvidia.com/gpu: 1
  tolerations:
  - key: nvidia.com/gpu  # auto
    operator: Equal
    value: present
    effect: NoSchedule
  - key: cloud.google.com/gke-accelerator
    operator: Equal
    value: nvidia-tesla-a100
    effect: NO_SCHEDULE
  - key: cloud.google.com/gke-gpu-partition-size
    operator: Equal
    value: 2g.10gb
    effect: NoSchedule

Explicitly request an unpartitioned A100 GPU

@task(
    limits=Resources(gpu="1"),
    accelerator=A100.unpartitioned,
)
def my_task() -> None:
    ...

with pod spec:

apiVersion: v1
kind: Pod
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: cloud.google.com/gke-accelerator
            operator: In
            values:
            - nvidia-tesla-a100
          - key: cloud.google.com/gke-gpu-partition-size
            operator: DoesNotExist
  containers:
  - resources:
      limits:
        nvidia.com/gpu: 1
  tolerations:
  - key: nvidia.com/gpu  # auto
    operator: Equal
    value: present
    effect: NoSchedule
  - key: cloud.google.com/gke-accelerator
    operator: Equal
    value: nvidia-tesla-a100
    effect: NoSchedule
  - key: cloud.google.com/gke-gpu-partition-size
    operator: Equal
    value: DoesNotExist
    effect: NoSchedule

Check all the applicable boxes

I updated the documentation accordingly.
All new and existing tests passed.
All commits are signed-off.

Signed-off-by: Jeev B <[email protected]>

codecov · 2023-10-05T04:13:40Z

Codecov Report

Attention: 16 lines in your changes are missing coverage. Please review.

Comparison is base (0ca2d22) 58.95% compared to head (4344ee4) 59.97%.

❗ Current head 4344ee4 differs from pull request most recent head f958e9c. Consider uploading reports for the commit f958e9c to get more accurate results

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #4172      +/-   ##
==========================================
+ Coverage   58.95%   59.97%   +1.01%     
==========================================
  Files         621      570      -51     
  Lines       52932    41201   -11731     
==========================================
- Hits        31206    24710    -6496     
+ Misses      19229    14095    -5134     
+ Partials     2497     2396     -101

Flag	Coverage Δ
unittests	`?`

Flags with carried forward coverage won't be shown. Click here to find out more.

Files	Coverage Δ
...go/tasks/pluginmachinery/flytek8s/config/config.go	`50.00% <ø> (ø)`
...propeller/pkg/apis/flyteworkflow/v1alpha1/iface.go	`0.00% <ø> (ø)`
...ytepropeller/pkg/compiler/transformers/k8s/node.go	`74.86% <100.00%> (+3.89%)`	⬆️
...ns/go/tasks/pluginmachinery/flytek8s/pod_helper.go	`75.72% <97.95%> (+2.54%)`	⬆️
...propeller/pkg/apis/flyteworkflow/v1alpha1/nodes.go	`6.97% <0.00%> (+0.15%)`	⬆️

... and 572 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: Jeev B <[email protected]>

EngHabu

Awesome! thank you for putting this together. MonoRepo is already paying dividends :-)

flyteplugins/go/tasks/pluginmachinery/flytek8s/pod_helper.go

EngHabu · 2023-10-07T14:41:32Z

flyteplugins/go/tasks/pluginmachinery/flytek8s/pod_helper.go

+		} else {
+			partitionSizeTol = &v1.Toleration{
+				Key:      config.GetK8sPluginConfig().GpuPartitionSizeNodeLabel,
+				Value:    GpuPartitionSizeNotSet,


Is that a k8s const value to mean something?

No, it’s an arbitrary default value that we set for the toleration value added to tasks that explicitly require unpartitioned GPUs. See:

flyte/flyteplugins/go/tasks/pluginmachinery/flytek8s/pod_helper.go

Line 36 in ee11093

const GpuPartitionSizeNotSet = "NotSet"

The toleration is useful if running multiple GPU node pools and an unpartitioned A100 node group is protected. See scenario in description. The toleration can also be overridden via plugin config.

Do you think this can be a surprising behavior?

Possibly. We can either:

Document this behavior. An extraneous toleration should generally be a no-op, unless it somehow conflicts with a real, but unrelated, taint.

Drop the "default" toleration and require that one be specified explicitly by config, if unpartitioned GPU node groups are protected by a taint.

In retrospect, I like (2).

@EngHabu: Does (2) seem reasonable to you?

Went ahead and made the change! :)

flyteplugins/go/tasks/pluginmachinery/flytek8s/pod_helper.go

Signed-off-by: Jeev B <[email protected]>

EngHabu · 2023-10-09T22:08:45Z

flyteplugins/go/tasks/pluginmachinery/flytek8s/pod_helper.go

-func ApplyInterruptibleNodeAffinity(interruptible bool, podSpec *v1.PodSpec) {
+func ApplyGPUNodeSelectors(podSpec *v1.PodSpec, gpuAccelerator *core.GPUAccelerator) {
+	// Short circuit if pod spec does not contain any containers that use GPUs
+	gpuResourceName := config.GetK8sPluginConfig().GpuResourceName


I could swear I left a comment about this but don't see it... can we avoid calling config.GetK8sPluginConfig()...
Can we call it once in the "root function" and just pass it down to all of these... it makes these functions more "obviously" testable...

Following this pattern here:

flyte/flyteplugins/go/tasks/pluginmachinery/flytek8s/container_helper.go

Line 167 in 4a780c3

gpuResourceName := config.GetK8sPluginConfig().GpuResourceName

and setting up per-test config, per convention, as such:

flyte/flyteplugins/go/tasks/pluginmachinery/flytek8s/pod_helper_test.go

Line 477 in 93d15fd

assert.NoError(t, config.SetK8sPluginConfig(&config.K8sPluginConfig{

It would be a non-trivial refactor to pass config down from root via args in all these places. Happy to do it, but should that precede this PR?

…d GPUs Signed-off-by: Jeev B <[email protected]>

eapolinario

We can always refactor to improve testability in a separate PR.

jeevb added 3 commits October 4, 2023 20:45

Add IDL changes

7d148e8

Signed-off-by: Jeev B <[email protected]>

Add propeller changes

c48883d

Signed-off-by: Jeev B <[email protected]>

Add flyteplugins changes

1f22f38

Signed-off-by: Jeev B <[email protected]>

jeevb force-pushed the jeev/gpu-selection branch 3 times, most recently from a126a85 to 6b62d65 Compare October 5, 2023 04:12

Cleanup tests

57b5626

Signed-off-by: Jeev B <[email protected]>

jeevb force-pushed the jeev/gpu-selection branch from 6b62d65 to 57b5626 Compare October 5, 2023 04:24

Add additional comments to IDL fields

2f84eeb

Signed-off-by: Jeev B <[email protected]>

jeevb mentioned this pull request Oct 5, 2023

[Extended Resources] GPU Accelerators flyteorg/flytekit#1843

Merged

8 tasks

jeevb added 3 commits October 5, 2023 14:44

Merge branch 'master' into jeev/gpu-selection

37ccba3

Fix issues from merge

27d0094

Signed-off-by: Jeev B <[email protected]>

First pass through PR

ee11093

Signed-off-by: Jeev B <[email protected]>

jeevb marked this pull request as ready for review October 6, 2023 03:30

eapolinario previously approved these changes Oct 6, 2023

View reviewed changes

EngHabu previously approved these changes Oct 7, 2023

View reviewed changes

Merge remote-tracking branch 'origin/master' into jeev/gpu-selection

e35c601

jeevb dismissed stale reviews from EngHabu and eapolinario via e35c601 October 9, 2023 17:11

Address PR comments

93d15fd

Signed-off-by: Jeev B <[email protected]>

jeevb requested review from EngHabu and eapolinario October 9, 2023 18:21

EngHabu reviewed Oct 9, 2023

View reviewed changes

jeevb added 2 commits October 11, 2023 20:15

Merge remote-tracking branch 'origin/master' into jeev/gpu-selection

6526390

No auto injection of toleration for explicitly requested unpartitione…

f958e9c

…d GPUs Signed-off-by: Jeev B <[email protected]>

jeevb force-pushed the jeev/gpu-selection branch from 69cbd9c to f958e9c Compare October 12, 2023 03:22

eapolinario approved these changes Oct 12, 2023

View reviewed changes

jeevb merged commit 737ef23 into master Oct 12, 2023
40 checks passed

jeevb deleted the jeev/gpu-selection branch October 12, 2023 16:42

cosmicBboy mentioned this pull request Dec 18, 2023

[Docs] Write docs and examples for GPU acceleration #4620

Closed

2 tasks

davidmirror-ops mentioned this pull request Jul 23, 2024

Update GPU docs #5515

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Extended Resources] GPU Accelerators #4172

[Extended Resources] GPU Accelerators #4172

jeevb commented Oct 5, 2023 •

edited

Loading

codecov bot commented Oct 5, 2023 •

edited

Loading

EngHabu left a comment

EngHabu Oct 7, 2023

jeevb Oct 7, 2023 •

edited

Loading

EngHabu Oct 9, 2023

jeevb Oct 9, 2023 •

edited

Loading

jeevb Oct 10, 2023

jeevb Oct 12, 2023

EngHabu Oct 9, 2023

jeevb Oct 9, 2023 •

edited

Loading

eapolinario left a comment

[Extended Resources] GPU Accelerators #4172

[Extended Resources] GPU Accelerators #4172

Conversation

jeevb commented Oct 5, 2023 • edited Loading

Implementation

Usage

Check all the applicable boxes

codecov bot commented Oct 5, 2023 • edited Loading

Codecov Report

EngHabu left a comment

Choose a reason for hiding this comment

EngHabu Oct 7, 2023

Choose a reason for hiding this comment

jeevb Oct 7, 2023 • edited Loading

Choose a reason for hiding this comment

EngHabu Oct 9, 2023

Choose a reason for hiding this comment

jeevb Oct 9, 2023 • edited Loading

Choose a reason for hiding this comment

jeevb Oct 10, 2023

Choose a reason for hiding this comment

jeevb Oct 12, 2023

Choose a reason for hiding this comment

EngHabu Oct 9, 2023

Choose a reason for hiding this comment

jeevb Oct 9, 2023 • edited Loading

Choose a reason for hiding this comment

eapolinario left a comment

Choose a reason for hiding this comment

jeevb commented Oct 5, 2023 •

edited

Loading

codecov bot commented Oct 5, 2023 •

edited

Loading

jeevb Oct 7, 2023 •

edited

Loading

jeevb Oct 9, 2023 •

edited

Loading

jeevb Oct 9, 2023 •

edited

Loading