Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Extended Resources] GPU Accelerators #4172

Merged
merged 12 commits into from
Oct 12, 2023
Merged

[Extended Resources] GPU Accelerators #4172

merged 12 commits into from
Oct 12, 2023

Conversation

jeevb
Copy link
Contributor

@jeevb jeevb commented Oct 5, 2023

This PR introduces the concept of ExtendedResources that is meant to encapsulate all specialized resources that are not already captured by container resources (v1.ResourceRequirements), and adds GPU accelerator as one such resource. We plan on leveraging ExtendedResources for other specialized resources such as shared memory (/dev/shm) in the future.

Implementation

In this proposed implementation ExtendedResources is added as a field within TaskTemplate for task-level configuration, and TaskNodeOverrides for node-level overrides. We considered adding ExtendedResources directly to the IDL Resources object, but found that this is highly specific to container tasks, and not propagated correctly for pod templates and Pod tasks. TaskTemplate seemed like the most reasonable candidate outside of Resources.

Generally, targeting tasks to specific GPU accelerators is a matter of setting the appropriate node affinities, and where relevant, tolerations. Mixed-partition multi-instance GPUs (MIGs) are an exception and have been excluded from the scope of this PR. We may add support for these in a future PR.

When building a resource for a task requesting a specific GPU accelerator, FlytePropeller now injects:

  1. Paired node selector requirement and toleration for the requested GPU device. The key is configurable via the gpu-device-node-label variable in the FlytePropeller k8s plugin configuration. A request for a T4 GPU might add the following to the resulting pod spec:
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: cloud.google.com/gke-accelerator
            operator: In
            values:
            - nvidia-tesla-t4
...
  tolerations:
  - effect: NoSchedule
    key: cloud.google.com/gke-accelerator
    operator: Equal
    value: nvidia-tesla-t4
  1. Paired node selector requirement and toleration for the requested GPU partition size (or unpartitioned). The key is configurable via the gpu-partition-size-node-label variable in the FlytePropeller k8s plugin configuration, and the node selector requirement and toleration to add for an unpartitioned MIG-capable GPU can be configured via gpu-unpartitioned-node-selector-requirement and gpu-unpartitioned-toleration respectively. A request for a partitioned or unpartitioned A100 GPU might add the following to the resulting pod spec:
 # 2g.10gb partition size
 spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: cloud.google.com/gke-accelerator
            operator: In
            values:
            - nvidia-tesla-a100
          - key: cloud.google.com/gke-gpu-partition-size
            operator: In
            values:
            - 2g.10gb
...
  tolerations:
  - effect: NoSchedule
    key: cloud.google.com/gke-accelerator
    operator: Equal
    value: nvidia-tesla-a100
  - effect: NoSchedule
    key: cloud.google.com/gke-gpu-partition-size
    operator: Equal
    value: 2g.10gb
 
 # Unparititioned
 spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: cloud.google.com/gke-accelerator
            operator: In
            values:
            - nvidia-tesla-a100
          - key: cloud.google.com/gke-gpu-partition-size
            operator: DoesNotExist
...
  tolerations:
  - effect: NoSchedule
    key: cloud.google.com/gke-accelerator
    operator: Equal
    value: nvidia-tesla-a100
  - effect: NoSchedule
    key: cloud.google.com/gke-gpu-partition-size
    operator: Equal
    value: DoesNotExist

Usage

ExtendedResources are applied to resources generated by all K8s plugins, with the exception of Spark, for now. We believe that this change gives operators good control over targeting workloads to specific GPU-accelerated nodes just by tuning the labels and taints attached to those nodes. Consider a deployment with 3 different GPU-accelerated node groups: T4, 2g.10gb A100, unpartitioned A100. Operators may want to make the T4 node group the default, the 2g.10gb partitioned A100 node group the default A100 node group, and require that the unpartitioned A100 node group be explicitly requested for special workloads. This can be achieved as follows:

  1. Create node groups with the following labels and taints:
T4 node group
  labels:
    cloud.google.com/gke-accelerator: nvidia-tesla-t4  # auto
  taints:
    - key: nvidia.com/gpu  # auto
      value: present
      effect: "NO_SCHEDULE"
2g.10gb A100 node group
  labels:
    cloud.google.com/gke-accelerator: nvidia-tesla-a100  # auto
    cloud.google.com/gke-gpu-partition-size: 2g.10gb  # auto
  taints:
    - key: nvidia.com/gpu  # auto
      value: present
      effect: "NO_SCHEDULE"
    - key: cloud.google.com/gke-accelerator
      value: nvidia-tesla-a100
      effect: "NO_SCHEDULE"
Unpartitioned A100 node group
  labels:
    cloud.google.com/gke-accelerator: nvidia-tesla-a100  # auto
  taints:
    - key: nvidia.com/gpu  # auto
      value: present
      effect: "NO_SCHEDULE"
    - key: cloud.google.com/gke-accelerator
      value: nvidia-tesla-a100
      effect: "NO_SCHEDULE"
    - key: cloud.google.com/gke-gpu-partition-size
      value: DoesNotExist
      effect: "NO_SCHEDULE"
  1. Add the following values to the FlytePropeller k8s plugin configuration:
plugins:
  k8s:
    gpu-device-node-label: cloud.google.com/gke-accelerator
    gpu-partition-size-node-label: cloud.google.com/gke-gpu-partition-size
    gpu-unpartitioned-toleration:
      effect: NoSchedule
      key: cloud.google.com/gke-gpu-partition-size
      operator: Equal
      value: DoesNotExist
  1. Users may now do one of the following:
Request a GPU with no preference for device (Scheduled on T4 node group)
@task(limits=Resources(gpu="1"))
def my_task() -> None:
    ...

with pod spec:

apiVersion: v1
kind: Pod
spec:
  containers:
  - resources:
      limits:
        nvidia.com/gpu: 1
  tolerations:
  - key: nvidia.com/gpu  # auto
    operator: Equal
    value: present
    effect: NoSchedule
Explicitly request a T4 GPU
@task(
    limits=Resources(gpu="1"),
    accelerator=T4,
)
def my_task() -> None:
    ...

with pod spec:

apiVersion: v1
kind: Pod
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: cloud.google.com/gke-accelerator
            operator: In
            values:
            - nvidia-tesla-t4
  containers:
  - resources:
      limits:
        nvidia.com/gpu: 1
  tolerations:
  - key: nvidia.com/gpu  # auto
    operator: Equal
    value: present
    effect: NoSchedule
  - key: cloud.google.com/gke-accelerator
    operator: Equal
    value: nvidia-tesla-t4
    effect: NoSchedule
Request an A100 GPU with no preference for partition size (Scheduled on default, partitioned 2g.10gb A100 node group)
@task(
    limits=Resources(gpu="1"),
    accelerator=A100,
)
def my_task() -> None:
    ...

with pod spec:

apiVersion: v1
kind: Pod
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: cloud.google.com/gke-accelerator
            operator: In
            values:
            - nvidia-tesla-a100
  containers:
  - resources:
      limits:
        nvidia.com/gpu: 1
  tolerations:
  - key: nvidia.com/gpu  # auto
    operator: Equal
    value: present
    effect: NoSchedule
  - key: cloud.google.com/gke-accelerator
    operator: Equal
    value: nvidia-tesla-a100
    effect: NoSchedule
Explicitly request a partitioned A100 GPU
@task(
    limits=Resources(gpu="1"),
    accelerator=A100.partition_2g_10gb,
)
def my_task() -> None:
    ...

with pod spec:

apiVersion: v1
kind: Pod
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: cloud.google.com/gke-accelerator
            operator: In
            values:
            - nvidia-tesla-a100
          - key: cloud.google.com/gke-gpu-partition-size
            operator: In
            values:
            - 2g.10gb
  containers:
  - resources:
      limits:
        nvidia.com/gpu: 1
  tolerations:
  - key: nvidia.com/gpu  # auto
    operator: Equal
    value: present
    effect: NoSchedule
  - key: cloud.google.com/gke-accelerator
    operator: Equal
    value: nvidia-tesla-a100
    effect: NO_SCHEDULE
  - key: cloud.google.com/gke-gpu-partition-size
    operator: Equal
    value: 2g.10gb
    effect: NoSchedule
Explicitly request an unpartitioned A100 GPU
@task(
    limits=Resources(gpu="1"),
    accelerator=A100.unpartitioned,
)
def my_task() -> None:
    ...

with pod spec:

apiVersion: v1
kind: Pod
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: cloud.google.com/gke-accelerator
            operator: In
            values:
            - nvidia-tesla-a100
          - key: cloud.google.com/gke-gpu-partition-size
            operator: DoesNotExist
  containers:
  - resources:
      limits:
        nvidia.com/gpu: 1
  tolerations:
  - key: nvidia.com/gpu  # auto
    operator: Equal
    value: present
    effect: NoSchedule
  - key: cloud.google.com/gke-accelerator
    operator: Equal
    value: nvidia-tesla-a100
    effect: NoSchedule
  - key: cloud.google.com/gke-gpu-partition-size
    operator: Equal
    value: DoesNotExist
    effect: NoSchedule

Check all the applicable boxes

  • I updated the documentation accordingly.
  • All new and existing tests passed.
  • All commits are signed-off.

@jeevb jeevb force-pushed the jeev/gpu-selection branch 3 times, most recently from a126a85 to 6b62d65 Compare October 5, 2023 04:12
@codecov
Copy link

codecov bot commented Oct 5, 2023

Codecov Report

Attention: 16 lines in your changes are missing coverage. Please review.

Comparison is base (0ca2d22) 58.95% compared to head (4344ee4) 59.97%.

❗ Current head 4344ee4 differs from pull request most recent head f958e9c. Consider uploading reports for the commit f958e9c to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #4172      +/-   ##
==========================================
+ Coverage   58.95%   59.97%   +1.01%     
==========================================
  Files         621      570      -51     
  Lines       52932    41201   -11731     
==========================================
- Hits        31206    24710    -6496     
+ Misses      19229    14095    -5134     
+ Partials     2497     2396     -101     
Flag Coverage Δ
unittests ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
...go/tasks/pluginmachinery/flytek8s/config/config.go 50.00% <ø> (ø)
...propeller/pkg/apis/flyteworkflow/v1alpha1/iface.go 0.00% <ø> (ø)
...ytepropeller/pkg/compiler/transformers/k8s/node.go 74.86% <100.00%> (+3.89%) ⬆️
...ns/go/tasks/pluginmachinery/flytek8s/pod_helper.go 75.72% <97.95%> (+2.54%) ⬆️
...propeller/pkg/apis/flyteworkflow/v1alpha1/nodes.go 6.97% <0.00%> (+0.15%) ⬆️

... and 572 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: Jeev B <[email protected]>
@jeevb jeevb force-pushed the jeev/gpu-selection branch from 6b62d65 to 57b5626 Compare October 5, 2023 04:24
@jeevb jeevb marked this pull request as ready for review October 6, 2023 03:30
eapolinario
eapolinario previously approved these changes Oct 6, 2023
EngHabu
EngHabu previously approved these changes Oct 7, 2023
Copy link
Contributor

@EngHabu EngHabu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! thank you for putting this together. MonoRepo is already paying dividends :-)

} else {
partitionSizeTol = &v1.Toleration{
Key: config.GetK8sPluginConfig().GpuPartitionSizeNodeLabel,
Value: GpuPartitionSizeNotSet,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that a k8s const value to mean something?

Copy link
Contributor Author

@jeevb jeevb Oct 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it’s an arbitrary default value that we set for the toleration value added to tasks that explicitly require unpartitioned GPUs. See:

const GpuPartitionSizeNotSet = "NotSet"

The toleration is useful if running multiple GPU node pools and an unpartitioned A100 node group is protected. See scenario in description. The toleration can also be overridden via plugin config.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think this can be a surprising behavior?

Copy link
Contributor Author

@jeevb jeevb Oct 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly. We can either:

  1. Document this behavior. An extraneous toleration should generally be a no-op, unless it somehow conflicts with a real, but unrelated, taint.
  2. Drop the "default" toleration and require that one be specified explicitly by config, if unpartitioned GPU node groups are protected by a taint.

In retrospect, I like (2).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@EngHabu: Does (2) seem reasonable to you?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Went ahead and made the change! :)

@jeevb jeevb dismissed stale reviews from EngHabu and eapolinario via e35c601 October 9, 2023 17:11
Signed-off-by: Jeev B <[email protected]>
@jeevb jeevb requested review from EngHabu and eapolinario October 9, 2023 18:21
func ApplyInterruptibleNodeAffinity(interruptible bool, podSpec *v1.PodSpec) {
func ApplyGPUNodeSelectors(podSpec *v1.PodSpec, gpuAccelerator *core.GPUAccelerator) {
// Short circuit if pod spec does not contain any containers that use GPUs
gpuResourceName := config.GetK8sPluginConfig().GpuResourceName
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could swear I left a comment about this but don't see it... can we avoid calling config.GetK8sPluginConfig()...
Can we call it once in the "root function" and just pass it down to all of these... it makes these functions more "obviously" testable...

Copy link
Contributor Author

@jeevb jeevb Oct 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following this pattern here:

gpuResourceName := config.GetK8sPluginConfig().GpuResourceName

and setting up per-test config, per convention, as such:

assert.NoError(t, config.SetK8sPluginConfig(&config.K8sPluginConfig{

It would be a non-trivial refactor to pass config down from root via args in all these places. Happy to do it, but should that precede this PR?

@jeevb jeevb force-pushed the jeev/gpu-selection branch from 69cbd9c to f958e9c Compare October 12, 2023 03:22
Copy link
Contributor

@eapolinario eapolinario left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can always refactor to improve testability in a separate PR.

@jeevb jeevb merged commit 737ef23 into master Oct 12, 2023
40 checks passed
@jeevb jeevb deleted the jeev/gpu-selection branch October 12, 2023 16:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants