From 07d3cfcd4c0a81b4b47ad73a77716e6a974a831c Mon Sep 17 00:00:00 2001
From: David Espejo <82604841+davidmirror-ops@users.noreply.github.com>
Date: Thu, 25 Jul 2024 15:29:25 -0500
Subject: [PATCH] Update GPU docs (#5515)

* Introduce 3 levels

Signed-off-by: davidmirror-ops <david.espejo@union.ai>

* Fix ImageSpec config

Signed-off-by: davidmirror-ops <david.espejo@union.ai>

* Rephrase 1st section and prereqs

Signed-off-by: davidmirror-ops <david.espejo@union.ai>

* Expand 2nd section up to nodeSelector key

Signed-off-by: davidmirror-ops <david.espejo@union.ai>

* Add partition scheduling info

Signed-off-by: davidmirror-ops <david.espejo@union.ai>

* Reorganize instructions

Signed-off-by: davidmirror-ops <david.espejo@union.ai>

* Improve clarity

Signed-off-by: davidmirror-ops <david.espejo@union.ai>

* Apply reviews pt1

Signed-off-by: davidmirror-ops <david.espejo@union.ai>

* Add note on default scheduling behavior

Signed-off-by: davidmirror-ops <david.espejo@union.ai>

* Add missing YAML and rephrase full A100 behavior

Signed-off-by: davidmirror-ops <david.espejo@union.ai>

---------

Signed-off-by: davidmirror-ops <david.espejo@union.ai>
---
 .../configuring_access_to_gpus.md             | 409 +++++++++++++++++-
 1 file changed, 390 insertions(+), 19 deletions(-)

diff --git a/docs/user_guide/productionizing/configuring_access_to_gpus.md b/docs/user_guide/productionizing/configuring_access_to_gpus.md
index 60e4a35ced..7ae6213a5f 100644
--- a/docs/user_guide/productionizing/configuring_access_to_gpus.md
+++ b/docs/user_guide/productionizing/configuring_access_to_gpus.md
@@ -6,31 +6,402 @@
 .. tags:: Deployment, Infrastructure, GPU, Intermediate
 ```
 
-Along with the simpler resources like CPU/Memory, you may want to configure and access GPU resources. Flyte
-allows you to configure the GPU access poilcy for your cluster. GPUs are expensive and it would not be ideal to
-treat machines with GPUs and machines with CPUs equally. You may want to reserve machines with GPUs for tasks
-that explicitly request GPUs. To achieve this, Flyte uses the Kubernetes concept of [taints and tolerations](https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/).
+Along with compute resources like CPU and memory, you may want to configure and access GPU resources. 
 
-Kubernetes can automatically apply tolerations for extended resources like GPUs using the [ExtendedResourceToleration plugin](https://kubernetes.io/docs/reference/access-authn-authz/admission-controllers/#extendedresourcetoleration), enabled by default in some cloud environments. Make sure the GPU nodes are tainted with a key matching the resource name, i.e., `key: nvidia.com/gpu`.
+Flyte provides different ways to request accelerator resources directly from the task decorator.
 
-You can also configure Flyte backend to apply specific tolerations. This configuration is controlled under generic  k8s plugin configuration as can be found [here](https://github.com/flyteorg/flyteplugins/blob/5a00b19d88b93f9636410a41f81a73356a711482/go/tasks/pluginmachinery/flytek8s/config/config.go#L120).
+>The examples in this section use [ImageSpec](https://docs.flyte.org/en/latest/user_guide/customizing_dependencies/imagespec.html#imagespec), a Flyte feature that builds a custom container image without a Dockerfile. Install it using `pip install flytekitplugins-envd`.
 
-The idea of this configuration is that whenever a task that can execute on Kubernetes requests for GPUs, it automatically
-adds the matching toleration for that resource (in this case, `gpu`) to the generated PodSpec.
-As it follows here, you can configure it to access specific resources using the tolerations for all resources supported by
-Kubernetes.
+## Requesting a GPU with no preference for device
+The goal in this example is to run the task on a single available GPU :
 
-Here's an example configuration:
+```python
+from flytekit import ImageSpec, Resources, task
 
+image = ImageSpec(
+    base_image= "ghcr.io/flyteorg/flytekit:py3.10-1.10.2",
+     name="pytorch",
+     python_version="3.10",
+     packages=["torch"],
+     builder="default",
+     registry="<YOUR_CONTAINER_REGISTRY>",
+ )
+
+@task(requests=Resources( gpu="1"))
+def gpu_available() -> bool:
+   return torch.cuda.is_available() # returns True if CUDA (provided by a GPU) is available
+```
+### How it works?
+
+![](https://raw.githubusercontent.com/flyteorg/static-resources/main/flyte/deployment/gpus/generic_gpu_access.png)
+
+When this task is evaluated, `flyteproller` injects a [toleration](https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/) in the pod spec:
+
+```yaml
+tolerations:    nvidia.com/gpu:NoSchedule op=Exists
+```
+The Kubernetes scheduler will admit the pod if there are worker nodes in the cluster with a matching taint and available resources.
+
+The resource `nvidia.com/gpu` key name is not arbitrary though. It corresponds to the [Extended Resource](https://kubernetes.io/docs/tasks/administer-cluster/extended-resource-node/) that the Kubernetes worker nodes advertise to the API server through the [device plugin](https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/#using-device-plugins). Using the information provided by the device plugin, the Kubernetes scheduler allocates an available accelerator to the Pod.
+
+>NVIDIA maintains a [GPU operator](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/index.html) that automates the management of all software prerequisites on Kubernetes, including the device plugin.
+
+``flytekit`` assumes by default that `nvidia.com/gpu` is the resource name for your GPUs. If your GPU accelerators expose a different resource name, adjust the following key in the Helm values file:
+
+**flyte-core**
+```yaml
+configmap:
+  k8s:
+    plugins:
+      k8s:
+        gpu-resource-name: <YOUR_GPU_RESOURCE_NAME>
+```
+
+**flyte-binary**
+```yaml
+configuration:
+  inline:
+    plugins:
+      k8s:
+        gpu-resource-name: <YOUR_GPU_RESOURCE_NAME> 
+```
+
+If your infrastructure requires additional tolerations for the scheduling of GPU resources to succeed, adjust the following section in the Helm values file:
+
+**flyte-core**
+```yaml
+configmap:
+  k8s:
+    plugins:
+      k8s:
+        resource-tolerations:
+        - nvidia.com/gpu: 
+          - key: "mykey"
+            operator: "Equal"
+            value: "myvalue"
+            effect: "NoSchedule"  
+```
+**flyte-binary**
+```yaml
+configuration:
+  inline:
+    plugins:
+      k8s:
+        resource-tolerations:
+        - nvidia.com/gpu: 
+          - key: "mykey"
+            operator: "Equal"
+            value: "myvalue"
+            effect: "NoSchedule" 
+```
+>For the above configuration, your worker nodes should have a  `mykey=myvalue:NoSchedule` configured [taint](https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/).
+
+## Requesting a specific GPU device
+
+The goal is to run the task on a specific type of accelerator: NVIDIA Tesla V100 in the following example:
+
+
+```python
+from flytekit import ImageSpec, Resources, task
+from flytekit.extras.accelerators import V100
+
+image = ImageSpec(
+    base_image= "ghcr.io/flyteorg/flytekit:py3.10-1.10.2",
+     name="pytorch",
+     python_version="3.10",
+     packages=["torch"],
+     builder="default",
+     registry="<YOUR_CONTAINER_REGISTRY>",
+ )
+
+@task(requests=Resources( gpu="1"),
+              accelerator=V100, 
+              ) #NVIDIA Tesla V100
+def gpu_available() -> bool:
+   return torch.cuda.is_available()
+```
+
+
+### How it works?
+
+When this task is evaluated, `flytepropeller` injects both a toleration and a [nodeSelector](https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#nodeselector) for a more flexible scheduling configuration.
+
+An example pod spec on GKE would include the following:
+
+```yaml
+apiVersion: v1
+kind: Pod
+spec:
+  affinity:
+    nodeAffinity:
+      requiredDuringSchedulingIgnoredDuringExecution:
+        nodeSelectorTerms:
+        - matchExpressions:
+          - key: cloud.google.com/gke-accelerator
+            operator: In
+            values:
+            - nvidia-tesla-v100
+  containers:
+  - resources:
+      limits:
+        nvidia.com/gpu: 1
+  tolerations:
+  - key: nvidia.com/gpu  # auto
+    operator: Equal
+    value: present
+    effect: NoSchedule
+  - key: cloud.google.com/gke-accelerator
+    operator: Equal
+    value: nvidia-tesla-v100
+    effect: NoSchedule
+```
+### Configuring the nodeSelector
+The `key` that the injected node selector uses corresponds to an arbitrary label that your Kubernetes worker nodes should already have. In the above example it's `cloud.google.com/gke-accelerator` but, depending on your cloud provider it could be any other value. You can inform Flyte about the labels your worker nodes use by adjusting the Helm values:
+
+**flyte-core**
 ```yaml
-plugins:
+configmap:
   k8s:
-    resource-tolerations:
-      - nvidia.com/gpu:
-        - key: "key1"
-          operator: "Equal"
-          value: "value1"
-          effect: "NoSchedule"
+    plugins:
+      k8s:
+        gpu-device-node-label: "cloud.google.com/gke-accelerator" #change to match your node's config
 ```
+**flyte-binary**
+```yaml
+configuration:
+  inline:
+    plugins:
+      k8s:
+       gpu-device-node-label: "cloud.google.com/gke-accelerator" #change to match your node's config 
+```
+While the `key` is arbitrary, the value (`nvidia-tesla-v100`) is not. `flytekit` has a set of [predefined](https://docs.flyte.org/en/latest/api/flytekit/extras.accelerators.html#predefined-accelerator-constants) constants and your node label has to use one of those values. 
+
+## Requesting a GPU partition
+
+`flytekit` supports [Multi-Instance GPU partitioning](https://developer.nvidia.com/blog/getting-the-most-out-of-the-a100-gpu-with-multi-instance-gpu/#mig_partitioning_and_gpu_instance_profiles) on NVIDIA A100 devices for optimal resource utilization.
 
-Getting this configuration into your deployment will depend on how Flyte is deployed on your cluster. If you use the default Opta/Helm route, you'll need to amend your Helm chart values ([example](https://github.com/flyteorg/flyte/blob/cc127265aec490ad9537d29bd7baff828043c6f5/charts/flyte-core/values.yaml#L629)) so that they end up [here](https://github.com/flyteorg/flyte/blob/3d265f166fcdd8e20b07ff82b494c0a7f6b7b108/deployment/eks/flyte_helm_generated.yaml#L521).
+Example:
+```python
+from flytekit import ImageSpec, Resources, task
+from flytekit.extras.accelerators import A100
+
+image = ImageSpec(
+    base_image= "ghcr.io/flyteorg/flytekit:py3.10-1.10.2",
+     name="pytorch",
+     python_version="3.10",
+     packages=["torch"],
+     builder="default",
+     registry="<YOUR_CONTAINER_REGISTRY>",
+ )
+
+@task(requests=Resources( gpu="1"),
+              accelerator=A100.partition_2g_10gb, 
+              ) # 2 compute instances with 10GB memory slice
+def gpu_available() -> bool:
+   return torch.cuda.is_available()
+```
+### How it works?
+In this case, ``flytepropeller`` injects an additional node selector expression to the resulting pod spec, indicating the partition size:
+
+```yaml
+spec:
+  affinity:
+    nodeAffinity:
+      requiredDuringSchedulingIgnoredDuringExecution:
+        nodeSelectorTerms:
+        - matchExpressions:
+          - key: nvidia.com/gpu.accelerator
+            operator: In
+            values:
+            - nvidia-tesla-a100
+          - key: nvidia.com/gpu.partition-size
+            operator: In
+            values:
+            - 2g.10gb
+```
+
+Plus and additional toleration:
+
+```yaml
+  tolerations:
+  - effect: NoSchedule
+    key: nvidia.com/gpu.accelerator
+    operator: Equal
+    value: nvidia-tesla-a100
+  - effect: NoSchedule
+    key: nvidia.com/gpu.partition-size
+    operator: Equal
+    value: 2g.10gb
+```
+In consequence, your Kubernetes worker nodes should have matching labels so the Kubernetes scheduler can admit the Pods:
+
+Node labels (example):
+```yaml
+nvidia.com/gpu.partition-size: "2g.10gb"
+nvidia.com/gpu.accelerator: "nvidia-tesla-a100"
+```
+
+ If you want to better control scheduling, configure your worker nodes with taints that match the tolerations injected to the pods.
+
+
+In the example the ``nvidia.com/gpu.partition-size`` key is arbitrary and can be controlled from the Helm chart:
+
+**flyte-core**
+```yaml
+configmap:
+  k8s:
+    plugins:
+      k8s:
+        gpu-partition-size-node-label: "nvidia.com/gpu.partition-size" #change to match your node's config
+```
+**flyte-binary**
+```yaml
+configuration:
+  inline:
+    plugins:
+      k8s:
+       gpu-partition-size-node-label: "nvidia.com/gpu.partition-size" #change to match your node's config 
+```
+The ``2g.10gb`` value comes from the [NVIDIA A100 supported instance profiles](https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html#concepts) and it's controlled from the Task decorator (``accelerator=A100.partition_2g_10gb`` in the above example). Depending on the profile requested in the Task, Flyte will inject the corresponding value for the node selector.
+
+>Learn more about the full list of ``flytekit`` supported partition profiles and task decorator options [here](https://docs.flyte.org/en/latest/api/flytekit/generated/flytekit.extras.accelerators.A100.html#flytekit.extras.accelerators.A100)
+
+## Additional use cases
+
+### Request an A100 device with no preference on partition configuration
+
+Example:
+
+```python
+from flytekit import ImageSpec, Resources, task
+from flytekit.extras.accelerators import A100
+
+image = ImageSpec(
+    base_image= "ghcr.io/flyteorg/flytekit:py3.10-1.10.2",
+     name="pytorch",
+     python_version="3.10",
+     packages=["torch"],
+     builder="default",
+     registry="<YOUR_CONTAINER_REGISTRY>",
+ )
+
+@task(requests=Resources( gpu="1"),
+              accelerator=A100, 
+              ) 
+def gpu_available() -> bool:
+   return torch.cuda.is_available()
+```
+
+#### How it works?
+
+flytekit uses a default `2g.10gb`partition size and `flytepropeller`  injects the node selector that matches labels on nodes with an `A100` device:
+
+```yaml
+spec:
+  affinity:
+    nodeAffinity:
+      requiredDuringSchedulingIgnoredDuringExecution:
+        nodeSelectorTerms:
+        - matchExpressions:
+          - key: nvidia.com/gpu.accelerator
+            operator: In
+            values:
+            - nvidia-tesla-a100
+```
+
+### Request an unpartitioned A100 device
+The goal is to run the task using the resources of the entire A100 GPU:
+
+```python
+from flytekit import ImageSpec, Resources, task
+from flytekit.extras.accelerators import A100
+
+image = ImageSpec(
+    base_image= "ghcr.io/flyteorg/flytekit:py3.10-1.10.2",
+     name="pytorch",
+     python_version="3.10",
+     packages=["torch"],
+     builder="default",
+     registry="<YOUR_CONTAINER_REGISTRY>",
+ )
+
+@task(requests=Resources( gpu="1"),
+              accelerator=A100.unpartitioned, 
+              ) # request the entire A100 device
+def gpu_available() -> bool:
+   return torch.cuda.is_available()
+```
+
+#### How it works?
+
+When this task is evaluated `flytepropeller` injects a node selector expression that only matches nodes where the label specifying a partition size is **not** present:
+
+```yaml
+spec:
+  affinity:
+    nodeAffinity:
+      requiredDuringSchedulingIgnoredDuringExecution:
+        nodeSelectorTerms:
+        - matchExpressions:
+          - key: nvidia.com/gpu.accelerator
+            operator: In
+            values:
+            - nvidia-tesla-a100
+          - key: nvidia.com/gpu.partition-size
+            operator: DoesNotExist
+```
+The expression can be controlled from the Helm values:
+
+
+**flyte-core**
+```yaml
+configmap:
+  k8s:
+    plugins:
+      k8s:
+        gpu-unpartitioned-node-selector-requirement :
+          key: cloud.google.com/gke-gpu-partition-size #change to match your node label configuration
+          operator: Equal
+          value: DoesNotExist
+```
+**flyte-binary**
+```yaml
+configuration:
+  inline:
+    plugins:
+      k8s:
+        gpu-unpartitioned-toleration:
+          gpu-unpartitioned-node-selector-requirement :
+          key: cloud.google.com/gke-gpu-partition-size #change to match your node label configuration
+          operator: Equal
+          value: DoesNotExist
+```
+
+
+Scheduling can be further controlled by setting in the Helm chart a toleration that `flytepropeller` injects into the task pods:
+
+**flyte-core**
+```yaml
+configmap:
+  k8s:
+    plugins:
+      k8s:
+        gpu-unpartitioned-toleration:
+          effect: NoSchedule
+          key: cloud.google.com/gke-gpu-partition-size
+          operator: Equal
+          value: DoesNotExist
+```
+**flyte-binary**
+```yaml
+configuration:
+  inline:
+    plugins:
+      k8s:
+        gpu-unpartitioned-toleration:
+          effect: NoSchedule
+          key: cloud.google.com/gke-gpu-partition-size
+          operator: Equal
+          value: DoesNotExist
+```
+In case your Kubernetes worker nodes are using taints, they need to match the above configuration.