-
Notifications
You must be signed in to change notification settings - Fork 205
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Claiming i915 resources in a single container has 2 cards accessible for Intel GPU Flex 140 #1377
Comments
Thanks for reaching out! In short, the scenario is by design. The best option for you is to modify the deployment to use balanced mode. For 140 vs. 170, GPU-plugin threats them as same. For now we are not planning on changing the default allocation policy. Longer story: Our GPU Aware Scheduler provides better granularity to reserve GPUs, but it's a bigger effort to set up: |
This tells more about the features supported by the GPU Aware Scheduler (GAS): https://github.com/intel/platform-aware-scheduling/blob/master/gpu-aware-scheduling/docs/usage.md Note that there are few gotchas in installing GAS, see: intel/platform-aware-scheduling#126 |
Thanks for your quick response, we will have a look at how GAS handles it. However, for the current design, It might cause some inconsistent issue. For example: For the first container scheduled to access the i915 resource. it only access the single card(card0). So with the same configurations, the different container might have different access permission to the two cards. That makes system results and behavior not consistent anymore. I think maybe the possible solution is to limit the i915 resource to 1 for each container. That can resolve all the potential issue. Thanks! |
Thank you! So the none policy allocates number of cards(in this case 1 or 2) for each pod randomly, irrespective of how many resources requested- Its different each time, is my understanding right? In this example case of balanced mode, for containers requesting <= 2 resources, it shows one card. For resources > 2, it shows 2 cards. Is this expected as well? |
Yes, "i915" resource is for number of GPU devices that should be visible [1] within given container. GPU plugin share count (CLI option) is just an upper limit on to how many containers the given GPU device may be assigned at the same time. When using just the GPU plugin, you can limit each container to its own GPU device only by using share count of 1. Whereas with GAS, you specify the sharing using GPU memory and millicore resource requests (and having large enough share count for GPU plugin to support that sharing), like you would do with k8s CPU resources. If e.g. at max 2 containers should be given the same GPU, containers should request half a core i.e. 500 millicores (see linked GAS docs for pod spec examples). Note that these GPU resource requests are just used for pod scheduling and device assignment decisions, there's no (cgroup based) enforcement for them, like there's for CPU resources [2]. To avoid GPU being underutilized (when it has not enough work to do), or container perf suffering (when GPU has too much work), those requests should roughly correspond to how much GPU your workload actually uses. You can check that e.g. with the [1] Whether those devices can actually be used when container uses some other user ID than root, depends on the user ID, device file access rights on the node hosts, and container runtime settings. For more info, see: https://kubernetes.io/blog/2021/11/09/non-root-containers-and-devices/ [2] Upstream kernel DRM infrastructure does not support cgroup for GPUs yet. Currently, enforcing things for GPU sharing would require e.g. GPU that supports SR-IOV (as Flex does) and configuring the SR-IOV partitions beforehand. Each SR-IOV partition will then show up as separate GPU. [3] Most of GPU metrics are available for the SR-IOV PF device, not for the VFs. GPU plugin handles that by providing only VFs for "i915" resource requests, and "i915_monitoring" resource (requested by GPU exporters) including also PF device in addition to all VFs. Note: monitoring resource needs to be explicitly enabled with GPU plugin option, it's not enabled by default. |
None policy can result in 1 or 2 actual GPUs to the container if i915 == 2. Balanced policy should result in 2 actual GPUs to the container if i915 == 2. If this is not the case, can you supply GPU-plugin logs for me to check? You'll have to add "-v=5" to gpu-plugin arguments as the default doesn't print much of anything.
Yes, that's correct. Packed mode "packs" each GPU before moving to the next one.
That's a way to workaround the limitations yes. |
Do we have a way to limit the i915 resource claim to 1 from gpu device plugin side? Or there has some other way to achieve it? You know currently, we can not use GAS yet. |
I get the feeling we might be thinking things differently. Do you have a workload that would require both GPUs from a single Flex 140 physical card? We don't currently have such allocation policy, though I believe it could be implemented. The next best thing is to set shared-dev-count to >1, set allocation policy to balanced, and use i915==2 in containers. That should allocate GPUs so that each pod gets access to one physical GPU (with two logical devices). But in general, one should consider the Flex 140 logical cards as normal GPUs. They share power and bus but other than that they work as independent entities. |
I think since we have several different combinations for the options, we also need to let users know what is the behavior or potential issues of these combinations, And what combination is preferred by us. |
IMO shared dev num is not really intended to give more than one GPU to a container. You could ask 10 x i915 resources from a node with only two physical GPUs and the container would receive those 10 resources. But the container would get only one or two actual GPU devices.
Yes, I agree. I created a ticket for improving the documentation: #1381 |
In general sharing is more for things that are batched, rather than things that have specific latency requirements. If you do have latency requirements, in addition to either using small enough share count that more cannot be schedule to same GPU, or by using GAS + suitable Pod GPU resource requests to take care of that, you need to vet the workloads to make sure they fit into same GPU without causing latency issues for the workloads running in parallel on the same GPU. (Reminder: kernel does not provide enforcement for how much GPU time each client can use, except for the "reset the offending GPU context" hang-check, which you may want to disable when running "random" HPC use-cases because their compute kernels may run long enough to trigger the hang check.) |
Thanks @vbedida79 for updating the "Possible solutions"
Thanks for your support! |
Using shared-dev-num = 1 (=no sharing) is the simplest case. I'd say that's the best basic setup. You can then have as many containers using a GPU as you have GPUs (as long as each container requests one GPU).
Currently with only GPU-plugin, the only way to achieve that is to use shared-dev-num == 1. Though, I guess we could change the default functionality so that it tries to pick different GPUs, if possible. But even that would start returning duplicate GPUs when the available GPUs are scarce.
What would be the use cases for having shared-dev-num > 1?
Sure. |
It's required to support GPU sharing with GAS resource management, as it's the upper limit on the sharing. It can also be used when you have only single GPU per node, know all the workloads, and manually make sure that everything you schedule fits to GPUs (heavier jobs have affinity rules to avoid them being scheduled to same node, and if you have multiple heavy deployments, you either use labels to put them to different sets of nodes, or do not schedule them at same time). It can also be useful for testing / development if you do not have that many GPUs in your cluster. |
Please allowed me to describe "the i915 resource set >1" more clearly. resources:
limits:
gpu.intel.com/i915: 1
gpu.intel.com/tiles: 2 But we found that users actually can set gpu.intel.com/i915 as any number they want, like resources:
limits:
gpu.intel.com/i915: 3
gpu.intel.com/tiles: 2 That will cause the inconsistent issue we mentioned. We hope that users can only set gpu.intel/i915 as 1, all the other numbers should not be allowed. |
It's by design. While same workload using multiple physical GPUs is uncommon, it's a valid "scale-up" use-case. For example AI training back propagation with GPU-to-GPU communication (e.g. using PVC XeLink). PR linked by Tuomas makes best-effort to provide workload separate GPU devices when it requests multiple ones, even when using "none" policy and no GAS is in use.
Share count = 1 with nodes having only single GPU at max, would result in pods which have container(s) that request more, remaining in Pending state. If you'd like such pods being kept in Pending state also on other situations, or to be rejected outright... That's probably better done on k8s control plane side, as generic resources management functionality, rather than something GPU specific. There's already LimitRange for normal k8s resources (memory/CPU): https://kubernetes.io/docs/concepts/policy/limit-range/ Maybe that could be extended to support also extended resource counts. It would be upstream k8s feature, but you might want to file separate ticket here as reminder, if you really need such thing. Getting it upstream will be much longer process though. PS. If you want want one-liner for finding out how many GPUs containers currently running in your cluster have requested, you can use this:
which provides output like this:
To see to which pods these resources belong to, just add
|
@eero-t |
From https://github.com/intel/platform-aware-scheduling/tree/master/gpu-aware-scheduling |
With shared-dev-num == 1 => no sharingi915 resource (of gpu.intel.com/i915) represents a GPU device on a node. It is similar to k8s' traditional resources like cpu or memory with the exception that i915 only supports full values (not 0.5 or 500m). In practice, when a Pod requests an i915 resource in its pod spec, the resulting container will see If the pod spec indicates 2 i915 resources, the resulting container will see No other containers will have access to the GPUs that were allocated to the Pod. With shared-dev-num > 1 => sharingThe difference here is that a GPU device on a node is not dedicated to any particular Pod. There can be multiple Pods using a GPU on a node (up to the shared-dev-num count). Lets say the shared-dev-num is set to 2. A Pod with Shared-dev-num selectionShared-dev-num should be selected depending on the use case of the cluster. If you are doing heavy computing, shared-dev-num should be probably set to 1. You are using the whole GPU and there's no room for other users. Note: as you are not using GAS, you shouldn't mix millicores or memory.max resources in the equation. They don't apply when you are only using GPU plugin. |
FYI: When Tuomas says "Pod" in above, he really means "container in a Pod". Resources are container attributes...
(Tuomas/Ukri, please correct in case I've misunderstood / remember something wrong.) Technically, it's a per-node list of resource IDs [1] and matching device file paths ( K8s scheduler knows just the size of that list, and how many of them are already allocated on each node. When scheduling a pod that requests When pod is deployed on that node, kubelet (k8s node agent) asks device plugin for device paths matching the specified resource IDs. At that point Intel GPU plugin can still affect them, based e.g. on pod annotations added by GAS. (When GAS is in use, it tells GPU plugin what device should be mapped to which container using pod annotations. Only when GAS is not used, GPU plugin does (some) independent decisions about that.) Kubelet adds provided device paths to pod container's OCI spec before giving it to container runtime, which uses [1] Different GPU plugins (Intel, AMD, Nvidia) use different resource IDs: BNF (PCI bus address), device UUID, or in case of Intel GPU plugin, GPU device control file name (as that makes some things simpler with GAS). [2] For non-root containers, see: https://kubernetes.io/blog/2021/11/09/non-root-containers-and-devices/ Btw. There's also a use-case where different containers want the same GPU device. For example cloud game pod where one of its containers is for the actual game, and another one for the media streaming functionality. GAS has functionality to handle these different kind of use-cases, for example: (IMHO using share count > 1 without GAS is a bit of a corner-case.) PS. In future, GPU workloads could use also DRA (dynamic resource allocation): https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/3063-dynamic-resource-allocation#summary With DRA, GPU device requests would work more like current k8s volume handling. You have separate device allocation objects (similar to PVs, Persistent Volumes) which pods can then claim (similar to PVCs, Persistent Volume Claims). |
Discussed with Ukri, and there's a clear use-case for this. A dedicated cluster with specific GPU workload pod, where it has been validated how many instances of that pod can work in parallel on the same GPU (without interfering each other too much). That does not need GAS, just a GPU plugin with suitable share count. |
@mythi Could you help to clarify the i915 resource definition in GPU device plugin? You know, we have to clearly understand this basic resource definition What @eero-t mentioned i915 resources definition makes sense to me. But It is not aligned with what we observed in the current GPU device plugin. |
I'm not the right person for this. The GPU maintainers in this thread can follow up with this if necessary |
Your equation is correct:
I've created two issues for the continuation: #1407 for the i915 documentation improvements and #1408 for the i915 resource limitation per Pod. I'd like the discussion to continue in those tickets. I hope our understanding of the i915 and shared-dev-num is now aligned. |
It is aligned to what I described. |
Summary
With default allocation policy "none",
shared-dev-num=10
and Intel® Data Center GPU Flex 140 - if a container is allocated 3 or moregpu.intel.com/i915
resources, 2 gpu cards are accessible in the pod.Detail
On Openshift 4.12 cluster, created gpu device plugin CR with allocation-policy none and shared-dev-num 10- on a node with Intel GPU Flex 140 card. So the node has 2 cards and 20
gpu.intel.com/i915
resources. For clinfo pod, if 1 or 2gpu.intel.com/i915
resources are allocated, one card is accessible from the pod.If more than three resources are allocated both the cards are accessible inside the pod
Is this expected by design?
Possible solutions
Card sharing: Card sharing from the drivers/hardware perspective is limited/not supported. Though 2 containers can share card, there's no card isolation for containers and can have security concerns. As we cannot use GAS yet, can we assume there is no way to support card sharing among containers? Is it possible to disable the sharing of card- i.e change the resource-allocation-policy options. That means user can only input shared dev num and one container can use the card exclusively.
Inconsistency of claiming resources and access to gpu devices: Do containers need multiple i915 resources? In the case of packed policy, shared-dev-num:3, each container requesting 2 resources- one container has access to 1 card, another container has access to 2 cards. 1 container should claim i915 resource, but they can claim more than 1 i915 resource
To resolve it, can containers be limited to request only 1 i915 resource each?
Thanks!
The text was updated successfully, but these errors were encountered: