DRA: Handle extended resource requests via DRA Driver #5004

klueska · 2024-12-17T09:02:19Z

Enhancement Description

One-line enhancement description (can be used as a release note):
Allow DRA drivers to honor requests made via the extended resource API (e.g. nvidia.com/gpu: 2) rather than requiring a standard device plugin be used.
Kubernetes Enhancement Proposal:
- Add KEP for DRA: Extended Resource #5136
- Incremental PRs:
  - TBD
Discussion Link:
- https://youtu.be/fKhX_lHK8Z0?si=gq5kIFHP3ve2TXyE&t=1822
Primary contact (assignee):
@klueska, @pohly, @johnbelamaric
Responsible SIGs:
/sig node
/wg device-management
Enhancement target (which target equals to which milestone):
- Alpha release target: 1.34
- Beta release target: 1.35
- Stable release target: 1.36
Alpha
- KEP (k/enhancements) update PR(s):
  - Add KEP for DRA: Extended Resource #5136
- Code (k/k) update PR(s):
  - TBD
- Docs (k/website) update PR(s):
  - TBD

The text was updated successfully, but these errors were encountered:

johnbelamaric · 2024-12-17T14:46:50Z

+1 yes please!

johnbelamaric · 2024-12-17T18:58:18Z

We need to sort out the requirements. A few initial questions:

For newly created pods, I think it's clear we want this to be transparent. Existing manifests that use the extended resource API should continue to work as before, without modification.
Can we handle this invisibly in the driver layer, or do we need to have DRA invoked at the control plane level and select the specific devices? If we don't, we will likely have a race condition - unless the scheduler can do some magical accounting (which seems possible).
How do we handle upgrades? If we have a node running device plugin, and we switch to the DRA driver (or we upgrade to a driver that supports both), do you have to delete the pods? Do they automatically adopt the devices? If so, how do we write those back to the allocation logic (since no DRA claim exists).
What happens if there are pods in a deployment, and some land on nodes with device plugin and some with DRA drivers?
We talked about letting specific device classes be advertised as specific extended resources. This could mean the existing resource names get mapped to specific device classes by the admin. It could also mean we have a convention like deviceclass.k8s.io/foo: 4 for extended resource names. How do these choices interplay with the questions above?

lengrongfu · 2024-12-27T14:05:18Z

Can each dra-driver implement a webhook to create a ResourceClaimTemplate after creating a pod and modify the application method of resources in the pod?

klueska · 2025-01-07T19:43:35Z

@lengrongfu that is what this KEP would be designed to avoid. There would be integrated scheduler support for all drivers, rather than requiring each DRA driver to provide a webhook.

alculquicondor · 2025-01-09T18:34:50Z

Open questions (from SIG Scheduling meeting):

How to handle resource quotas
Scheduling throughput (API requests and overall processing).

ffromani · 2025-01-21T18:13:20Z

/cc

yliaog · 2025-01-28T21:42:49Z

/cc

johnbelamaric · 2025-01-30T23:01:59Z

/sig scheduling

johnbelamaric · 2025-01-30T23:03:32Z

/assign @yliaog

Yu, I am assigning to you, let me know if that's OK

haircommander · 2025-02-05T21:23:20Z

/label lead-opted-in
/milestone v1.33

note: PRR freeze is tomorrow! you need to have a KEP update for this opened before then. Thanks!

johnbelamaric · 2025-02-05T22:04:37Z

/stage alpha

dipesh-rawat · 2025-02-06T10:08:25Z

Hello @klueska @pohly @johnbelamaric @yliaog 👋, v1.33 Enhancements team here.

Just checking in as we approach enhancements freeze on 02:00 UTC Friday 14th February 2025 / 19:00 PDT Thursday 13th February 2025.

This enhancement is targeting stage alpha for v1.33 (correct me, if otherwise)

Here's where this enhancement currently stands:

KEP readme using the latest template has been merged into the k/enhancements repo.
KEP status is marked as implementable for latest-milestone: v1.32.
KEP readme has up-to-date graduation criteria
KEP has a production readiness review that has been completed and merged into k/enhancements. (For more information on the PRR process, check here). If your production readiness review is not completed yet, please make sure to fill the production readiness questionnaire in your KEP by the PRR Freeze deadline on Thursday 6th February 2025 so that the PRR team has enough time to review your KEP.

For this KEP, we would need to update the following:

Create the KEP readme using the latest template and merge it in the k/enhancements repo.
Ensure that the KEP has undergone a production readiness review and has been merged into k/enhancements.

The status of this enhancement is marked as At risk for enhancements freeze. Please keep the issue description up-to-date with appropriate stages as well

If you anticipate missing enhancements freeze, you can file an exception request in advance. Thank you!

dipesh-rawat · 2025-02-10T19:31:03Z

Hi @klueska @pohly @johnbelamaric @yliaog 👋, 1.33 Enhancements team here,

Just a quick friendly reminder as we approach the enhancements freeze later this week, at 02:00 UTC Friday 14th February 2025 / 19:00 PDT Thursday 13th February 2025.

The current status of this enhancement is marked as At risk for enhancement freeze. There are a few requirements mentioned in the comment #5004 (comment) that still need to be completed.

If you anticipate missing enhancements freeze, you can file an exception request in advance. Thank you!

johnbelamaric · 2025-02-11T16:44:36Z

@dipesh-rawat we will be doing this in 1.34 instead - I updated the description above, can you do whatever else the release team needs to properly account for that?

haircommander · 2025-02-11T17:14:39Z

/remove-label lead-opted-in
/remove-milestone v1.33

dipesh-rawat · 2025-02-11T18:35:10Z

I see that this issue has been opted-out of v1.33 and is now planned for a future release. I will go ahead and mark it as Deferred on the v1.33 board for tracking purposes - do let the enhancement team know otherwise.

/milestone clear

k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. wg/device-management Categorizes an issue or PR as relevant to WG Device Management. labels Dec 17, 2024

github-project-automation bot added this to SIG Node: Dynamic Resource Allocation Dec 17, 2024

github-project-automation bot moved this to 🆕 New in SIG Node: Dynamic Resource Allocation Dec 17, 2024

haircommander added this to SIG Node 1.33 KEPs planning Jan 24, 2025

haircommander moved this to Triage in SIG Node 1.33 KEPs planning Jan 24, 2025

haircommander moved this from Triage to Draft Stage in SIG Node 1.33 KEPs planning Jan 28, 2025

k8s-ci-robot added the sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. label Jan 30, 2025

github-project-automation bot moved this to Needs Triage in SIG Scheduling Jan 30, 2025

github-project-automation bot added this to SIG Scheduling Jan 30, 2025

k8s-ci-robot assigned yliaog Jan 30, 2025

johnbelamaric moved this from 🆕 New to 🏗 In progress in SIG Node: Dynamic Resource Allocation Feb 4, 2025

k8s-ci-robot added this to the v1.33 milestone Feb 5, 2025

k8s-ci-robot added the lead-opted-in Denotes that an issue has been opted in to a release label Feb 5, 2025

k8s-ci-robot added the stage/alpha Denotes an issue tracking an enhancement targeted for Alpha status label Feb 5, 2025

yliaog mentioned this issue Feb 5, 2025

Add KEP for DRA: Extended Resource #5136

Open

k8s-infra-ci-robot added this to 1.33 Enhancements Tracking Feb 6, 2025

dipesh-rawat moved this to At risk for enhancements freeze in 1.33 Enhancements Tracking Feb 6, 2025

k8s-ci-robot removed the lead-opted-in Denotes that an issue has been opted in to a release label Feb 11, 2025

k8s-ci-robot removed this from the v1.33 milestone Feb 11, 2025

dipesh-rawat moved this from At risk for enhancements freeze to Deferred in 1.33 Enhancements Tracking Feb 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DRA: Handle extended resource requests via DRA Driver #5004

DRA: Handle extended resource requests via DRA Driver #5004

klueska commented Dec 17, 2024 •

edited by johnbelamaric

Loading

johnbelamaric commented Dec 17, 2024

johnbelamaric commented Dec 17, 2024 •

edited

Loading

lengrongfu commented Dec 27, 2024

klueska commented Jan 7, 2025

alculquicondor commented Jan 9, 2025

ffromani commented Jan 21, 2025

yliaog commented Jan 28, 2025

johnbelamaric commented Jan 30, 2025

johnbelamaric commented Jan 30, 2025

haircommander commented Feb 5, 2025

johnbelamaric commented Feb 5, 2025

dipesh-rawat commented Feb 6, 2025

dipesh-rawat commented Feb 10, 2025

johnbelamaric commented Feb 11, 2025 •

edited

Loading

haircommander commented Feb 11, 2025

dipesh-rawat commented Feb 11, 2025

DRA: Handle extended resource requests via DRA Driver #5004

DRA: Handle extended resource requests via DRA Driver #5004

Comments

klueska commented Dec 17, 2024 • edited by johnbelamaric Loading

Enhancement Description

johnbelamaric commented Dec 17, 2024

johnbelamaric commented Dec 17, 2024 • edited Loading

lengrongfu commented Dec 27, 2024

klueska commented Jan 7, 2025

alculquicondor commented Jan 9, 2025

ffromani commented Jan 21, 2025

yliaog commented Jan 28, 2025

johnbelamaric commented Jan 30, 2025

johnbelamaric commented Jan 30, 2025

haircommander commented Feb 5, 2025

johnbelamaric commented Feb 5, 2025

dipesh-rawat commented Feb 6, 2025

dipesh-rawat commented Feb 10, 2025

johnbelamaric commented Feb 11, 2025 • edited Loading

haircommander commented Feb 11, 2025

dipesh-rawat commented Feb 11, 2025

klueska commented Dec 17, 2024 •

edited by johnbelamaric

Loading

johnbelamaric commented Dec 17, 2024 •

edited

Loading

johnbelamaric commented Feb 11, 2025 •

edited

Loading