Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRA: Handle extended resource requests via DRA Driver #5004

Open
4 tasks
klueska opened this issue Dec 17, 2024 · 16 comments
Open
4 tasks

DRA: Handle extended resource requests via DRA Driver #5004

klueska opened this issue Dec 17, 2024 · 16 comments
Assignees
Labels
sig/node Categorizes an issue or PR as relevant to SIG Node. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. stage/alpha Denotes an issue tracking an enhancement targeted for Alpha status wg/device-management Categorizes an issue or PR as relevant to WG Device Management.

Comments

@klueska
Copy link
Contributor

klueska commented Dec 17, 2024

Enhancement Description

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. wg/device-management Categorizes an issue or PR as relevant to WG Device Management. labels Dec 17, 2024
@johnbelamaric
Copy link
Member

+1 yes please!

@johnbelamaric
Copy link
Member

johnbelamaric commented Dec 17, 2024

We need to sort out the requirements. A few initial questions:

  1. For newly created pods, I think it's clear we want this to be transparent. Existing manifests that use the extended resource API should continue to work as before, without modification.
  2. Can we handle this invisibly in the driver layer, or do we need to have DRA invoked at the control plane level and select the specific devices? If we don't, we will likely have a race condition - unless the scheduler can do some magical accounting (which seems possible).
  3. How do we handle upgrades? If we have a node running device plugin, and we switch to the DRA driver (or we upgrade to a driver that supports both), do you have to delete the pods? Do they automatically adopt the devices? If so, how do we write those back to the allocation logic (since no DRA claim exists).
  4. What happens if there are pods in a deployment, and some land on nodes with device plugin and some with DRA drivers?
  5. We talked about letting specific device classes be advertised as specific extended resources. This could mean the existing resource names get mapped to specific device classes by the admin. It could also mean we have a convention like deviceclass.k8s.io/foo: 4 for extended resource names. How do these choices interplay with the questions above?

@lengrongfu
Copy link
Member

Can each dra-driver implement a webhook to create a ResourceClaimTemplate after creating a pod and modify the application method of resources in the pod?

@klueska
Copy link
Contributor Author

klueska commented Jan 7, 2025

@lengrongfu that is what this KEP would be designed to avoid. There would be integrated scheduler support for all drivers, rather than requiring each DRA driver to provide a webhook.

@alculquicondor
Copy link
Member

Open questions (from SIG Scheduling meeting):

  • How to handle resource quotas
  • Scheduling throughput (API requests and overall processing).

@ffromani
Copy link
Contributor

/cc

@yliaog
Copy link

yliaog commented Jan 28, 2025

/cc

@johnbelamaric
Copy link
Member

/sig scheduling

@k8s-ci-robot k8s-ci-robot added the sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. label Jan 30, 2025
@github-project-automation github-project-automation bot moved this to Needs Triage in SIG Scheduling Jan 30, 2025
@johnbelamaric
Copy link
Member

/assign @yliaog

Yu, I am assigning to you, let me know if that's OK

@johnbelamaric johnbelamaric moved this from 🆕 New to 🏗 In progress in SIG Node: Dynamic Resource Allocation Feb 4, 2025
@haircommander
Copy link
Contributor

/label lead-opted-in
/milestone v1.33

note: PRR freeze is tomorrow! you need to have a KEP update for this opened before then. Thanks!

@k8s-ci-robot k8s-ci-robot added this to the v1.33 milestone Feb 5, 2025
@k8s-ci-robot k8s-ci-robot added the lead-opted-in Denotes that an issue has been opted in to a release label Feb 5, 2025
@johnbelamaric
Copy link
Member

/stage alpha

@k8s-ci-robot k8s-ci-robot added the stage/alpha Denotes an issue tracking an enhancement targeted for Alpha status label Feb 5, 2025
@dipesh-rawat
Copy link
Member

Hello @klueska @pohly @johnbelamaric @yliaog 👋, v1.33 Enhancements team here.

Just checking in as we approach enhancements freeze on 02:00 UTC Friday 14th February 2025 / 19:00 PDT Thursday 13th February 2025.

This enhancement is targeting stage alpha for v1.33 (correct me, if otherwise)

Here's where this enhancement currently stands:

  • KEP readme using the latest template has been merged into the k/enhancements repo.
  • KEP status is marked as implementable for latest-milestone: v1.32.
  • KEP readme has up-to-date graduation criteria
  • KEP has a production readiness review that has been completed and merged into k/enhancements. (For more information on the PRR process, check here). If your production readiness review is not completed yet, please make sure to fill the production readiness questionnaire in your KEP by the PRR Freeze deadline on Thursday 6th February 2025 so that the PRR team has enough time to review your KEP.

For this KEP, we would need to update the following:

  • Create the KEP readme using the latest template and merge it in the k/enhancements repo.
  • Ensure that the KEP has undergone a production readiness review and has been merged into k/enhancements.

The status of this enhancement is marked as At risk for enhancements freeze. Please keep the issue description up-to-date with appropriate stages as well

If you anticipate missing enhancements freeze, you can file an exception request in advance. Thank you!

@dipesh-rawat dipesh-rawat moved this to At risk for enhancements freeze in 1.33 Enhancements Tracking Feb 6, 2025
@dipesh-rawat
Copy link
Member

Hi @klueska @pohly @johnbelamaric @yliaog 👋, 1.33 Enhancements team here,

Just a quick friendly reminder as we approach the enhancements freeze later this week, at 02:00 UTC Friday 14th February 2025 / 19:00 PDT Thursday 13th February 2025.

The current status of this enhancement is marked as At risk for enhancement freeze. There are a few requirements mentioned in the comment #5004 (comment) that still need to be completed.

If you anticipate missing enhancements freeze, you can file an exception request in advance. Thank you!

@johnbelamaric
Copy link
Member

johnbelamaric commented Feb 11, 2025

@dipesh-rawat we will be doing this in 1.34 instead - I updated the description above, can you do whatever else the release team needs to properly account for that?

@haircommander
Copy link
Contributor

/remove-label lead-opted-in
/remove-milestone v1.33

@k8s-ci-robot k8s-ci-robot removed the lead-opted-in Denotes that an issue has been opted in to a release label Feb 11, 2025
@dipesh-rawat
Copy link
Member

I see that this issue has been opted-out of v1.33 and is now planned for a future release. I will go ahead and mark it as Deferred on the v1.33 board for tracking purposes - do let the enhancement team know otherwise.

/milestone clear

@k8s-ci-robot k8s-ci-robot removed this from the v1.33 milestone Feb 11, 2025
@dipesh-rawat dipesh-rawat moved this from At risk for enhancements freeze to Deferred in 1.33 Enhancements Tracking Feb 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sig/node Categorizes an issue or PR as relevant to SIG Node. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. stage/alpha Denotes an issue tracking an enhancement targeted for Alpha status wg/device-management Categorizes an issue or PR as relevant to WG Device Management.
Projects
Status: Deferred
Status: Draft Stage
Status: 🏗 In progress
Status: Needs Triage
Development

No branches or pull requests

9 participants