From 6e540a5cdf142b7a807f1c00ed9e7d305dc1cbc9 Mon Sep 17 00:00:00 2001 From: Swati Gupta Date: Tue, 27 May 2025 19:37:59 +0000 Subject: [PATCH 1/7] kep-3695-beta update Signed-off-by: Swati Gupta --- keps/prod-readiness/sig-node/3695.yaml | 2 ++ keps/sig-node/3695-pod-resources-for-dra/README.md | 9 +++++++-- keps/sig-node/3695-pod-resources-for-dra/kep.yaml | 4 ++-- 3 files changed, 11 insertions(+), 4 deletions(-) diff --git a/keps/prod-readiness/sig-node/3695.yaml b/keps/prod-readiness/sig-node/3695.yaml index 1a863dbcb4f..10323ab2f0e 100644 --- a/keps/prod-readiness/sig-node/3695.yaml +++ b/keps/prod-readiness/sig-node/3695.yaml @@ -4,3 +4,5 @@ kep-number: 3695 alpha: approver: "@johnbelamaric" +beta: + approver: "@johnbelamaric" diff --git a/keps/sig-node/3695-pod-resources-for-dra/README.md b/keps/sig-node/3695-pod-resources-for-dra/README.md index 93af0497133..6a40586b172 100644 --- a/keps/sig-node/3695-pod-resources-for-dra/README.md +++ b/keps/sig-node/3695-pod-resources-for-dra/README.md @@ -274,8 +274,9 @@ These cases will be added in the existing e2e tests: #### Beta -- [ ] Gather feedback from consumers of the DRA feature. -- [ ] No major bugs reported in the previous cycle. +- [x] Gather feedback from consumers of the DRA feature. + - Integration with DCGM exporter (WIP) +- [x] No major bugs reported in the previous cycle. #### GA @@ -438,6 +439,10 @@ N/A. - 2024-09-10: KEP Updated to reflect the current state of the implementation. +- Kubernetes 1.27: Alpha version of the KEP. + +- Kubernetes 1.34: Beta version of the KEP. + ## Drawbacks ## Alternatives diff --git a/keps/sig-node/3695-pod-resources-for-dra/kep.yaml b/keps/sig-node/3695-pod-resources-for-dra/kep.yaml index 25ae5196e34..0bb21c157c1 100644 --- a/keps/sig-node/3695-pod-resources-for-dra/kep.yaml +++ b/keps/sig-node/3695-pod-resources-for-dra/kep.yaml @@ -18,12 +18,12 @@ see-also: replaces: [] # The target maturity stage in the current dev cycle for this KEP. -stage: alpha +stage: beta # The most recent milestone for which work toward delivery of this KEP has been # done. This can be the current (upcoming) milestone, if it is being actively # worked on. -latest-milestone: "v1.27" +latest-milestone: "v1.34" # The milestone at which this feature was, or is targeted to be, at each stage. milestone: From 430351f47a014e189a174eeb2c95647107e6ac2a Mon Sep 17 00:00:00 2001 From: Swati Gupta Date: Tue, 27 May 2025 14:09:29 -0700 Subject: [PATCH 2/7] Update keps/sig-node/3695-pod-resources-for-dra/README.md Co-authored-by: Kevin Klues --- keps/sig-node/3695-pod-resources-for-dra/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/keps/sig-node/3695-pod-resources-for-dra/README.md b/keps/sig-node/3695-pod-resources-for-dra/README.md index 6a40586b172..ecfff02ae7c 100644 --- a/keps/sig-node/3695-pod-resources-for-dra/README.md +++ b/keps/sig-node/3695-pod-resources-for-dra/README.md @@ -275,7 +275,7 @@ These cases will be added in the existing e2e tests: #### Beta - [x] Gather feedback from consumers of the DRA feature. - - Integration with DCGM exporter (WIP) + - Integration with the NVIDIA DCGM exporter (WIP) - [x] No major bugs reported in the previous cycle. #### GA From 72432d50ba6658ef55bbc77ad9d2a0b9405f64e9 Mon Sep 17 00:00:00 2001 From: Swati Gupta Date: Mon, 9 Jun 2025 20:44:13 +0000 Subject: [PATCH 3/7] update milestone version and beta template Signed-off-by: Swati Gupta --- keps/sig-node/3695-pod-resources-for-dra/README.md | 4 ++-- keps/sig-node/3695-pod-resources-for-dra/kep.yaml | 2 +- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/keps/sig-node/3695-pod-resources-for-dra/README.md b/keps/sig-node/3695-pod-resources-for-dra/README.md index ecfff02ae7c..a920bc6f4cd 100644 --- a/keps/sig-node/3695-pod-resources-for-dra/README.md +++ b/keps/sig-node/3695-pod-resources-for-dra/README.md @@ -274,8 +274,8 @@ These cases will be added in the existing e2e tests: #### Beta -- [x] Gather feedback from consumers of the DRA feature. - - Integration with the NVIDIA DCGM exporter (WIP) +- [] Gather feedback from consumers of the DRA feature. + - Integration with the NVIDIA DCGM exporter. - [x] No major bugs reported in the previous cycle. #### GA diff --git a/keps/sig-node/3695-pod-resources-for-dra/kep.yaml b/keps/sig-node/3695-pod-resources-for-dra/kep.yaml index 0bb21c157c1..a2f1729cabb 100644 --- a/keps/sig-node/3695-pod-resources-for-dra/kep.yaml +++ b/keps/sig-node/3695-pod-resources-for-dra/kep.yaml @@ -28,7 +28,7 @@ latest-milestone: "v1.34" # The milestone at which this feature was, or is targeted to be, at each stage. milestone: alpha: "v1.27" - beta: "v1.33" + beta: "v1.34" stable: "v1.36" # The following PRR answers are required at alpha release From 3d1512b9e2ca9453f62425dbf917f820484aa052 Mon Sep 17 00:00:00 2001 From: Swati Gupta Date: Tue, 10 Jun 2025 21:29:34 +0000 Subject: [PATCH 4/7] Update approver and kep status Signed-off-by: Swati Gupta --- keps/prod-readiness/sig-node/3695.yaml | 2 +- keps/sig-node/3695-pod-resources-for-dra/kep.yaml | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/keps/prod-readiness/sig-node/3695.yaml b/keps/prod-readiness/sig-node/3695.yaml index 10323ab2f0e..741ef3e817f 100644 --- a/keps/prod-readiness/sig-node/3695.yaml +++ b/keps/prod-readiness/sig-node/3695.yaml @@ -5,4 +5,4 @@ kep-number: 3695 alpha: approver: "@johnbelamaric" beta: - approver: "@johnbelamaric" + approver: "@soltysh" diff --git a/keps/sig-node/3695-pod-resources-for-dra/kep.yaml b/keps/sig-node/3695-pod-resources-for-dra/kep.yaml index a2f1729cabb..64e4f91204c 100644 --- a/keps/sig-node/3695-pod-resources-for-dra/kep.yaml +++ b/keps/sig-node/3695-pod-resources-for-dra/kep.yaml @@ -4,8 +4,8 @@ authors: - "@moshe010" owning-sig: sig-node participating-sigs: [] -status: provisional -creation-date: implementable +status: implementable +creation-date: 2023-02-07 reviewers: - "@ffromani" - "@swatisehgal" From 2402b80109af6c918eb577ae50c038b5059b2f4f Mon Sep 17 00:00:00 2001 From: Swati Gupta Date: Mon, 16 Jun 2025 21:59:43 +0000 Subject: [PATCH 5/7] Update to Release Signoff Checklist Signed-off-by: Swati Gupta --- .../3695-pod-resources-for-dra/README.md | 22 +++++++++---------- 1 file changed, 10 insertions(+), 12 deletions(-) diff --git a/keps/sig-node/3695-pod-resources-for-dra/README.md b/keps/sig-node/3695-pod-resources-for-dra/README.md index a920bc6f4cd..c34683a7146 100644 --- a/keps/sig-node/3695-pod-resources-for-dra/README.md +++ b/keps/sig-node/3695-pod-resources-for-dra/README.md @@ -1,4 +1,4 @@ -# KEP-3695: Extend the PodResources API to include resources allocated by DRA + KEP-3695: Extend the PodResources API to include resources allocated by DRA - [Release Signoff Checklist](#release-signoff-checklist) @@ -36,17 +36,17 @@ Items marked with (R) are required *prior to targeting to a milestone / release*. - [x] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) -- [ ] (R) KEP approvers have approved the KEP status as `implementable` -- [ ] (R) Design details are appropriately documented +- [x] (R) KEP approvers have approved the KEP status as `implementable` +- [x] (R) Design details are appropriately documented - [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors) - [ ] e2e Tests for all Beta API Operations (endpoints) - [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) - [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free -- [ ] (R) Graduation criteria is in place +- [x] (R) Graduation criteria is in place - [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) -- [ ] (R) Production readiness review completed +- [x] (R) Production readiness review completed - [ ] (R) Production readiness review approved -- [ ] "Implementation History" section is up-to-date for milestone +- [x] "Implementation History" section is up-to-date for milestone - [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] - [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes @@ -274,8 +274,8 @@ These cases will be added in the existing e2e tests: #### Beta -- [] Gather feedback from consumers of the DRA feature. - - Integration with the NVIDIA DCGM exporter. +- [x] Gather feedback from consumers of the DRA feature. + - Integration with the NVIDIA DCGM exporter (https://github.com/NVIDIA/dcgm-exporter/pull/501) to gather per pod Dynamic Resources managed by [k8s-dra-driver-gpu](https://github.com/NVIDIA/k8s-dra-driver-gpu). - [x] No major bugs reported in the previous cycle. #### GA @@ -334,7 +334,7 @@ The API becomes available again. The API is stateless, so no recovery is needed, ###### Are there any tests for feature enablement/disablement? -e2e test will demonstrate that when the feature gate is disabled, the API returns the appropriate error code. +e2e test will demonstrate that when the feature gate is disabled, the API returns the appropriate error code. (https://github.com/kubernetes/kubernetes/pull/116846) ### Rollout, Upgrade and Rollback Planning @@ -439,9 +439,7 @@ N/A. - 2024-09-10: KEP Updated to reflect the current state of the implementation. -- Kubernetes 1.27: Alpha version of the KEP. - -- Kubernetes 1.34: Beta version of the KEP. +- 2025-05-27: Beta version of the KEP. ## Drawbacks From 64302a33a0c801161f1fc2a52630d105fd6037a3 Mon Sep 17 00:00:00 2001 From: Swati Gupta Date: Wed, 18 Jun 2025 17:31:43 +0000 Subject: [PATCH 6/7] Add when Get() returns error Signed-off-by: Swati Gupta --- keps/sig-node/3695-pod-resources-for-dra/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/keps/sig-node/3695-pod-resources-for-dra/README.md b/keps/sig-node/3695-pod-resources-for-dra/README.md index c34683a7146..e532cde22e8 100644 --- a/keps/sig-node/3695-pod-resources-for-dra/README.md +++ b/keps/sig-node/3695-pod-resources-for-dra/README.md @@ -107,7 +107,7 @@ to allow querying specific pods for their allocated resources. returns the list of PodResources for *all* pods across *all* namespaces in the cluster). That is, it allows one to specify a specific pod and namespace to retrieve PodResources from, rather than having to query all of them all at -once. +once. `Get()` returns error if the pod is known to the kubelet, but is terminated. The full PodResources API (including our proposed extensions) can be seen below: From ae728ef68c89d8d2471df5454247452401fdd9f9 Mon Sep 17 00:00:00 2001 From: Swati Gupta Date: Wed, 18 Jun 2025 18:54:48 +0000 Subject: [PATCH 7/7] Address upgrade, resource usage and other related comments Signed-off-by: Swati Gupta --- .../3695-pod-resources-for-dra/README.md | 31 ++++++++++++++----- 1 file changed, 24 insertions(+), 7 deletions(-) diff --git a/keps/sig-node/3695-pod-resources-for-dra/README.md b/keps/sig-node/3695-pod-resources-for-dra/README.md index e532cde22e8..611219f1f95 100644 --- a/keps/sig-node/3695-pod-resources-for-dra/README.md +++ b/keps/sig-node/3695-pod-resources-for-dra/README.md @@ -348,7 +348,12 @@ Kubelet may fail to start. The new API may report inconsistent data, or may caus ###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? -Not Applicable. +Not Applicable. Because this change: + +- Is read-only in the kubelet’s in-memory state. +- Is behind a feature gate, so turning it off simply disables the new endpoints without affecting any existing behavior. + +In practice, restart the kubelet with the gate disabled (rollback) or re-enabled (upgrade), and the API behavior reverts or returns without loss of data or consistency. Therefore we don’t need a special upgrade/downgrade test matrix for this KEP. ###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? @@ -373,7 +378,9 @@ Call the PodResources API and see the result. ###### What are the reasonable SLOs (Service Level Objectives) for the enhancement? -N/A. +100% in normal operation. The proposed API exposes in read only mode kubelet internal data, critical for functioning of the kubelet. +This data has to be available 100% of the time for the proper functioning of the kubelet, thus is expected to be available 100% of time. +The only possible error source is the API calls being throttled by the rate-limiting introduced with the GA graduation of the parent KEP 606. ###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? @@ -409,29 +416,35 @@ No. ###### Will enabling / using this feature result in increasing size or count of the existing API objects? -No. +No. Enabling this feature does not change the number of API objects returned. But it may increase the size of each object whenever there are Dynamic Resources to report where each ContainerResources now has an extra dynamic_resources field. ###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? No. Feature is out of existing any paths in kubelet. ###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? +Negligible amount of CPU and memory. Because the API is purely read-only and piggy-backs on the kubelet’s existing cache and checkpointing machinery, exposing Dynamic Resources incurs only similar minimal serialization and storage as CPUManager and DeviceManager—so any extra CPU, memory, disk, or I/O impact is negligible. -DDOSing the API can lead to resource exhaustion. +###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)? + +No, because the endpoint queries existing data structures inside the kubelet. ### Troubleshooting ###### How does this feature react if the API server and/or etcd is unavailable? -N/A. +No impact, the feature is node-local. ###### What are other known failure modes? -The API will always return a well-known error. In normal operation, the API is expected to never return an error and always return a valid response, because it utilizes internal kubelet data which is always available. Bugs may cause the API to return unexpected errors, or to return inconsistent data. Consumers of the API should treat unexpected errors as bugs of this API. +feature gate disabled: The API will always return a well-known error. In normal operation, the API is expected to never return an error and always return a valid response, because it utilizes internal kubelet data which is always available. +Bugs may cause the API to return unexpected errors, or to return inconsistent data. +Consumers of the API should treat unexpected errors as bugs of this API. ###### What steps should be taken if SLOs are not being met to determine the problem? -N/A. +Check the error code to learn if the consumer of the API is being throttle by rate limiting introduced in the parent KEP 606. +Check the kubelet logs to learn about resource allocation errors. ## Implementation History @@ -443,4 +456,8 @@ N/A. ## Drawbacks +N/A + ## Alternatives + +N/A