Skip to content

kep-3695-beta update #5346

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Jun 19, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions keps/prod-readiness/sig-node/3695.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,3 +4,5 @@
kep-number: 3695
alpha:
approver: "@johnbelamaric"
beta:
approver: "@soltysh"
54 changes: 37 additions & 17 deletions keps/sig-node/3695-pod-resources-for-dra/README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# KEP-3695: Extend the PodResources API to include resources allocated by DRA
KEP-3695: Extend the PodResources API to include resources allocated by DRA

<!-- toc -->
- [Release Signoff Checklist](#release-signoff-checklist)
Expand Down Expand Up @@ -36,17 +36,17 @@
Items marked with (R) are required *prior to targeting to a milestone / release*.

- [x] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
- [ ] (R) KEP approvers have approved the KEP status as `implementable`
- [ ] (R) Design details are appropriately documented
- [x] (R) KEP approvers have approved the KEP status as `implementable`
- [x] (R) Design details are appropriately documented
- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- [ ] e2e Tests for all Beta API Operations (endpoints)
- [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
- [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free
- [ ] (R) Graduation criteria is in place
- [x] (R) Graduation criteria is in place
- [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
- [ ] (R) Production readiness review completed
- [x] (R) Production readiness review completed
- [ ] (R) Production readiness review approved
- [ ] "Implementation History" section is up-to-date for milestone
- [x] "Implementation History" section is up-to-date for milestone
- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Expand Down Expand Up @@ -107,7 +107,7 @@ to allow querying specific pods for their allocated resources.
returns the list of PodResources for *all* pods across *all* namespaces in the
cluster). That is, it allows one to specify a specific pod and namespace to
retrieve PodResources from, rather than having to query all of them all at
once.
once. `Get()` returns error if the pod is known to the kubelet, but is terminated.

The full PodResources API (including our proposed extensions) can be seen below:

Expand Down Expand Up @@ -274,8 +274,9 @@ These cases will be added in the existing e2e tests:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Earlier in the doc:

  1. Please make sure to check appropriate boxes in the ## Release Signoff Checklist.
  2. Missing links in the integration tests section, see template, and in the e2e section as well, see template. Either of the two is required for beta promotion, and it looks like you had a requirement for e2e during alpha, so I expect those to be completed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Please make sure to check appropriate boxes in the ## Release Signoff Checklist.

This was addressed - thank you.

  • Missing links in the integration tests section, see template, and in the e2e section as well, see template. Either of the two is required for beta promotion, and it looks like you had a requirement for e2e during alpha, so I expect those to be completed.

This one still holds. We need links for integration and e2e based on the template in the appropriate section. I believe e2es were added in kubernetes/kubernetes#116846 so you should be able to quickly fill those in. Not sure if there are others.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still holds, I see Francesco mentioned several tests that were added, can we make sure they are explicitly linked in this document?

#### Beta

- [ ] Gather feedback from consumers of the DRA feature.
- [ ] No major bugs reported in the previous cycle.
- [x] Gather feedback from consumers of the DRA feature.
- Integration with the NVIDIA DCGM exporter (https://github.com/NVIDIA/dcgm-exporter/pull/501) to gather per pod Dynamic Resources managed by [k8s-dra-driver-gpu](https://github.com/NVIDIA/k8s-dra-driver-gpu).
- [x] No major bugs reported in the previous cycle.

#### GA

Expand Down Expand Up @@ -333,7 +334,7 @@ The API becomes available again. The API is stateless, so no recovery is needed,

###### Are there any tests for feature enablement/disablement?

e2e test will demonstrate that when the feature gate is disabled, the API returns the appropriate error code.
e2e test will demonstrate that when the feature gate is disabled, the API returns the appropriate error code. (https://github.com/kubernetes/kubernetes/pull/116846)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The linked PR isn't testing feature enablement/disablement, or am I misreading it? The closest place where you test this feature gate is https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/apis/podresources/server_v1_test.go but there you only turn this on, but I don't see the requested on/off test.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have a on/off test scattered across the existing tests: https://github.com/kubernetes/kubernetes/blob/v1.34.0-alpha.1/test/e2e_node/podresources_test.go#L977 and https://github.com/kubernetes/kubernetes/blob/v1.34.0-alpha.1/test/e2e_node/podresources_test.go#L1066
We can use a PR to make the tests more explicit and some changes are needed if the FG goes to default on: the FG status should be set explicitly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, can we make sure this is listed here?


### Rollout, Upgrade and Rollback Planning

Expand All @@ -347,7 +348,12 @@ Kubelet may fail to start. The new API may report inconsistent data, or may caus

###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

Not Applicable.
Not Applicable. Because this change:

- Is read-only in the kubelet’s in-memory state.
- Is behind a feature gate, so turning it off simply disables the new endpoints without affecting any existing behavior.

In practice, restart the kubelet with the gate disabled (rollback) or re-enabled (upgrade), and the API behavior reverts or returns without loss of data or consistency. Therefore we don’t need a special upgrade/downgrade test matrix for this KEP.

###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

Expand All @@ -372,7 +378,9 @@ Call the PodResources API and see the result.

###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?

N/A.
100% in normal operation. The proposed API exposes in read only mode kubelet internal data, critical for functioning of the kubelet.
This data has to be available 100% of the time for the proper functioning of the kubelet, thus is expected to be available 100% of time.
The only possible error source is the API calls being throttled by the rate-limiting introduced with the GA graduation of the parent KEP 606.

###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

Expand Down Expand Up @@ -408,36 +416,48 @@ No.

###### Will enabling / using this feature result in increasing size or count of the existing API objects?

No.
No. Enabling this feature does not change the number of API objects returned. But it may increase the size of each object whenever there are Dynamic Resources to report where each ContainerResources now has an extra dynamic_resources field.

###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

No. Feature is out of existing any paths in kubelet.

###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
Negligible amount of CPU and memory. Because the API is purely read-only and piggy-backs on the kubelet’s existing cache and checkpointing machinery, exposing Dynamic Resources incurs only similar minimal serialization and storage as CPUManager and DeviceManager—so any extra CPU, memory, disk, or I/O impact is negligible.

###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

DDOSing the API can lead to resource exhaustion.
No, because the endpoint queries existing data structures inside the kubelet.

### Troubleshooting

###### How does this feature react if the API server and/or etcd is unavailable?

N/A.
No impact, the feature is node-local.

###### What are other known failure modes?

The API will always return a well-known error. In normal operation, the API is expected to never return an error and always return a valid response, because it utilizes internal kubelet data which is always available. Bugs may cause the API to return unexpected errors, or to return inconsistent data. Consumers of the API should treat unexpected errors as bugs of this API.
feature gate disabled: The API will always return a well-known error. In normal operation, the API is expected to never return an error and always return a valid response, because it utilizes internal kubelet data which is always available.
Bugs may cause the API to return unexpected errors, or to return inconsistent data.
Consumers of the API should treat unexpected errors as bugs of this API.

###### What steps should be taken if SLOs are not being met to determine the problem?

N/A.
Check the error code to learn if the consumer of the API is being throttle by rate limiting introduced in the parent KEP 606.
Check the kubelet logs to learn about resource allocation errors.

## Implementation History

- 2023-01-12: KEP created

- 2024-09-10: KEP Updated to reflect the current state of the implementation.

- 2025-05-27: Beta version of the KEP.

## Drawbacks

N/A

## Alternatives

N/A
10 changes: 5 additions & 5 deletions keps/sig-node/3695-pod-resources-for-dra/kep.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,8 @@ authors:
- "@moshe010"
owning-sig: sig-node
participating-sigs: []
status: provisional
creation-date: implementable
status: implementable
creation-date: 2023-02-07
reviewers:
- "@ffromani"
- "@swatisehgal"
Expand All @@ -18,17 +18,17 @@ see-also:
replaces: []

# The target maturity stage in the current dev cycle for this KEP.
stage: alpha
stage: beta

# The most recent milestone for which work toward delivery of this KEP has been
# done. This can be the current (upcoming) milestone, if it is being actively
# worked on.
latest-milestone: "v1.27"
latest-milestone: "v1.34"

# The milestone at which this feature was, or is targeted to be, at each stage.
milestone:
alpha: "v1.27"
beta: "v1.33"
beta: "v1.34"
stable: "v1.36"

# The following PRR answers are required at alpha release
Expand Down