-
Notifications
You must be signed in to change notification settings - Fork 1.5k
kep-3695-beta update #5346
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kep-3695-beta update #5346
Changes from all commits
6e540a5
430351f
72432d5
3d1512b
2402b80
64302a3
ae728ef
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -4,3 +4,5 @@ | |
kep-number: 3695 | ||
alpha: | ||
approver: "@johnbelamaric" | ||
beta: | ||
approver: "@soltysh" |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,4 @@ | ||
# KEP-3695: Extend the PodResources API to include resources allocated by DRA | ||
KEP-3695: Extend the PodResources API to include resources allocated by DRA | ||
|
||
<!-- toc --> | ||
- [Release Signoff Checklist](#release-signoff-checklist) | ||
|
@@ -36,17 +36,17 @@ | |
Items marked with (R) are required *prior to targeting to a milestone / release*. | ||
|
||
- [x] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) | ||
- [ ] (R) KEP approvers have approved the KEP status as `implementable` | ||
- [ ] (R) Design details are appropriately documented | ||
- [x] (R) KEP approvers have approved the KEP status as `implementable` | ||
- [x] (R) Design details are appropriately documented | ||
- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors) | ||
- [ ] e2e Tests for all Beta API Operations (endpoints) | ||
- [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) | ||
- [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free | ||
- [ ] (R) Graduation criteria is in place | ||
- [x] (R) Graduation criteria is in place | ||
- [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) | ||
- [ ] (R) Production readiness review completed | ||
- [x] (R) Production readiness review completed | ||
- [ ] (R) Production readiness review approved | ||
- [ ] "Implementation History" section is up-to-date for milestone | ||
- [x] "Implementation History" section is up-to-date for milestone | ||
- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] | ||
- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes | ||
|
||
|
@@ -107,7 +107,7 @@ to allow querying specific pods for their allocated resources. | |
returns the list of PodResources for *all* pods across *all* namespaces in the | ||
cluster). That is, it allows one to specify a specific pod and namespace to | ||
retrieve PodResources from, rather than having to query all of them all at | ||
once. | ||
once. `Get()` returns error if the pod is known to the kubelet, but is terminated. | ||
|
||
The full PodResources API (including our proposed extensions) can be seen below: | ||
|
||
|
@@ -274,8 +274,9 @@ These cases will be added in the existing e2e tests: | |
|
||
#### Beta | ||
guptaNswati marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
- [ ] Gather feedback from consumers of the DRA feature. | ||
- [ ] No major bugs reported in the previous cycle. | ||
- [x] Gather feedback from consumers of the DRA feature. | ||
- Integration with the NVIDIA DCGM exporter (https://github.com/NVIDIA/dcgm-exporter/pull/501) to gather per pod Dynamic Resources managed by [k8s-dra-driver-gpu](https://github.com/NVIDIA/k8s-dra-driver-gpu). | ||
- [x] No major bugs reported in the previous cycle. | ||
guptaNswati marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
#### GA | ||
|
||
guptaNswati marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
@@ -333,7 +334,7 @@ The API becomes available again. The API is stateless, so no recovery is needed, | |
|
||
###### Are there any tests for feature enablement/disablement? | ||
|
||
e2e test will demonstrate that when the feature gate is disabled, the API returns the appropriate error code. | ||
e2e test will demonstrate that when the feature gate is disabled, the API returns the appropriate error code. (https://github.com/kubernetes/kubernetes/pull/116846) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The linked PR isn't testing feature enablement/disablement, or am I misreading it? The closest place where you test this feature gate is https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/apis/podresources/server_v1_test.go but there you only turn this on, but I don't see the requested on/off test. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. we have a on/off test scattered across the existing tests: https://github.com/kubernetes/kubernetes/blob/v1.34.0-alpha.1/test/e2e_node/podresources_test.go#L977 and https://github.com/kubernetes/kubernetes/blob/v1.34.0-alpha.1/test/e2e_node/podresources_test.go#L1066 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Great, can we make sure this is listed here? |
||
|
||
### Rollout, Upgrade and Rollback Planning | ||
|
||
|
@@ -347,7 +348,12 @@ Kubelet may fail to start. The new API may report inconsistent data, or may caus | |
|
||
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? | ||
|
||
Not Applicable. | ||
Not Applicable. Because this change: | ||
|
||
- Is read-only in the kubelet’s in-memory state. | ||
- Is behind a feature gate, so turning it off simply disables the new endpoints without affecting any existing behavior. | ||
|
||
In practice, restart the kubelet with the gate disabled (rollback) or re-enabled (upgrade), and the API behavior reverts or returns without loss of data or consistency. Therefore we don’t need a special upgrade/downgrade test matrix for this KEP. | ||
|
||
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? | ||
|
||
|
@@ -372,7 +378,9 @@ Call the PodResources API and see the result. | |
|
||
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement? | ||
|
||
N/A. | ||
100% in normal operation. The proposed API exposes in read only mode kubelet internal data, critical for functioning of the kubelet. | ||
This data has to be available 100% of the time for the proper functioning of the kubelet, thus is expected to be available 100% of time. | ||
The only possible error source is the API calls being throttled by the rate-limiting introduced with the GA graduation of the parent KEP 606. | ||
|
||
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? | ||
|
||
|
@@ -408,36 +416,48 @@ No. | |
|
||
###### Will enabling / using this feature result in increasing size or count of the existing API objects? | ||
|
||
No. | ||
No. Enabling this feature does not change the number of API objects returned. But it may increase the size of each object whenever there are Dynamic Resources to report where each ContainerResources now has an extra dynamic_resources field. | ||
|
||
###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? | ||
|
||
No. Feature is out of existing any paths in kubelet. | ||
|
||
###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? | ||
Negligible amount of CPU and memory. Because the API is purely read-only and piggy-backs on the kubelet’s existing cache and checkpointing machinery, exposing Dynamic Resources incurs only similar minimal serialization and storage as CPUManager and DeviceManager—so any extra CPU, memory, disk, or I/O impact is negligible. | ||
|
||
###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)? | ||
|
||
DDOSing the API can lead to resource exhaustion. | ||
No, because the endpoint queries existing data structures inside the kubelet. | ||
|
||
### Troubleshooting | ||
|
||
###### How does this feature react if the API server and/or etcd is unavailable? | ||
|
||
N/A. | ||
No impact, the feature is node-local. | ||
|
||
###### What are other known failure modes? | ||
|
||
The API will always return a well-known error. In normal operation, the API is expected to never return an error and always return a valid response, because it utilizes internal kubelet data which is always available. Bugs may cause the API to return unexpected errors, or to return inconsistent data. Consumers of the API should treat unexpected errors as bugs of this API. | ||
feature gate disabled: The API will always return a well-known error. In normal operation, the API is expected to never return an error and always return a valid response, because it utilizes internal kubelet data which is always available. | ||
Bugs may cause the API to return unexpected errors, or to return inconsistent data. | ||
Consumers of the API should treat unexpected errors as bugs of this API. | ||
|
||
###### What steps should be taken if SLOs are not being met to determine the problem? | ||
|
||
N/A. | ||
Check the error code to learn if the consumer of the API is being throttle by rate limiting introduced in the parent KEP 606. | ||
Check the kubelet logs to learn about resource allocation errors. | ||
|
||
## Implementation History | ||
|
||
- 2023-01-12: KEP created | ||
|
||
- 2024-09-10: KEP Updated to reflect the current state of the implementation. | ||
|
||
- 2025-05-27: Beta version of the KEP. | ||
|
||
## Drawbacks | ||
|
||
N/A | ||
|
||
## Alternatives | ||
|
||
N/A |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Earlier in the doc:
## Release Signoff Checklist
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was addressed - thank you.
This one still holds. We need links for integration and e2e based on the template in the appropriate section. I believe e2es were added in kubernetes/kubernetes#116846 so you should be able to quickly fill those in. Not sure if there are others.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This still holds, I see Francesco mentioned several tests that were added, can we make sure they are explicitly linked in this document?