10
10
- [ Story 1] ( #story-1 )
11
11
- [ Risks and Mitigations] ( #risks-and-mitigations )
12
12
- [ Design Details] ( #design-details )
13
- - [ CPU] ( #cpu )
14
- - [ Memory] ( #memory )
15
- - [ IO] ( #io )
13
+ - [ CPU] ( #cpu )
14
+ - [ Memory] ( #memory )
15
+ - [ IO] ( #io )
16
16
- [ Test Plan] ( #test-plan )
17
17
- [ Prerequisite testing updates] ( #prerequisite-testing-updates )
18
18
- [ Unit tests] ( #unit-tests )
@@ -125,16 +125,25 @@ full avg10=0.00 avg60=0.00 avg300=0.00 total=0
125
125
```
126
126
127
127
``` go
128
+ // PSI data for an individual resource.
128
129
type PSIData struct {
129
- Avg10 *float64 ` json:"avg10"`
130
- Avg60 *float64 ` json:"avg60"`
131
- Avg300 *float64 ` json:"avg300"`
132
- Total *float64 ` json:"total"`
130
+ // Total time duration for tasks in the cgroup have waited due to congestion.
131
+ // Unit: nanoseconds.
132
+ Total uint64 ` json:"total"`
133
+ // The average (in %) tasks have waited due to congestion over a 10 second window.
134
+ Avg10 float64 ` json:"avg10"`
135
+ // The average (in %) tasks have waited due to congestion over a 60 second window.
136
+ Avg60 float64 ` json:"avg60"`
137
+ // The average (in %) tasks have waited due to congestion over a 300 second window.
138
+ Avg300 float64 ` json:"avg300"`
133
139
}
134
140
141
+ // PSI statistics for an individual resource.
135
142
type PSIStats struct {
136
- Some *PSIData ` json:"some,omitempty"`
137
- Full *PSIData ` json:"full,omitempty"`
143
+ // PSI data for some tasks in the cgroup.
144
+ Some PSIData ` json:"some,omitempty"`
145
+ // PSI data for all tasks in the cgroup.
146
+ Full PSIData ` json:"full,omitempty"`
138
147
}
139
148
```
140
149
@@ -146,15 +155,15 @@ metric data will be available through CRI instead.
146
155
``` go
147
156
type CPUStats struct {
148
157
// PSI stats of the overall node
149
- PSI cadvisorapi. PSIStats ` json:"psi,omitempty"`
158
+ PSI * PSIStats ` json:"psi,omitempty"`
150
159
}
151
160
```
152
161
153
162
##### Memory
154
163
``` go
155
164
type MemoryStats struct {
156
165
// PSI stats of the overall node
157
- PSI cadvisorapi. PSIStats ` json:"psi,omitempty"`
166
+ PSI * PSIStats ` json:"psi,omitempty"`
158
167
}
159
168
```
160
169
@@ -166,7 +175,7 @@ type IOStats struct {
166
175
Time metav1.Time ` json:"time"`
167
176
168
177
// PSI stats of the overall node
169
- PSI cadvisorapi. PSIStats ` json:"psi,omitempty"`
178
+ PSI * PSIStats ` json:"psi,omitempty"`
170
179
}
171
180
172
181
type NodeStats struct {
@@ -220,6 +229,7 @@ This can inform certain test coverage improvements that we want to do before
220
229
extending the production code to implement this enhancement.
221
230
-->
222
231
- ` k8s.io/kubernetes/pkg/kubelet/server/stats ` : ` 2023-10-04 ` - ` 74.4% `
232
+ - ` k8s.io/kubernetes/pkg/kubelet/stats ` : ` 2025-06-10 ` - ` 77.4% `
223
233
224
234
##### Integration tests
225
235
@@ -238,6 +248,8 @@ For Beta and GA, add links to added tests together with links to k8s-triage for
238
248
https://storage.googleapis.com/k8s-triage/index.html
239
249
-->
240
250
251
+ Within Kubernetes, the feature is implemented solely in kubelet. Therefore a Kubernetes integration test doesn't apply here.
252
+
241
253
Any identified external user of either of these endpoints (prometheus, metrics-server) should be tested to make sure they're not broken by new fields in the API response.
242
254
243
255
##### e2e tests
@@ -252,7 +264,7 @@ https://storage.googleapis.com/k8s-triage/index.html
252
264
We expect no non-infra related flakes in the last month as a GA graduation criteria.
253
265
-->
254
266
255
- - < test >: < link to test coverage >
267
+ - ` test/e2e_node/summary_test.go ` : ` https://storage.googleapis.com/k8s-triage/index.html? test=test%2Fe2e_node%2Fsummary_test.go `
256
268
257
269
### Graduation Criteria
258
270
@@ -269,7 +281,8 @@ We expect no non-infra related flakes in the last month as a GA graduation crite
269
281
- Allowing time for feedback.
270
282
271
283
#### GA
272
- - TBD
284
+ - Gather evidence of real world usage.
285
+ - No major issue reported.
273
286
274
287
#### Deprecation
275
288
@@ -338,12 +351,6 @@ well as the [existing list] of feature gates.
338
351
- [X] Feature gate (also fill in values in ` kep.yaml ` )
339
352
- Feature gate name: KubeletPSI
340
353
- Components depending on the feature gate: kubelet
341
- - [ ] Other
342
- - Describe the mechanism:
343
- - Will enabling / disabling the feature require downtime of the control
344
- plane?
345
- - Will enabling / disabling the feature require downtime or reprovisioning
346
- of a node?
347
354
348
355
###### Does enabling the feature change any default behavior?
349
356
@@ -368,7 +375,7 @@ NOTE: Also set `disable-supported` to `true` or `false` in `kep.yaml`.
368
375
Yes
369
376
370
377
###### What happens if we reenable the feature if it was previously rolled back?
371
- No PSI metrics will be availabe in kubelet Summary API nor Prometheus metrics if the
378
+ No PSI metrics will be available in kubelet Summary API nor Prometheus metrics if the
372
379
feature was rolled back.
373
380
374
381
###### Are there any tests for feature enablement/disablement?
@@ -405,13 +412,34 @@ rollout. Similarly, consider large clusters and how enablement/disablement
405
412
will rollout across nodes.
406
413
-->
407
414
415
+ The PSI metrics in kubelet Summary API and Prometheus metrics are for monitoring purpose,
416
+ and are not used by Kubernetes itself to inform workload lifecycle decisions. Therefore it should
417
+ not impact running workloads.
418
+
419
+ If there is a bug and kubelet fails to serve the metrics during rollout, the kubelet Summary API
420
+ and Prometheus metrics could be corrupted, and other components that depend on those metrics could
421
+ be impacted. Disabling the feature gate / rolling back the feature should be safe.
422
+
408
423
###### What specific metrics should inform a rollback?
409
424
410
425
<!--
411
426
What signals should users be paying attention to when the feature is young
412
427
that might indicate a serious problem?
413
428
-->
414
429
430
+ PSI metrics exposed at kubelet ` /metrics/cadvisor ` endpoint:
431
+
432
+ ```
433
+ container_pressure_cpu_stalled_seconds_total
434
+ container_pressure_cpu_waiting_seconds_total
435
+ container_pressure_memory_stalled_seconds_total
436
+ container_pressure_memory_waiting_seconds_total
437
+ container_pressure_io_stalled_seconds_total
438
+ container_pressure_io_waiting_seconds_total
439
+ ```
440
+
441
+ kubelet Summary API at the ` /stats/summary ` endpoint.
442
+
415
443
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
416
444
417
445
<!--
@@ -420,12 +448,23 @@ Longer term, we may want to require automated upgrade/rollback tests, but we
420
448
are missing a bunch of machinery and tooling and can't do that now.
421
449
-->
422
450
451
+ Test plan:
452
+ - Create pods when the feature is alpha and disabled
453
+ - Upgrade kubelet so the feature is beta and enabled
454
+ - Pods should continue to run
455
+ - PSI metrics should be reported in kubelet Summary API and Prometheus metrics
456
+ - Roll back kubelet to previous version
457
+ - Pods should continue to run
458
+ - PSI metrics should no longer be reported
459
+
423
460
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
424
461
425
462
<!--
426
463
Even if applying deprecation policies, they may still surprise some users.
427
464
-->
428
465
466
+ No
467
+
429
468
### Monitoring Requirements
430
469
431
470
<!--
@@ -456,13 +495,8 @@ and operation of this feature.
456
495
Recall that end users cannot usually observe component logs or access metrics.
457
496
-->
458
497
459
- - [ ] Events
460
- - Event Reason:
461
- - [ ] API .status
462
- - Condition name:
463
- - Other field:
464
- - [ ] Other (treat as last resort)
465
- - Details:
498
+ - [x] Other (treat as last resort)
499
+ - Details: The feature is only about metrics surfacing. One can know that it is working by reading the metrics.
466
500
467
501
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
468
502
@@ -481,6 +515,8 @@ These goals will help you determine what you need to measure (SLIs) in the next
481
515
question.
482
516
-->
483
517
518
+ kubelet Summary API and Prometheus metrics should continue serving traffics meeting their originally targeted SLOs
519
+
484
520
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
485
521
486
522
<!--
584
620
585
621
- 2023/09/13: Initial proposal
586
622
- 2025/06/10: Drop Phase 2 from this KEP. Phase 2 will be tracked in its own KEP to allow separate milestone tracking
623
+ - 2025/06/10: Update the proposal with Beta requirements
587
624
588
625
## Drawbacks
589
626
0 commit comments