-
Notifications
You must be signed in to change notification settings - Fork 1.5k
KEP 1287: Instrumentation for in-place pod resize #5340
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -37,6 +37,12 @@ | |
- [QOS Class](#qos-class) | ||
- [Resource Quota](#resource-quota) | ||
- [Affected Components](#affected-components) | ||
- [Instrumentation](#instrumentation) | ||
- [<code>kubelet_pod_resize_requests_total</code>](#kubelet_pod_resize_requests_total) | ||
- [<code>kubelet_container_resize_requests_total</code>](#kubelet_container_resize_requests_total) | ||
- [<code>kubelet_pod_resize_sli_duration_seconds</code>](#kubelet_pod_resize_sli_duration_seconds) | ||
- [<code>kubelet_pod_infeasible_resize_total</code>](#kubelet_pod_infeasible_resize_total) | ||
- [<code>kubelet_pod_deferred_resize_accepted_total</code>](#kubelet_pod_deferred_resize_accepted_total) | ||
- [Static CPU & Memory Policy](#static-cpu--memory-policy) | ||
- [Future Enhancements](#future-enhancements) | ||
- [Mutable QOS Class "Shape"](#mutable-qos-class-shape) | ||
|
@@ -881,6 +887,80 @@ Other components: | |
* check how the change of meaning of resource requests influence other | ||
Kubernetes components. | ||
|
||
### Instrumentation | ||
|
||
The kubelet will record the following metrics: | ||
|
||
#### `kubelet_pod_resize_requests_total` | ||
|
||
This metric tracks the total number of resize requests observed by the Kubelet, counted at the pod level. | ||
A single pod update changing multiple containers will be considered a single resize request. | ||
|
||
Labels: | ||
- `resource_type` - what type of resource is being resized. Possible values: `cpu_limits`, `cpu_requests` `memory_limits`, or `memory_requests`. If more than one of these resource types is changing in the resize request, | ||
we increment the counter multiple times, once for each. This means that a single pod update changing multiple | ||
resource types will be considered multiple requests for this metric. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ig this is a little weird but I'm not sure if the alternatives are better. We also already have |
||
- `operation_type` - whether the resize is a net increase or a decrease (taken as an aggregate across | ||
all containers in the pod). Possible values: `increase`, `decrease`, `add`, or `remove`. | ||
|
||
This metric is recorded as a counter. | ||
|
||
#### `kubelet_container_resize_requests_total` | ||
|
||
This metric tracks the total number of resize requests observed by the Kubelet, counted at the container level. | ||
A single pod update changing multiple containers will be considered separate resize requests. | ||
|
||
Labels: | ||
- `resource_type` - what type of resource is being resized. Possible values: `cpu_limits`, `cpu_requests` `memory_limits`, or `memory_requests`. If more than one of these resource types is changing in the resize request, | ||
we increment the counter multiple times, once for each. This means that a single pod update changing multiple | ||
resource types will be considered multiple requests for this metric. | ||
- `operation_type` - whether the resize is an increase or a decrease. Possible values: `increase`, `decrease`, `add`, or `remove`. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm assuming requests / limits can be added to a container but I don't actually know if that's true?? (I know kubernetes/kubernetes#127143 is adding support to remove them) |
||
|
||
This metric is recorded as a counter. | ||
|
||
#### `kubelet_pod_resize_sli_duration_seconds` | ||
|
||
This metric tracks the latency between when the kubelet accepts a resize request and when it finshes actuating | ||
the request. More precisely, this metric tracks the total amount of time that the `PodResizeInProgress` condition | ||
is present on a pod. | ||
|
||
Labels: | ||
- `resource_type` - what type of resource is being resized. Possible values: `cpu_limits`, `cpu_requests` `memory_limits`, or `memory_requests`. If more than one of these resource types is changing in the resize request, | ||
we increment the counter multiple times, once for each. | ||
- `operation_type` - whether the resize is an increase or a decrease. Possible values: `increase`, `decrease`, `add`, or `remove`. | ||
|
||
This metric is recorded as a gauge. | ||
|
||
#### `kubelet_pod_infeasible_resize_total` | ||
|
||
This metric tracks the total count of resize requests that the kubelet marks as infeasible. This will make it | ||
easier for us to see which of the current limitations users are running into the most. | ||
|
||
Labels: | ||
- `reason` - why the resize is infeasible. Although a more detailed "reason" will be provided in the `PodResizePending` | ||
condition in the pod, we limit this label to only the following possible values to keep cardinality low: | ||
- `guaranteed_pod_cpu_manager_static_policy` - In-place resize is not supported for Guaranteed Pods alongside CPU Manager static policy. | ||
- `guaranteed_pod_memory_manager_static_policy` - In-place resize is not supported for Guaranteed Pods alongside Memory Manager static policy. | ||
- `static_pod` - In-place resize is not supported for static pods. | ||
- `swap_limitation` - In-place resize is not supported for containers with swap. | ||
- `node_capacity` - The node doesn't have enough capacity for this resize request. | ||
|
||
This list of possible reasons may shrink or grow depending on limitations that are added or removed in the future. | ||
|
||
This metric is recorded as a counter. | ||
|
||
#### `kubelet_pod_deferred_resize_accepted_total` | ||
|
||
This metric tracks the total number of resize requests that the Kubelet originally marked as deferred but | ||
later accepted. This metric primarily exists because if a deferred resize is accepted through the timed retry as | ||
opposed to being explicitly signaled, it indicates an issue in the Kubelet's logic for handling deferred | ||
resizes that we should fix. | ||
|
||
Labels: | ||
- `retry_reason` - whether the resize was accepted through the timed retry or explicitly signaled. Possible values: `timed`, `signaled`. | ||
|
||
This metric is recorded as a counter. | ||
|
||
### Static CPU & Memory Policy | ||
|
||
Resizing pods with static CPU & memory policy configured is out-of-scope for the beta release of | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @ndixita
I don't have all the context but we might want to revisit or reuse this metric in the context of pod-level resources when resize of pod-level resources is supported