KEP-3926: updating the PRR questionnaire

ibihim · ibihim · commit a893ad0f73b3 · 2025-10-09T19:05:29.000+02:00
diff --git a/keps/sig-auth/3926-handling-undecryptable-resources/README.md b/keps/sig-auth/3926-handling-undecryptable-resources/README.md
@@ -100,6 +100,7 @@ tags, and then generate with `hack/update-toc.sh`.
       - [e2e tests](#e2e-tests)
   - [Graduation Criteria](#graduation-criteria)
     - [Alpha](#alpha)
+    - [Beta](#beta)
   - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy)
   - [Version Skew Strategy](#version-skew-strategy)
 - [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
@@ -558,6 +559,11 @@ in back-to-back releases.
 - Error type is implemented
 - Deletion of malformed etcd objects and its admission can be enabled via a feature flag
 
+#### Beta
+
+- Extended testing is available
+- Dry-Run is implemented
+
 ### Upgrade / Downgrade Strategy
 
 <!--
@@ -679,7 +685,8 @@ feature gate after having objects written with the new field) are also critical.
 You can take a look at one potential example of such test in:
 https://github.com/kubernetes/kubernetes/pull/97058/files#diff-7826f7adbc1996a05ab52e3f5f02429e94b68ce6bce0dc534d1be636154fded3R246-R282
 -->
-The implementation, including tests, is waiting for an approval of this enhancement.
+All tests verify feature enablement / disablement to ensure backwards
+compatibility.
 
 ### Rollout, Upgrade and Rollback Planning
 
@@ -698,15 +705,19 @@ feature flags will be enabled on some API servers and not others during the
 rollout. Similarly, consider large clusters and how enablement/disablement
 will rollout across nodes.
 -->
+No impact on rollout or rollback.
 
 ###### What specific metrics should inform a rollback?
 
 <!--
 What signals should users be paying attention to when the feature is young
 that might indicate a serious problem?
 -->
-If the average time of `apiserver_request_duration_seconds{verb="delete"}` of the kube-apiserver
-increases greatly, this feature might have caused a performance regression.
+If the average time of `apiserver_request_duration_seconds{verb="delete"}` or
+`apiserver_request_duration_seconds{verb="list"}` the amount of
+`apiserver_current_inqueue_requests` or `apiserver_current_inflight_requests`
+increases greatly over an extended period of time this feature might have
+caused a performance regression.
 
 ###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
 
@@ -715,12 +726,14 @@ Describe manual testing that was done and the outcomes.
 Longer term, we may want to require automated upgrade/rollback tests, but we
 are missing a bunch of machinery and tooling and can't do that now.
 -->
+No testing of upgrade->downgrade->upgrade necessary.
 
 ###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
 
 <!--
 Even if applying deprecation policies, they may still surprise some users.
 -->
+No deprecations.
 
 ### Monitoring Requirements
 
@@ -739,6 +752,13 @@ checking if there are objects with field X set) may be a last resort. Avoid
 logs or events for this purpose.
 -->
 
+This feature is for cluster administrators performing emergency recovery, not for workload automation.
+
+To detect actual usage (i.e., unsafe deletions being performed):
+
+- Audit logs: Search for annotation: `apiserver.k8s.io/unsafe-delete-ignore-read-error`.
+- RBAC: Check RoleBindings/ClusterRoleBindings granting `unsafe-delete-ignore-read-errors` verb.
+
 ###### How can someone using this feature know that it is working for their instance?
 
 <!--
@@ -775,25 +795,38 @@ These goals will help you determine what you need to measure (SLIs) in the next
 question.
 -->
 
+All corrupt object DELETEs complete, when feature is enabled, option is set and
+the user is authorized.
+
 ###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
 
 <!--
 Pick one more of these and delete the rest.
 -->
 
 - [x] Metrics
-  - Metric name: `apiserver_request_duration_seconds`
-  - [Optional] Aggregation method: `verb=delete`
-  - Components exposing the metric: kube-apiserver
+    - Metric name: `apiserver_request_duration_seconds`
+      [Optional] Aggregation method:
+        - `verb=delete` > track latency increase from deleting corrupt objects (the latency should actually shorten)
+        - `verb=list` > track latency increase from client re-lists
+      Components exposing the metric: kube-apiserver
+    - Metric name: `apiserver_current_inqueue_requests`
+      Components exposing the metric: kube-apiserver
+      Details: Detect apiserver overload from request queueing
+    - Metric name: `apiserver_current_inflight_requests`
+      Components exposing the metric: kube-apiserver
+      Details: Detect apiserver being maxed out on requests consistently
 - [ ] Other (treat as last resort)
-  - Details:
+    - Details:
 
 ###### Are there any missing metrics that would be useful to have to improve observability of this feature?
 
 <!--
 Describe the metrics themselves and the reasons why they weren't added (e.g., cost,
 implementation difficulties, etc.).
 -->
+- No metric tracking when unsafe deletions actually occur. Not assumed to happen
+  often.
 
 ### Dependencies
 
@@ -817,6 +850,7 @@ and creating new ones, as well as about cluster-level services (e.g. DNS):
       - Impact of its outage on the feature:
       - Impact of its degraded performance or high-error rates on the feature:
 -->
+- kube-apiserver
 
 ### Scalability
 
@@ -830,11 +864,16 @@ For GA, this section is required: approvers should be able to confirm the
 previous answers based on experience in the field.
 -->
 The feature itself should not bring any concerns in terms of performance at scale.
+In particular as its usage is supposed to run on potentially broken clusters.
 
-The only issue in terms of scaling comes with the error that attempts to list all
+An issue in terms of scaling comes with the error that attempts to list all
 resources that appeared to be malformed while reading from the storage. A limit
 of 100 presented resources was arbitrarily picked to prevent huge HTTP responses.
 
+Another issue in terms of scaling happens when the corrupt objects are deleted.
+Client reflectors re-list to recover, this causes temporarily increased load on
+the client-side and the kube-apiserver.
+
 ###### Will enabling / using this feature result in any new API calls?
 
 <!--
@@ -849,6 +888,7 @@ Focusing mostly on:
   - periodic API calls to reconcile state (e.g. periodic fetching state,
     heartbeats, leader election, etc.)
 -->
+No.
 
 ###### Will enabling / using this feature result in introducing new API types?
 
@@ -858,6 +898,7 @@ Describe them, providing:
   - Supported number of objects per cluster
   - Supported number of objects per namespace (for namespace-scoped objects)
 -->
+No.
 
 ###### Will enabling / using this feature result in any new calls to the cloud provider?
 
@@ -866,6 +907,7 @@ Describe them, providing:
   - Which API(s):
   - Estimated increase:
 -->
+No.
 
 ###### Will enabling / using this feature result in increasing size or count of the existing API objects?
 
@@ -875,6 +917,8 @@ Describe them, providing:
   - Estimated increase in size: (e.g., new annotation of size 32B)
   - Estimated amount of new objects: (e.g., new Object X for every existing Pod)
 -->
+DeleteOptions gets a new boolean field, but it is transient: no persistence in
+etcd.
 
 ###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
 
@@ -887,6 +931,23 @@ Think about adding additional work or introducing new steps in between
 [existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos
 -->
 
+DELETE operations:
+
+- Unsafe DELETE path is faster (skips preconditions, validation, finalizers)
+- Decreases latency for the unsafe delete itself
+
+LIST operations:
+
+- Client-side reflectors re-list when their watch breaks (after corrupt object deletion ERROR event)
+- Temporarily increases LIST request volume to apiserver
+- Latency increase depends on: number of watching clients × object count × apiserver resources
+
+Expected impact:
+
+- Negligible under the circumstance that the cluster is in a potentially broken
+  state.
+- Potentially noticeable if: popular resource (many watchers) × many objects × resource-constrained apiserver
+
 ###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
 
 <!--
@@ -899,6 +960,12 @@ This through this both in small and large cases, again with respect to the
 [supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md
 -->
 
+Temporary increase during cleanup, dependent on object and resource type
+popularity:
+
+- apiserver: CPU / network during re-lists
+- client-side: CPU / memory / network during re-lists / rebuilding cache
+
 ###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
 
 <!--
@@ -911,6 +978,8 @@ Are there any tests that were run/should be run to understand performance charac
 and validate the declared limits?
 -->
 
+No.
+
 ### Troubleshooting
 
 <!--