@@ -100,6 +100,7 @@ tags, and then generate with `hack/update-toc.sh`.
100100      -  [ e2e tests] ( #e2e-tests ) 
101101  -  [ Graduation Criteria] ( #graduation-criteria ) 
102102    -  [ Alpha] ( #alpha ) 
103+     -  [ Beta] ( #beta ) 
103104  -  [ Upgrade / Downgrade Strategy] ( #upgrade--downgrade-strategy ) 
104105  -  [ Version Skew Strategy] ( #version-skew-strategy ) 
105106-  [ Production Readiness Review Questionnaire] ( #production-readiness-review-questionnaire ) 
@@ -558,6 +559,11 @@ in back-to-back releases.
558559-  Error type is implemented
559560-  Deletion of malformed etcd objects and its admission can be enabled via a feature flag
560561
562+ #### Beta  
563+ 
564+ -  Extended testing is available
565+ -  Dry-Run is implemented
566+ 
561567### Upgrade / Downgrade Strategy  
562568
563569<!-- 
@@ -679,7 +685,8 @@ feature gate after having objects written with the new field) are also critical.
679685You can take a look at one potential example of such test in: 
680686https://github.com/kubernetes/kubernetes/pull/97058/files#diff-7826f7adbc1996a05ab52e3f5f02429e94b68ce6bce0dc534d1be636154fded3R246-R282 
681687--> 
682- The implementation, including tests, is waiting for an approval of this enhancement.
688+ All tests verify feature enablement / disablement to ensure backwards
689+ compatibility.
683690
684691### Rollout, Upgrade and Rollback Planning  
685692
@@ -698,15 +705,19 @@ feature flags will be enabled on some API servers and not others during the
698705rollout. Similarly, consider large clusters and how enablement/disablement 
699706will rollout across nodes. 
700707--> 
708+ No impact on rollout or rollback.
701709
702710###### What specific metrics should inform a rollback?  
703711
704712<!-- 
705713What signals should users be paying attention to when the feature is young 
706714that might indicate a serious problem? 
707715--> 
708- If the average time of ` apiserver_request_duration_seconds{verb="delete"} `  of the kube-apiserver
709- increases greatly, this feature might have caused a performance regression.
716+ If the average time of ` apiserver_request_duration_seconds{verb="delete"} `  or
717+ ` apiserver_request_duration_seconds{verb="list"} `  the amount of
718+ ` apiserver_current_inqueue_requests `  or ` apiserver_current_inflight_requests ` 
719+ increases greatly over an extended period of time this feature might have
720+ caused a performance regression.
710721
711722###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?  
712723
@@ -715,12 +726,14 @@ Describe manual testing that was done and the outcomes.
715726Longer term, we may want to require automated upgrade/rollback tests, but we 
716727are missing a bunch of machinery and tooling and can't do that now. 
717728--> 
729+ No testing of upgrade->downgrade->upgrade necessary.
718730
719731###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?  
720732
721733<!-- 
722734Even if applying deprecation policies, they may still surprise some users. 
723735--> 
736+ No deprecations.
724737
725738### Monitoring Requirements  
726739
@@ -739,6 +752,13 @@ checking if there are objects with field X set) may be a last resort. Avoid
739752logs or events for this purpose. 
740753--> 
741754
755+ This feature is for cluster administrators performing emergency recovery, not for workload automation.
756+ 
757+ To detect actual usage (i.e., unsafe deletions being performed):
758+ 
759+ -  Audit logs: Search for annotation: ` apiserver.k8s.io/unsafe-delete-ignore-read-error ` .
760+ -  RBAC: Check RoleBindings/ClusterRoleBindings granting ` unsafe-delete-ignore-read-errors `  verb.
761+ 
742762###### How can someone using this feature know that it is working for their instance?  
743763
744764<!-- 
@@ -775,25 +795,38 @@ These goals will help you determine what you need to measure (SLIs) in the next
775795question. 
776796--> 
777797
798+ All corrupt object DELETEs complete, when feature is enabled, option is set and
799+ the user is authorized.
800+ 
778801###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?  
779802
780803<!-- 
781804Pick one more of these and delete the rest. 
782805--> 
783806
784807-  [x]  Metrics
785-   -  Metric name: ` apiserver_request_duration_seconds ` 
786-   -  [ Optional]  Aggregation method: ` verb=delete ` 
787-   -  Components exposing the metric: kube-apiserver
808+     -  Metric name: ` apiserver_request_duration_seconds ` 
809+       [ Optional]  Aggregation method:
810+         - ` verb=delete `  > track latency increase from deleting corrupt objects (the latency should actually shorten)
811+         - ` verb=list `  > track latency increase from client re-lists
812+       Components exposing the metric: kube-apiserver
813+     -  Metric name: ` apiserver_current_inqueue_requests ` 
814+       Components exposing the metric: kube-apiserver
815+       Details: Detect apiserver overload from request queueing
816+     -  Metric name: ` apiserver_current_inflight_requests ` 
817+       Components exposing the metric: kube-apiserver
818+       Details: Detect apiserver being maxed out on requests consistently
788819-  [ ]  Other (treat as last resort)
789-   -  Details:
820+      -  Details:
790821
791822###### Are there any missing metrics that would be useful to have to improve observability of this feature?  
792823
793824<!-- 
794825Describe the metrics themselves and the reasons why they weren't added (e.g., cost, 
795826implementation difficulties, etc.). 
796827--> 
828+ -  No metric tracking when unsafe deletions actually occur. Not assumed to happen
829+   often.
797830
798831### Dependencies  
799832
@@ -817,6 +850,7 @@ and creating new ones, as well as about cluster-level services (e.g. DNS):
817850      - Impact of its outage on the feature: 
818851      - Impact of its degraded performance or high-error rates on the feature: 
819852--> 
853+ -  kube-apiserver
820854
821855### Scalability  
822856
@@ -830,11 +864,16 @@ For GA, this section is required: approvers should be able to confirm the
830864previous answers based on experience in the field. 
831865--> 
832866The feature itself should not bring any concerns in terms of performance at scale.
867+ In particular as its usage is supposed to run on potentially broken clusters.
833868
834- The only  issue in terms of scaling comes with the error that attempts to list all
869+ An  issue in terms of scaling comes with the error that attempts to list all
835870resources that appeared to be malformed while reading from the storage. A limit
836871of 100 presented resources was arbitrarily picked to prevent huge HTTP responses.
837872
873+ Another issue in terms of scaling happens when the corrupt objects are deleted.
874+ Client reflectors re-list to recover, this causes temporarily increased load on
875+ the client-side and the kube-apiserver.
876+ 
838877###### Will enabling / using this feature result in any new API calls?  
839878
840879<!-- 
@@ -849,6 +888,7 @@ Focusing mostly on:
849888  - periodic API calls to reconcile state (e.g. periodic fetching state, 
850889    heartbeats, leader election, etc.) 
851890--> 
891+ No.
852892
853893###### Will enabling / using this feature result in introducing new API types?  
854894
@@ -858,6 +898,7 @@ Describe them, providing:
858898  - Supported number of objects per cluster 
859899  - Supported number of objects per namespace (for namespace-scoped objects) 
860900--> 
901+ No.
861902
862903###### Will enabling / using this feature result in any new calls to the cloud provider?  
863904
@@ -866,6 +907,7 @@ Describe them, providing:
866907  - Which API(s): 
867908  - Estimated increase: 
868909--> 
910+ No.
869911
870912###### Will enabling / using this feature result in increasing size or count of the existing API objects?  
871913
@@ -875,6 +917,8 @@ Describe them, providing:
875917  - Estimated increase in size: (e.g., new annotation of size 32B) 
876918  - Estimated amount of new objects: (e.g., new Object X for every existing Pod) 
877919--> 
920+ DeleteOptions gets a new boolean field, but it is transient: no persistence in
921+ etcd.
878922
879923###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?  
880924
@@ -887,6 +931,23 @@ Think about adding additional work or introducing new steps in between
887931[existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos 
888932--> 
889933
934+ DELETE operations:
935+ 
936+ -  Unsafe DELETE path is faster (skips preconditions, validation, finalizers)
937+ -  Decreases latency for the unsafe delete itself
938+ 
939+ LIST operations:
940+ 
941+ -  Client-side reflectors re-list when their watch breaks (after corrupt object deletion ERROR event)
942+ -  Temporarily increases LIST request volume to apiserver
943+ -  Latency increase depends on: number of watching clients × object count × apiserver resources
944+ 
945+ Expected impact:
946+ 
947+ -  Negligible under the circumstance that the cluster is in a potentially broken
948+   state.
949+ -  Potentially noticeable if: popular resource (many watchers) × many objects × resource-constrained apiserver
950+ 
890951###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?  
891952
892953<!-- 
@@ -899,6 +960,12 @@ This through this both in small and large cases, again with respect to the
899960[supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md 
900961--> 
901962
963+ Temporary increase during cleanup, dependent on object and resource type
964+ popularity:
965+ 
966+ -  apiserver: CPU / network during re-lists
967+ -  client-side: CPU / memory / network during re-lists / rebuilding cache
968+ 
902969###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?  
903970
904971<!-- 
@@ -911,6 +978,8 @@ Are there any tests that were run/should be run to understand performance charac
911978and validate the declared limits? 
912979--> 
913980
981+ No.
982+ 
914983### Troubleshooting  
915984
916985<!-- 
0 commit comments