@@ -558,6 +558,11 @@ in back-to-back releases.
558558- Error type is implemented
559559- Deletion of malformed etcd objects and its admission can be enabled via a feature flag
560560
561+ #### Beta
562+
563+ - Extended testing is available
564+ - Dry-Run is implemented
565+
561566### Upgrade / Downgrade Strategy
562567
563568<!--
@@ -679,7 +684,8 @@ feature gate after having objects written with the new field) are also critical.
679684You can take a look at one potential example of such test in:
680685https://github.com/kubernetes/kubernetes/pull/97058/files#diff-7826f7adbc1996a05ab52e3f5f02429e94b68ce6bce0dc534d1be636154fded3R246-R282
681686-->
682- The implementation, including tests, is waiting for an approval of this enhancement.
687+ All tests verify feature enablement / disablement to ensure backwards
688+ compatibility.
683689
684690### Rollout, Upgrade and Rollback Planning
685691
@@ -698,15 +704,19 @@ feature flags will be enabled on some API servers and not others during the
698704rollout. Similarly, consider large clusters and how enablement/disablement
699705will rollout across nodes.
700706-->
707+ No impact on rollout or rollback.
701708
702709###### What specific metrics should inform a rollback?
703710
704711<!--
705712What signals should users be paying attention to when the feature is young
706713that might indicate a serious problem?
707714-->
708- If the average time of ` apiserver_request_duration_seconds{verb="delete"} ` of the kube-apiserver
709- increases greatly, this feature might have caused a performance regression.
715+ If the average time of ` apiserver_request_duration_seconds{verb="delete"} ` or
716+ ` apiserver_request_duration_seconds{verb="list"} ` the amount of
717+ ` apiserver_current_inqueue_requests ` or ` apiserver_current_inflight_requests `
718+ increases greatly over an extended period of time this feature might have
719+ caused a performance regression.
710720
711721###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
712722
@@ -715,12 +725,14 @@ Describe manual testing that was done and the outcomes.
715725Longer term, we may want to require automated upgrade/rollback tests, but we
716726are missing a bunch of machinery and tooling and can't do that now.
717727-->
728+ No testing of upgrade->downgrade->upgrade necessary.
718729
719730###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
720731
721732<!--
722733Even if applying deprecation policies, they may still surprise some users.
723734-->
735+ No deprecations.
724736
725737### Monitoring Requirements
726738
@@ -739,6 +751,13 @@ checking if there are objects with field X set) may be a last resort. Avoid
739751logs or events for this purpose.
740752-->
741753
754+ This feature is for cluster administrators performing emergency recovery, not for workload automation.
755+
756+ To detect actual usage (i.e., unsafe deletions being performed):
757+
758+ - Audit logs: Search for annotation: ` apiserver.k8s.io/unsafe-delete-ignore-read-error ` .
759+ - RBAC: Check RoleBindings/ClusterRoleBindings granting ` unsafe-delete-ignore-read-errors ` verb.
760+
742761###### How can someone using this feature know that it is working for their instance?
743762
744763<!--
@@ -775,25 +794,38 @@ These goals will help you determine what you need to measure (SLIs) in the next
775794question.
776795-->
777796
797+ All corrupt object DELETEs complete, when feature is enabled, option is set and
798+ the user is authorized.
799+
778800###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
779801
780802<!--
781803Pick one more of these and delete the rest.
782804-->
783805
784806- [x] Metrics
785- - Metric name: ` apiserver_request_duration_seconds `
786- - [ Optional] Aggregation method: ` verb=delete `
787- - Components exposing the metric: kube-apiserver
807+ - Metric name: ` apiserver_request_duration_seconds `
808+ [ Optional] Aggregation method:
809+ - ` verb=delete ` > track latency increase from deleting corrupt objects (the latency should actually shorten)
810+ - ` verb=list ` > track latency increase from client re-lists
811+ Components exposing the metric: kube-apiserver
812+ - Metric name: ` apiserver_current_inqueue_requests `
813+ Components exposing the metric: kube-apiserver
814+ Details: Detect apiserver overload from request queueing
815+ - Metric name: ` apiserver_current_inflight_requests `
816+ Components exposing the metric: kube-apiserver
817+ Details: Detect apiserver being maxed out on requests consistently
788818- [ ] Other (treat as last resort)
789- - Details:
819+ - Details:
790820
791821###### Are there any missing metrics that would be useful to have to improve observability of this feature?
792822
793823<!--
794824Describe the metrics themselves and the reasons why they weren't added (e.g., cost,
795825implementation difficulties, etc.).
796826-->
827+ - No metric tracking when unsafe deletions actually occur. Not assumed to happen
828+ often.
797829
798830### Dependencies
799831
@@ -817,6 +849,7 @@ and creating new ones, as well as about cluster-level services (e.g. DNS):
817849 - Impact of its outage on the feature:
818850 - Impact of its degraded performance or high-error rates on the feature:
819851-->
852+ - kube-apiserver
820853
821854### Scalability
822855
@@ -830,11 +863,16 @@ For GA, this section is required: approvers should be able to confirm the
830863previous answers based on experience in the field.
831864-->
832865The feature itself should not bring any concerns in terms of performance at scale.
866+ In particular as its usage is supposed to run on potentially broken clusters.
833867
834- The only issue in terms of scaling comes with the error that attempts to list all
868+ An issue in terms of scaling comes with the error that attempts to list all
835869resources that appeared to be malformed while reading from the storage. A limit
836870of 100 presented resources was arbitrarily picked to prevent huge HTTP responses.
837871
872+ Another issue in terms of scaling happens when the corrupt objects are deleted.
873+ Client reflectors re-list to recover, this causes temporarily increased load on
874+ the client-side and the kube-apiserver.
875+
838876###### Will enabling / using this feature result in any new API calls?
839877
840878<!--
@@ -849,6 +887,7 @@ Focusing mostly on:
849887 - periodic API calls to reconcile state (e.g. periodic fetching state,
850888 heartbeats, leader election, etc.)
851889-->
890+ No.
852891
853892###### Will enabling / using this feature result in introducing new API types?
854893
@@ -858,6 +897,7 @@ Describe them, providing:
858897 - Supported number of objects per cluster
859898 - Supported number of objects per namespace (for namespace-scoped objects)
860899-->
900+ No.
861901
862902###### Will enabling / using this feature result in any new calls to the cloud provider?
863903
@@ -866,6 +906,7 @@ Describe them, providing:
866906 - Which API(s):
867907 - Estimated increase:
868908-->
909+ No.
869910
870911###### Will enabling / using this feature result in increasing size or count of the existing API objects?
871912
@@ -875,6 +916,8 @@ Describe them, providing:
875916 - Estimated increase in size: (e.g., new annotation of size 32B)
876917 - Estimated amount of new objects: (e.g., new Object X for every existing Pod)
877918-->
919+ DeleteOptions gets a new boolean field, but it is transient: no persistence in
920+ etcd.
878921
879922###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
880923
@@ -887,6 +930,23 @@ Think about adding additional work or introducing new steps in between
887930[existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos
888931-->
889932
933+ DELETE operations:
934+
935+ - Unsafe DELETE path is faster (skips preconditions, validation, finalizers)
936+ - Decreases latency for the unsafe delete itself
937+
938+ LIST operations:
939+
940+ - Client-side reflectors re-list when their watch breaks (after corrupt object deletion ERROR event)
941+ - Temporarily increases LIST request volume to apiserver
942+ - Latency increase depends on: number of watching clients × object count × apiserver resources
943+
944+ Expected impact:
945+
946+ - Negligible under the circumstance that the cluster is in a potentially broken
947+ state.
948+ - Potentially noticeable if: popular resource (many watchers) × many objects × resource-constrained apiserver
949+
890950###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
891951
892952<!--
@@ -899,6 +959,12 @@ This through this both in small and large cases, again with respect to the
899959[supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md
900960-->
901961
962+ Temporary increase during cleanup, dependent on object and resource type
963+ popularity:
964+
965+ - apiserver: CPU / network during re-lists
966+ - client-side: CPU / memory / network during re-lists / rebuilding cache
967+
902968###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
903969
904970<!--
@@ -911,6 +977,8 @@ Are there any tests that were run/should be run to understand performance charac
911977and validate the declared limits?
912978-->
913979
980+ No.
981+
914982### Troubleshooting
915983
916984<!--
0 commit comments