Skip to content

Commit a893ad0

Browse files
committed
KEP-3926: updating the PRR questionnaire
1 parent ad03a49 commit a893ad0

File tree

1 file changed

+77
-8
lines changed
  • keps/sig-auth/3926-handling-undecryptable-resources

1 file changed

+77
-8
lines changed

keps/sig-auth/3926-handling-undecryptable-resources/README.md

Lines changed: 77 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -100,6 +100,7 @@ tags, and then generate with `hack/update-toc.sh`.
100100
- [e2e tests](#e2e-tests)
101101
- [Graduation Criteria](#graduation-criteria)
102102
- [Alpha](#alpha)
103+
- [Beta](#beta)
103104
- [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy)
104105
- [Version Skew Strategy](#version-skew-strategy)
105106
- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
@@ -558,6 +559,11 @@ in back-to-back releases.
558559
- Error type is implemented
559560
- Deletion of malformed etcd objects and its admission can be enabled via a feature flag
560561

562+
#### Beta
563+
564+
- Extended testing is available
565+
- Dry-Run is implemented
566+
561567
### Upgrade / Downgrade Strategy
562568

563569
<!--
@@ -679,7 +685,8 @@ feature gate after having objects written with the new field) are also critical.
679685
You can take a look at one potential example of such test in:
680686
https://github.com/kubernetes/kubernetes/pull/97058/files#diff-7826f7adbc1996a05ab52e3f5f02429e94b68ce6bce0dc534d1be636154fded3R246-R282
681687
-->
682-
The implementation, including tests, is waiting for an approval of this enhancement.
688+
All tests verify feature enablement / disablement to ensure backwards
689+
compatibility.
683690

684691
### Rollout, Upgrade and Rollback Planning
685692

@@ -698,15 +705,19 @@ feature flags will be enabled on some API servers and not others during the
698705
rollout. Similarly, consider large clusters and how enablement/disablement
699706
will rollout across nodes.
700707
-->
708+
No impact on rollout or rollback.
701709

702710
###### What specific metrics should inform a rollback?
703711

704712
<!--
705713
What signals should users be paying attention to when the feature is young
706714
that might indicate a serious problem?
707715
-->
708-
If the average time of `apiserver_request_duration_seconds{verb="delete"}` of the kube-apiserver
709-
increases greatly, this feature might have caused a performance regression.
716+
If the average time of `apiserver_request_duration_seconds{verb="delete"}` or
717+
`apiserver_request_duration_seconds{verb="list"}` the amount of
718+
`apiserver_current_inqueue_requests` or `apiserver_current_inflight_requests`
719+
increases greatly over an extended period of time this feature might have
720+
caused a performance regression.
710721

711722
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
712723

@@ -715,12 +726,14 @@ Describe manual testing that was done and the outcomes.
715726
Longer term, we may want to require automated upgrade/rollback tests, but we
716727
are missing a bunch of machinery and tooling and can't do that now.
717728
-->
729+
No testing of upgrade->downgrade->upgrade necessary.
718730

719731
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
720732

721733
<!--
722734
Even if applying deprecation policies, they may still surprise some users.
723735
-->
736+
No deprecations.
724737

725738
### Monitoring Requirements
726739

@@ -739,6 +752,13 @@ checking if there are objects with field X set) may be a last resort. Avoid
739752
logs or events for this purpose.
740753
-->
741754

755+
This feature is for cluster administrators performing emergency recovery, not for workload automation.
756+
757+
To detect actual usage (i.e., unsafe deletions being performed):
758+
759+
- Audit logs: Search for annotation: `apiserver.k8s.io/unsafe-delete-ignore-read-error`.
760+
- RBAC: Check RoleBindings/ClusterRoleBindings granting `unsafe-delete-ignore-read-errors` verb.
761+
742762
###### How can someone using this feature know that it is working for their instance?
743763

744764
<!--
@@ -775,25 +795,38 @@ These goals will help you determine what you need to measure (SLIs) in the next
775795
question.
776796
-->
777797

798+
All corrupt object DELETEs complete, when feature is enabled, option is set and
799+
the user is authorized.
800+
778801
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
779802

780803
<!--
781804
Pick one more of these and delete the rest.
782805
-->
783806

784807
- [x] Metrics
785-
- Metric name: `apiserver_request_duration_seconds`
786-
- [Optional] Aggregation method: `verb=delete`
787-
- Components exposing the metric: kube-apiserver
808+
- Metric name: `apiserver_request_duration_seconds`
809+
[Optional] Aggregation method:
810+
- `verb=delete` > track latency increase from deleting corrupt objects (the latency should actually shorten)
811+
- `verb=list` > track latency increase from client re-lists
812+
Components exposing the metric: kube-apiserver
813+
- Metric name: `apiserver_current_inqueue_requests`
814+
Components exposing the metric: kube-apiserver
815+
Details: Detect apiserver overload from request queueing
816+
- Metric name: `apiserver_current_inflight_requests`
817+
Components exposing the metric: kube-apiserver
818+
Details: Detect apiserver being maxed out on requests consistently
788819
- [ ] Other (treat as last resort)
789-
- Details:
820+
- Details:
790821

791822
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
792823

793824
<!--
794825
Describe the metrics themselves and the reasons why they weren't added (e.g., cost,
795826
implementation difficulties, etc.).
796827
-->
828+
- No metric tracking when unsafe deletions actually occur. Not assumed to happen
829+
often.
797830

798831
### Dependencies
799832

@@ -817,6 +850,7 @@ and creating new ones, as well as about cluster-level services (e.g. DNS):
817850
- Impact of its outage on the feature:
818851
- Impact of its degraded performance or high-error rates on the feature:
819852
-->
853+
- kube-apiserver
820854

821855
### Scalability
822856

@@ -830,11 +864,16 @@ For GA, this section is required: approvers should be able to confirm the
830864
previous answers based on experience in the field.
831865
-->
832866
The feature itself should not bring any concerns in terms of performance at scale.
867+
In particular as its usage is supposed to run on potentially broken clusters.
833868

834-
The only issue in terms of scaling comes with the error that attempts to list all
869+
An issue in terms of scaling comes with the error that attempts to list all
835870
resources that appeared to be malformed while reading from the storage. A limit
836871
of 100 presented resources was arbitrarily picked to prevent huge HTTP responses.
837872

873+
Another issue in terms of scaling happens when the corrupt objects are deleted.
874+
Client reflectors re-list to recover, this causes temporarily increased load on
875+
the client-side and the kube-apiserver.
876+
838877
###### Will enabling / using this feature result in any new API calls?
839878

840879
<!--
@@ -849,6 +888,7 @@ Focusing mostly on:
849888
- periodic API calls to reconcile state (e.g. periodic fetching state,
850889
heartbeats, leader election, etc.)
851890
-->
891+
No.
852892

853893
###### Will enabling / using this feature result in introducing new API types?
854894

@@ -858,6 +898,7 @@ Describe them, providing:
858898
- Supported number of objects per cluster
859899
- Supported number of objects per namespace (for namespace-scoped objects)
860900
-->
901+
No.
861902

862903
###### Will enabling / using this feature result in any new calls to the cloud provider?
863904

@@ -866,6 +907,7 @@ Describe them, providing:
866907
- Which API(s):
867908
- Estimated increase:
868909
-->
910+
No.
869911

870912
###### Will enabling / using this feature result in increasing size or count of the existing API objects?
871913

@@ -875,6 +917,8 @@ Describe them, providing:
875917
- Estimated increase in size: (e.g., new annotation of size 32B)
876918
- Estimated amount of new objects: (e.g., new Object X for every existing Pod)
877919
-->
920+
DeleteOptions gets a new boolean field, but it is transient: no persistence in
921+
etcd.
878922

879923
###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
880924

@@ -887,6 +931,23 @@ Think about adding additional work or introducing new steps in between
887931
[existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos
888932
-->
889933

934+
DELETE operations:
935+
936+
- Unsafe DELETE path is faster (skips preconditions, validation, finalizers)
937+
- Decreases latency for the unsafe delete itself
938+
939+
LIST operations:
940+
941+
- Client-side reflectors re-list when their watch breaks (after corrupt object deletion ERROR event)
942+
- Temporarily increases LIST request volume to apiserver
943+
- Latency increase depends on: number of watching clients × object count × apiserver resources
944+
945+
Expected impact:
946+
947+
- Negligible under the circumstance that the cluster is in a potentially broken
948+
state.
949+
- Potentially noticeable if: popular resource (many watchers) × many objects × resource-constrained apiserver
950+
890951
###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
891952

892953
<!--
@@ -899,6 +960,12 @@ This through this both in small and large cases, again with respect to the
899960
[supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md
900961
-->
901962

963+
Temporary increase during cleanup, dependent on object and resource type
964+
popularity:
965+
966+
- apiserver: CPU / network during re-lists
967+
- client-side: CPU / memory / network during re-lists / rebuilding cache
968+
902969
###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
903970

904971
<!--
@@ -911,6 +978,8 @@ Are there any tests that were run/should be run to understand performance charac
911978
and validate the declared limits?
912979
-->
913980

981+
No.
982+
914983
### Troubleshooting
915984

916985
<!--

0 commit comments

Comments
 (0)