diff --git a/keps/prod-readiness/sig-auth/3926.yaml b/keps/prod-readiness/sig-auth/3926.yaml index a7890981385..5d7e9fec1a8 100644 --- a/keps/prod-readiness/sig-auth/3926.yaml +++ b/keps/prod-readiness/sig-auth/3926.yaml @@ -1,3 +1,5 @@ kep-number: 3926 alpha: - approver: "@deads2k" \ No newline at end of file + approver: "@deads2k" +beta: + approver: "@deads2k" diff --git a/keps/sig-auth/3926-handling-undecryptable-resources/README.md b/keps/sig-auth/3926-handling-undecryptable-resources/README.md index c2e16960fcf..f1012f1b427 100644 --- a/keps/sig-auth/3926-handling-undecryptable-resources/README.md +++ b/keps/sig-auth/3926-handling-undecryptable-resources/README.md @@ -100,6 +100,7 @@ tags, and then generate with `hack/update-toc.sh`. - [e2e tests](#e2e-tests) - [Graduation Criteria](#graduation-criteria) - [Alpha](#alpha) + - [Beta](#beta) - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) - [Version Skew Strategy](#version-skew-strategy) - [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) @@ -558,6 +559,11 @@ in back-to-back releases. - Error type is implemented - Deletion of malformed etcd objects and its admission can be enabled via a feature flag +#### Beta + +- Extended testing is available +- Dry-Run is implemented + ### Upgrade / Downgrade Strategy -The implementation, including tests, is waiting for an approval of this enhancement. +All tests verify feature enablement / disablement to ensure backwards +compatibility. ### Rollout, Upgrade and Rollback Planning @@ -698,6 +705,7 @@ feature flags will be enabled on some API servers and not others during the rollout. Similarly, consider large clusters and how enablement/disablement will rollout across nodes. --> +No impact on rollout or rollback. ###### What specific metrics should inform a rollback? @@ -705,8 +713,11 @@ will rollout across nodes. What signals should users be paying attention to when the feature is young that might indicate a serious problem? --> -If the average time of `apiserver_request_duration_seconds{verb="delete"}` of the kube-apiserver -increases greatly, this feature might have caused a performance regression. +If the average time of `apiserver_request_duration_seconds{verb="delete"}` or +`apiserver_request_duration_seconds{verb="list"}` the amount of +`apiserver_current_inqueue_requests` or `apiserver_current_inflight_requests` +increases greatly over an extended period of time this feature might have +caused a performance regression. ###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? @@ -715,12 +726,14 @@ Describe manual testing that was done and the outcomes. Longer term, we may want to require automated upgrade/rollback tests, but we are missing a bunch of machinery and tooling and can't do that now. --> +No testing of upgrade->downgrade->upgrade necessary. ###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? +No deprecations. ### Monitoring Requirements @@ -739,6 +752,13 @@ checking if there are objects with field X set) may be a last resort. Avoid logs or events for this purpose. --> +This feature is for cluster administrators performing emergency recovery, not for workload automation. + +To detect actual usage (i.e., unsafe deletions being performed): + +- Audit logs: Search for annotation: `apiserver.k8s.io/unsafe-delete-ignore-read-error`. +- RBAC: Check RoleBindings/ClusterRoleBindings granting `unsafe-delete-ignore-read-errors` verb. + ###### How can someone using this feature know that it is working for their instance? +All corrupt object DELETEs complete, when feature is enabled, option is set and +the user is authorized. + ###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? - [x] Metrics - - Metric name: `apiserver_request_duration_seconds` - - [Optional] Aggregation method: `verb=delete` - - Components exposing the metric: kube-apiserver + - Metric name: `apiserver_request_duration_seconds` + [Optional] Aggregation method: + - `verb=delete` > track latency increase from deleting corrupt objects (the latency should actually shorten) + - `verb=list` > track latency increase from client re-lists + Components exposing the metric: kube-apiserver + - Metric name: `apiserver_current_inqueue_requests` + Components exposing the metric: kube-apiserver + Details: Detect apiserver overload from request queueing + - Metric name: `apiserver_current_inflight_requests` + Components exposing the metric: kube-apiserver + Details: Detect apiserver being maxed out on requests consistently - [ ] Other (treat as last resort) - - Details: + - Details: ###### Are there any missing metrics that would be useful to have to improve observability of this feature? @@ -794,6 +825,8 @@ Pick one more of these and delete the rest. Describe the metrics themselves and the reasons why they weren't added (e.g., cost, implementation difficulties, etc.). --> +- No metric tracking when unsafe deletions actually occur. Not assumed to happen + often. ### Dependencies @@ -817,6 +850,7 @@ and creating new ones, as well as about cluster-level services (e.g. DNS): - Impact of its outage on the feature: - Impact of its degraded performance or high-error rates on the feature: --> +- kube-apiserver ### Scalability @@ -830,11 +864,16 @@ For GA, this section is required: approvers should be able to confirm the previous answers based on experience in the field. --> The feature itself should not bring any concerns in terms of performance at scale. +In particular as its usage is supposed to run on potentially broken clusters. -The only issue in terms of scaling comes with the error that attempts to list all +An issue in terms of scaling comes with the error that attempts to list all resources that appeared to be malformed while reading from the storage. A limit of 100 presented resources was arbitrarily picked to prevent huge HTTP responses. +Another issue in terms of scaling happens when the corrupt objects are deleted. +Client reflectors re-list to recover, this causes temporarily increased load on +the client-side and the kube-apiserver. + ###### Will enabling / using this feature result in any new API calls? +No. ###### Will enabling / using this feature result in introducing new API types? @@ -858,6 +898,7 @@ Describe them, providing: - Supported number of objects per cluster - Supported number of objects per namespace (for namespace-scoped objects) --> +No. ###### Will enabling / using this feature result in any new calls to the cloud provider? @@ -866,6 +907,7 @@ Describe them, providing: - Which API(s): - Estimated increase: --> +No. ###### Will enabling / using this feature result in increasing size or count of the existing API objects? @@ -875,6 +917,8 @@ Describe them, providing: - Estimated increase in size: (e.g., new annotation of size 32B) - Estimated amount of new objects: (e.g., new Object X for every existing Pod) --> +DeleteOptions gets a new boolean field, but it is transient: no persistence in +etcd. ###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? @@ -887,6 +931,23 @@ Think about adding additional work or introducing new steps in between [existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos --> +DELETE operations: + +- Unsafe DELETE path is faster (skips preconditions, validation, finalizers) +- Decreases latency for the unsafe delete itself + +LIST operations: + +- Client-side reflectors re-list when their watch breaks (after corrupt object deletion ERROR event) +- Temporarily increases LIST request volume to apiserver +- Latency increase depends on: number of watching clients × object count × apiserver resources + +Expected impact: + +- Negligible under the circumstance that the cluster is in a potentially broken + state. +- Potentially noticeable if: popular resource (many watchers) × many objects × resource-constrained apiserver + ###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? +Temporary increase during cleanup, dependent on object and resource type +popularity: + +- apiserver: CPU / network during re-lists +- client-side: CPU / memory / network during re-lists / rebuilding cache + ###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)? +No. + ### Troubleshooting