Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion keps/prod-readiness/sig-auth/3926.yaml
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
kep-number: 3926
alpha:
approver: "@deads2k"
approver: "@deads2k"
beta:
approver: "@deads2k"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can put my name here, since I'm looking at this one.

85 changes: 77 additions & 8 deletions keps/sig-auth/3926-handling-undecryptable-resources/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -100,6 +100,7 @@ tags, and then generate with `hack/update-toc.sh`.
- [e2e tests](#e2e-tests)
- [Graduation Criteria](#graduation-criteria)
- [Alpha](#alpha)
- [Beta](#beta)
- [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy)
- [Version Skew Strategy](#version-skew-strategy)
- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
Expand Down Expand Up @@ -558,6 +559,11 @@ in back-to-back releases.
- Error type is implemented
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Above, missing links to integration tests, since you're not planning e2e at all, that's a major blocker for promotion.

- Deletion of malformed etcd objects and its admission can be enabled via a feature flag

#### Beta

- Extended testing is available
- Dry-Run is implemented

### Upgrade / Downgrade Strategy

<!--
Expand Down Expand Up @@ -679,7 +685,8 @@ feature gate after having objects written with the new field) are also critical.
You can take a look at one potential example of such test in:
https://github.com/kubernetes/kubernetes/pull/97058/files#diff-7826f7adbc1996a05ab52e3f5f02429e94b68ce6bce0dc534d1be636154fded3R246-R282
-->
The implementation, including tests, is waiting for an approval of this enhancement.
All tests verify feature enablement / disablement to ensure backwards
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, links for the tests, or make sure they are included in the earlier sections.

compatibility.

### Rollout, Upgrade and Rollback Planning

Expand All @@ -698,15 +705,19 @@ feature flags will be enabled on some API servers and not others during the
rollout. Similarly, consider large clusters and how enablement/disablement
will rollout across nodes.
-->
No impact on rollout or rollback.

###### What specific metrics should inform a rollback?

<!--
What signals should users be paying attention to when the feature is young
that might indicate a serious problem?
-->
If the average time of `apiserver_request_duration_seconds{verb="delete"}` of the kube-apiserver
increases greatly, this feature might have caused a performance regression.
If the average time of `apiserver_request_duration_seconds{verb="delete"}` or
`apiserver_request_duration_seconds{verb="list"}` the amount of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, can you make sure these metrics are mentioned at the end of kep.yaml in metrics section, please?

`apiserver_current_inqueue_requests` or `apiserver_current_inflight_requests`
increases greatly over an extended period of time this feature might have
caused a performance regression.

###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

Expand All @@ -715,12 +726,14 @@ Describe manual testing that was done and the outcomes.
Longer term, we may want to require automated upgrade/rollback tests, but we
are missing a bunch of machinery and tooling and can't do that now.
-->
No testing of upgrade->downgrade->upgrade necessary.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain why no such tests are necessary?


###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

<!--
Even if applying deprecation policies, they may still surprise some users.
-->
No deprecations.

### Monitoring Requirements

Expand All @@ -739,6 +752,13 @@ checking if there are objects with field X set) may be a last resort. Avoid
logs or events for this purpose.
-->

This feature is for cluster administrators performing emergency recovery, not for workload automation.

To detect actual usage (i.e., unsafe deletions being performed):

- Audit logs: Search for annotation: `apiserver.k8s.io/unsafe-delete-ignore-read-error`.
- RBAC: Check RoleBindings/ClusterRoleBindings granting `unsafe-delete-ignore-read-errors` verb.

###### How can someone using this feature know that it is working for their instance?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This question is missing answers.


<!--
Expand Down Expand Up @@ -775,25 +795,38 @@ These goals will help you determine what you need to measure (SLIs) in the next
question.
-->

All corrupt object DELETEs complete, when feature is enabled, option is set and
the user is authorized.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The question is about SLO, iow. what is the excpected time for delete completion? Check https://github.com/kubernetes/community/blob/master/sig-scalability/slos/slos.md for suggestions.


###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

<!--
Pick one more of these and delete the rest.
-->

- [x] Metrics
- Metric name: `apiserver_request_duration_seconds`
- [Optional] Aggregation method: `verb=delete`
- Components exposing the metric: kube-apiserver
- Metric name: `apiserver_request_duration_seconds`
[Optional] Aggregation method:
- `verb=delete` > track latency increase from deleting corrupt objects (the latency should actually shorten)
- `verb=list` > track latency increase from client re-lists
Components exposing the metric: kube-apiserver
- Metric name: `apiserver_current_inqueue_requests`
Components exposing the metric: kube-apiserver
Details: Detect apiserver overload from request queueing
- Metric name: `apiserver_current_inflight_requests`
Components exposing the metric: kube-apiserver
Details: Detect apiserver being maxed out on requests consistently
- [ ] Other (treat as last resort)
- Details:
- Details:

###### Are there any missing metrics that would be useful to have to improve observability of this feature?

<!--
Describe the metrics themselves and the reasons why they weren't added (e.g., cost,
implementation difficulties, etc.).
-->
- No metric tracking when unsafe deletions actually occur. Not assumed to happen
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit:

Suggested change
- No metric tracking when unsafe deletions actually occur. Not assumed to happen
- No new metric for tracking unsafe deletions is required. The available metrics (`apiserver_request_duration_seconds`) are sufficient to track the functionality of this feature.

often.

### Dependencies

Expand All @@ -817,6 +850,7 @@ and creating new ones, as well as about cluster-level services (e.g. DNS):
- Impact of its outage on the feature:
- Impact of its degraded performance or high-error rates on the feature:
-->
- kube-apiserver
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a question about external services to kubernetes. So in your case No is sufficient answer.


### Scalability

Expand All @@ -830,11 +864,16 @@ For GA, this section is required: approvers should be able to confirm the
previous answers based on experience in the field.
-->
The feature itself should not bring any concerns in terms of performance at scale.
In particular as its usage is supposed to run on potentially broken clusters.

The only issue in terms of scaling comes with the error that attempts to list all
An issue in terms of scaling comes with the error that attempts to list all
resources that appeared to be malformed while reading from the storage. A limit
of 100 presented resources was arbitrarily picked to prevent huge HTTP responses.

Another issue in terms of scaling happens when the corrupt objects are deleted.
Client reflectors re-list to recover, this causes temporarily increased load on
the client-side and the kube-apiserver.

###### Will enabling / using this feature result in any new API calls?

<!--
Expand All @@ -849,6 +888,7 @@ Focusing mostly on:
- periodic API calls to reconcile state (e.g. periodic fetching state,
heartbeats, leader election, etc.)
-->
No.

###### Will enabling / using this feature result in introducing new API types?

Expand All @@ -858,6 +898,7 @@ Describe them, providing:
- Supported number of objects per cluster
- Supported number of objects per namespace (for namespace-scoped objects)
-->
No.

###### Will enabling / using this feature result in any new calls to the cloud provider?

Expand All @@ -866,6 +907,7 @@ Describe them, providing:
- Which API(s):
- Estimated increase:
-->
No.

###### Will enabling / using this feature result in increasing size or count of the existing API objects?

Expand All @@ -875,6 +917,8 @@ Describe them, providing:
- Estimated increase in size: (e.g., new annotation of size 32B)
- Estimated amount of new objects: (e.g., new Object X for every existing Pod)
-->
DeleteOptions gets a new boolean field, but it is transient: no persistence in
etcd.

###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

Expand All @@ -887,6 +931,23 @@ Think about adding additional work or introducing new steps in between
[existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos
-->

DELETE operations:

- Unsafe DELETE path is faster (skips preconditions, validation, finalizers)
- Decreases latency for the unsafe delete itself

LIST operations:

- Client-side reflectors re-list when their watch breaks (after corrupt object deletion ERROR event)
- Temporarily increases LIST request volume to apiserver
- Latency increase depends on: number of watching clients × object count × apiserver resources

Expected impact:

- Negligible under the circumstance that the cluster is in a potentially broken
state.
- Potentially noticeable if: popular resource (many watchers) × many objects × resource-constrained apiserver

###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?

<!--
Expand All @@ -899,6 +960,12 @@ This through this both in small and large cases, again with respect to the
[supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md
-->

Temporary increase during cleanup, dependent on object and resource type
popularity:

- apiserver: CPU / network during re-lists
- client-side: CPU / memory / network during re-lists / rebuilding cache

###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

<!--
Expand All @@ -911,6 +978,8 @@ Are there any tests that were run/should be run to understand performance charac
and validate the declared limits?
-->

No.

### Troubleshooting

<!--
Expand Down
8 changes: 1 addition & 7 deletions keps/sig-auth/3926-handling-undecryptable-resources/kep.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,6 @@ authors:
- "@stlaz"
owning-sig: sig-auth
participating-sigs:
- sig-auth
- sig-api-machinery
status: implementable
creation-date: 2023-03-27
Expand All @@ -14,13 +13,8 @@ reviewers:
approvers:
- "@deads2k"

see-also:
-
replaces:
-

# The target maturity stage in the current dev cycle for this KEP.
stage: alpha
stage: beta

# The most recent milestone for which work toward delivery of this KEP has been
# done. This can be the current (upcoming) milestone, if it is being actively
Expand Down