- 
                Notifications
    You must be signed in to change notification settings 
- Fork 1.6k
KEP-3926: updating the PRR questionnaire #5645
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | 
|---|---|---|
| @@ -1,3 +1,5 @@ | ||
| kep-number: 3926 | ||
| alpha: | ||
| approver: "@deads2k" | ||
| approver: "@deads2k" | ||
| beta: | ||
| approver: "@deads2k" | ||
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|  | @@ -100,6 +100,7 @@ tags, and then generate with `hack/update-toc.sh`. | |||||
| - [e2e tests](#e2e-tests) | ||||||
| - [Graduation Criteria](#graduation-criteria) | ||||||
| - [Alpha](#alpha) | ||||||
| - [Beta](#beta) | ||||||
| - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) | ||||||
| - [Version Skew Strategy](#version-skew-strategy) | ||||||
| - [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) | ||||||
|  | @@ -558,6 +559,11 @@ in back-to-back releases. | |||||
| - Error type is implemented | ||||||
| There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Above, missing links to integration tests, since you're not planning e2e at all, that's a major blocker for promotion. | ||||||
| - Deletion of malformed etcd objects and its admission can be enabled via a feature flag | ||||||
|  | ||||||
| #### Beta | ||||||
|  | ||||||
| - Extended testing is available | ||||||
| - Dry-Run is implemented | ||||||
|  | ||||||
| ### Upgrade / Downgrade Strategy | ||||||
|  | ||||||
| <!-- | ||||||
|  | @@ -679,7 +685,8 @@ feature gate after having objects written with the new field) are also critical. | |||||
| You can take a look at one potential example of such test in: | ||||||
| https://github.com/kubernetes/kubernetes/pull/97058/files#diff-7826f7adbc1996a05ab52e3f5f02429e94b68ce6bce0dc534d1be636154fded3R246-R282 | ||||||
| --> | ||||||
| The implementation, including tests, is waiting for an approval of this enhancement. | ||||||
| All tests verify feature enablement / disablement to ensure backwards | ||||||
| There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Again, links for the tests, or make sure they are included in the earlier sections. | ||||||
| compatibility. | ||||||
|  | ||||||
| ### Rollout, Upgrade and Rollback Planning | ||||||
|  | ||||||
|  | @@ -698,15 +705,19 @@ feature flags will be enabled on some API servers and not others during the | |||||
| rollout. Similarly, consider large clusters and how enablement/disablement | ||||||
| will rollout across nodes. | ||||||
| --> | ||||||
| No impact on rollout or rollback. | ||||||
|  | ||||||
| ###### What specific metrics should inform a rollback? | ||||||
|  | ||||||
| <!-- | ||||||
| What signals should users be paying attention to when the feature is young | ||||||
| that might indicate a serious problem? | ||||||
| --> | ||||||
| If the average time of `apiserver_request_duration_seconds{verb="delete"}` of the kube-apiserver | ||||||
| increases greatly, this feature might have caused a performance regression. | ||||||
| If the average time of `apiserver_request_duration_seconds{verb="delete"}` or | ||||||
| `apiserver_request_duration_seconds{verb="list"}` the amount of | ||||||
| There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nit, can you make sure these metrics are mentioned at the end of  | ||||||
| `apiserver_current_inqueue_requests` or `apiserver_current_inflight_requests` | ||||||
| increases greatly over an extended period of time this feature might have | ||||||
| caused a performance regression. | ||||||
|  | ||||||
| ###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? | ||||||
|  | ||||||
|  | @@ -715,12 +726,14 @@ Describe manual testing that was done and the outcomes. | |||||
| Longer term, we may want to require automated upgrade/rollback tests, but we | ||||||
| are missing a bunch of machinery and tooling and can't do that now. | ||||||
| --> | ||||||
| No testing of upgrade->downgrade->upgrade necessary. | ||||||
| There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you explain why no such tests are necessary? | ||||||
|  | ||||||
| ###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? | ||||||
|  | ||||||
| <!-- | ||||||
| Even if applying deprecation policies, they may still surprise some users. | ||||||
| --> | ||||||
| No deprecations. | ||||||
|  | ||||||
| ### Monitoring Requirements | ||||||
|  | ||||||
|  | @@ -739,6 +752,13 @@ checking if there are objects with field X set) may be a last resort. Avoid | |||||
| logs or events for this purpose. | ||||||
| --> | ||||||
|  | ||||||
| This feature is for cluster administrators performing emergency recovery, not for workload automation. | ||||||
|  | ||||||
| To detect actual usage (i.e., unsafe deletions being performed): | ||||||
|  | ||||||
| - Audit logs: Search for annotation: `apiserver.k8s.io/unsafe-delete-ignore-read-error`. | ||||||
| - RBAC: Check RoleBindings/ClusterRoleBindings granting `unsafe-delete-ignore-read-errors` verb. | ||||||
|  | ||||||
| ###### How can someone using this feature know that it is working for their instance? | ||||||
| There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This question is missing answers. | ||||||
|  | ||||||
| <!-- | ||||||
|  | @@ -775,25 +795,38 @@ These goals will help you determine what you need to measure (SLIs) in the next | |||||
| question. | ||||||
| --> | ||||||
|  | ||||||
| All corrupt object DELETEs complete, when feature is enabled, option is set and | ||||||
| the user is authorized. | ||||||
| There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The question is about SLO, iow. what is the excpected time for delete completion? Check https://github.com/kubernetes/community/blob/master/sig-scalability/slos/slos.md for suggestions. | ||||||
|  | ||||||
| ###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? | ||||||
|  | ||||||
| <!-- | ||||||
| Pick one more of these and delete the rest. | ||||||
| --> | ||||||
|  | ||||||
| - [x] Metrics | ||||||
| - Metric name: `apiserver_request_duration_seconds` | ||||||
| - [Optional] Aggregation method: `verb=delete` | ||||||
| - Components exposing the metric: kube-apiserver | ||||||
| - Metric name: `apiserver_request_duration_seconds` | ||||||
| [Optional] Aggregation method: | ||||||
| - `verb=delete` > track latency increase from deleting corrupt objects (the latency should actually shorten) | ||||||
| - `verb=list` > track latency increase from client re-lists | ||||||
| Components exposing the metric: kube-apiserver | ||||||
| - Metric name: `apiserver_current_inqueue_requests` | ||||||
| Components exposing the metric: kube-apiserver | ||||||
| Details: Detect apiserver overload from request queueing | ||||||
| - Metric name: `apiserver_current_inflight_requests` | ||||||
| Components exposing the metric: kube-apiserver | ||||||
| Details: Detect apiserver being maxed out on requests consistently | ||||||
| - [ ] Other (treat as last resort) | ||||||
| - Details: | ||||||
| - Details: | ||||||
|  | ||||||
| ###### Are there any missing metrics that would be useful to have to improve observability of this feature? | ||||||
|  | ||||||
| <!-- | ||||||
| Describe the metrics themselves and the reasons why they weren't added (e.g., cost, | ||||||
| implementation difficulties, etc.). | ||||||
| --> | ||||||
| - No metric tracking when unsafe deletions actually occur. Not assumed to happen | ||||||
| There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nit: 
        Suggested change
       
 | ||||||
| often. | ||||||
|  | ||||||
| ### Dependencies | ||||||
|  | ||||||
|  | @@ -817,6 +850,7 @@ and creating new ones, as well as about cluster-level services (e.g. DNS): | |||||
| - Impact of its outage on the feature: | ||||||
| - Impact of its degraded performance or high-error rates on the feature: | ||||||
| --> | ||||||
| - kube-apiserver | ||||||
| There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is a question about external services to kubernetes. So in your case  | ||||||
|  | ||||||
| ### Scalability | ||||||
|  | ||||||
|  | @@ -830,11 +864,16 @@ For GA, this section is required: approvers should be able to confirm the | |||||
| previous answers based on experience in the field. | ||||||
| --> | ||||||
| The feature itself should not bring any concerns in terms of performance at scale. | ||||||
| In particular as its usage is supposed to run on potentially broken clusters. | ||||||
|  | ||||||
| The only issue in terms of scaling comes with the error that attempts to list all | ||||||
| An issue in terms of scaling comes with the error that attempts to list all | ||||||
| resources that appeared to be malformed while reading from the storage. A limit | ||||||
| of 100 presented resources was arbitrarily picked to prevent huge HTTP responses. | ||||||
|  | ||||||
| Another issue in terms of scaling happens when the corrupt objects are deleted. | ||||||
| Client reflectors re-list to recover, this causes temporarily increased load on | ||||||
| the client-side and the kube-apiserver. | ||||||
|  | ||||||
| ###### Will enabling / using this feature result in any new API calls? | ||||||
|  | ||||||
| <!-- | ||||||
|  | @@ -849,6 +888,7 @@ Focusing mostly on: | |||||
| - periodic API calls to reconcile state (e.g. periodic fetching state, | ||||||
| heartbeats, leader election, etc.) | ||||||
| --> | ||||||
| No. | ||||||
|  | ||||||
| ###### Will enabling / using this feature result in introducing new API types? | ||||||
|  | ||||||
|  | @@ -858,6 +898,7 @@ Describe them, providing: | |||||
| - Supported number of objects per cluster | ||||||
| - Supported number of objects per namespace (for namespace-scoped objects) | ||||||
| --> | ||||||
| No. | ||||||
|  | ||||||
| ###### Will enabling / using this feature result in any new calls to the cloud provider? | ||||||
|  | ||||||
|  | @@ -866,6 +907,7 @@ Describe them, providing: | |||||
| - Which API(s): | ||||||
| - Estimated increase: | ||||||
| --> | ||||||
| No. | ||||||
|  | ||||||
| ###### Will enabling / using this feature result in increasing size or count of the existing API objects? | ||||||
|  | ||||||
|  | @@ -875,6 +917,8 @@ Describe them, providing: | |||||
| - Estimated increase in size: (e.g., new annotation of size 32B) | ||||||
| - Estimated amount of new objects: (e.g., new Object X for every existing Pod) | ||||||
| --> | ||||||
| DeleteOptions gets a new boolean field, but it is transient: no persistence in | ||||||
| etcd. | ||||||
|  | ||||||
| ###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? | ||||||
|  | ||||||
|  | @@ -887,6 +931,23 @@ Think about adding additional work or introducing new steps in between | |||||
| [existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos | ||||||
| --> | ||||||
|  | ||||||
| DELETE operations: | ||||||
|  | ||||||
| - Unsafe DELETE path is faster (skips preconditions, validation, finalizers) | ||||||
| - Decreases latency for the unsafe delete itself | ||||||
|  | ||||||
| LIST operations: | ||||||
|  | ||||||
| - Client-side reflectors re-list when their watch breaks (after corrupt object deletion ERROR event) | ||||||
| - Temporarily increases LIST request volume to apiserver | ||||||
| - Latency increase depends on: number of watching clients × object count × apiserver resources | ||||||
|  | ||||||
| Expected impact: | ||||||
|  | ||||||
| - Negligible under the circumstance that the cluster is in a potentially broken | ||||||
| state. | ||||||
| - Potentially noticeable if: popular resource (many watchers) × many objects × resource-constrained apiserver | ||||||
|  | ||||||
| ###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? | ||||||
|  | ||||||
| <!-- | ||||||
|  | @@ -899,6 +960,12 @@ This through this both in small and large cases, again with respect to the | |||||
| [supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md | ||||||
| --> | ||||||
|  | ||||||
| Temporary increase during cleanup, dependent on object and resource type | ||||||
| popularity: | ||||||
|  | ||||||
| - apiserver: CPU / network during re-lists | ||||||
| - client-side: CPU / memory / network during re-lists / rebuilding cache | ||||||
|  | ||||||
| ###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)? | ||||||
|  | ||||||
| <!-- | ||||||
|  | @@ -911,6 +978,8 @@ Are there any tests that were run/should be run to understand performance charac | |||||
| and validate the declared limits? | ||||||
| --> | ||||||
|  | ||||||
| No. | ||||||
|  | ||||||
| ### Troubleshooting | ||||||
|  | ||||||
| <!-- | ||||||
|  | ||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can put my name here, since I'm looking at this one.