-
Notifications
You must be signed in to change notification settings - Fork 573
CNTRLPLANE-1576: add event-ttl configuration to kube-apiserver #2520
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
@tjungblu: This pull request references CNTRLPLANE-1576 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Hello @tjungblu! Some important instructions when contributing to openshift/api: |
b24d8da
to
e816e63
Compare
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
operator/v1/types_kubeapiserver.go
Outdated
// The value must be parseable as a time duration value; | ||
// see <https://pkg.go.dev/time#ParseDuration>. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We prefer not to use duration values anymore. Instead, create a int32 type, with units in the name
For example, this should be eventTTLMinutes
.
We do this because not all clients are built in Go, and building a Go compatible duration parsing in other languages is not going to be fun
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for reminding me. I've changed this to be configurable in minutes via int32.
operator/v1/types_kubeapiserver.go
Outdated
// If configured, it must be a value of 1m (one minute) or greater, we only allow setting | ||
// minute and hour durations (e.g. 5m or 5h). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should set a sensible maximum value for this TTL. Can you suggest a sensible maximum value for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, made it 2x the default of 3h now
operator/v1/types_kubeapiserver.go
Outdated
// +kubebuilder:validation:Format=duration | ||
// +kubebuilder:validation:Pattern=^(0|([0-9]+(\.[0-9]+)?(m|h))+)$ | ||
// +kubebuilder:validation:Type:=string | ||
// +default="3h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is a configuration API, we want to reserve the right to change this value over time. This will be easier if we don't set the value through openapi defaulting.
It's better here to set the value in the controller by detecting that it is omitted.
We generally add a comment to the godoc to explain this scenario
// When omitted this means no opinion, and the platform is left to choose a reasonable default, which is subject to change over time.
// The current default value is 3h.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added this, great idea
operator/v1/types_kubeapiserver.go
Outdated
type KubeAPIServerSpec struct { | ||
StaticPodOperatorSpec `json:",inline"` | ||
|
||
// eventTTL specifies the amount of time that the events are stored before being deleted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would a user choose either a larger or smaller value for this? Is there a performance benefit? What is too large? What are the impacts of choosing something very small?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How deep do we want to document this on the godoc vs. the actual openshift documentation?
Or is this just for your personal curiosity?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should give at least some guidance as to what the effect is of choosing a smaller or larger value, doesn't have to be super deep, but if we can make the API self service that's preferable
My main concern is that someone sees they can change it to 1m in the oc explain
, so does so, without us giving them any hint as to potential issues this short feedback loop could introduce
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've left a few words about lowering it, since in the other thread we left the 3h default as the maximum.
Let me know if a deeper explanation is required here, I'd rather not duplicate content too much and it diverges real quick as the docs team will not update the API along any docs bug coming in...
e816e63
to
6054d58
Compare
operator/v1/types_kubeapiserver.go
Outdated
// | ||
// +kubebuilder:default:=180 | ||
// +kubebuilder:validation:Minimum=30 | ||
// +kubebuilder:validation:Maximum=360 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have users that need to increase the event TTL above 3h? Choosing 3h as the maximum would preserve the upper bound when it comes to benchmarking and publishing supported cluster sizes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For reference, upstream default TTL is 1h and it is already considering very high for today's cluster sizes.
The only reason we have 3h downstream as the default value is because origin jobs have a 3h timeout and this allows to retain the events for the entire duration of the tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that would also be a case-in-point to have it configurable for longer periods in our CI. I've changed the upper bound to 3h again.
operator/v1/types_kubeapiserver.go
Outdated
StaticPodOperatorSpec `json:",inline"` | ||
|
||
// eventTTLMinutes specifies the amount of time that the events are stored before being deleted. | ||
// This setting is allowed between 30 minutes minimum up to 6h (360 minutes). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the field value is an integer, it would be better to reverse these
// This setting is allowed between 30 minutes minimum up to 6h (360 minutes). | |
// The TTL is allowed between 30 minutes minimum up to a maximum of 360 minutes (6 hours). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
operator/v1/types_kubeapiserver.go
Outdated
// When omitted this means no opinion, and the platform is left to choose a reasonable default, which is subject to change over time. | ||
// The current default value is 3h (180 minutes). | ||
// | ||
// +kubebuilder:default:=180 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a comment on the old default, we shouldn't have the openapi default like this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so removing ALL defaults or leaving this default and adding the other default back?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed all defaults now
Is there an EP for this new field? It should be added behind a feature gate |
operator/v1/types_kubeapiserver.go
Outdated
// The current default value is 3h (180 minutes). | ||
// | ||
// +kubebuilder:default:=180 | ||
// +kubebuilder:validation:Minimum=30 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can go as far down as 5 minutes. This would allow customers that persists events externally to reduce the amount of events stored in their cluster as much as possible while still being able to know what is happening right now in their cluster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, when there is an event spew, the only way to recover the cluster quickly is to reduce the event-ttl as much as possible. For that some people set the ttl to 0 in vanilla Kubernetes, but I think that for debug purposes, it would still be useful to have some events around, so 5 minutes is better IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For that some people set the ttl to 0 in vanilla Kubernetes
That would be bad, because 0 will not attach a lease on the object, so it will never be deleted :)
But I'm cool with 5 minutes, updating the api and enhancement accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume that we’ll assess the trade-off of a lower TTL on resources such as CPU before promoting this setting to GA. If we get it wrong, we can always bump the value later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, I'm looking into this right now, there's going to be a section in the enhancement. There are also some data points already, if you want to take a look
@JoelSpeed yes, there is: openshift/enhancements#1857 We can add it behind a feature gate, if required. |
This adds a minute based configuration to configure the event ttl setting in kube-apiserver. Default will stay 3h, as currently defined in KAS-O. Signed-off-by: Thomas Jungblut <[email protected]>
6054d58
to
61c6605
Compare
@tjungblu: This pull request references CNTRLPLANE-1576 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Ok, API changes LGTM, but lets get this behind a feature gate and go through that process |
Cool thanks, I'll open a separate one for the feature gate. /hold |
added the gate in here: #2525 |
@tjungblu: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
API PR in openshift/api#2520 Feature Gate PR in openshift/api#2525 Signed-off-by: Thomas Jungblut <[email protected]>
API PR in openshift/api#2520 Feature Gate PR in openshift/api#2525 Signed-off-by: Thomas Jungblut <[email protected]>
// +kubebuilder:validation:Minimum=5 | ||
// +kubebuilder:validation:Maximum=180 | ||
// +optional | ||
EventTTLMinutes int32 `json:"eventTTLMinutes,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mhm, what happens if an admin sets that field, but there is a controller that is using an old generated client that doesn’t know about it and that controller updates the spec?
is this something we should/usually take into account?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a general issue with adding new fields to any API. In this case, depending on the type of request they make (update vs patch), they would likely clear the field and then the behaviour would be reverted, it's not a catastrophic change of the behaviour so isn't going to burn the cluster to the ground.
If you wanted to, you could make this field un-removable, but I don't think that's really the right use case for this field
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
out of curiosity, how could one make a field unremovable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need a chain of "un-removability" from the nearest ancestor that is always required
// +kubebuilder:validation:XValidation:rule="!oldSelf.?child.hasValue() || self.?child.hasValue()",optionalOldSelf=true,message="child may not be removed once set"
This would go on the struct definition of each ancestor til you find the nearest always required ancestor
This adds a minute/hour configuration duration to configure the event ttl setting in kube-apiserver. Default will stay 3h, as currently defined in KAS-O.
Corresponding enhancement in
openshift/enhancements#1857