-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prevent unnecessary reconciles on provider restarts #696
Comments
I have done a basic implementation of this on a custom provider by adding the following fields to our MR CRDs: SpecHash string `json:"specHash,omitempty"`
LastExternalReconcile metav1.Time `json:"lastExternalReconcile,omitempty"` The following logic was done in the reconciler: func (c *External) Observe(ctx context.Context, mg resource.Managed) (managed.ExternalObservation, error) {
// Calculate the hash of the spec of the MR
hash, err := ComputeHash(mg)
if err != nil {
return managed.ExternalObservation{}, err
}
// On first run, the saved hash will be empty and/or the LastExternalReconcile will be empty
firstRun := GetSpecHash(mg) == "" || GetLastExternalReconcile(mg).IsZero()
// If the spec has changed or the sync period has passed, we need to resync
needsResync := GetSpecHash(mg) != hash || GetLastExternalReconcile(mg).Add(*constants.SyncPeriod).Before(time.Now()) || mg.GetCondition(xpv1.TypeReady).Reason == xpv1.Deleting().Reason
if firstRun || needsResync {
result, err := c.Handler.Observe(ctx, mg)
if err != nil {
return result, errors.Wrap(err, "error in fetching resource")
}
if !result.ResourceExists {
return result, nil
}
// Set the calculated hash in the MR status
SetSpecHash(mg, hash)
// Set the last external reconcile time in the MR status
SetLastExternalReconcile(mg, time.Now())
mg.SetConditions(xpv1.Available())
return result, nil
}
return managed.ExternalObservation{ResourceExists: true, ResourceUpToDate: true}, nil
} This makes it so a reconcile is done if the MR spec changes or if the |
Thanks for your efforts to describe this issue and for showing up to community meetings to advocate for it. We appreciate the way you're engaging on this @gravufo 🙇♂️ I think the general idea is reasonable to me, so that large deployments can make the conscious trade-off of sync frequency (and drift detection) vs. control plane performance and health. A few thoughts:
|
I wonder how big the hash actually looks? I think we are moved to SSA so things like size of annotation size shouldn't really matter but I do wonder if we hinder the experience of dealing with annotations relating to external-name, by adding the hash in annotations. |
Thank you for your feedback @jbw976! Here are my thoughts about every point you brought up:
I'm definitely not against making this behaviour opt-in, but I would argue that this would benefit everyone, including users in small scale. As a user (large or small scale), I expect my resource to be synced immediately when initially created or when I update the spec and I expect it to be regularly reconciled based on the From there on, it would be clear that the Not having this garantee makes the
I think this is a possibility, however I see it more like a Condition, just as the Synced and Ready states are set. That is just my impression though, no strong opinion on where it is saved.
It's actually the About the comment from @blakeromano:
In our implementation, we used hashstructure which generates small hashes that look like this: I appreciate the feedback and I hope we can have a good discussion about this! Thanks everyone 👍 |
I'm supportive of addressing the issue. I'm not sure about the proposed solution though. Some initial thoughts:
Is there anything we could do to the existing controller-runtime / client-go reconcile rate-limiting machinery to smooth out large bursts of reconciles on startup without needing to add tracking fields, additional logic, etc? I wonder if there's any prior art for addressing this issue in other Kubernetes controllers, especially controller-runtime based controllers. I imagine others must have faced similar problems. |
You're right, it is indeed a behaviour change. I may have been too optimistic about it being enabled by default. I really don't mind if there's a flag, but I think aiming to have this behaviour by default would make more sense to me, because that's how I initially thought it would behave before I actually realized it wasn't. Not sure how we could go about that though...feature flag it and in a major change the default?
I actually don't save it as a condition for that specific reason, I simply saved it as a field straight under
I agree that it would be nice to avoid the extra reconcile loop if possible, but exiting early mitigates most of the drawbacks, especially considering that the extra reconcile would only happen when the external api is called only, which is not that frequent (depending on I tried to think of other products/controllers in the kubernetes ecosystem that could have the same type of problem and the only thing I thought of is maybe ArgoCD since it reconciles apps every 3 minutes by default and has to sync external git repos and/or helm repos. Not sure if it's exactly the same though. |
What problem are you facing?
Today, when provider pods are restarted (whether it's due to updates to the provider version or simply k8s nodes shuffling), the cache is completely lost forcing the controllers to reconcile all MRs regardless of the last time they were synced and the configured poll frequency.
Depending on the provider, this can be fine and fast, but can sometimes be very slow and even problematic (think external APIs rate limiting). The specific use-case I am referring to is provider-upjet-azure which easily gets rate-limited by the Azure RM API if it gets restarted when managing a large number of MRs.
How could Crossplane help solve your problem?
Since we have no choice but to requeue all objects (this makes sense), I think a nice improvement would be to validate the last time this object (assuming
Synced
andReady
are bothtrue
) was reconciled and calculating based on the poll frequency if it needs to be reconciled again right now or if it should be requeued at the time the poll frequency would be hit.This would prevent mega bursts of external calls every time we need to restart provider pods, regardless of the reason.
The text was updated successfully, but these errors were encountered: