-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature: Deferred queue for no-op TGB #3861
Feature: Deferred queue for no-op TGB #3861
Conversation
Hi @zac-nixon. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
c8f5815
to
f5b7fc4
Compare
/ok-to-test |
PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
9db2194
to
5f6606c
Compare
d5a1f17
to
8fbabba
Compare
/retest |
8fbabba
to
37eb983
Compare
/retest |
/lgtm |
8fda896
to
8313c91
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
We plan to do a release as it is. However, i have discussion with Zac offline and we can do below changes as follow up post release:
- we can handle the requeue & skip hash check within the controller instead of via k8s api calls(adding reset hash) by
- return a
NewRequeueNeededAfter
error, and record the hash to ignore check.
- return a
- we can consolidate the annotations to reduce footprint on TGB resources. e.g. a single
elbv2.k8s.aws/checkpoint: <hash>/<timestamp>
instead of two annotation. We can be backwards compatible by simply ignore hash don't match new format.
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: M00nF1sh, zac-nixon The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Issue
#3326
Continuation of #3821, I removed ingress / svc hashing due to some complexities with ensuring the ingress / svc hash is stable.
Description
As outlined in #3326, during controller restarts it might take a long time for the new controller to start processing meaningful changes. This is because the controller has no state of what is fresh and stale. This PR improves start time for customers with lots of TargetGroupBindings by caching the last state of the endpoints / tgb spec in a SHA256 hash which is stored within the TGB annotations. When the controller detects that the reconcile is a no-op, by recomputing the SHA256 on, it can short circuit the reconciliation logic to quickly get the no-op TGB out of the queue.
For any TGBs that are deemed no-ops they are put into a different queue that sideline the TGB for a "safe" amount of time to let all reconciliations happen. After the safe amount of time, these TGBs have their annotations reset which will trigger the main reconcile loop to reconcile completely. This safe time is jittered as to not reconcile all TGBs together, I noticed the standard reconcile loop will reconcile all TGB together which might drain AWS API throttle limits temporarily. By adding jitter to the reconcile this should help general throttling issues.
logs:
Commit 2
TL;DR
The approach did not consider the case of containers restarting, while keeping the same pod object. The reconcile logic is to deregister the pod on container stop, then when the container starts up again re-register it. This part worked fine. The issue is that the solution would not re-run the reconcile loop again to flip the readiness gate from false to true.
I've added logic that will clear out any checkpoints during register / deregister calls to ensure that we will run the reconcile loop until all readiness gates are flipped to true. For controllers WITHOUT readiness gates, no extra reconciles are ran.
Checklist
README.md
, or thedocs
directory)BONUS POINTS checklist: complete for good vibes and maybe prizes?! 🤯