Feature: Deferred queue for no-op TGB #3861

zac-nixon · 2024-09-23T20:11:11Z

Issue

Continuation of #3821, I removed ingress / svc hashing due to some complexities with ensuring the ingress / svc hash is stable.

Description

As outlined in #3326, during controller restarts it might take a long time for the new controller to start processing meaningful changes. This is because the controller has no state of what is fresh and stale. This PR improves start time for customers with lots of TargetGroupBindings by caching the last state of the endpoints / tgb spec in a SHA256 hash which is stored within the TGB annotations. When the controller detects that the reconcile is a no-op, by recomputing the SHA256 on, it can short circuit the reconciliation logic to quickly get the no-op TGB out of the queue.

For any TGBs that are deemed no-ops they are put into a different queue that sideline the TGB for a "safe" amount of time to let all reconciliations happen. After the safe amount of time, these TGBs have their annotations reset which will trigger the main reconcile loop to reconcile completely. This safe time is jittered as to not reconcile all TGBs together, I noticed the standard reconcile loop will reconcile all TGB together which might drain AWS API throttle limits temporarily. By adding jitter to the reconcile this should help general throttling issues.

logs:

{"level":"info","ts":"2024-09-23T19:30:56Z","msg":"Skipping targetgroupbinding reconcile","TGB":{"name":"echoserver-mc","namespace":"echoserver"},"calculated hash":"ymUzgdqlKVVW0ZY2YGTF7OTKMiqHT002RcJLRXWDols/1iL7S-XTGHlK2M1d0-P0eu-rdNyvq0X9fwxUyz9mWa4"}
{"level":"info","ts":"2024-09-23T19:30:56Z","logger":"deferredTGBQueue","msg":"enqueued new deferred TGB","TGB":"echoserver-mc"}
{"level":"info","ts":"2024-09-23T19:30:56Z","msg":"Skipping targetgroupbinding reconcile","TGB":{"name":"k8s-esing-echoserv-3c797ed157","namespace":"es-ing"},"calculated hash":"IxSgm9GyLMW40ZYIQvig7HMFICZYqIaEuXkzi2O3deg/8VJ5SF6mKRO_9WTA0uPGEReYMZT1d2Piq0STqZxR7ls"}
{"level":"info","ts":"2024-09-23T19:30:56Z","logger":"deferredTGBQueue","msg":"enqueued new deferred TGB","TGB":"k8s-esing-echoserv-3c797ed157"}
{"level":"info","ts":"2024-09-23T19:30:56Z","msg":"Skipping targetgroupbinding reconcile","TGB":{"name":"echoserver-instance","namespace":"echoserver"},"calculated hash":"5SXxZawWspjdLtEOu46ZTAkDxWYc4LxAC2P_GFE7skw/9Ja0pzCp6YecOAJjY2TqwYoPEwZEb04tn3BiBHneLe8"}
{"level":"info","ts":"2024-09-23T19:30:56Z","logger":"deferredTGBQueue","msg":"enqueued new deferred TGB","TGB":"echoserver-instance"}
{"level":"info","ts":"2024-09-23T19:30:56Z","msg":"Skipping targetgroupbinding reconcile","TGB":{"name":"echoserver-instance","namespace":"echoserver"},"calculated hash":"5SXxZawWspjdLtEOu46ZTAkDxWYc4LxAC2P_GFE7skw/9Ja0pzCp6YecOAJjY2TqwYoPEwZEb04tn3BiBHneLe8"}
{"level":"info","ts":"2024-09-23T19:30:56Z","logger":"deferredTGBQueue","msg":"enqueued new deferred TGB","TGB":"echoserver-instance"}
{"level":"info","ts":"2024-09-23T19:30:56Z","msg":"Skipping targetgroupbinding reconcile","TGB":{"name":"k8s-esing-echoserv-3c797ed157","namespace":"es-ing"},"calculated hash":"IxSgm9GyLMW40ZYIQvig7HMFICZYqIaEuXkzi2O3deg/8VJ5SF6mKRO_9WTA0uPGEReYMZT1d2Piq0STqZxR7ls"}
{"level":"info","ts":"2024-09-23T19:30:56Z","logger":"deferredTGBQueue","msg":"enqueued new deferred TGB","TGB":"k8s-esing-echoserv-3c797ed157"}
{"level":"info","ts":"2024-09-23T19:30:56Z","msg":"Skipping targetgroupbinding reconcile","TGB":{"name":"echoserver-mc","namespace":"echoserver"},"calculated hash":"ymUzgdqlKVVW0ZY2YGTF7OTKMiqHT002RcJLRXWDols/1iL7S-XTGHlK2M1d0-P0eu-rdNyvq0X9fwxUyz9mWa4"}
{"level":"info","ts":"2024-09-23T19:30:56Z","logger":"deferredTGBQueue","msg":"enqueued new deferred TGB","TGB":"echoserver-mc"}
....
{"level":"info","ts":"2024-09-23T19:31:56Z","logger":"deferredTGBQueue","msg":"Processing deferred TGB","item":{"name":"echoserver-mc","namespace":"echoserver"}}
{"level":"info","ts":"2024-09-23T19:31:56Z","logger":"deferredTGBQueue","msg":"TGB checkpoint reset","TGB":{"name":"echoserver-mc","namespace":"echoserver"}}
{"level":"info","ts":"2024-09-23T19:31:56Z","logger":"deferredTGBQueue","msg":"Processing deferred TGB","item":{"name":"echoserver-instance","namespace":"echoserver"}}
{"level":"info","ts":"2024-09-23T19:31:56Z","logger":"deferredTGBQueue","msg":"TGB checkpoint reset","TGB":{"name":"echoserver-instance","namespace":"echoserver"}}
{"level":"info","ts":"2024-09-23T19:31:56Z","logger":"deferredTGBQueue","msg":"Processing deferred TGB","item":{"name":"k8s-esing-echoserv-3c797ed157","namespace":"es-ing"}}
{"level":"info","ts":"2024-09-23T19:31:56Z","msg":"authorizing securityGroup ingress","securityGroupID":"sg-086f1ccc63c13108a","permission":[{"FromPort":8081,"IpProtocol":"tcp","IpRanges":null,"Ipv6Ranges":null,"PrefixListIds":null,"ToPort":8081,"UserIdGroupPairs":[{"Description":"elbv2.k8s.aws/targetGroupBinding=shared","GroupId":"sg-0016497699787096a","GroupName":null,"PeeringStatus":null,"UserId":null,"VpcId":null,"VpcPeeringConnectionId":null}]}]}
{"level":"info","ts":"2024-09-23T19:31:56Z","logger":"deferredTGBQueue","msg":"TGB checkpoint reset","TGB":{"name":"k8s-esing-echoserv-3c797ed157","namespace":"es-ing"}}
{"level":"info","ts":"2024-09-23T19:31:56Z","msg":"authorized securityGroup ingress","securityGroupID":"sg-086f1ccc63c13108a"}
{"level":"info","ts":"2024-09-23T19:31:57Z","msg":"Skipping targetgroupbinding reconcile","TGB":{"name":"echoserver-mc","namespace":"echoserver"},"calculated hash":"ymUzgdqlKVVW0ZY2YGTF7OTKMiqHT002RcJLRXWDols/1iL7S-XTGHlK2M1d0-P0eu-rdNyvq0X9fwxUyz9mWa4"}
{"level":"info","ts":"2024-09-23T19:31:57Z","msg":"revoking securityGroup ingress","securityGroupID":"sg-086f1ccc63c13108a","permission":[{"FromPort":8081,"IpProtocol":"tcp","IpRanges":null,"Ipv6Ranges":null,"PrefixListIds":null,"ToPort":8081,"UserIdGroupPairs":[{"Description":"elbv2.k8s.aws/targetGroupBinding=shared","GroupId":"sg-0016497699787096a","GroupName":null,"PeeringStatus":null,"UserId":"565768096483","VpcId":null,"VpcPeeringConnectionId":null}]}]}

Commit 2

TL;DR
The approach did not consider the case of containers restarting, while keeping the same pod object. The reconcile logic is to deregister the pod on container stop, then when the container starts up again re-register it. This part worked fine. The issue is that the solution would not re-run the reconcile loop again to flip the readiness gate from false to true.
I've added logic that will clear out any checkpoints during register / deregister calls to ensure that we will run the reconcile loop until all readiness gates are flipped to true. For controllers WITHOUT readiness gates, no extra reconciles are ran.

Checklist

Added tests that cover your change (if possible)
Added/modified documentation as required (such as the README.md, or the docs directory)
Manually tested
Made sure the title of the PR is a good description that can go into the release notes

BONUS POINTS checklist: complete for good vibes and maybe prizes?! 🤯

Backfilled missing tests for code in same general area 🎉
Refactored something and made the world a better place 🌟

k8s-ci-robot · 2024-09-23T20:11:21Z

Hi @zac-nixon. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

shraddhabang · 2024-09-23T20:20:48Z

/ok-to-test

k8s-ci-robot · 2024-09-24T05:51:51Z

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

zac-nixon · 2024-09-30T22:52:52Z

/retest

zac-nixon · 2024-10-04T16:41:12Z

/retest

shraddhabang · 2024-10-07T20:03:48Z

/lgtm

pkg/targetgroupbinding/resource_manager.go

M00nF1sh

/lgtm
/approve
We plan to do a release as it is. However, i have discussion with Zac offline and we can do below changes as follow up post release:

we can handle the requeue & skip hash check within the controller instead of via k8s api calls(adding reset hash) by
- return a NewRequeueNeededAfter error, and record the hash to ignore check.
we can consolidate the annotations to reduce footprint on TGB resources. e.g. a single elbv2.k8s.aws/checkpoint: <hash>/<timestamp> instead of two annotation. We can be backwards compatible by simply ignore hash don't match new format.

k8s-ci-robot · 2024-10-11T20:45:40Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: M00nF1sh, zac-nixon

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [M00nF1sh]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Sep 23, 2024

k8s-ci-robot requested review from johngmyers and oliviassss September 23, 2024 20:11

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Sep 23, 2024

k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Sep 23, 2024

zac-nixon changed the title ~~Feature: Defered queue for no-op TGB~~ Feature: Deferred queue for no-op TGB Sep 23, 2024

zac-nixon force-pushed the znixon/perf-just-tgb branch from c8f5815 to f5b7fc4 Compare September 23, 2024 20:19

zac-nixon force-pushed the znixon/perf-just-tgb branch 4 times, most recently from 9db2194 to 5f6606c Compare September 24, 2024 20:43

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 24, 2024

zac-nixon force-pushed the znixon/perf-just-tgb branch 2 times, most recently from d5a1f17 to 8fbabba Compare September 30, 2024 18:57

zac-nixon added 2 commits September 30, 2024 17:49

Feature: Deferred TGB queue for no-op reconciles

43ea5c7

account for pods that go unready and revive themself

37eb983

zac-nixon force-pushed the znixon/perf-just-tgb branch from 8fbabba to 37eb983 Compare October 1, 2024 00:49

improve logging

01fac95

k8s-ci-robot assigned shraddhabang Oct 7, 2024

k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Oct 7, 2024

shraddhabang reviewed Oct 8, 2024

View reviewed changes

pkg/targetgroupbinding/resource_manager.go Outdated Show resolved Hide resolved

preempt pod readiness gate changes by clearing check point

8313c91

zac-nixon force-pushed the znixon/perf-just-tgb branch from 8fda896 to 8313c91 Compare October 10, 2024 23:25

M00nF1sh approved these changes Oct 11, 2024

View reviewed changes

k8s-ci-robot assigned M00nF1sh Oct 11, 2024

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 11, 2024

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 11, 2024

k8s-ci-robot merged commit ebc3c25 into kubernetes-sigs:main Oct 11, 2024
9 checks passed

zac-nixon deleted the znixon/perf-just-tgb branch October 14, 2024 16:49

zac-nixon mentioned this pull request Oct 22, 2024

problem wait for podInfo repo sync, times out waiting for condition #3902

Closed

zac-nixon mentioned this pull request Dec 5, 2024

REQUEST: New membership for zac-nixon kubernetes/org#5292

Closed

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: Deferred queue for no-op TGB #3861

Feature: Deferred queue for no-op TGB #3861

zac-nixon commented Sep 23, 2024 •

edited

Loading

k8s-ci-robot commented Sep 23, 2024

shraddhabang commented Sep 23, 2024

k8s-ci-robot commented Sep 24, 2024

zac-nixon commented Sep 30, 2024

zac-nixon commented Oct 4, 2024

shraddhabang commented Oct 7, 2024

M00nF1sh left a comment

k8s-ci-robot commented Oct 11, 2024

Feature: Deferred queue for no-op TGB #3861

Feature: Deferred queue for no-op TGB #3861

Conversation

zac-nixon commented Sep 23, 2024 • edited Loading

Issue

Description

Checklist

BONUS POINTS checklist: complete for good vibes and maybe prizes?! 🤯

k8s-ci-robot commented Sep 23, 2024

shraddhabang commented Sep 23, 2024

k8s-ci-robot commented Sep 24, 2024

zac-nixon commented Sep 30, 2024

zac-nixon commented Oct 4, 2024

shraddhabang commented Oct 7, 2024

M00nF1sh left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Oct 11, 2024

zac-nixon commented Sep 23, 2024 •

edited

Loading