Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: Deferred queue for no-op TGB #3861

Merged
merged 4 commits into from
Oct 11, 2024

Conversation

zac-nixon
Copy link
Collaborator

@zac-nixon zac-nixon commented Sep 23, 2024

Issue

#3326

Continuation of #3821, I removed ingress / svc hashing due to some complexities with ensuring the ingress / svc hash is stable.

Description

As outlined in #3326, during controller restarts it might take a long time for the new controller to start processing meaningful changes. This is because the controller has no state of what is fresh and stale. This PR improves start time for customers with lots of TargetGroupBindings by caching the last state of the endpoints / tgb spec in a SHA256 hash which is stored within the TGB annotations. When the controller detects that the reconcile is a no-op, by recomputing the SHA256 on, it can short circuit the reconciliation logic to quickly get the no-op TGB out of the queue.

For any TGBs that are deemed no-ops they are put into a different queue that sideline the TGB for a "safe" amount of time to let all reconciliations happen. After the safe amount of time, these TGBs have their annotations reset which will trigger the main reconcile loop to reconcile completely. This safe time is jittered as to not reconcile all TGBs together, I noticed the standard reconcile loop will reconcile all TGB together which might drain AWS API throttle limits temporarily. By adding jitter to the reconcile this should help general throttling issues.

logs:

{"level":"info","ts":"2024-09-23T19:30:56Z","msg":"Skipping targetgroupbinding reconcile","TGB":{"name":"echoserver-mc","namespace":"echoserver"},"calculated hash":"ymUzgdqlKVVW0ZY2YGTF7OTKMiqHT002RcJLRXWDols/1iL7S-XTGHlK2M1d0-P0eu-rdNyvq0X9fwxUyz9mWa4"}
{"level":"info","ts":"2024-09-23T19:30:56Z","logger":"deferredTGBQueue","msg":"enqueued new deferred TGB","TGB":"echoserver-mc"}
{"level":"info","ts":"2024-09-23T19:30:56Z","msg":"Skipping targetgroupbinding reconcile","TGB":{"name":"k8s-esing-echoserv-3c797ed157","namespace":"es-ing"},"calculated hash":"IxSgm9GyLMW40ZYIQvig7HMFICZYqIaEuXkzi2O3deg/8VJ5SF6mKRO_9WTA0uPGEReYMZT1d2Piq0STqZxR7ls"}
{"level":"info","ts":"2024-09-23T19:30:56Z","logger":"deferredTGBQueue","msg":"enqueued new deferred TGB","TGB":"k8s-esing-echoserv-3c797ed157"}
{"level":"info","ts":"2024-09-23T19:30:56Z","msg":"Skipping targetgroupbinding reconcile","TGB":{"name":"echoserver-instance","namespace":"echoserver"},"calculated hash":"5SXxZawWspjdLtEOu46ZTAkDxWYc4LxAC2P_GFE7skw/9Ja0pzCp6YecOAJjY2TqwYoPEwZEb04tn3BiBHneLe8"}
{"level":"info","ts":"2024-09-23T19:30:56Z","logger":"deferredTGBQueue","msg":"enqueued new deferred TGB","TGB":"echoserver-instance"}
{"level":"info","ts":"2024-09-23T19:30:56Z","msg":"Skipping targetgroupbinding reconcile","TGB":{"name":"echoserver-instance","namespace":"echoserver"},"calculated hash":"5SXxZawWspjdLtEOu46ZTAkDxWYc4LxAC2P_GFE7skw/9Ja0pzCp6YecOAJjY2TqwYoPEwZEb04tn3BiBHneLe8"}
{"level":"info","ts":"2024-09-23T19:30:56Z","logger":"deferredTGBQueue","msg":"enqueued new deferred TGB","TGB":"echoserver-instance"}
{"level":"info","ts":"2024-09-23T19:30:56Z","msg":"Skipping targetgroupbinding reconcile","TGB":{"name":"k8s-esing-echoserv-3c797ed157","namespace":"es-ing"},"calculated hash":"IxSgm9GyLMW40ZYIQvig7HMFICZYqIaEuXkzi2O3deg/8VJ5SF6mKRO_9WTA0uPGEReYMZT1d2Piq0STqZxR7ls"}
{"level":"info","ts":"2024-09-23T19:30:56Z","logger":"deferredTGBQueue","msg":"enqueued new deferred TGB","TGB":"k8s-esing-echoserv-3c797ed157"}
{"level":"info","ts":"2024-09-23T19:30:56Z","msg":"Skipping targetgroupbinding reconcile","TGB":{"name":"echoserver-mc","namespace":"echoserver"},"calculated hash":"ymUzgdqlKVVW0ZY2YGTF7OTKMiqHT002RcJLRXWDols/1iL7S-XTGHlK2M1d0-P0eu-rdNyvq0X9fwxUyz9mWa4"}
{"level":"info","ts":"2024-09-23T19:30:56Z","logger":"deferredTGBQueue","msg":"enqueued new deferred TGB","TGB":"echoserver-mc"}
....
{"level":"info","ts":"2024-09-23T19:31:56Z","logger":"deferredTGBQueue","msg":"Processing deferred TGB","item":{"name":"echoserver-mc","namespace":"echoserver"}}
{"level":"info","ts":"2024-09-23T19:31:56Z","logger":"deferredTGBQueue","msg":"TGB checkpoint reset","TGB":{"name":"echoserver-mc","namespace":"echoserver"}}
{"level":"info","ts":"2024-09-23T19:31:56Z","logger":"deferredTGBQueue","msg":"Processing deferred TGB","item":{"name":"echoserver-instance","namespace":"echoserver"}}
{"level":"info","ts":"2024-09-23T19:31:56Z","logger":"deferredTGBQueue","msg":"TGB checkpoint reset","TGB":{"name":"echoserver-instance","namespace":"echoserver"}}
{"level":"info","ts":"2024-09-23T19:31:56Z","logger":"deferredTGBQueue","msg":"Processing deferred TGB","item":{"name":"k8s-esing-echoserv-3c797ed157","namespace":"es-ing"}}
{"level":"info","ts":"2024-09-23T19:31:56Z","msg":"authorizing securityGroup ingress","securityGroupID":"sg-086f1ccc63c13108a","permission":[{"FromPort":8081,"IpProtocol":"tcp","IpRanges":null,"Ipv6Ranges":null,"PrefixListIds":null,"ToPort":8081,"UserIdGroupPairs":[{"Description":"elbv2.k8s.aws/targetGroupBinding=shared","GroupId":"sg-0016497699787096a","GroupName":null,"PeeringStatus":null,"UserId":null,"VpcId":null,"VpcPeeringConnectionId":null}]}]}
{"level":"info","ts":"2024-09-23T19:31:56Z","logger":"deferredTGBQueue","msg":"TGB checkpoint reset","TGB":{"name":"k8s-esing-echoserv-3c797ed157","namespace":"es-ing"}}
{"level":"info","ts":"2024-09-23T19:31:56Z","msg":"authorized securityGroup ingress","securityGroupID":"sg-086f1ccc63c13108a"}
{"level":"info","ts":"2024-09-23T19:31:57Z","msg":"Skipping targetgroupbinding reconcile","TGB":{"name":"echoserver-mc","namespace":"echoserver"},"calculated hash":"ymUzgdqlKVVW0ZY2YGTF7OTKMiqHT002RcJLRXWDols/1iL7S-XTGHlK2M1d0-P0eu-rdNyvq0X9fwxUyz9mWa4"}
{"level":"info","ts":"2024-09-23T19:31:57Z","msg":"revoking securityGroup ingress","securityGroupID":"sg-086f1ccc63c13108a","permission":[{"FromPort":8081,"IpProtocol":"tcp","IpRanges":null,"Ipv6Ranges":null,"PrefixListIds":null,"ToPort":8081,"UserIdGroupPairs":[{"Description":"elbv2.k8s.aws/targetGroupBinding=shared","GroupId":"sg-0016497699787096a","GroupName":null,"PeeringStatus":null,"UserId":"565768096483","VpcId":null,"VpcPeeringConnectionId":null}]}]}

Commit 2

TL;DR
The approach did not consider the case of containers restarting, while keeping the same pod object. The reconcile logic is to deregister the pod on container stop, then when the container starts up again re-register it. This part worked fine. The issue is that the solution would not re-run the reconcile loop again to flip the readiness gate from false to true.
I've added logic that will clear out any checkpoints during register / deregister calls to ensure that we will run the reconcile loop until all readiness gates are flipped to true. For controllers WITHOUT readiness gates, no extra reconciles are ran.

Checklist

  • Added tests that cover your change (if possible)
  • Added/modified documentation as required (such as the README.md, or the docs directory)
  • Manually tested
  • Made sure the title of the PR is a good description that can go into the release notes

BONUS POINTS checklist: complete for good vibes and maybe prizes?! 🤯

  • Backfilled missing tests for code in same general area 🎉
  • Refactored something and made the world a better place 🌟

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Sep 23, 2024
@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Sep 23, 2024
@k8s-ci-robot
Copy link
Contributor

Hi @zac-nixon. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Sep 23, 2024
@zac-nixon zac-nixon changed the title Feature: Defered queue for no-op TGB Feature: Deferred queue for no-op TGB Sep 23, 2024
@shraddhabang
Copy link
Collaborator

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Sep 23, 2024
@k8s-ci-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@zac-nixon zac-nixon force-pushed the znixon/perf-just-tgb branch 4 times, most recently from 9db2194 to 5f6606c Compare September 24, 2024 20:43
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 24, 2024
@zac-nixon zac-nixon force-pushed the znixon/perf-just-tgb branch 2 times, most recently from d5a1f17 to 8fbabba Compare September 30, 2024 18:57
@zac-nixon
Copy link
Collaborator Author

/retest

@zac-nixon zac-nixon force-pushed the znixon/perf-just-tgb branch from 8fbabba to 37eb983 Compare October 1, 2024 00:49
@zac-nixon
Copy link
Collaborator Author

/retest

@shraddhabang
Copy link
Collaborator

/lgtm

@k8s-ci-robot k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Oct 7, 2024
@zac-nixon zac-nixon force-pushed the znixon/perf-just-tgb branch from 8fda896 to 8313c91 Compare October 10, 2024 23:25
Copy link
Collaborator

@M00nF1sh M00nF1sh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve
We plan to do a release as it is. However, i have discussion with Zac offline and we can do below changes as follow up post release:

  1. we can handle the requeue & skip hash check within the controller instead of via k8s api calls(adding reset hash) by
    • return a NewRequeueNeededAfter error, and record the hash to ignore check.
  2. we can consolidate the annotations to reduce footprint on TGB resources. e.g. a single elbv2.k8s.aws/checkpoint: <hash>/<timestamp> instead of two annotation. We can be backwards compatible by simply ignore hash don't match new format.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 11, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: M00nF1sh, zac-nixon

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 11, 2024
@k8s-ci-robot k8s-ci-robot merged commit ebc3c25 into kubernetes-sigs:main Oct 11, 2024
9 checks passed
@zac-nixon zac-nixon deleted the znixon/perf-just-tgb branch October 14, 2024 16:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants