Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase default QPS and Burst value to improve slow startup duration. #1066

Closed
wants to merge 1 commit into from
Closed

Conversation

nak3
Copy link
Contributor

@nak3 nak3 commented Jun 21, 2023

🐛 This patch increases default QPS and Burst value to improve slow startup duration.

Release Note

The default `kube-api-qps` and `kube-api-burst` are set to `200`. You can adjust them via env variable if necessary.

/cc @skonto @maschmid

@knative-prow knative-prow bot requested review from maschmid and skonto June 21, 2023 11:58
@knative-prow
Copy link

knative-prow bot commented Jun 21, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: nak3

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@knative-prow knative-prow bot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Jun 21, 2023
@codecov
Copy link

codecov bot commented Jun 21, 2023

Codecov Report

Merging #1066 (c564ebf) into main (2090f5b) will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##             main    #1066   +/-   ##
=======================================
  Coverage   80.95%   80.95%           
=======================================
  Files          18       18           
  Lines        1355     1355           
=======================================
  Hits         1097     1097           
  Misses        205      205           
  Partials       53       53           

@nak3
Copy link
Contributor Author

nak3 commented Jun 23, 2023

/assign @skonto

@dprotaso
Copy link
Contributor

Do you have a linked issue describing the slow startup?

@maschmid
Copy link

There was originally #879 , which was eventually closed as stale.

@dprotaso
Copy link
Contributor

Out of curiosity which outbound calls to the K8s API server are being throttled?

@maschmid
Copy link

I believe it was services and endpoints gets, done by the IngressTranslator

@dprotaso
Copy link
Contributor

I believe it was services and endpoints gets, done by the IngressTranslator

Is there a reason those calls aren't using informers?

@nak3
Copy link
Contributor Author

nak3 commented Jun 28, 2023

I was not involved the discussion in #276 but the client correctly list all resources at startup.

https://github.com/knative-sandbox/net-kourier/blob/790358f73d01d37c908df5c7c54a7e275c22219e/pkg/reconciler/ingress/controller.go#L213-L214

I researched the context before as #968 (comment) but it was a little bit complicated and still open #968 (it didn't intend to solve this informer vs client but I think related) for that.

@skonto
Copy link
Contributor

skonto commented Jun 29, 2023

Finding good default values is an open question to me (depends on the scale we support). If we can't avoid the direct K8s api server calls then I don't see many options here. In general this is a problem faced elsewhere too see for example kedacore/keda#3730 & the fix kedacore/keda#3731.

@nak3
Copy link
Contributor Author

nak3 commented Jun 29, 2023

We will actually turn off them in the future knative/pkg#2756 so using bigger value like 200 and letting users adjust them via APF or these env values should be alright for now, I think.

@nak3
Copy link
Contributor Author

nak3 commented Jun 30, 2023

@skonto @dprotaso

Could you merge this before 1.11 cut if you are alright?

@dprotaso
Copy link
Contributor

I guess to confirm my understanding - the listing to fill the informer cache is the slow piece right now? And the increasing the QPS/Burst settings will help?

@nak3
Copy link
Contributor Author

nak3 commented Jul 3, 2023

Yes, that's correct. The one "the listing to fill the informer cache is the slow piece right now", I need to test and confirm it by myself but I believe there is no change since #276.

@skonto
Copy link
Contributor

skonto commented Jul 4, 2023

@dprotaso Afaik getReadyIngresses is the slow part that lists ingresses (with a big number things get stuck). This call accesses the api server directly https://github.com/knative-sandbox/net-kourier/blob/main/pkg/reconciler/ingress/controller.go#L317.

@nak3 could you confirm the last bits so we can merge.

@nak3
Copy link
Contributor Author

nak3 commented Jul 4, 2023

getReadyIngresses also calls k8s client but the critical part is startupTranslator which lists services, endpoints and etc as @maschmid answered above #1066 (comment)

Please see the log in #840 or (SRVKS-1078 our internal issue). It prints Priming the config with ... so getReadyIngresses passed without stuck.

@skonto
Copy link
Contributor

skonto commented Jul 6, 2023

Yeah correct now I remember ;) Ok so should we merge? Are there any objections?

@dprotaso
Copy link
Contributor

dprotaso commented Jul 10, 2023

I'm confused why doesn't startupTranslator use an informer? Are we then not fetching the data twice?

@nak3
Copy link
Contributor Author

nak3 commented Jul 11, 2023

Hmm... I tested #1075. It works and the start speed is fast on 200 ksvcs.

I'm not sure what issue observed with the informer when #276 was created (the comment line clearly mentions The startup translator uses clients instead of listeners to correctly list all resources at startup. but it works on the current net-kourier.

Should we merge #1075 instead of here?

@dprotaso
Copy link
Contributor

Should we merge #1075 instead of here?

Yeah - if it's fast then overall it's fetching less data from the API server

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants