Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restart of autoscaler causes Knative to perform four status update calls per active revision #14669

Closed
SaschaSchwarze0 opened this issue Nov 24, 2023 · 6 comments · Fixed by #14866
Labels
area/autoscale kind/bug Categorizes issue or PR as related to a bug.
Milestone

Comments

@SaschaSchwarze0
Copy link
Contributor

When restarting the autoscaler component, then Knative keeps itself busy for a while.

For every PodAutoscaler that is related to an active revision, I see this happening:

  1. The PodAutoscaler status is updated to desiredScale=-1 here
  2. The Revision status is updated to desiredReplicas=nil here
  3. The PodAutoscaler status is updated to desiredScale=1 (for my revision with minScale=1) at the same code location as in (1)
  4. The Revision status is updated to desiredReplicas=1 at the same code location as in (2)

If you have just a few revisions in the system, then this does not really matter. If you have 1,000 active revisions, then this matters. Both autoscaler as well as controller must each perform two Kubernetes API calls per active revision. Assuming a QPS of 50, then this is 1,000 * 2 / 50/s = 40 s. This is the duration that it effectively cannot handle any other Knative related operation (creation of new KService etc) because it is throttled on the amount of KService calls.

With a code change like this, I can easily prevent this, but I do not know if this has negative side effects.

In what area(s)?

/area autoscale

Other classifications:

What version of Knative?

All recent versions.

Expected Behavior

A restart of the autoscaler should not cause unnecessary Kubernetes API calls for each active revision.

Actual Behavior

A restart of the autoscaler causes two Kubernetes API calls per active revision in both autoscaler and controller.

Steps to Reproduce the Problem

Have a KSvc with minScale=maxScale=1 in the system. Then restart the autoscaler.

@SaschaSchwarze0 SaschaSchwarze0 added the kind/bug Categorizes issue or PR as related to a bug. label Nov 24, 2023
@dprotaso
Copy link
Member

Unsure if you have any experience with API priority and fairness?

If so can you comment on - knative/pkg#2756

@dprotaso dprotaso added this to the v1.13.0 milestone Nov 24, 2023
@SaschaSchwarze0
Copy link
Contributor Author

SaschaSchwarze0 commented Nov 24, 2023

Unsure if you have any experience with API priority and fairness?

If so can you comment on - knative/pkg#2756

Only to a certain degree. In particular interesting and not sure if possible ... could Knative itself request different priorities on its API calls depending on the queue that is being processed ? I think when it starts everything is on the slow queue ?

But anyway, best is always to omit requests no matter how they are prioritized. That would be my preference here. As far as I understand it, the autoscaler at startup sets the desiredScale to -1 because it has no metrics yet. Just doing nothing instead sounds better and maybe has no side effects at all ?

@dprotaso
Copy link
Member

Yeah I agree - I was just highlighting turning off client-side limiting could be a workaround since you have some guards on the server

@psschwei
Copy link
Contributor

With a code change like this, I can easily prevent this, but I do not know if this has negative side effects.

I looked into this a little bit, and while I couldn't see any obvious side effects, that's far from a guarantee 😄 😟

I wonder if one possible way to move forward (assuming we want to add this) would be to:

  • put the change behind a gate (so that it would be opt-in and if there were issues folks could turn it off)
  • regardless of gate status, log when this condition occurs so that folks can get some anecdotal data (and potentially report if there are situations where it happens when we don't want it to be)

Just thinking out loud a bit...

@SaschaSchwarze0
Copy link
Contributor Author

SaschaSchwarze0 commented Feb 5, 2024

We are running with this change patched in (and activated) for a couple of weeks now in production and have not observed any issues. Will open a PR.

@SaschaSchwarze0
Copy link
Contributor Author

Opened PR without any configuration option but with the log statement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/autoscale kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants