Restart of autoscaler causes Knative to perform four status update calls per active revision #14669

SaschaSchwarze0 · 2023-11-24T11:21:24Z

When restarting the autoscaler component, then Knative keeps itself busy for a while.

For every PodAutoscaler that is related to an active revision, I see this happening:

The PodAutoscaler status is updated to desiredScale=-1 here
The Revision status is updated to desiredReplicas=nil here
The PodAutoscaler status is updated to desiredScale=1 (for my revision with minScale=1) at the same code location as in (1)
The Revision status is updated to desiredReplicas=1 at the same code location as in (2)

If you have just a few revisions in the system, then this does not really matter. If you have 1,000 active revisions, then this matters. Both autoscaler as well as controller must each perform two Kubernetes API calls per active revision. Assuming a QPS of 50, then this is 1,000 * 2 / 50/s = 40 s. This is the duration that it effectively cannot handle any other Knative related operation (creation of new KService etc) because it is throttled on the amount of KService calls.

With a code change like this, I can easily prevent this, but I do not know if this has negative side effects.

In what area(s)?

/area autoscale

Other classifications:

What version of Knative?

All recent versions.

Expected Behavior

A restart of the autoscaler should not cause unnecessary Kubernetes API calls for each active revision.

Actual Behavior

A restart of the autoscaler causes two Kubernetes API calls per active revision in both autoscaler and controller.

Steps to Reproduce the Problem

Have a KSvc with minScale=maxScale=1 in the system. Then restart the autoscaler.

dprotaso · 2023-11-24T14:11:50Z

Unsure if you have any experience with API priority and fairness?

If so can you comment on - knative/pkg#2756

SaschaSchwarze0 · 2023-11-24T14:46:46Z

Unsure if you have any experience with API priority and fairness?

If so can you comment on - knative/pkg#2756

Only to a certain degree. In particular interesting and not sure if possible ... could Knative itself request different priorities on its API calls depending on the queue that is being processed ? I think when it starts everything is on the slow queue ?

But anyway, best is always to omit requests no matter how they are prioritized. That would be my preference here. As far as I understand it, the autoscaler at startup sets the desiredScale to -1 because it has no metrics yet. Just doing nothing instead sounds better and maybe has no side effects at all ?

dprotaso · 2023-11-24T15:00:18Z

Yeah I agree - I was just highlighting turning off client-side limiting could be a workaround since you have some guards on the server

psschwei · 2023-12-13T15:33:14Z

With a code change like this, I can easily prevent this, but I do not know if this has negative side effects.

I looked into this a little bit, and while I couldn't see any obvious side effects, that's far from a guarantee 😄 😟

I wonder if one possible way to move forward (assuming we want to add this) would be to:

put the change behind a gate (so that it would be opt-in and if there were issues folks could turn it off)
regardless of gate status, log when this condition occurs so that folks can get some anecdotal data (and potentially report if there are situations where it happens when we don't want it to be)

Just thinking out loud a bit...

SaschaSchwarze0 · 2024-02-05T10:03:46Z

We are running with this change patched in (and activated) for a couple of weeks now in production and have not observed any issues. Will open a PR.

SaschaSchwarze0 · 2024-02-05T10:43:24Z

Opened PR without any configuration option but with the log statement.

SaschaSchwarze0 added the kind/bug Categorizes issue or PR as related to a bug. label Nov 24, 2023

knative-prow bot added the area/autoscale label Nov 24, 2023

dprotaso added this to the v1.13.0 milestone Nov 24, 2023

SaschaSchwarze0 mentioned this issue Feb 5, 2024

Prevent a PodAutoscaler's DesiredScale to turn to -1 #14866

Merged

knative-prow bot closed this as completed in #14866 Mar 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restart of autoscaler causes Knative to perform four status update calls per active revision #14669

Restart of autoscaler causes Knative to perform four status update calls per active revision #14669

SaschaSchwarze0 commented Nov 24, 2023

dprotaso commented Nov 24, 2023

SaschaSchwarze0 commented Nov 24, 2023 •

edited

Loading

dprotaso commented Nov 24, 2023

psschwei commented Dec 13, 2023

SaschaSchwarze0 commented Feb 5, 2024 •

edited

Loading

SaschaSchwarze0 commented Feb 5, 2024

Restart of autoscaler causes Knative to perform four status update calls per active revision #14669

Restart of autoscaler causes Knative to perform four status update calls per active revision #14669

Comments

SaschaSchwarze0 commented Nov 24, 2023

In what area(s)?

What version of Knative?

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

dprotaso commented Nov 24, 2023

SaschaSchwarze0 commented Nov 24, 2023 • edited Loading

dprotaso commented Nov 24, 2023

psschwei commented Dec 13, 2023

SaschaSchwarze0 commented Feb 5, 2024 • edited Loading

SaschaSchwarze0 commented Feb 5, 2024

SaschaSchwarze0 commented Nov 24, 2023 •

edited

Loading

SaschaSchwarze0 commented Feb 5, 2024 •

edited

Loading