improve the stability of autoscaling when HPA is enabled #450

tpiperatgod · 2022-08-17T01:27:37Z

(If this PR fixes a github issue, please add Fixes #<xyz>.)

Fixes #444

(or if this PR is one task of a github issue, please add Master Issue: #<xyz> to link to the master issue.)

Master Issue: #

Motivation

Explain here the context, and why you're making that change. What is the problem you're trying to solve.

Modifications

add spec.minReplicas to indicate the minimum number of replicas for the workloads
change spec.replicas to an optional field
do not listen for changes of statefulSet.status event anymore
optimize the conditions for determining whether a resource needs to be CREATE or UPDATE, so as to remove frequent updates without substantive changes
update CRDs (also in helm charts)

Verifying this change

Make sure that the change passes the CI checks.

(Please pick either of the following options)

This change is a trivial rework / code cleanup without any test coverage.

(or)

This change is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

Added integration tests for end-to-end deployment with large payloads (10MB)
Extended integration test for recovery after broker failure

Documentation

Check the box below.

Need to update docs?

doc-required

(If you need help on updating docs, create a doc issue)
no-need-doc

(Please explain why)
doc

(If this PR contains doc changes)

jiangpengcheng · 2022-08-17T01:44:59Z

api/v1alpha1/sink_webhook.go

+			r.Spec.MinReplicas = new(int32)
+			*r.Spec.MinReplicas = *r.Spec.Replicas
+		} else {
+			r.Spec.Replicas = new(int32)


will FM still have the HPA error since the replicas are set to 1 even user doesn't set it?

In this PR, spec.replicas becomes optional, so that the user does not need to configure this value (as in the argo template), but only needs to configure a minimum value of spec.minReplicas.

If spec.replicas is changed under HPA control, applying the template file again (without spec.replicas configured in it) will not trigger reconciliation.

jiangpengcheng · 2022-08-17T01:59:32Z

controllers/function_controller.go

+	Log                           logr.Logger
+	Scheme                        *runtime.Scheme
+	functionGenerations           *sync.Map
+	isFunctionGenerationIncreased bool


looks like the functionGenerations stores all function generations separately using a map, while isFunctionGenerationIncreased is shared for all functions. will this cause any conflicts?

and when the FM controller pod is recreated, these data will be lost, does this have any effects for the reconciliation?

You're right, I need to recheck the isFunctionGenerationIncreased issue.

If the controller restarts, a routine reconciliation is performed and the Generations state of all resources is restored.

tpiperatgod · 2022-08-17T02:39:18Z

Simple case:

Create Function with a configuration like below:

spec:
  minReplicas: 1
  maxReplicas: 10
  pod:
    autoScalingMetrics:
      - type: Resource
        resource:
          name: cpu
          target:
            type: Utilization
            averageUtilization: 80
    autoScalingBehavior:
      scaleDown:
        stabilizationWindowSeconds: 300
        policies:
        - type: Percent
          value: 100
          periodSeconds: 15
      scaleUp:
        stabilizationWindowSeconds: 60
        policies:
        - type: Percent
          value: 100
          periodSeconds: 15
        - type: Pods
          value: 4
          periodSeconds: 15
        selectPolicy: Max

Before the first period of the HPA begins, Function's spec.replicas is the same as spec.minReplicas and is passed to StatefulSet's spec.replicas
When HPA triggers autoscaling, such as scaling Function to 2 replicas
Function's spec.replicas changed to 2 by HPA
The increase of Function Generations in the Reconciliation logic triggers an update of the StatefulSet, changing its spec.replicas to 2
At this point, if you apply the Function again (using the configuration above), the Function does not trigger a new Reconciliation process

armangurkan · 2022-08-17T18:58:48Z

We have deployed the code in the branch this is the error message we get

tpiperatgod · 2022-08-18T01:08:38Z

We have deployed the code in the branch this is the error message we get

This error means that the previous update has not been completed in this reconciliation. In most cases, this can be considered a warning.

Can you describe what resources are currently not working as expected? and how? And can you paste the current configurations?

armangurkan · 2022-08-18T06:10:05Z

We have deployed the code in the branch this is the error message we get

This error means that the previous update has not been completed in this reconciliation. In most cases, this can be considered a warning.

Can you describe what resources are currently not working as expected? and how? And can you paste the current configurations?

What we had experience was, statefulset replicas got down to 0 constantly, we observed a constant cycle of shutdown and re-instantiation of pods, so at first we thought it was actually processing data, and it was being fired every time there was a new message in the topic they were subscribed to, to test it we shot down all the traffic, and that behavior was still in place, then I checked the function mesh operator logs and saw this error message.

tpiperatgod · 2022-08-18T07:05:52Z

What we had experience was, statefulset replicas got down to 0 constantly, we observed a constant cycle of shutdown and re-instantiation of pods, so at first we thought it was actually processing data, and it was being fired every time there was a new message in the topic they were subscribed to, to test it we shot down all the traffic, and that behavior was still in place, then I checked the function mesh operator logs and saw this error message.

This is consistent with what I observed prior to fixing this issue, which was mainly caused by:

FunctionMesh listens to too many events (HPA, StatefulSet), including events generated by changes in HPA.status and StatefulSet.status, which constantly trigger new reconciliation processes, and changes generated by new reconciliations will continue to trigger new reconciliations...
There is currently an issue with FunctionMesh, when Function/Sink/Source is automatically scaled by HPA, all pods are rescheduled every time, which will cause the metrics (CPU\MEMORY) to go up over time and HPA will change the number of replicas of Function/Sink/Source frequently.

For reason 1, I filtered the events generated by StatefulSet.status in this PR and refined the reconciliation trigger conditions for FunctionMesh.

For reason 2, I suggest to increase the stabilizationWindowSeconds of HPA, ref to https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/#stabilization-window

Can you pates the configuration of spec.pod.autoScalingBehavior and spec.pod.autoScalingMetrics?

And for resources that don't need HPA, you can remove the spec.maxReplicas.

armangurkan · 2022-08-18T16:02:57Z

What we had experience was, statefulset replicas got down to 0 constantly, we observed a constant cycle of shutdown and re-instantiation of pods, so at first we thought it was actually processing data, and it was being fired every time there was a new message in the topic they were subscribed to, to test it we shot down all the traffic, and that behavior was still in place, then I checked the function mesh operator logs and saw this error message.

This is consistent with what I observed prior to fixing this issue, which was mainly caused by:

FunctionMesh listens to too many events (HPA, StatefulSet), including events generated by changes in HPA.status and StatefulSet.status, which constantly trigger new reconciliation processes, and changes generated by new reconciliations will continue to trigger new reconciliations...

There is currently an issue with FunctionMesh, when Function/Sink/Source is automatically scaled by HPA, all pods are rescheduled every time, which will cause the metrics (CPU\MEMORY) to go up over time and HPA will change the number of replicas of Function/Sink/Source frequently.

For reason 1, I filtered the events generated by StatefulSet.status in this PR and refined the reconciliation trigger conditions for FunctionMesh.

For reason 2, I suggest to increase the stabilizationWindowSeconds of HPA, ref to https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/#stabilization-window

Can you pates the configuration of spec.pod.autoScalingBehavior and spec.pod.autoScalingMetrics?

And for resources that don't need HPA, you can remove the spec.maxReplicas.
At this point all our functions and sinks need scaling in our function mesh.

        autoScalingMetrics:
        - type: Resource
          resource:
            name: memory
            target:
              type: Utilization
              averageUtilization: 60

This is currently it but we can define autoscaling behavior if necessary. So do you have benchmarks in mind so we can give it a shot on scaling up and scaling down params?

tpiperatgod · 2022-08-19T02:47:32Z

        autoScalingMetrics:
        - type: Resource
          resource:
            name: memory
            target:
              type: Utilization
              averageUtilization: 60
This is currently it but we can define autoscaling behavior if necessary. So do you have benchmarks in mind so we can give it a shot on scaling up and scaling down params?

My preferred configuration is as follows.

  pod:
    autoScalingMetrics:
      - type: Resource
        resource:
          name: cpu
          target:
            type: Utilization
            averageUtilization: 80 # I have observed that the cpu usage of the pod stays around 57% when idle
    autoScalingBehavior:
      scaleDown:
        stabilizationWindowSeconds: 120
        policies:
        - type: Percent
          value: 100
          periodSeconds: 15
      scaleUp:
        stabilizationWindowSeconds: 120
        policies:
        - type: Percent
          value: 50
          periodSeconds: 15
        - type: Pods
          value: 2
          periodSeconds: 15
        selectPolicy: Max

By default, autoScalingBehavior.scaleUp.stabilizationWindowSeconds has a value of 0, which means that when hpa's scaling algorithm meets a condition (e.g., cpu usage exceeds 80%), then hpa will immediately increase the number of pods. Because of this issue, every time the number of replicas of workloads changes, all pods are rescheduled (i.e., rebuilt), which causes the cpu usage metric to skyrocket, which in turn continues to increase the expected replica value of hpa.

I suggest setting autoScalingBehavior.scaleUp.stabilizationWindowSeconds to 120 or a reasonable range, which will make hpa less sensitive.

In addition, configurations like autoScalingBehavior.scaleDown.policies and autoScalingBehavior.scaleUp.policies can also be used to control the magnitude of scaling.

nlu90 · 2022-08-22T16:20:10Z

add spec.minReplicas to indicate the minimum number of replicas for the workloads
change spec.replicas to an optional field

why not use spec.replicas as the default minReplicas?

spec.relicas should serve as the initial value, and the following invariant must hold: "minRelicas <= replicas <= maxRelicas"

alperencelik · 2022-08-22T20:29:19Z

add spec.minReplicas to indicate the minimum number of replicas for the workloads
change spec.replicas to an optional field

why not use spec.replicas as the default minReplicas?

spec.relicas should serve as the initial value, and the following invariant must hold: "minRelicas <= replicas <= maxRelicas"

This flow is not working with ArgoCD. HPA can't scale when statefulset has "replica" count. To work with ArgoCD the replicas should be empty and then HPA can scale the StatefulSet. You can refer to this documentation for more detailed information.

alperencelik · 2022-08-22T20:38:11Z

        autoScalingMetrics:
        - type: Resource
          resource:
            name: memory
            target:
              type: Utilization
              averageUtilization: 60
This is currently it but we can define autoscaling behavior if necessary. So do you have benchmarks in mind so we can give it a shot on scaling up and scaling down params?
My preferred configuration is as follows.
  pod:
    autoScalingMetrics:
      - type: Resource
        resource:
          name: cpu
          target:
            type: Utilization
            averageUtilization: 80 # I have observed that the cpu usage of the pod stays around 57% when idle
    autoScalingBehavior:
      scaleDown:
        stabilizationWindowSeconds: 120
        policies:
        - type: Percent
          value: 100
          periodSeconds: 15
      scaleUp:
        stabilizationWindowSeconds: 120
        policies:
        - type: Percent
          value: 50
          periodSeconds: 15
        - type: Pods
          value: 2
          periodSeconds: 15
        selectPolicy: Max
By default, autoScalingBehavior.scaleUp.stabilizationWindowSeconds has a value of 0, which means that when hpa's scaling algorithm meets a condition (e.g., cpu usage exceeds 80%), then hpa will immediately increase the number of pods. Because of this issue, every time the number of replicas of workloads changes, all pods are rescheduled (i.e., rebuilt), which causes the cpu usage metric to skyrocket, which in turn continues to increase the expected replica value of hpa.

I suggest setting autoScalingBehavior.scaleUp.stabilizationWindowSeconds to 120 or a reasonable range, which will make hpa less sensitive.

In addition, configurations like autoScalingBehavior.scaleDown.policies and autoScalingBehavior.scaleUp.policies can also be used to control the magnitude of scaling.

Seems like setting autoScalingBehavior is fixed the issue mentioned above but as you mentioned at this time when HPA scales downward or upward all pods are rescheduling. I know you guys put a lot of effort into this work and I am grateful to you but also I'm very excited to see when we'll have this feature on upcoming release.

tpiperatgod · 2022-08-23T03:51:02Z

why not use spec.replicas as the default minReplicas?

spec.relicas should serve as the initial value, and the following invariant must hold: "minRelicas <= replicas <= maxRelicas"

@nlu90
The targetRef of HPA is Function/Sink/Source, so when HPA is enabled, spec.replicas of Function/Sink/Source will be controlled by HPA. So we need to add a new field that is not affected by HPA, i.e. spec.minReplicas.

Also, webhook will help user to maintain this rule: minRelicas <= replicas <= maxRelicas and set initial values for spec.replicas and spec.minReplicas when they are empty.

Seems like setting autoScalingBehavior is fixed the issue mentioned above but as you mentioned at this time when HPA scales downward or upward all pods are rescheduling. I know you guys put a lot of effort into this work and I am grateful to you but also I'm very excited to see when we'll have this feature on upcoming release.

@alperencelik
Thanks. By milestone, this change will be included in the mid-September release, v0.6.0.

freeznet

overall LGTM, left one comment, ptal, thank.

freeznet · 2022-08-23T03:51:32Z

controllers/function_controller.go

 	return ctrl.Result{}, nil
 }

+func (r *FunctionReconciler) checkIfFunctionGenerationsIsIncreased(function *v1alpha1.Function) bool {


seems like the cached functionGenerations will always trigger as a new generation when the controller restart. Is it desired? or can we move the observed generation to the function's status? like add an ObservedGeneration to FunctionStatus?

please refer to https://github.com/streamnative/pulsar-operators/pull/365 for more details about the idea about ObservedGeneration

good idea, I will update it.

armangurkan · 2022-08-28T19:16:55Z

We have deployed the code in the branch this is the error message we get

@tpiperatgod , somehow the issue is back, the functions can not scale (so in fact they got deleted as soon as they scale, we observe this, in the logs of kubecontroller manager where every successful create is followed by a succesful delete), the previous error of the function controller is thrown frequently, also in the kube scheduler, we observe that volume mounts is erroring out constantly for the new pods that are being created, as follows

E0827 00:04:44.373395       1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0827 00:04:44.373456       1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0827 00:04:44.389374       1 scheduler.go:322] "Error updating pod" err="pods \"fm-some-function-function-0"
E0827 00:33:34.582859       1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-1"
E0827 00:33:34.582923       1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-1"
E0827 00:35:21.047823       1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0827 00:35:21.047877       1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0827 09:39:13.643549       1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0827 09:39:13.643628       1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
I0827 09:39:13.643661       1 factory.go:231] "Pod some-function-function-0\" not found"
E0827 09:39:13.659803       1 scheduler.go:322] "Error updating pod" err="pods \"fm-some-function-function-0"
E0827 09:53:04.134059       1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0827 09:53:04.134104       1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0827 09:53:04.171206       1 scheduler.go:322] "Error updating pod" err="pods \"fm-some-function-function-0"
E0827 09:56:33.602585       1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0827 09:56:33.602650       1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0827 09:56:33.637281       1 scheduler.go:322] "Error updating pod" err="pods \"fm-some-function-function-0"
E0827 10:43:47.118786       1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0827 10:43:47.118845       1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
I0827 10:43:47.118864       1 factory.go:231] "Pod some-function-function-0\" not found"
E0827 10:43:47.151607       1 scheduler.go:322] "Error updating pod" err="pods \"fm-some-function-function-0"
E0827 11:10:05.507240       1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0827 11:10:05.507312       1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0827 11:10:05.542283       1 scheduler.go:322] "Error updating pod" err="pods \"fm-some-function-function-0"
E0827 16:50:17.789041       1 authentication.go:63] "Unable to authenticate the request" err="[invalid bearer token, [invalid bearer token, serviceaccounts \"kube-prometheus-stack-in-c-prometheus\" not found]]"
E0827 17:36:47.796489       1 authentication.go:63] "Unable to authenticate the request" err="[invalid bearer token, [invalid bearer token, serviceaccounts \"kube-prometheus-stack-in-c-prometheus\" not found]]"
E0827 23:07:54.580377       1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0827 23:07:54.580431       1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
I0827 23:07:54.580447       1 factory.go:231] "Pod some-function-function-0\" not found"
E0827 23:07:54.606340       1 scheduler.go:322] "Error updating pod" err="pods \"fm-some-function-function-0"
E0827 23:19:56.126715       1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0827 23:19:56.126767       1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0827 23:19:56.133013       1 event_broadcaster.go:253] Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"fm-some-function-function-0.170f56f49cdd8c10" is invalid: series.count: Invalid value: "": should be at least 2' (will not retry!)
E0827 23:41:13.876819       1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0827 23:41:13.876860       1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
I0827 23:41:13.876873       1 factory.go:231] "Pod some-function-function-0\" not found"
E0827 23:41:13.878275       1 scheduler.go:322] "Error updating pod" err="pods \"fm-some-function-function-0"
E0827 23:41:44.039050       1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0827 23:41:44.039089       1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0828 00:46:03.242083       1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0828 00:46:03.242149       1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0828 00:46:03.254107       1 scheduler.go:322] "Error updating pod" err="pods \"fm-some-function-function-0"
E0828 01:06:50.788980       1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0828 01:06:50.789034       1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0828 01:06:50.801204       1 scheduler.go:322] "Error updating pod" err="pods \"fm-some-function-function-0"
E0828 01:09:21.319015       1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0828 01:09:21.319052       1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0828 01:09:22.834663       1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0828 01:09:22.834705       1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0828 01:09:22.868887       1 scheduler.go:322] "Error updating pod" err="pods \"fm-some-function-function-0"
E0828 01:15:37.074735       1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0828 01:15:37.074813       1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0828 01:15:37.076709       1 scheduler.go:322] "Error updating pod" err="pods \"fm-some-function-function-0"
E0828 01:25:53.094399       1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0828 01:25:53.094457       1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0828 01:25:53.119481       1 scheduler.go:322] "Error updating pod" err="pods \"fm-some-function-function-0"
E0828 01:47:17.269534       1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-1"
E0828 01:47:17.269590       1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-1"
E0828 03:41:43.951020       1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0828 03:41:43.951061       1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0828 03:41:43.969157       1 scheduler.go:322] "Error updating pod" err="pods \"fm-some-function-function-0"
E0828 03:49:44.570421       1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-3"
E0828 03:49:44.570477       1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-3"
E0828 03:51:29.845382       1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-3"
E0828 03:51:29.845430       1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-3"
E0828 03:51:29.870361       1 scheduler.go:322] "Error updating pod" err="pods \"fm-some-function-function-3"
E0828 04:40:50.285779       1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-3"
E0828 04:40:50.285822       1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-3"
E0828 04:40:50.298786       1 scheduler.go:322] "Error updating pod" err="pods \"fm-some-function-function-3"
E0828 05:40:10.830900       1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-3"
E0828 05:40:10.830939       1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-3"
I0828 05:40:10.830951       1 factory.go:231] "Pod some-function-function-3\" not found"
E0828 05:40:10.832292       1 scheduler.go:322] "Error updating pod" err="pods \"fm-some-function-function-3"
E0828 08:08:26.497729       1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-3"
E0828 08:08:26.497986       1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-3"
E0828 08:08:26.506780       1 scheduler.go:322] "Error updating pod" err="pods \"fm-some-function-function-3"
E0828 09:06:16.771165       1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-3"
E0828 09:06:16.771209       1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-3"
E0828 10:23:53.929578       1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-3"
E0828 10:23:53.929654       1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-3"
E0828 10:23:53.934027       1 event_broadcaster.go:253] Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"fm-some-function-function-3.170f7b3012f49f86" is invalid: series.count: Invalid value: "": should be at least 2' (will not retry!)
E0828 10:45:10.646185       1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-3"
E0828 10:45:10.646229       1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-3"
E0828 11:23:59.037715       1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-3"
E0828 11:23:59.037777       1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-3"
E0828 12:01:47.955955       1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-3"
E0828 12:01:47.956012       1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-3"
E0828 12:01:47.993038       1 scheduler.go:322] "Error updating pod" err="pods \"fm-some-function-function-3"
E0828 12:14:02.968369       1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-3"
E0828 12:14:02.968418       1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-3"
I0828 12:14:02.968432       1 factory.go:231] "Pod some-function-function-3\" not found"
E0828 12:14:02.969748       1 scheduler.go:322] "Error updating pod" err="pods \"fm-some-function-function-3"
E0828 13:00:41.608169       1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0828 13:00:41.608220       1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0828 13:21:23.711091       1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-3"
E0828 13:21:23.711137       1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-3"
E0828 15:14:18.279071       1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-3"
E0828 15:14:18.279115       1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-3"
E0828 16:03:36.962893       1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-3"
E0828 16:03:36.962960       1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-3"

Also please refer to the error event

Signed-off-by: laminar <[email protected]>

tpiperatgod · 2022-08-29T00:34:14Z

Hi @armangurkan, can you show the status of the HPA? I have tested it in my cluster and I found that the cause is still the HPA trigger condition.

And I don't think this PR will solve the root cause, this will only improve the stability of the reconciliation (it also depends on the configuration of the HPA of the function).

armangurkan · 2022-08-29T01:33:59Z

@tpiperatgod The weird thing is as a result of colliding deployments the autoscaling worked for some reason and I tried every combination that can be possible and can not get it to a working state. Do you have a slack channel that you guys can invite me to. Also this feature is very important for us and we are happy to contribute as a team as we can get into the details of the project over the slack channel.

tpiperatgod · 2022-08-29T03:52:05Z

@tpiperatgod The weird thing is as a result of colliding deployments the autoscaling worked for some reason and I tried every combination that can be possible and can not get it to a working state. Do you have a slack channel that you guys can invite me to. Also this feature is very important for us and we are happy to contribute as a team as we can get into the details of the project over the slack channel.

Welcome, please wait for me to find a suitable slack channel.

you can also join this: https://pulsar.apache.org/community/

tpiperatgod · 2022-08-30T11:08:09Z

controllers/function_mesh.go

+		if functionSpec.MaxReplicas != nil && condition.Status == metav1.ConditionTrue && condition.Action == v1alpha1.NoAction {
+			continue
+		}


@armangurkan as mentioned, I added this logic to skip the reconciliation for function/sink/source that already controlled by the HPA

Signed-off-by: laminar <[email protected]>

nlu90

You can merge after resolving the conflicts.

tpiperatgod requested review from nlu90, freeznet and a team as code owners August 17, 2022 01:27

github-actions bot assigned tpiperatgod Aug 17, 2022

github-actions bot added the doc-required This pr needs a document label Aug 17, 2022

jiangpengcheng reviewed Aug 17, 2022

View reviewed changes

tpiperatgod mentioned this pull request Aug 17, 2022

FunctionMesh Horizontal Scaling Collusion with Immutable Deployments #444

Closed

freeznet reviewed Aug 23, 2022

View reviewed changes

freeznet added type/enhancement Indicates an improvement to an existing feature component/controller m/2022-09 labels Aug 23, 2022

freeznet previously approved these changes Aug 23, 2022

View reviewed changes

tpiperatgod added 5 commits August 29, 2022 07:52

Add spec.minReplicas and change spec.replicas to an optional field

4a931e3

Signed-off-by: laminar <[email protected]>

update crd templates in charts

abe9ed9

Signed-off-by: laminar <[email protected]>

fix

4e9d6d2

Signed-off-by: laminar <[email protected]>

adjust the checkIf<Resource>GenerationsIsIncreased function

589444b

Signed-off-by: laminar <[email protected]>

update the generation check mechanism

e5c308f

Signed-off-by: laminar <[email protected]>

tpiperatgod dismissed freeznet’s stale review via 8d414a7 August 30, 2022 09:37

tpiperatgod commented Aug 30, 2022

View reviewed changes

Improve the stability of FunctionMesh reconciliation

191463c

Signed-off-by: laminar <[email protected]>

freeznet previously approved these changes Aug 31, 2022

View reviewed changes

jiangpengcheng previously approved these changes Sep 5, 2022

View reviewed changes

nlu90 previously approved these changes Sep 6, 2022

View reviewed changes

Merge branch 'master' into issue-444

5c4a057

tpiperatgod dismissed stale reviews from nlu90, jiangpengcheng, and freeznet via 5c4a057 September 7, 2022 01:34

tpiperatgod requested review from freeznet, nlu90 and jiangpengcheng September 7, 2022 01:45

nlu90 approved these changes Sep 7, 2022

View reviewed changes

tpiperatgod merged commit 1a56afb into streamnative:master Sep 7, 2022

tpiperatgod deleted the issue-444 branch September 7, 2022 02:45

Huanli-Meng mentioned this pull request Sep 20, 2022

[FM v0.6.0]--Update docs for HPA autoscaling enhancement streamnative/function-mesh-website#172

Merged

Huanli-Meng added doc-added and removed doc-required This pr needs a document labels Sep 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improve the stability of autoscaling when HPA is enabled #450

improve the stability of autoscaling when HPA is enabled #450

tpiperatgod commented Aug 17, 2022

jiangpengcheng Aug 17, 2022

tpiperatgod Aug 17, 2022

jiangpengcheng Aug 17, 2022

tpiperatgod Aug 17, 2022

tpiperatgod commented Aug 17, 2022

armangurkan commented Aug 17, 2022

tpiperatgod commented Aug 18, 2022

armangurkan commented Aug 18, 2022

tpiperatgod commented Aug 18, 2022

armangurkan commented Aug 18, 2022

tpiperatgod commented Aug 19, 2022 •

edited

Loading

nlu90 commented Aug 22, 2022 •

edited

Loading

alperencelik commented Aug 22, 2022

alperencelik commented Aug 22, 2022 •

edited

Loading

tpiperatgod commented Aug 23, 2022

freeznet left a comment

freeznet Aug 23, 2022

freeznet Aug 23, 2022

tpiperatgod Aug 23, 2022

tpiperatgod Aug 23, 2022

armangurkan commented Aug 28, 2022

tpiperatgod commented Aug 29, 2022

armangurkan commented Aug 29, 2022

tpiperatgod commented Aug 29, 2022 •

edited

Loading

tpiperatgod Aug 30, 2022

nlu90 left a comment

improve the stability of autoscaling when HPA is enabled #450

improve the stability of autoscaling when HPA is enabled #450

Conversation

tpiperatgod commented Aug 17, 2022

Motivation

Modifications

Verifying this change

Documentation

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tpiperatgod commented Aug 17, 2022

armangurkan commented Aug 17, 2022

tpiperatgod commented Aug 18, 2022

armangurkan commented Aug 18, 2022

tpiperatgod commented Aug 18, 2022

armangurkan commented Aug 18, 2022

tpiperatgod commented Aug 19, 2022 • edited Loading

nlu90 commented Aug 22, 2022 • edited Loading

alperencelik commented Aug 22, 2022

alperencelik commented Aug 22, 2022 • edited Loading

tpiperatgod commented Aug 23, 2022

freeznet left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

armangurkan commented Aug 28, 2022

tpiperatgod commented Aug 29, 2022

armangurkan commented Aug 29, 2022

tpiperatgod commented Aug 29, 2022 • edited Loading

Choose a reason for hiding this comment

nlu90 left a comment

Choose a reason for hiding this comment

tpiperatgod commented Aug 19, 2022 •

edited

Loading

nlu90 commented Aug 22, 2022 •

edited

Loading

alperencelik commented Aug 22, 2022 •

edited

Loading

tpiperatgod commented Aug 29, 2022 •

edited

Loading