Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve the stability of autoscaling when HPA is enabled #450

Merged
merged 7 commits into from
Sep 7, 2022
Merged

improve the stability of autoscaling when HPA is enabled #450

merged 7 commits into from
Sep 7, 2022

Conversation

tpiperatgod
Copy link
Contributor

(If this PR fixes a github issue, please add Fixes #<xyz>.)

Fixes #444

(or if this PR is one task of a github issue, please add Master Issue: #<xyz> to link to the master issue.)

Master Issue: #

Motivation

Explain here the context, and why you're making that change. What is the problem you're trying to solve.

Modifications

  • add spec.minReplicas to indicate the minimum number of replicas for the workloads
  • change spec.replicas to an optional field
  • do not listen for changes of statefulSet.status event anymore
  • optimize the conditions for determining whether a resource needs to be CREATE or UPDATE, so as to remove frequent updates without substantive changes
  • update CRDs (also in helm charts)

Verifying this change

  • Make sure that the change passes the CI checks.

(Please pick either of the following options)

This change is a trivial rework / code cleanup without any test coverage.

(or)

This change is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

  • Added integration tests for end-to-end deployment with large payloads (10MB)
  • Extended integration test for recovery after broker failure

Documentation

Check the box below.

Need to update docs?

  • doc-required

    (If you need help on updating docs, create a doc issue)

  • no-need-doc

    (Please explain why)

  • doc

    (If this PR contains doc changes)

@tpiperatgod tpiperatgod requested review from nlu90, freeznet and a team as code owners August 17, 2022 01:27
@github-actions github-actions bot added the doc-required This pr needs a document label Aug 17, 2022
r.Spec.MinReplicas = new(int32)
*r.Spec.MinReplicas = *r.Spec.Replicas
} else {
r.Spec.Replicas = new(int32)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will FM still have the HPA error since the replicas are set to 1 even user doesn't set it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this PR, spec.replicas becomes optional, so that the user does not need to configure this value (as in the argo template), but only needs to configure a minimum value of spec.minReplicas.

If spec.replicas is changed under HPA control, applying the template file again (without spec.replicas configured in it) will not trigger reconciliation.

Log logr.Logger
Scheme *runtime.Scheme
functionGenerations *sync.Map
isFunctionGenerationIncreased bool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like the functionGenerations stores all function generations separately using a map, while isFunctionGenerationIncreased is shared for all functions. will this cause any conflicts?

and when the FM controller pod is recreated, these data will be lost, does this have any effects for the reconciliation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, I need to recheck the isFunctionGenerationIncreased issue.

If the controller restarts, a routine reconciliation is performed and the Generations state of all resources is restored.

@tpiperatgod
Copy link
Contributor Author

Simple case:

  1. Create Function with a configuration like below:
spec:
  minReplicas: 1
  maxReplicas: 10
  pod:
    autoScalingMetrics:
      - type: Resource
        resource:
          name: cpu
          target:
            type: Utilization
            averageUtilization: 80
    autoScalingBehavior:
      scaleDown:
        stabilizationWindowSeconds: 300
        policies:
        - type: Percent
          value: 100
          periodSeconds: 15
      scaleUp:
        stabilizationWindowSeconds: 60
        policies:
        - type: Percent
          value: 100
          periodSeconds: 15
        - type: Pods
          value: 4
          periodSeconds: 15
        selectPolicy: Max
  1. Before the first period of the HPA begins, Function's spec.replicas is the same as spec.minReplicas and is passed to StatefulSet's spec.replicas
  2. When HPA triggers autoscaling, such as scaling Function to 2 replicas
  3. Function's spec.replicas changed to 2 by HPA
  4. The increase of Function Generations in the Reconciliation logic triggers an update of the StatefulSet, changing its spec.replicas to 2
  5. At this point, if you apply the Function again (using the configuration above), the Function does not trigger a new Reconciliation process

@armangurkan
Copy link

image
We have deployed the code in the branch this is the error message we get

@tpiperatgod
Copy link
Contributor Author

image We have deployed the code in the branch this is the error message we get

This error means that the previous update has not been completed in this reconciliation. In most cases, this can be considered a warning.

Can you describe what resources are currently not working as expected? and how? And can you paste the current configurations?

@armangurkan
Copy link

image We have deployed the code in the branch this is the error message we get

This error means that the previous update has not been completed in this reconciliation. In most cases, this can be considered a warning.

Can you describe what resources are currently not working as expected? and how? And can you paste the current configurations?

What we had experience was, statefulset replicas got down to 0 constantly, we observed a constant cycle of shutdown and re-instantiation of pods, so at first we thought it was actually processing data, and it was being fired every time there was a new message in the topic they were subscribed to, to test it we shot down all the traffic, and that behavior was still in place, then I checked the function mesh operator logs and saw this error message.

@tpiperatgod
Copy link
Contributor Author

What we had experience was, statefulset replicas got down to 0 constantly, we observed a constant cycle of shutdown and re-instantiation of pods, so at first we thought it was actually processing data, and it was being fired every time there was a new message in the topic they were subscribed to, to test it we shot down all the traffic, and that behavior was still in place, then I checked the function mesh operator logs and saw this error message.

This is consistent with what I observed prior to fixing this issue, which was mainly caused by:

  1. FunctionMesh listens to too many events (HPA, StatefulSet), including events generated by changes in HPA.status and StatefulSet.status, which constantly trigger new reconciliation processes, and changes generated by new reconciliations will continue to trigger new reconciliations...
  2. There is currently an issue with FunctionMesh, when Function/Sink/Source is automatically scaled by HPA, all pods are rescheduled every time, which will cause the metrics (CPU\MEMORY) to go up over time and HPA will change the number of replicas of Function/Sink/Source frequently.

For reason 1, I filtered the events generated by StatefulSet.status in this PR and refined the reconciliation trigger conditions for FunctionMesh.

For reason 2, I suggest to increase the stabilizationWindowSeconds of HPA, ref to https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/#stabilization-window

Can you pates the configuration of spec.pod.autoScalingBehavior and spec.pod.autoScalingMetrics?

And for resources that don't need HPA, you can remove the spec.maxReplicas.

@armangurkan
Copy link

What we had experience was, statefulset replicas got down to 0 constantly, we observed a constant cycle of shutdown and re-instantiation of pods, so at first we thought it was actually processing data, and it was being fired every time there was a new message in the topic they were subscribed to, to test it we shot down all the traffic, and that behavior was still in place, then I checked the function mesh operator logs and saw this error message.

This is consistent with what I observed prior to fixing this issue, which was mainly caused by:

  1. FunctionMesh listens to too many events (HPA, StatefulSet), including events generated by changes in HPA.status and StatefulSet.status, which constantly trigger new reconciliation processes, and changes generated by new reconciliations will continue to trigger new reconciliations...
  2. There is currently an issue with FunctionMesh, when Function/Sink/Source is automatically scaled by HPA, all pods are rescheduled every time, which will cause the metrics (CPU\MEMORY) to go up over time and HPA will change the number of replicas of Function/Sink/Source frequently.

For reason 1, I filtered the events generated by StatefulSet.status in this PR and refined the reconciliation trigger conditions for FunctionMesh.

For reason 2, I suggest to increase the stabilizationWindowSeconds of HPA, ref to https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/#stabilization-window

Can you pates the configuration of spec.pod.autoScalingBehavior and spec.pod.autoScalingMetrics?

And for resources that don't need HPA, you can remove the spec.maxReplicas.
At this point all our functions and sinks need scaling in our function mesh.

        autoScalingMetrics:
        - type: Resource
          resource:
            name: memory
            target:
              type: Utilization
              averageUtilization: 60

This is currently it but we can define autoscaling behavior if necessary. So do you have benchmarks in mind so we can give it a shot on scaling up and scaling down params?

@tpiperatgod
Copy link
Contributor Author

tpiperatgod commented Aug 19, 2022

        autoScalingMetrics:
        - type: Resource
          resource:
            name: memory
            target:
              type: Utilization
              averageUtilization: 60

This is currently it but we can define autoscaling behavior if necessary. So do you have benchmarks in mind so we can give it a shot on scaling up and scaling down params?

My preferred configuration is as follows.

  pod:
    autoScalingMetrics:
      - type: Resource
        resource:
          name: cpu
          target:
            type: Utilization
            averageUtilization: 80 # I have observed that the cpu usage of the pod stays around 57% when idle
    autoScalingBehavior:
      scaleDown:
        stabilizationWindowSeconds: 120
        policies:
        - type: Percent
          value: 100
          periodSeconds: 15
      scaleUp:
        stabilizationWindowSeconds: 120
        policies:
        - type: Percent
          value: 50
          periodSeconds: 15
        - type: Pods
          value: 2
          periodSeconds: 15
        selectPolicy: Max

By default, autoScalingBehavior.scaleUp.stabilizationWindowSeconds has a value of 0, which means that when hpa's scaling algorithm meets a condition (e.g., cpu usage exceeds 80%), then hpa will immediately increase the number of pods. Because of this issue, every time the number of replicas of workloads changes, all pods are rescheduled (i.e., rebuilt), which causes the cpu usage metric to skyrocket, which in turn continues to increase the expected replica value of hpa.

I suggest setting autoScalingBehavior.scaleUp.stabilizationWindowSeconds to 120 or a reasonable range, which will make hpa less sensitive.

In addition, configurations like autoScalingBehavior.scaleDown.policies and autoScalingBehavior.scaleUp.policies can also be used to control the magnitude of scaling.

@nlu90
Copy link
Contributor

nlu90 commented Aug 22, 2022

add spec.minReplicas to indicate the minimum number of replicas for the workloads
change spec.replicas to an optional field

why not use spec.replicas as the default minReplicas?

spec.relicas should serve as the initial value, and the following invariant must hold: "minRelicas <= replicas <= maxRelicas"

@alperencelik
Copy link

add spec.minReplicas to indicate the minimum number of replicas for the workloads
change spec.replicas to an optional field

why not use spec.replicas as the default minReplicas?

spec.relicas should serve as the initial value, and the following invariant must hold: "minRelicas <= replicas <= maxRelicas"

This flow is not working with ArgoCD. HPA can't scale when statefulset has "replica" count. To work with ArgoCD the replicas should be empty and then HPA can scale the StatefulSet. You can refer to this documentation for more detailed information.

@alperencelik
Copy link

alperencelik commented Aug 22, 2022

        autoScalingMetrics:
        - type: Resource
          resource:
            name: memory
            target:
              type: Utilization
              averageUtilization: 60

This is currently it but we can define autoscaling behavior if necessary. So do you have benchmarks in mind so we can give it a shot on scaling up and scaling down params?

My preferred configuration is as follows.

  pod:
    autoScalingMetrics:
      - type: Resource
        resource:
          name: cpu
          target:
            type: Utilization
            averageUtilization: 80 # I have observed that the cpu usage of the pod stays around 57% when idle
    autoScalingBehavior:
      scaleDown:
        stabilizationWindowSeconds: 120
        policies:
        - type: Percent
          value: 100
          periodSeconds: 15
      scaleUp:
        stabilizationWindowSeconds: 120
        policies:
        - type: Percent
          value: 50
          periodSeconds: 15
        - type: Pods
          value: 2
          periodSeconds: 15
        selectPolicy: Max

By default, autoScalingBehavior.scaleUp.stabilizationWindowSeconds has a value of 0, which means that when hpa's scaling algorithm meets a condition (e.g., cpu usage exceeds 80%), then hpa will immediately increase the number of pods. Because of this issue, every time the number of replicas of workloads changes, all pods are rescheduled (i.e., rebuilt), which causes the cpu usage metric to skyrocket, which in turn continues to increase the expected replica value of hpa.

I suggest setting autoScalingBehavior.scaleUp.stabilizationWindowSeconds to 120 or a reasonable range, which will make hpa less sensitive.

In addition, configurations like autoScalingBehavior.scaleDown.policies and autoScalingBehavior.scaleUp.policies can also be used to control the magnitude of scaling.

Seems like setting autoScalingBehavior is fixed the issue mentioned above but as you mentioned at this time when HPA scales downward or upward all pods are rescheduling. I know you guys put a lot of effort into this work and I am grateful to you but also I'm very excited to see when we'll have this feature on upcoming release.

@tpiperatgod
Copy link
Contributor Author

why not use spec.replicas as the default minReplicas?

spec.relicas should serve as the initial value, and the following invariant must hold: "minRelicas <= replicas <= maxRelicas"

@nlu90
The targetRef of HPA is Function/Sink/Source, so when HPA is enabled, spec.replicas of Function/Sink/Source will be controlled by HPA. So we need to add a new field that is not affected by HPA, i.e. spec.minReplicas.

Also, webhook will help user to maintain this rule: minRelicas <= replicas <= maxRelicas and set initial values for spec.replicas and spec.minReplicas when they are empty.

Seems like setting autoScalingBehavior is fixed the issue mentioned above but as you mentioned at this time when HPA scales downward or upward all pods are rescheduling. I know you guys put a lot of effort into this work and I am grateful to you but also I'm very excited to see when we'll have this feature on upcoming release.

@alperencelik
Thanks. By milestone, this change will be included in the mid-September release, v0.6.0.

Copy link
Member

@freeznet freeznet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall LGTM, left one comment, ptal, thank.

return ctrl.Result{}, nil
}

func (r *FunctionReconciler) checkIfFunctionGenerationsIsIncreased(function *v1alpha1.Function) bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like the cached functionGenerations will always trigger as a new generation when the controller restart. Is it desired? or can we move the observed generation to the function's status? like add an ObservedGeneration to FunctionStatus?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please refer to https://github.com/streamnative/pulsar-operators/pull/365 for more details about the idea about ObservedGeneration

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea, I will update it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@freeznet freeznet added type/enhancement Indicates an improvement to an existing feature component/controller m/2022-09 labels Aug 23, 2022
freeznet
freeznet previously approved these changes Aug 23, 2022
@armangurkan
Copy link

image We have deployed the code in the branch this is the error message we get

@tpiperatgod , somehow the issue is back, the functions can not scale (so in fact they got deleted as soon as they scale, we observe this, in the logs of kubecontroller manager where every successful create is followed by a succesful delete), the previous error of the function controller is thrown frequently, also in the kube scheduler, we observe that volume mounts is erroring out constantly for the new pods that are being created, as follows

E0827 00:04:44.373395       1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0827 00:04:44.373456       1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0827 00:04:44.389374       1 scheduler.go:322] "Error updating pod" err="pods \"fm-some-function-function-0"
E0827 00:33:34.582859       1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-1"
E0827 00:33:34.582923       1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-1"
E0827 00:35:21.047823       1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0827 00:35:21.047877       1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0827 09:39:13.643549       1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0827 09:39:13.643628       1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
I0827 09:39:13.643661       1 factory.go:231] "Pod some-function-function-0\" not found"
E0827 09:39:13.659803       1 scheduler.go:322] "Error updating pod" err="pods \"fm-some-function-function-0"
E0827 09:53:04.134059       1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0827 09:53:04.134104       1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0827 09:53:04.171206       1 scheduler.go:322] "Error updating pod" err="pods \"fm-some-function-function-0"
E0827 09:56:33.602585       1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0827 09:56:33.602650       1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0827 09:56:33.637281       1 scheduler.go:322] "Error updating pod" err="pods \"fm-some-function-function-0"
E0827 10:43:47.118786       1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0827 10:43:47.118845       1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
I0827 10:43:47.118864       1 factory.go:231] "Pod some-function-function-0\" not found"
E0827 10:43:47.151607       1 scheduler.go:322] "Error updating pod" err="pods \"fm-some-function-function-0"
E0827 11:10:05.507240       1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0827 11:10:05.507312       1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0827 11:10:05.542283       1 scheduler.go:322] "Error updating pod" err="pods \"fm-some-function-function-0"
E0827 16:50:17.789041       1 authentication.go:63] "Unable to authenticate the request" err="[invalid bearer token, [invalid bearer token, serviceaccounts \"kube-prometheus-stack-in-c-prometheus\" not found]]"
E0827 17:36:47.796489       1 authentication.go:63] "Unable to authenticate the request" err="[invalid bearer token, [invalid bearer token, serviceaccounts \"kube-prometheus-stack-in-c-prometheus\" not found]]"
E0827 23:07:54.580377       1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0827 23:07:54.580431       1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
I0827 23:07:54.580447       1 factory.go:231] "Pod some-function-function-0\" not found"
E0827 23:07:54.606340       1 scheduler.go:322] "Error updating pod" err="pods \"fm-some-function-function-0"
E0827 23:19:56.126715       1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0827 23:19:56.126767       1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0827 23:19:56.133013       1 event_broadcaster.go:253] Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"fm-some-function-function-0.170f56f49cdd8c10" is invalid: series.count: Invalid value: "": should be at least 2' (will not retry!)
E0827 23:41:13.876819       1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0827 23:41:13.876860       1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
I0827 23:41:13.876873       1 factory.go:231] "Pod some-function-function-0\" not found"
E0827 23:41:13.878275       1 scheduler.go:322] "Error updating pod" err="pods \"fm-some-function-function-0"
E0827 23:41:44.039050       1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0827 23:41:44.039089       1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0828 00:46:03.242083       1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0828 00:46:03.242149       1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0828 00:46:03.254107       1 scheduler.go:322] "Error updating pod" err="pods \"fm-some-function-function-0"
E0828 01:06:50.788980       1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0828 01:06:50.789034       1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0828 01:06:50.801204       1 scheduler.go:322] "Error updating pod" err="pods \"fm-some-function-function-0"
E0828 01:09:21.319015       1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0828 01:09:21.319052       1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0828 01:09:22.834663       1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0828 01:09:22.834705       1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0828 01:09:22.868887       1 scheduler.go:322] "Error updating pod" err="pods \"fm-some-function-function-0"
E0828 01:15:37.074735       1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0828 01:15:37.074813       1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0828 01:15:37.076709       1 scheduler.go:322] "Error updating pod" err="pods \"fm-some-function-function-0"
E0828 01:25:53.094399       1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0828 01:25:53.094457       1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0828 01:25:53.119481       1 scheduler.go:322] "Error updating pod" err="pods \"fm-some-function-function-0"
E0828 01:47:17.269534       1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-1"
E0828 01:47:17.269590       1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-1"
E0828 03:41:43.951020       1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0828 03:41:43.951061       1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0828 03:41:43.969157       1 scheduler.go:322] "Error updating pod" err="pods \"fm-some-function-function-0"
E0828 03:49:44.570421       1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-3"
E0828 03:49:44.570477       1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-3"
E0828 03:51:29.845382       1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-3"
E0828 03:51:29.845430       1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-3"
E0828 03:51:29.870361       1 scheduler.go:322] "Error updating pod" err="pods \"fm-some-function-function-3"
E0828 04:40:50.285779       1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-3"
E0828 04:40:50.285822       1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-3"
E0828 04:40:50.298786       1 scheduler.go:322] "Error updating pod" err="pods \"fm-some-function-function-3"
E0828 05:40:10.830900       1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-3"
E0828 05:40:10.830939       1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-3"
I0828 05:40:10.830951       1 factory.go:231] "Pod some-function-function-3\" not found"
E0828 05:40:10.832292       1 scheduler.go:322] "Error updating pod" err="pods \"fm-some-function-function-3"
E0828 08:08:26.497729       1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-3"
E0828 08:08:26.497986       1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-3"
E0828 08:08:26.506780       1 scheduler.go:322] "Error updating pod" err="pods \"fm-some-function-function-3"
E0828 09:06:16.771165       1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-3"
E0828 09:06:16.771209       1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-3"
E0828 10:23:53.929578       1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-3"
E0828 10:23:53.929654       1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-3"
E0828 10:23:53.934027       1 event_broadcaster.go:253] Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"fm-some-function-function-3.170f7b3012f49f86" is invalid: series.count: Invalid value: "": should be at least 2' (will not retry!)
E0828 10:45:10.646185       1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-3"
E0828 10:45:10.646229       1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-3"
E0828 11:23:59.037715       1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-3"
E0828 11:23:59.037777       1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-3"
E0828 12:01:47.955955       1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-3"
E0828 12:01:47.956012       1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-3"
E0828 12:01:47.993038       1 scheduler.go:322] "Error updating pod" err="pods \"fm-some-function-function-3"
E0828 12:14:02.968369       1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-3"
E0828 12:14:02.968418       1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-3"
I0828 12:14:02.968432       1 factory.go:231] "Pod some-function-function-3\" not found"
E0828 12:14:02.969748       1 scheduler.go:322] "Error updating pod" err="pods \"fm-some-function-function-3"
E0828 13:00:41.608169       1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0828 13:00:41.608220       1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-0"
E0828 13:21:23.711091       1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-3"
E0828 13:21:23.711137       1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-3"
E0828 15:14:18.279071       1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-3"
E0828 15:14:18.279115       1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-3"
E0828 16:03:36.962893       1 framework.go:1000] "Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"fm-some-function-function-3"
E0828 16:03:36.962960       1 factory.go:225] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": Operation cannot be fulfilled on pods/binding \"fm-some-function-function-3"

Also please refer to the error event
image

@tpiperatgod
Copy link
Contributor Author

Hi @armangurkan, can you show the status of the HPA? I have tested it in my cluster and I found that the cause is still the HPA trigger condition.

And I don't think this PR will solve the root cause, this will only improve the stability of the reconciliation (it also depends on the configuration of the HPA of the function).

@armangurkan
Copy link

@tpiperatgod The weird thing is as a result of colliding deployments the autoscaling worked for some reason and I tried every combination that can be possible and can not get it to a working state. Do you have a slack channel that you guys can invite me to. Also this feature is very important for us and we are happy to contribute as a team as we can get into the details of the project over the slack channel.

@tpiperatgod
Copy link
Contributor Author

tpiperatgod commented Aug 29, 2022

@tpiperatgod The weird thing is as a result of colliding deployments the autoscaling worked for some reason and I tried every combination that can be possible and can not get it to a working state. Do you have a slack channel that you guys can invite me to. Also this feature is very important for us and we are happy to contribute as a team as we can get into the details of the project over the slack channel.

Welcome, please wait for me to find a suitable slack channel.

you can also join this: https://pulsar.apache.org/community/

Comment on lines 239 to 244
if functionSpec.MaxReplicas != nil && condition.Status == metav1.ConditionTrue && condition.Action == v1alpha1.NoAction {
continue
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@armangurkan as mentioned, I added this logic to skip the reconciliation for function/sink/source that already controlled by the HPA

freeznet
freeznet previously approved these changes Aug 31, 2022
jiangpengcheng
jiangpengcheng previously approved these changes Sep 5, 2022
nlu90
nlu90 previously approved these changes Sep 6, 2022
Copy link
Contributor

@nlu90 nlu90 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can merge after resolving the conflicts.

@tpiperatgod tpiperatgod merged commit 1a56afb into streamnative:master Sep 7, 2022
@tpiperatgod tpiperatgod deleted the issue-444 branch September 7, 2022 02:45
@Huanli-Meng Huanli-Meng added doc-added and removed doc-required This pr needs a document labels Sep 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/controller doc-added m/2022-09 type/enhancement Indicates an improvement to an existing feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

FunctionMesh Horizontal Scaling Collusion with Immutable Deployments
7 participants