Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GRPC traffic is not load balanced correctly #142

Open
deenav opened this issue Mar 25, 2019 · 16 comments
Open

GRPC traffic is not load balanced correctly #142

deenav opened this issue Mar 25, 2019 · 16 comments
Assignees
Labels
kind/enhancement Enhancement of an existing feature priority/P2 Slight inconvenience or annoyance to applications, system continues to function status/needs-investigation Further investigation is required

Comments

@deenav
Copy link

deenav commented Mar 25, 2019

While running longevity test in PKS cluster with moderate IO (medium-scale) workload, Observing time lapses in the controller logs whenever there are more than one controller in the PKS cluster.

Environment details: PKS / K8 with medium cluster:

3 master: xlarge: 4 CPU, 16 GB Ram, 32 GB Disk
5 worker: 2xlarge: 8 CPU, 32 GB Ram, 64 GB Disk
Tier-1 storage is from VSAN datastore
Tier-2 storage curved on NFS Client provisioner using Isilon as backend
Pravega version: zk-closed-client-issue-0.5.0-2161.60655bf
Zookeeper Operator : pravega/zookeeper-operator:0.2.1
Pravega Operator: adrianmo/pravega-operator:issue-134-5

Steps to reproduce

  1. Deploy Pravega-operator and Pravega cluster with minimum 2 controller instances on PKS Cluster.
  2. Start longevity test with moderate IO (medium-scale) workload and observe that both controllers are serving read/write requests from the longevity run. Use kubectl logs -f po/<controller-pod-name> command to monitor the controller logs.
  3. Induce one controller restart using kubectl exec -it <controller1-pod-name> reboot command.
  4. After the reboot, when controller completes it's initial configuration stages like ControllerServiceStarter STARTING then it is not logging anything and not serving read/write requests from the client.
  5. Restarted pod remains idle until the active controller pod gets some kind of error/event (I tested by triggering reboot on the active controller ) which makes the idle controller to resume it's services.

Snip of time lapses in controller log:-

2019-03-21 10:20:56,286 3023 [ControllerServiceStarter STARTING] INFO  i.p.c.s.ControllerServiceStarter - Awaiting start of controller event processors
2019-03-21 10:20:56,286 3023 [ControllerServiceStarter STARTING] INFO  i.p.c.s.ControllerServiceStarter - Awaiting start of controller cluster listener
2019-03-21 10:20:57,456 4193 [controllerpool-16] INFO  i.p.c.e.i.EventProcessorGroupImpl - Notifying failure of process 172.24.4.7-e1620132-57fb-4e3f-a74b-4a359aca27c6 participating in reader group EventProcessorGroup[abortStreamReaders]
2019-03-21 10:20:57,456 4193 [controllerpool-19] INFO  i.p.c.e.i.EventProcessorGroupImpl - Notifying failure of process 172.24.4.7-e1620132-57fb-4e3f-a74b-4a359aca27c6 participating in reader group EventProcessorGroup[scaleGroup]
2019-03-21 10:20:57,456 4193 [controllerpool-30] INFO  i.p.c.e.i.EventProcessorGroupImpl - Notifying failure of process 172.24.4.7-e1620132-57fb-4e3f-a74b-4a359aca27c6 participating in reader group EventProcessorGroup[commitStreamReaders]
2019-03-21 11:53:37,489 5564226 [controllerpool-18] INFO  i.p.c.cluster.zkImpl.ClusterZKImpl - Node 172.24.4.6:9090:eff7c8dd-8794-4dc4-8dc0-c361cb2b2eb4 removed from cluster
2019-03-21 11:53:37,492 5564229 [controllerpool-18] INFO  i.p.c.f.ControllerClusterListener - Received controller cluster event: HOST_REMOVED for host: 172.24.4.6:9090:eff7c8dd-8794-4dc4-8dc0-c361cb2b2eb4
2019-03-21 11:53:37,494 5564231 [Curator-LeaderSelector-0] INFO  i.p.c.fault.SegmentMonitorLeader - Obtained leadership to monitor the Host to Segment Container Mapping

Problem location
My observation here is, it seems load balancing not happening properly when restarted/failed controller resumes it's operation if more than one controller in the cluster.

@deenav
Copy link
Author

deenav commented Mar 25, 2019

Attaching controller logs
controller-logs.zip

@sumit-bm sumit-bm added the kind/bug Something isn't working label Mar 25, 2019
@adrianmo
Copy link
Contributor

adrianmo commented Mar 25, 2019

@deenav can you share the client configuration? I would like to know how the Pravega endpoint is used in the client application. Thanks.

@deenav
Copy link
Author

deenav commented Mar 26, 2019

@adrianmo I was running pravega-longevity medium-scale workload on my cluster using

./tools/longevity_tests.py --image-repository=devops-repo.isus.emc.com:8116/
pravega-longevity --image-tag=18.ed0bc5e-0.5.0-2151.2e8f0b7 --controller-uri tcp://10.100.200.126:9090 --test=medium-scale deploy

@adrianmo
Copy link
Contributor

@deenav please try to reproduce the error using the Controller service name instead of the IP address in the --controller-uri parameter.

@Tristan1900
Copy link
Member

I was able to reproduce the similar error yesterday but it suddenly worked today and I cannot reproduce it again in the past hours. It will be harder to debug if this error happens intermittently.
It seems that this is a load balance problem as deenav suggests because I have been looking into the controller logs but nothing suspicious occurs.
I was misled by those messages Notifying failure of process 172.24.4.7-e1620132-57fb-4e3f-a74b-4a359aca27c6 participating in reader group EventProcessorGroup[abortStreamReaders] because this failed process has been addressed by another controller already. But it isn't the cause of our problem.

@deenav
Copy link
Author

deenav commented Mar 29, 2019

@adrianmo It seems pravega-longevity does not work with Controller service name as --controller-uri. It's throwing Name resolver io.pravega.client.stream.impl.ControllerResolverFactory$ControllerNameResolver@3f5a124d returned an empty list exception and not starting IO's.

Exception in thread "Thread-32" io.pravega.common.util.RetriesExhaustedException: java.util.concurrent.CompletionException: io.grpc.StatusRuntimeException: UNAVAILABLE: Name resolver io.pravega.client.stream.impl.ControllerResolverFactory$ControllerNameResolver@3f5a124d returned an empty list
        at io.pravega.common.util.Retry$RetryAndThrowBase.lambda$null$6(Retry.java:245)
        at java.util.concurrent.CompletableFuture.uniExceptionally(Unknown Source)
        at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(Unknown Source)
        at java.util.concurrent.CompletableFuture.postComplete(Unknown Source)
        at java.util.concurrent.CompletableFuture.completeExceptionally(Unknown Source)
        at io.pravega.client.stream.impl.ControllerImpl$RPCAsyncCallback.onError(ControllerImpl.java:1016)
        at io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:434)
        at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39)
        at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23)
        at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40)
        at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39)
        at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23)
        at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40)
        at io.grpc.internal.CensusStatsModule$StatsClientInterceptor$1$1.onClose(CensusStatsModule.java:694)
        at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39)
        at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23)
        at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40)
        at io.grpc.internal.CensusTracingModule$TracingClientInterceptor$1$1.onClose(CensusTracingModule.java:397)
        at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:459)
        at io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:63)
        at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.close(ClientCallImpl.java:546)
        at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.access$600(ClientCallImpl.java:467)
        at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:584)
        at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
        at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
        at java.lang.Thread.run(Unknown Source)

@Raghuveera-Rao
Copy link

Raghuveera-Rao commented May 7, 2019

one Controller pod in inactive state after deleting one of the controller(out of 2).

Pravega version: 0.5.0-2236.5228e2d
Zookeeper Operator : pravega/zookeeper-operator:0.2.1
Pravega Operator: pravega/pravega-operator:0.3.2

If you delete the controller pod (which is properly running), then other controller (which was stuck) starts to take the new i/o.

@Raghuveera-Rao
Copy link

contr2_1.log

@adrianmo
Copy link
Contributor

adrianmo commented May 30, 2019

After investigating, I'm quite sure that the following is happening:

1- When the longevity process starts, it establishes multiple GRPC (HTTP2) connections to the Controller service, which are forwarded to the backend pods in a round robin fashion. Therefore, some connections are established with Controller 1, and some other with Controller 2.
2- When you restart Controller 1, all the connections to that pod are dropped and reestablished to the service which forwards it to the only pod available, Controller 2. At this point, Controller 2 is handling all connections from the longevity process.
3- When Controller 1 comes up and running again, it does not receive any traffic because all connections are already established with Controller 2. This is why Controller 1 doesn't manifest any activity and seems idle.
4- Eventually, when Controller 2 faces an issue or is deleted, connections are reestablished with the service and forwarded to Controller 1. Then you start seeing activity again.

This happens because HTTP/2 multiplexes requests into a single TCP connection, as opposed to HTTP/1, which establishes a connection for each request.

Related documents:

@adrianmo adrianmo changed the title time lapses between log lines seen in the controller logs when there is more than one controller in the PKS cluster GRPC traffic is not load balanced correctly May 30, 2019
@adrianmo adrianmo added priority/P1 Recoverable error, functionality/performance impaired but not lost, no permanent damage status/needs-investigation Further investigation is required labels May 30, 2019
@adrianmo adrianmo added priority/P2 Slight inconvenience or annoyance to applications, system continues to function and removed priority/P1 Recoverable error, functionality/performance impaired but not lost, no permanent damage labels May 30, 2019
@pbelgundi pbelgundi added the version 0.3.2 This is an issue with operator version 0.3.2 label Jul 2, 2019
@pbelgundi pbelgundi self-assigned this Jul 2, 2019
@pbelgundi
Copy link
Contributor

pbelgundi commented Jul 3, 2019

Based on current operator/controller code, behavior exhibited by controller is the expected behavior. When out of 2 controllers say 'a' and 'b', if 'a' goes down, all new client connections are redirected by load-balancer to controller 'b' and later when 'a' comes up, since the connections to 'b' are persistent, there is no way to automatically 'shift' these from 'b' to 'a'. Currently no attempt to do this is made by operator/ controller, by design.

If such load balancing is a requirement, please file a new feature request for the same.

@ravisharda
Copy link

ravisharda commented Jul 3, 2019

@pbelgundi the behavior you are describing represents "failover" for High Availability. The problem here is w.r.t. load balancing gRPC traffic across the Controller nodes.

The traffic to Controller pods in Pravega Operator managed Kubernetes deployments is not load-balanced due to the choice of load balancing mechanism used in the Operator: service type of load balancer, which represents a layer 4 load balancer L4 LB).

An L4 LB cannot load balance gRPC traffic, as pointed out by references listed by @adrianmo above. Load balancing of gRPC traffic requires request-level load balancing, which is supported by some of the Ingress controllers like Ingress-nginx (which in turn spin up application-level/layer 7 load balancers).

Whether to use sticky sessions or load balance across all nodes is usually a policy decision represented as rules/configuration in a load balancer. In this case, the choice of the load balancer itself has restricted the options.

It is not by design that gRPC load balancing was left out. It was just that the problem was realized later. @adrian and I had discussed this in length some time back, and he did some investigation and posted his findings above.

I just want to share the context. In my view too, this is an enhancement.

@pbelgundi
Copy link
Contributor

@pbelgundi the behavior you are describing represents "failover" for High Availability. The problem here is w.r.t. load balancing gRPC traffic across the Controller nodes.

No. What I have mentioned is not about "failover". That is working as expected.
This issue is about uneven load-balancing post "failover".
Current operator takes care of correctly load balancing in a normal (non-failover) situation.

There sure are better ways to do grpc load balancing so it works post "failover" as well. Its just that current operator does not attempt to do it and needs to be taken up as a feature/enhancement.

@RaulGracia
Copy link

@pbelgundi I'm not sure if to just close this issue is the right thing to do. You are right, this is an enhancement to be done and probably it is not for 0.3.2. What's more, it is still unclear whether this needs to be fixed in the operator or in Pravega. But it would be great to keep this open at the moment
(relabeling the issue) or replace this issue with another one to keep the knowledge on the problem that has been accumulated so far, instead of just closing it.

@pbelgundi
Copy link
Contributor

@pbelgundi I'm not sure if to just close this issue is the right thing to do. You are right, this is an enhancement to be done and probably it is not for 0.3.2. What's more, it is still unclear whether this needs to be fixed in the operator or in Pravega. But it would be great to keep this open at the moment
(relabeling the issue) or replace this issue with another one to keep the knowledge on the problem that has been accumulated so far, instead of just closing it.

I'm fine with re-labelling too. This would definitely require fix on operator.

@pbelgundi pbelgundi reopened this Jul 3, 2019
@shiveshr
Copy link

shiveshr commented Jul 3, 2019

Right now, by virtue of using grpc we load balance requests on a per client basis. This is by design choice of using grpc which uses HTTP/2 as transport and multiplexes multiple requests over a persistent connection.
So this design choice has the limitation that once a client connects to a controller, its a sticky connection until that connection is disrupted.

As an improvement we should definitely load balance the "requests" and not just connections across available controller instances. But that will require a broader discussion. We can keep this issue and keep that discussion in this context but we should move it to 0.6 and as a new feature.

@shiveshr shiveshr added kind/enhancement Enhancement of an existing feature and removed kind/bug Something isn't working labels Jul 3, 2019
@shiveshr shiveshr added version 0.4.0 Issue with Operator 0.4.0 and removed version 0.3.2 This is an issue with operator version 0.3.2 labels Jul 3, 2019
@pbelgundi pbelgundi removed the version 0.4.0 Issue with Operator 0.4.0 label May 6, 2020
@pbelgundi pbelgundi removed their assignment May 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/enhancement Enhancement of an existing feature priority/P2 Slight inconvenience or annoyance to applications, system continues to function status/needs-investigation Further investigation is required
Projects
None yet
Development

No branches or pull requests

10 participants