Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Identify slow points in invocation path #465

Open
jd0-sag opened this issue Jan 31, 2017 · 5 comments
Open

Identify slow points in invocation path #465

jd0-sag opened this issue Jan 31, 2017 · 5 comments
Assignees

Comments

@jd0-sag
Copy link
Contributor

jd0-sag commented Jan 31, 2017

We need to improve the overall performance of the system (especially in the case of multi-server stripes). This requires measuring the current cost of these operations so we have some guidance on where to drill down further.

For these initial numbers, I created a single-client test which sends a sequence of 10000 invokes which do effectively nothing (but manually requesting replication - implicit replication disabled), measuring the average time it takes from before the send to when the RECEIVED ack returns to when the COMPLETED or RETIRED ack returns (did 10000 runs of each, interleaved to avoid skew due to trends over time). This was then tested on a single-server stripe versus a 2-server stripe. NOTE: Client and server(s) run on the same machine (avoids network overhead and this test isn't CPU heavy, just a measure of total path length).

Each run produced an average of the various ack timings for all configurations. I ran 3 sets and averaged the corresponding results (there was relatively little variability between them).

Single-server:

  • RECEIVED to COMPLETED: 25% longer
  • RECEIVED to RETIRED: 54% longer

2-server:

*RECEIVED to COMPLETED: 77% longer
*RECEIVED to RETIRED: 177% longer

Overall cost of adding a passive:

*RECEIVED when waiting for COMPLETED: 9% longer
*RECEIVED when waiting for RETIRED: 13% longer
*COMPLETED: 55% longer
*RETIRED: 104% longer

Overall cost of waiting for RETIRED (above just waiting for COMPLETED):

  • versus single-server RECEIVED: 5%
  • versus single-server COMPLETED: 30%
  • versus 2-server RECEIVED: 9%
  • versus 2-server COMPLETED: 70%

Oddities:

  • In all runs, waiting for RETIRED instead of COMPLETED made the time until RECEIVED arrived take longer, which wasn't expected.
  • If the tests aren't interleaved, the set which runs later performs far better. This remained true even after introducing a "warming" run, before the collection runs. I had expected this warming to minimize such an effect but this wasn't true.

Consistencies with expectations:

  • Waiting for RETIRED, on the passive, effectively doubles the path-length (since the manual replication means the passive can only start after the active has mostly run the message) so the 70% overhead is believable. Single-server cost of 30% is believable since it only adds another invoke, not another network round-trip.
  • Waiting for RECEIVED is roughly 10% more expensive when a passive is added (since this means waiting for the message to be replicated).

Next steps in investigation:

  • Determine why waiting for RETIRED versus COMPLETED changes how long it takes until the RECEIVED arrives (this should be purely client-side).
  • Determine why adding a passive increases COMPLETED cost so much more than RECEIVED (since they should both be effectively no-ops with a network message).
@jd0-sag jd0-sag self-assigned this Jan 31, 2017
@jd0-sag
Copy link
Contributor Author

jd0-sag commented Jan 31, 2017

@myronkscott suspects that the reason why waiting for the RETIRED slows down the RECEIVED is due to the impact of client<->active batching: sending the next message after COMPLETED means you are more likely to see the RECEIVED and the previous RETIRED in the same batch, leading to one less blocking call across the stacks.

This also implies that we should investigate if the client<->server batching can benefit any further from some of the ideas applied in our recent active<->passive batching.

@jd0-sag
Copy link
Contributor Author

jd0-sag commented Jan 31, 2017

The previous results should be taken with a grain of salt since I realized that they were running on concurrency key 0. I changed them to 1 and got some numbers which look a bit better (since running on key 0 causes a lot of pipeline flush messages which cloud the image).

Single-server:

  • RECEIVED to COMPLETED: 17% longer
  • RECEIVED to RETIRED: 32% longer

2-server:

  • RECEIVED to COMPLETED: 74% longer
  • RECEIVED to RETIRED: 86% longer

Overall cost of adding a passive:

  • RECEIVED when waiting for COMPLETED: 19% longer
  • RECEIVED when waiting for RETIRED: 18% longer
  • COMPLETED: 77% longer
  • RETIRED: 67% longer

Overall cost of waiting for RETIRED (above just waiting for COMPLETED):

  • versus single-server RECEIVED: 3%
  • versus single-server COMPLETED: 16%
  • versus 2-server RECEIVED: 2%
  • versus 2-server COMPLETED: 9%

The main question still remains: Why does adding a passive increase COMPLETED cost so much more than RECEIVED (since they should both be effectively no-ops with a network message)?

@jd0-sag
Copy link
Contributor Author

jd0-sag commented Feb 1, 2017

A clear hot-point is ManagedEntityImpl.scheduleInOrder(). In my tests, adding a passive roughly doubles the cost of this method, which already accounts for roughly 90% of the time spent handling a message within the ProcessTransactionHandler thread.

There are also some message decoding operations around there but that is a secondary investigation.

@myronkscott suspects that the added cost in this method is related to creating the replicated message (per-passive). If so, is it possible to push this off to the batching mechanism for the group channel or the RequestProcessor? (sending early is preferable so the batching mechanism is probably the best bet)

Note that the lion's share of the path is most likely in the RequestProcessor (since we seem to have some waits there which can likely be eliminated/changed). This is just an anomaly I noticed while starting the dig into the path. Additionally, ProcessTransactionHandler is the critical path so any small win there should improve overall scalability.

@jd0-sag
Copy link
Contributor Author

jd0-sag commented Feb 1, 2017

It seems like the expensive part is in RequestProcessor.scheduleRequest() (still in the ProcessTransactionHandler thread). 71% of that time is spent in ActiveToPassiveReplication.replicateActivity() (which is effectively a no-op in the active-only case).

Digging further into that, we spend 66% of that time in ReplicationSender.replicateMessage().

This makes sense, and we might want to look into optimizing that method, at some point, but this is so many levels deep that it is a very thin sliver of the total time. I am going to consider this part of the investigation done with this conclusion:

Investigate optimization of ReplicationSender.replicateMessage()

On to the next part of this investigation, which is looking at activity on the RequestProcessor thread. In this case, the hot-point can sometimes be seen directly in a profiler (observed in Java Mission Control): ActivePassiveAckWaiter.waitForReceived().

This is the problem I had initially suspected was still somewhere in the stack: blocking active activity on passive replication for non-lifecycle operations.

Lifecycle operations (specifically CREATE or RECONFIGURE) require consensus across the stripe but normal INVOKE calls do not. This means that active-side execution, for invokes, never needs to block on anything related to the passive.

The only thing which needs to block on the passive acknowledgements, in the invoke case, is the sending of the same acks back to the client. That is, we don't need to block on passive RECEIVED before running the active invoke but we do need to wait for the passive RECEIVED before we can send the RECEIVED to the client.

To illustrate this further, it is perfectly acceptable for the active or the passive to be any number of invokes ahead of the rest of the stripe so long as the client isn't told about RECEIVED until all the servers have received or told about COMPLETED until all the servers have completed.

RETIRED complicates this a bit, since that is active-side decision-making, but it should be possible for the active to optimistically decide that an invoke is retired, in response to its own completion, so long as it doesn't send the RETIRED to the client before the matching COMPLETE the retirement depends on comes back from the passive.

Actually implementing these ideas will be a bit tricky, and potentially ugly, since the thread which delivers the last required ack will be the one which sends the ack message back to the client. The active->client message batching will hopefully mitigate this (so that we still have a specific thread doing the serialize+send) and, if not, we should be able to modify it to do so.

@jd0-sag
Copy link
Contributor Author

jd0-sag commented Feb 1, 2017

Looking at the profile, we do often see stacks where the RequestProcessor is waiting for passive acks, both RECEIVED and COMPLETED:

RECEIVED

   com.tc.objectserver.entity.ActivePassiveAckWaiter.waitForReceived line: 57 
   com.tc.objectserver.entity.ManagedEntityImpl$SchedulingRunnable.waitForPassives line: 1048 
   com.tc.objectserver.entity.ManagedEntityImpl$SchedulingRunnable.access$1500 line: 976 
   com.tc.objectserver.entity.ManagedEntityImpl$ResultCapture.complete line: 1201 
   com.tc.objectserver.entity.ManagedEntityImpl.performAction line: 718 
   com.tc.objectserver.entity.ManagedEntityImpl.invoke line: 473 
   com.tc.objectserver.entity.ManagedEntityImpl.lambda$processInvokeRequest$2 line: 309 
   com.tc.objectserver.entity.ManagedEntityImpl$$Lambda$42/500080127.run line: not available 
   com.tc.objectserver.entity.ManagedEntityImpl$SchedulingRunnable.run line: 1025 
   com.tc.objectserver.entity.RequestProcessor$EntityRequest.invoke line: 170 
   com.tc.objectserver.entity.RequestProcessor$EntityRequest.run line: 161 
   com.tc.objectserver.entity.RequestProcessorHandler.handleEvent line: 27 
   com.tc.objectserver.entity.RequestProcessorHandler.handleEvent line: 23 
   com.tc.async.impl.StageQueueImpl$HandledContext.runWithHandler line: 502 
   com.tc.async.impl.StageImpl$WorkerThread.run line: 197 

COMPLETED

   com.tc.objectserver.entity.ActivePassiveAckWaiter.waitForCompleted line: 67 
   com.tc.objectserver.entity.ManagedEntityImpl$SchedulingRunnable.waitForPassives line: 1046 
   com.tc.objectserver.entity.ManagedEntityImpl$SchedulingRunnable.access$1500 line: 976 
   com.tc.objectserver.entity.ManagedEntityImpl$ResultCapture.complete line: 1201 
   com.tc.objectserver.entity.ManagedEntityImpl.performAction line: 720 
   com.tc.objectserver.entity.ManagedEntityImpl.invoke line: 473 
   com.tc.objectserver.entity.ManagedEntityImpl.lambda$processInvokeRequest$2 line: 309 
   com.tc.objectserver.entity.ManagedEntityImpl$$Lambda$42/500080127.run line: not available 
   com.tc.objectserver.entity.ManagedEntityImpl$SchedulingRunnable.run line: 1025 
   com.tc.objectserver.entity.RequestProcessor$EntityRequest.invoke line: 170 
   com.tc.objectserver.entity.RequestProcessor$EntityRequest.run line: 161 
   com.tc.objectserver.entity.RequestProcessorHandler.handleEvent line: 27 
   com.tc.objectserver.entity.RequestProcessorHandler.handleEvent line: 23 
   com.tc.async.impl.StageQueueImpl$HandledContext.runWithHandler line: 502 
   com.tc.async.impl.StageImpl$WorkerThread.run line: 197 

What is interesting is that both of these are called from ManagedEntityImpl.performAction(): if we want to run the invoke on the active, we invoke and then wait for RECEIVED. If we don't, then we only wait for COMPLETED. In either case, the call then eventually makes it to the point where it adds the response to the outgoing batch for the client.

This is the extraneous case mentioned in the previous comment (we don't need to wait for the COMPLETED, just build the right dependent action for when to relay the COMPLETED, and perhaps RETIRED on to the client) but also seems incorrect: we should be waiting for COMPLETED in both cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants