-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Spool] Actual implementation #14507
base: master
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #14507 +/- ##
============================================
+ Coverage 61.75% 64.02% +2.27%
- Complexity 207 1602 +1395
============================================
Files 2436 2707 +271
Lines 133233 149187 +15954
Branches 20636 22860 +2224
============================================
+ Hits 82274 95519 +13245
- Misses 44911 46656 +1745
- Partials 6048 7012 +964
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
rename some test methods Co-authored-by: Yash Mayya <[email protected]>
Explain implementation is working, but not actual queries.
I think this PR is ready to review |
Map<Integer, QueryServerInstance> receiverServerMap = receiverMetadata.getWorkerIdToServerInstanceMap(); | ||
Map<Integer, Map<Integer, MailboxInfos>> senderMailboxesMap = senderMetadata.getWorkerIdToMailboxesMap(); | ||
Map<Integer, Map<Integer, MailboxInfos>> receiverMailboxesMap = receiverMetadata.getWorkerIdToMailboxesMap(); | ||
for (Integer receiverStageId : sendNode.getReceiverStageIds()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend to review this class with the hide whitespace
option in GH
if (protoReceiverIds == null || protoReceiverIds.isEmpty()) { | ||
// This should only happen if a not updated broker sends the request | ||
receiverIds = List.of(protoMailboxSendNode.getReceiverStageId()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This (and the change in proto) is what keeps backward compatibility
|
||
@Override | ||
public String toString() { | ||
return "g" + _id; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added some toString to mailboxes to make it easier to understand them in debug/trace logs
if (LOGGER.isTraceEnabled()) { | ||
LOGGER.trace("Sending EOS metadata. Only mailbox #{} will get stats", mailboxIdToSendMetadata); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remember that this verbose way to log (first check then log) is used to avoid the unnecessary varargs array allocation in LOGGER.trace(string, object...)
when the log is not going to be printed. Given this is the hotpath, we are extra careful trying to not include extra allocations.
*/ | ||
public boolean send(TransferableBlock block) | ||
throws Exception { | ||
throws IOException, TimeoutException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is more precise and IIRC we need to limit the scope of the exception in order to create the BlockExchangeSendingMailbox
Improve PR description |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it might be very useful to have a couple of integration tests for some basic spool use cases (verifying both the execution output as well as the plan potentially)?
@@ -207,13 +207,17 @@ protected TransferableBlock getNextBlock() | |||
buildBroadcastHashTable(); | |||
} | |||
if (_upstreamErrorBlock != null) { | |||
LOGGER.trace("Returning upstream error block for join operator"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are all the changes in this file unrelated to spools?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, all the changes are just new logs I've added while debuging some blocking issues. These logs are disabled by default and should be easily removed by either the jit given they are not allocating. I recommend to keep them in case we need to debug something in the future.
pinot-query-planner/src/main/java/org/apache/pinot/query/QueryEnvironment.java
Show resolved
Hide resolved
...planner/src/main/java/org/apache/pinot/query/planner/explain/PhysicalExplainPlanVisitor.java
Show resolved
Hide resolved
...y-planner/src/main/java/org/apache/pinot/query/planner/physical/DispatchablePlanVisitor.java
Show resolved
Hide resolved
...t-query-planner/src/main/java/org/apache/pinot/query/planner/serde/PlanNodeDeserializer.java
Show resolved
Hide resolved
pinot-query-runtime/src/main/java/org/apache/pinot/query/mailbox/InMemorySendingMailbox.java
Show resolved
Hide resolved
if (context.isJoinStage(receiverStageId)) { | ||
sendsToJoin = true; | ||
break; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're only checking if at least one receiver stage is a join stage here?
LOGGER.debug("==[RECEIVE]== EOS received : " + _id + " in mailbox: " + removed.getId() | ||
+ " (" + _mailboxes.size() + " mailboxes alive)"); | ||
+ " (" + ids + " mailboxes alive)"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+ " (" + ids + " mailboxes alive)"); | |
+ " (Mailboxes alive: " + ids + ")"); |
...query-runtime/src/main/java/org/apache/pinot/query/runtime/operator/MailboxSendOperator.java
Show resolved
Hide resolved
...ry-runtime/src/main/java/org/apache/pinot/query/runtime/operator/exchange/BlockExchange.java
Show resolved
Hide resolved
...ry-runtime/src/main/java/org/apache/pinot/query/runtime/operator/exchange/BlockExchange.java
Show resolved
Hide resolved
...ry-runtime/src/main/java/org/apache/pinot/query/runtime/operator/exchange/BlockExchange.java
Show resolved
Hide resolved
We have some tests that verify that, but in different files. For example:
TBH there is no test verifying with the plan that spools are used AND the result is correct (ie comparing with h2), although the query in these two cases is the same. |
This PR is a continuation of #14495 and the next step on #14196.
While previous PRs were focused the machinery to find the common stages and how to replace them in the plan, this one is focused on sending these plans to the servers and how servers deal with them.
The main change is that now send operator may need to send to multiple stages. That implies a change in the protocol, which I made backward compatible by deprecating the old attribute and adding a new one that is repeated. A deprecated attribute can still be used and reserves the id in protobuf. Servers still recognize the old attribute (required in case a new server talks with an old broker) but they first try to read the new attribute (which will be null/empty if sent by an old broker) and in case it is empty, use the old attribute.
To understand the change in the operators is important to understand the topology used by BlockExchange and Mailbox. Mailboxes receive blocks and send them to exactly one worker. They could send the block through memory or GRPC. BlockExchanges are the ones that decide how to distribute the block between different mailboxes. For example BroadcastExchange send the block to all Mailboxes while RandomExchange sends the block to a random mailbox.
With the multi-sender operators we need to send to broadcast the blocks to multiple stages, which they may need to apply another distribution for each worker. What this PR does is to create a new layer of BlockExchanges when a sender is multi-sender. The root of this layer is always a standard
BroadcastExchange
, but it sends to a new type ofSendingMailbox
: ABlockExchangeSendingMailbox
. This new mailbox is just an adaptor pattern that wraps aBlockExchange
and shows it as aSendingMailbox
. TheBlockExchange
s used in this second layer is the one normally used for each stage. Remember that this distribution must the same for all stages in the same multi-sender.As an example of how it works, you can run the following in ColocatedJoinEngineQuickStart
which returns:
TODO: