-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
exec: hash joiner #1
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very very cool stuff. I think this is looking good for a first cut, but I don't fully understand everything in the implementation yet. Many of my comments are about improved documentation.
I would recommend going through and adding documentation to each of the main components - cleaning things up. Then, I think you should switch the implementation to use the exec.Operator
version to make sure we don't keep diverging from what's in the main repo, and PR it. We can continue the review there.
const hashTableBucketSize = 1 << 16 | ||
|
||
type hashTableInt struct { | ||
first []int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know the other code in this repo isn't well commented, but we'll need to add comments to all of these fields once we productionize it - otherwise, it'll be impossible for people to understand what's going on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should describe the contract of hashTableInt
as well. How is it used? What does it do exactly? At least a few sentences would be helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Specifically, what's first and next? How do they work? What's the overall structure of the hash table, what guarantees does it provide?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review! I've started migrating this code out to cockroachdb and I promise there is better documentation in there 😆
|
||
func (hashTable *hashTableInt) grow(amount int) { | ||
hashTable.next = append(hashTable.next, make([]int, amount)...) | ||
hashTable.keys = append(hashTable.keys, make(intColumn, amount)...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may cause multiple allocations. Depending on whether you want fine-grained control over how much the slice will grow, I think the right way to do this is to allocate a new slice if the old's slice capacity is too small to fit the new amount, and copy the old slice into the new slice...
I hear Go 1.11 is more optimal for this case (https://go-review.googlesource.com/c/go/+/109517) but unfortunately we're still on 1.10 for other reasons. If this is your bottleneck i'd consider changing it, otherwise I guess we can leave for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that makes sense, the implementation's a little different with the ColVec
stuff and different types, which we can discuss later.
|
||
// hashJoinerIntInt performs a hashJoin joining on two integer columns where the | ||
// left table represents the build relation. It does not work with N - N joins. | ||
type hashJoinerIntInt struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I would stick this in the top of the file! since it's the first thing somebody will want to read. The rest of the stuff might even belong in a separate file.
|
||
hashJoiner.hashTable = makeHashTableInt(hashTableBucketSize, len(hashJoiner.leftCols)) | ||
|
||
hashJoiner.build() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't belong in Init
, which is designed to run before execution starts at all - more of a setup phase than a do work phase. You should put this into Next
behind a conditional that will only run once. In distsql we do this with a little state machine infrastructure.
|
||
// build performs the build phase of the hash join using the left relation. | ||
// Different builders used different heuristics for the build phase allowing us | ||
// to evaluate cpu-memory trade-offs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great idea to have multiple builders! Perhaps we will be able to select a builder at plan time depending on the characteristics of the tables. Do you see any opportunities like that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes definitely :D Found some cool literature on how to optimize the build phase for various tradeoffs... and we will also need different builders/probers when we expand to N-N joins
valCol := builder.hashTable.values[valIdx] | ||
|
||
for i:= 0; i < batchSize; i++ { | ||
valCol[i + builder.totalSize + 1] = outCol[i] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another + 1 - why? Seems like you could get rid of these everywhere, maybe, unless I'm missing something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the lack of documentation :P The whole point of the +1
is because in our hash table, the index = 0
is reserved and represents the end of chain. So for every row index, we want to offset by 1 such that next[i + 1]
holds the next value in their corresponding bucket chains. Then for consistency, I had keys
and values
have that same offset by 1. This is how they implemented it in that paper, but now that I think about it, we can just have -1
be equal to end of chain.
break | ||
} | ||
|
||
builder.insertBatch(flow, eqColIdx, outCols, batchSize) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a slight complication here which is that you need to be examining the selection vector if it's set... since we don't have a standard way to do that yet, I suggest leaving it out for now, but add a TODO for us to make sure this is fixed later.
builder.hashTable.growNext(builder.totalSize) | ||
|
||
for i := 0; i < builder.totalSize; i++ { | ||
builder.hashTable.insertKey(builder.bucket[i], i + 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like this loop is over the same bounds as the one above. Is there a reason you can't/shouldn't do both these steps in one loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so. I was trying to figure out why the papers implementation split up the build process into a hash -> bucket -> insert loops, when it could easily have been combined into a single loop. Any idea why they might've done that?
hashTable.next = append(hashTable.next, make([]int, amount)...) | ||
} | ||
|
||
func (hashTable *hashTableInt) insertKey(hashKey int, id int) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment on this - what does insertKey
do exactly?
hashTable.allocated += amount | ||
} | ||
|
||
func (hashTable *hashTableInt) insert(hashKey int, key int) (id int) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be unused by the group implementation - why? Also, please add a comment on what it does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wrote this function before implementing the second builder and it does a bit too much since it inserts to keys
which we don't want to do if we are preloading everything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review! I'll continue this in the cockroach repo
Toy hash joiner for joining on int-int columns where we build with the left relation (requires unique key on join column) and probe on right relation.
Results of
go test -bench=BenchmarkHashJoin
:Using the
hashJoinBuilder
:Using the
hashJoinGroupBuilder
: