-
Notifications
You must be signed in to change notification settings - Fork 1.7k
feat: selectivity metrics (for Explain Analyze) in Hash Join #18488
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
7ba1336 to
e00d993
Compare
|
Thank you, I think the implementation is correct. The only consideration is performance, Hash Join implementation is definitely on the performance critical path, so we have to be careful not to introduce additional overhead. This PR should be good to go if we can verify it has no influence on the performance. In this PR, the extra overhead is for each batch, count the I believe these metrics provide more insight than simply computing |
+1. Better to do some profiling for the classic join patterns to have a clear print. |
|
🤖 |
Kicked it off |
|
🤖: Benchmark completed Details
|
|
🤖 |
|
🤖: Benchmark completed Details
|
|
Considering the results, I think we should move it to |
It's running in a noisy cloud environment, and tpch_mem takes quite short time, so it might not be accurate. I’ve verified this with tpch_mem10 locally, and it actually slows down several queries. I tried to make this count distinct indices faster (and sent a PR feniljain#2), and after that I think the result looks good: |
Very interesting
Thanks for confirming!
Very curious about this PR, cause it seems you have written the same logic as mine, but using loops instead, am I missing some detail? Is it that the |
If we can make the loop body really simple, the compiler can figure out how to generate more efficient machine code like using SIMD instructions, and the hardware can execute faster through several mechanisms (e.g. better memory prefetching), this can result in several times of speed up for the equivalent implementations. I'm not entirely sure under which circumstances the compiler might fail to optimize, so I try to keep the loop body as simple as possible — and that usually works well. |
make HJ indices distinct count faster
|
Yup, its easier to autovectorize simple loops! But in Rust IIUC even iterators are lowered to loops itself. Interesting point to explore what didn't work out in this case. I have merged your PR, thanks for it! |
Yes, that’s true. I’m also curious about when auto-vectorization actually works and what’s an easy way to verify it. Once we start adding complexity to a simple loop body (e.g., iterators, branches, or inlined function calls), it tends to stop working at some point. So my current practice is to keep it as simple as possible on hot paths. (avoid iterators/inlines) I guess the situation is somewhat analogous to the optimization implementation inside |
|
Let's include some simple tests, otherwise it's good to go after test passes. Thanks again! |
Which issue does this PR close?
Explain Analyze) in Hash Join #18409What changes are included in this PR?
Added a distinct element calculator in core hash join loop. It also works on an assumption that indices will be returned in an increasing order, I couldn't see a place where this assumption is broken, but if that's not the case, please do help me out.
Also, I am not 100% sure my implementation for
avg_fanoutis correct, so do let me know if that needs changes.Are these changes tested?
No failures in
sqllogictests/tests indatafusion/core/tests/sql/, should I add a test case for this?