Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel distinct hash aggregate #4881

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Conversation

benjaminwinger
Copy link
Collaborator

Extends the partitioning done in the hash aggregate operator to apply to distinct hash tables as well as the main hash table. The aggregation is then disabled in the thread-local hash tables for distinct keys, and computed from scratch when combining the data into the global tables.

This still doesn't parallelize the simple distinct aggregate, which I'll do in a later PR.

This also fixes support for nested types in the hash aggregate generally (I realised that I had missed implementing the row data versions of the comparison functions for structs and lists).

@benjaminwinger benjaminwinger force-pushed the parallel-distinct branch 2 times, most recently from b177964 to 3413492 Compare February 10, 2025 19:12
Copy link

Benchmark Result

Master commit hash: 3b37b0845e07fa51e47f9dfecf4f2f87ee399e42
Branch commit hash: 848595776549315de36caa1bdae388ef7059fda4

Query Group Query Name Mean Time - Commit (ms) Mean Time - Master (ms) Diff
aggregation q24 723.30 731.20 -7.90 (-1.08%)
aggregation q28 6414.60 6339.96 74.65 (1.18%)
filter q14 126.88 118.57 8.31 (7.01%)
filter q15 128.32 115.52 12.80 (11.08%)
filter q16 304.40 309.00 -4.60 (-1.49%)
filter q17 445.84 444.76 1.08 (0.24%)
filter q18 1923.49 1911.24 12.25 (0.64%)
filter zonemap-node 88.45 83.03 5.42 (6.53%)
filter zonemap-node-lhs-cast 90.38 81.52 8.86 (10.87%)
filter zonemap-node-null 88.38 80.69 7.69 (9.53%)
filter zonemap-rel 5490.86 5471.28 19.58 (0.36%)
fixed_size_expr_evaluator q07 573.75 563.42 10.33 (1.83%)
fixed_size_expr_evaluator q08 804.56 795.32 9.24 (1.16%)
fixed_size_expr_evaluator q09 805.57 796.53 9.04 (1.14%)
fixed_size_expr_evaluator q10 237.85 228.96 8.89 (3.88%)
fixed_size_expr_evaluator q11 229.56 222.04 7.52 (3.39%)
fixed_size_expr_evaluator q12 227.25 217.14 10.11 (4.66%)
fixed_size_expr_evaluator q13 1460.15 1447.51 12.64 (0.87%)
fixed_size_seq_scan q23 112.27 105.86 6.41 (6.05%)
join q29 714.32 715.70 -1.38 (-0.19%)
join q30 10183.29 9894.64 288.66 (2.92%)
join q31 4.32 7.14 -2.82 (-39.47%)
join SelectiveTwoHopJoin 58.25 55.23 3.02 (5.47%)
ldbc_snb_ic q35 2576.67 2689.46 -112.79 (-4.19%)
ldbc_snb_ic q36 496.84 474.01 22.82 (4.82%)
ldbc_snb_is q32 6.83 5.73 1.10 (19.22%)
ldbc_snb_is q33 14.35 15.37 -1.02 (-6.62%)
ldbc_snb_is q34 1.24 1.18 0.06 (5.14%)
multi-rel multi-rel-large-scan 1769.58 1397.44 372.14 (26.63%)
multi-rel multi-rel-lookup 22.13 31.59 -9.46 (-29.95%)
multi-rel multi-rel-small-scan 99.27 55.71 43.56 (78.18%)
order_by q25 130.60 123.06 7.54 (6.13%)
order_by q26 446.54 440.62 5.92 (1.34%)
order_by q27 1442.87 1437.25 5.62 (0.39%)
recursive_join recursive-join-bidirection 291.11 315.22 -24.10 (-7.65%)
recursive_join recursive-join-dense 7410.99 7359.64 51.34 (0.70%)
recursive_join recursive-join-path 24152.40 24295.00 -142.60 (-0.59%)
recursive_join recursive-join-sparse 1053.61 1047.62 6.00 (0.57%)
recursive_join recursive-join-trail 7395.64 7333.16 62.48 (0.85%)
scan_after_filter q01 170.92 168.37 2.54 (1.51%)
scan_after_filter q02 157.74 151.14 6.60 (4.37%)
shortest_path_ldbc100 q37 80.06 100.48 -20.42 (-20.32%)
shortest_path_ldbc100 q38 259.42 387.24 -127.82 (-33.01%)
shortest_path_ldbc100 q39 65.73 64.98 0.75 (1.15%)
shortest_path_ldbc100 q40 363.10 462.65 -99.55 (-21.52%)
var_size_expr_evaluator q03 2073.78 2093.01 -19.23 (-0.92%)
var_size_expr_evaluator q04 2224.38 2232.23 -7.86 (-0.35%)
var_size_expr_evaluator q05 2637.64 2663.57 -25.92 (-0.97%)
var_size_expr_evaluator q06 1330.54 1321.14 9.40 (0.71%)
var_size_seq_scan q19 1453.13 1443.32 9.81 (0.68%)
var_size_seq_scan q20 2410.84 2341.89 68.94 (2.94%)
var_size_seq_scan q21 2302.59 2302.94 -0.35 (-0.02%)
var_size_seq_scan q22 126.15 126.92 -0.77 (-0.61%)

@benjaminwinger
Copy link
Collaborator Author

Benchmarks (adapted from ClickBench, on 128 threads 2xEPYC 7551):

Query Before After
MATCH (h:hits) RETURN h.RegionID, COUNT(DISTINCT h.UserID) AS u ORDER BY u DESC LIMIT 10; 36s 2.5s
MATCH (h:hits) RETURN h.RegionID, SUM(h.AdvEngineID), COUNT(*) AS c, AVG(h.ResolutionWidth), COUNT(DISTINCT h.UserID) ORDER BY c DESC LIMIT 10; 47s 2.8s
MATCH (h:hits) WHERE h.MobilePhoneModel <> '' RETURN h.MobilePhoneModel, COUNT(DISTINCT h.UserID) AS u ORDER BY u DESC LIMIT 10; 6.4s 0.97s
MATCH (h:hits) WHERE h.MobilePhoneModel <> '' RETURN h.MobilePhone, h.MobilePhoneModel, COUNT(DISTINCT h.UserID) AS u ORDER BY u DESC LIMIT 10; 7.2s 0.96s
MATCH (h:hits) WHERE h.SearchPhrase <> '' RETURN h.SearchPhrase, COUNT(DISTINCT h.UserID) AS u ORDER BY u DESC LIMIT 10; 20s 1.4s
MATCH (h:hits) WHERE contains(h.Title, 'Google') AND NOT contains(h.URL, '.google.') AND h.SearchPhrase <> '' RETURN h.SearchPhrase, MIN(h.URL), MIN(h.Title), COUNT(*) AS c, COUNT(DISTINCT h.UserID) ORDER BY c DESC LIMIT 10;* 33s 2.5s

*This last query is substantially different from the SQL version since we don't have an exact equivalent of the LIKE operator. Contains is probably faster, while the regex-based =~ is bottlenecked by our regex matching and will probably never be faster (takes ~17s with these changes though, so that might be an area we should look at optimizing) and doesn't show anything meaningful about the aggregation performance.

Copy link
Contributor

@ray6080 ray6080 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me! Thanks!
Have a question and several very minor comments, you can take a look.

Copy link

Benchmark Result

Master commit hash: dfabf90eab17ec0dc0f87d18464152412e1fd8ee
Branch commit hash: 492a4ed8600a7314e84a97245c34024dfcae6b4c

Query Group Query Name Mean Time - Commit (ms) Mean Time - Master (ms) Diff
aggregation q24 724.15 736.85 -12.70 (-1.72%)
aggregation q28 6385.26 6358.04 27.22 (0.43%)
filter q14 126.75 128.29 -1.54 (-1.20%)
filter q15 123.86 126.50 -2.64 (-2.09%)
filter q16 304.11 306.18 -2.07 (-0.68%)
filter q17 444.65 446.79 -2.15 (-0.48%)
filter q18 1948.34 1922.93 25.41 (1.32%)
filter zonemap-node 88.99 88.87 0.12 (0.13%)
filter zonemap-node-lhs-cast 88.61 90.75 -2.14 (-2.35%)
filter zonemap-node-null 88.39 90.66 -2.26 (-2.50%)
filter zonemap-rel 5484.15 5394.15 90.00 (1.67%)
fixed_size_expr_evaluator q07 570.60 581.95 -11.35 (-1.95%)
fixed_size_expr_evaluator q08 800.81 801.57 -0.76 (-0.09%)
fixed_size_expr_evaluator q09 801.89 803.44 -1.55 (-0.19%)
fixed_size_expr_evaluator q10 236.67 236.67 -0.00 (-0.00%)
fixed_size_expr_evaluator q11 228.95 229.64 -0.69 (-0.30%)
fixed_size_expr_evaluator q12 225.73 231.70 -5.97 (-2.58%)
fixed_size_expr_evaluator q13 1455.63 1465.25 -9.61 (-0.66%)
fixed_size_seq_scan q23 111.17 111.76 -0.59 (-0.53%)
join q29 743.22 703.37 39.85 (5.67%)
join q30 10996.13 11083.57 -87.45 (-0.79%)
join q31 6.92 9.98 -3.05 (-30.60%)
join SelectiveTwoHopJoin 52.41 59.99 -7.58 (-12.63%)
ldbc_snb_ic q35 2637.48 2607.02 30.47 (1.17%)
ldbc_snb_ic q36 472.10 485.56 -13.45 (-2.77%)
ldbc_snb_is q32 4.81 4.47 0.34 (7.57%)
ldbc_snb_is q33 16.21 14.83 1.37 (9.27%)
ldbc_snb_is q34 1.45 1.25 0.20 (16.40%)
multi-rel multi-rel-large-scan 1675.49 1392.59 282.90 (20.32%)
multi-rel multi-rel-lookup 19.77 32.54 -12.77 (-39.23%)
multi-rel multi-rel-small-scan 72.33 102.16 -29.83 (-29.20%)
order_by q25 131.85 131.92 -0.07 (-0.05%)
order_by q26 458.56 452.45 6.11 (1.35%)
order_by q27 1437.75 1420.37 17.38 (1.22%)
recursive_join recursive-join-bidirection 282.79 296.22 -13.43 (-4.53%)
recursive_join recursive-join-dense 7395.58 7444.01 -48.44 (-0.65%)
recursive_join recursive-join-path 24183.58 24117.33 66.25 (0.27%)
recursive_join recursive-join-sparse 1051.37 1057.45 -6.08 (-0.58%)
recursive_join recursive-join-trail 7379.74 7418.08 -38.34 (-0.52%)
scan_after_filter q01 173.52 175.01 -1.49 (-0.85%)
scan_after_filter q02 158.23 159.85 -1.62 (-1.01%)
shortest_path_ldbc100 q37 94.70 97.65 -2.94 (-3.02%)
shortest_path_ldbc100 q38 378.26 377.28 0.98 (0.26%)
shortest_path_ldbc100 q39 62.03 64.85 -2.82 (-4.35%)
shortest_path_ldbc100 q40 451.36 464.15 -12.79 (-2.76%)
var_size_expr_evaluator q03 2156.60 2149.45 7.15 (0.33%)
var_size_expr_evaluator q04 2277.84 2203.44 74.40 (3.38%)
var_size_expr_evaluator q05 2705.36 2620.11 85.25 (3.25%)
var_size_expr_evaluator q06 1363.42 1345.39 18.03 (1.34%)
var_size_seq_scan q19 1515.28 1459.82 55.46 (3.80%)
var_size_seq_scan q20 2434.15 2352.12 82.03 (3.49%)
var_size_seq_scan q21 2362.03 2311.06 50.96 (2.21%)
var_size_seq_scan q22 128.63 128.13 0.50 (0.39%)

Copy link

codecov bot commented Feb 12, 2025

Codecov Report

Attention: Patch coverage is 90.72581% with 23 lines in your changes missing coverage. Please review.

Project coverage is 86.53%. Comparing base (b3f88d7) to head (18b0903).
Report is 6 commits behind head on master.

Files with missing lines Patch % Lines
src/processor/result/base_hash_table.cpp 73.33% 20 Missing ⚠️
...cessor/operator/aggregate/aggregate_hash_table.cpp 97.67% 2 Missing ⚠️
...rc/processor/operator/aggregate/hash_aggregate.cpp 98.30% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #4881      +/-   ##
==========================================
- Coverage   86.53%   86.53%   -0.01%     
==========================================
  Files        1403     1403              
  Lines       60536    60665     +129     
  Branches     7442     7460      +18     
==========================================
+ Hits        52385    52494     +109     
- Misses       7982     8002      +20     
  Partials      169      169              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link

Benchmark Result

Master commit hash: dfabf90eab17ec0dc0f87d18464152412e1fd8ee
Branch commit hash: fbaeb5abf4e321d987a1a2d37e01438dcac55e26

Query Group Query Name Mean Time - Commit (ms) Mean Time - Master (ms) Diff
aggregation q24 723.79 736.85 -13.06 (-1.77%)
aggregation q28 6386.04 6358.04 27.99 (0.44%)
filter q14 130.13 128.29 1.84 (1.43%)
filter q15 129.89 126.50 3.39 (2.68%)
filter q16 307.59 306.18 1.41 (0.46%)
filter q17 449.94 446.79 3.15 (0.70%)
filter q18 1922.84 1922.93 -0.09 (-0.00%)
filter zonemap-node 90.75 88.87 1.87 (2.11%)
filter zonemap-node-lhs-cast 89.02 90.75 -1.73 (-1.91%)
filter zonemap-node-null 89.98 90.66 -0.68 (-0.75%)
filter zonemap-rel 5542.24 5394.15 148.09 (2.75%)
fixed_size_expr_evaluator q07 570.26 581.95 -11.70 (-2.01%)
fixed_size_expr_evaluator q08 804.93 801.57 3.36 (0.42%)
fixed_size_expr_evaluator q09 816.82 803.44 13.38 (1.67%)
fixed_size_expr_evaluator q10 238.35 236.67 1.68 (0.71%)
fixed_size_expr_evaluator q11 235.44 229.64 5.80 (2.52%)
fixed_size_expr_evaluator q12 228.35 231.70 -3.35 (-1.44%)
fixed_size_expr_evaluator q13 1463.77 1465.25 -1.47 (-0.10%)
fixed_size_seq_scan q23 109.84 111.76 -1.92 (-1.72%)
join q29 746.26 703.37 42.89 (6.10%)
join q30 10734.21 11083.57 -349.37 (-3.15%)
join q31 5.54 9.98 -4.44 (-44.50%)
join SelectiveTwoHopJoin 55.89 59.99 -4.09 (-6.83%)
ldbc_snb_ic q35 2556.99 2607.02 -50.03 (-1.92%)
ldbc_snb_ic q36 418.13 485.56 -67.43 (-13.89%)
ldbc_snb_is q32 3.96 4.47 -0.51 (-11.49%)
ldbc_snb_is q33 15.72 14.83 0.89 (6.01%)
ldbc_snb_is q34 1.22 1.25 -0.03 (-2.54%)
multi-rel multi-rel-large-scan 1337.03 1392.59 -55.55 (-3.99%)
multi-rel multi-rel-lookup 30.83 32.54 -1.71 (-5.24%)
multi-rel multi-rel-small-scan 52.66 102.16 -49.50 (-48.46%)
order_by q25 133.09 131.92 1.17 (0.89%)
order_by q26 459.01 452.45 6.56 (1.45%)
order_by q27 1452.07 1420.37 31.70 (2.23%)
recursive_join recursive-join-bidirection 282.12 296.22 -14.11 (-4.76%)
recursive_join recursive-join-dense 7403.85 7444.01 -40.17 (-0.54%)
recursive_join recursive-join-path 24250.06 24117.33 132.73 (0.55%)
recursive_join recursive-join-sparse 1068.19 1057.45 10.75 (1.02%)
recursive_join recursive-join-trail 7347.12 7418.08 -70.96 (-0.96%)
scan_after_filter q01 171.29 175.01 -3.72 (-2.12%)
scan_after_filter q02 156.82 159.85 -3.03 (-1.89%)
shortest_path_ldbc100 q37 88.95 97.65 -8.70 (-8.91%)
shortest_path_ldbc100 q38 371.34 377.28 -5.94 (-1.57%)
shortest_path_ldbc100 q39 63.65 64.85 -1.20 (-1.85%)
shortest_path_ldbc100 q40 455.76 464.15 -8.39 (-1.81%)
var_size_expr_evaluator q03 2143.93 2149.45 -5.52 (-0.26%)
var_size_expr_evaluator q04 2262.03 2203.44 58.59 (2.66%)
var_size_expr_evaluator q05 2730.13 2620.11 110.02 (4.20%)
var_size_expr_evaluator q06 1356.58 1345.39 11.19 (0.83%)
var_size_seq_scan q19 1505.17 1459.82 45.35 (3.11%)
var_size_seq_scan q20 2429.26 2352.12 77.14 (3.28%)
var_size_seq_scan q21 2375.24 2311.06 64.17 (2.78%)
var_size_seq_scan q22 128.83 128.13 0.70 (0.55%)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants