Skip to content

Add BatchCoalescer::push_filtered_batch and docs #7652

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 17, 2025

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Jun 12, 2025

Which issue does this PR close?

We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax.

Rationale for this change

In order to coalesce the result of applying a filter currently requires first copying the results into an intermediate array (calling filter).

My plan is to remove this extra copy by building the final array up directly incrementally

To do to so, there needs to be an API that can take the original data and the filter

What changes are included in this PR?

  1. Add BatchCoalescer::push_filtered_batch and docs
  2. Update benchmarks to use it

Are there any user-facing changes?

New API

@github-actions github-actions bot added the arrow Changes to the arrow crate label Jun 12, 2025
@alamb
Copy link
Contributor Author

alamb commented Jun 12, 2025

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.11.0-1015-gcp #15~24.04.1-Ubuntu SMP Thu Apr 24 20:41:05 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/add_push_filter (9e2e88f) to e32f545 diff
BENCH_NAME=coalesce_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench coalesce_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=alamb_add_push_filter
Results will be posted here when complete

// Add the filtered batch to the coalescer
coalescer.push_batch(filtered_batch).unwrap();
coalescer
.push_batch_with_filter(batch.clone(), filter)
Copy link
Contributor Author

@alamb alamb Jun 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the point of the PR -- to add this API to prepare for the coalescer to build it up incrementally

@alamb alamb marked this pull request as ready for review June 12, 2025 20:24
@alamb
Copy link
Contributor Author

alamb commented Jun 12, 2025

🤖: Benchmark completed

Details

group                                                                                alamb_add_push_filter                  main
-----                                                                                ---------------------                  ----
filter: mixed_dict, 8192, nulls: 0, selectivity: 0.001                               1.00    259.6±3.81ms        ? ?/sec    1.22    315.8±1.80ms        ? ?/sec
filter: mixed_dict, 8192, nulls: 0, selectivity: 0.01                                1.00      8.8±0.09ms        ? ?/sec    1.03      9.0±0.15ms        ? ?/sec
filter: mixed_dict, 8192, nulls: 0, selectivity: 0.1                                 1.00      4.4±0.10ms        ? ?/sec    1.01      4.4±0.11ms        ? ?/sec
filter: mixed_dict, 8192, nulls: 0, selectivity: 0.8                                 1.00      3.5±0.02ms        ? ?/sec    1.01      3.6±0.03ms        ? ?/sec
filter: mixed_dict, 8192, nulls: 0.1, selectivity: 0.001                             1.00    262.0±2.94ms        ? ?/sec    1.16    303.5±2.27ms        ? ?/sec
filter: mixed_dict, 8192, nulls: 0.1, selectivity: 0.01                              1.00     10.3±0.09ms        ? ?/sec    1.00     10.4±0.07ms        ? ?/sec
filter: mixed_dict, 8192, nulls: 0.1, selectivity: 0.1                               1.00      5.0±0.12ms        ? ?/sec    1.01      5.0±0.10ms        ? ?/sec
filter: mixed_dict, 8192, nulls: 0.1, selectivity: 0.8                               1.00      4.7±0.05ms        ? ?/sec    1.00      4.7±0.02ms        ? ?/sec
filter: mixed_utf8, 8192, nulls: 0, selectivity: 0.001                               1.00     69.6±1.36ms        ? ?/sec    1.01     70.4±0.86ms        ? ?/sec
filter: mixed_utf8, 8192, nulls: 0, selectivity: 0.01                                1.00     12.7±0.18ms        ? ?/sec    1.02     13.0±0.11ms        ? ?/sec
filter: mixed_utf8, 8192, nulls: 0, selectivity: 0.1                                 1.00      9.9±0.40ms        ? ?/sec    1.03     10.1±0.35ms        ? ?/sec
filter: mixed_utf8, 8192, nulls: 0, selectivity: 0.8                                 1.00      8.2±0.15ms        ? ?/sec    1.13      9.3±0.21ms        ? ?/sec
filter: mixed_utf8, 8192, nulls: 0.1, selectivity: 0.001                             1.00     85.3±0.63ms        ? ?/sec    1.01     85.7±0.75ms        ? ?/sec
filter: mixed_utf8, 8192, nulls: 0.1, selectivity: 0.01                              1.00     14.8±0.16ms        ? ?/sec    1.02     15.1±0.15ms        ? ?/sec
filter: mixed_utf8, 8192, nulls: 0.1, selectivity: 0.1                               1.00     10.3±0.33ms        ? ?/sec    1.00     10.3±0.32ms        ? ?/sec
filter: mixed_utf8, 8192, nulls: 0.1, selectivity: 0.8                               1.00     10.2±0.45ms        ? ?/sec    1.06     10.8±0.19ms        ? ?/sec
filter: mixed_utf8view (max_string_len=128), 8192, nulls: 0, selectivity: 0.001      1.01     69.9±0.55ms        ? ?/sec    1.00     69.3±0.43ms        ? ?/sec
filter: mixed_utf8view (max_string_len=128), 8192, nulls: 0, selectivity: 0.01       1.00      8.8±0.06ms        ? ?/sec    1.01      8.9±0.04ms        ? ?/sec
filter: mixed_utf8view (max_string_len=128), 8192, nulls: 0, selectivity: 0.1        1.00      5.1±0.21ms        ? ?/sec    1.04      5.3±0.26ms        ? ?/sec
filter: mixed_utf8view (max_string_len=128), 8192, nulls: 0, selectivity: 0.8        1.01      3.4±0.03ms        ? ?/sec    1.00      3.4±0.04ms        ? ?/sec
filter: mixed_utf8view (max_string_len=128), 8192, nulls: 0.1, selectivity: 0.001    1.00     87.8±0.38ms        ? ?/sec    1.01     88.5±1.91ms        ? ?/sec
filter: mixed_utf8view (max_string_len=128), 8192, nulls: 0.1, selectivity: 0.01     1.00     12.1±0.07ms        ? ?/sec    1.01     12.2±0.06ms        ? ?/sec
filter: mixed_utf8view (max_string_len=128), 8192, nulls: 0.1, selectivity: 0.1      1.01      6.3±0.21ms        ? ?/sec    1.00      6.3±0.14ms        ? ?/sec
filter: mixed_utf8view (max_string_len=128), 8192, nulls: 0.1, selectivity: 0.8      1.00      3.8±0.02ms        ? ?/sec    1.00      3.8±0.02ms        ? ?/sec
filter: mixed_utf8view (max_string_len=20), 8192, nulls: 0, selectivity: 0.001       1.00     60.8±0.29ms        ? ?/sec    1.00     60.9±0.43ms        ? ?/sec
filter: mixed_utf8view (max_string_len=20), 8192, nulls: 0, selectivity: 0.01        1.00      7.1±0.02ms        ? ?/sec    1.00      7.1±0.04ms        ? ?/sec
filter: mixed_utf8view (max_string_len=20), 8192, nulls: 0, selectivity: 0.1         1.00      3.1±0.19ms        ? ?/sec    1.01      3.1±0.21ms        ? ?/sec
filter: mixed_utf8view (max_string_len=20), 8192, nulls: 0, selectivity: 0.8         1.07      2.6±0.02ms        ? ?/sec    1.00      2.4±0.02ms        ? ?/sec
filter: mixed_utf8view (max_string_len=20), 8192, nulls: 0.1, selectivity: 0.001     1.01     71.1±0.34ms        ? ?/sec    1.00     70.7±0.33ms        ? ?/sec
filter: mixed_utf8view (max_string_len=20), 8192, nulls: 0.1, selectivity: 0.01      1.00     10.7±0.03ms        ? ?/sec    1.00     10.7±0.04ms        ? ?/sec
filter: mixed_utf8view (max_string_len=20), 8192, nulls: 0.1, selectivity: 0.1       1.00      4.4±0.12ms        ? ?/sec    1.03      4.5±0.18ms        ? ?/sec
filter: mixed_utf8view (max_string_len=20), 8192, nulls: 0.1, selectivity: 0.8       1.01      4.8±0.02ms        ? ?/sec    1.00      4.8±0.06ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0, selectivity: 0.001                          1.00     87.8±0.28ms        ? ?/sec    1.03     90.4±0.29ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0, selectivity: 0.01                           1.00     11.7±0.06ms        ? ?/sec    1.01     11.9±0.05ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0, selectivity: 0.1                            1.00      5.1±0.15ms        ? ?/sec    1.00      5.1±0.13ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0, selectivity: 0.8                            1.10      4.4±0.03ms        ? ?/sec    1.00      4.0±0.02ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0.1, selectivity: 0.001                        1.00    115.8±0.42ms        ? ?/sec    1.07    123.7±0.62ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0.1, selectivity: 0.01                         1.00     15.9±0.04ms        ? ?/sec    1.04     16.6±0.08ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0.1, selectivity: 0.1                          1.02      7.0±0.14ms        ? ?/sec    1.00      6.9±0.19ms        ? ?/sec
filter: single_utf8view, 8192, nulls: 0.1, selectivity: 0.8                          1.05      7.0±0.02ms        ? ?/sec    1.00      6.7±0.03ms        ? ?/sec

Copy link
Contributor

@Dandandan Dandandan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@alamb
Copy link
Contributor Author

alamb commented Jun 17, 2025

Thank you for the review @Dandandan

@alamb alamb merged commit a19fc62 into apache:main Jun 17, 2025
29 of 30 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants