Change `BooleanBuffer::append_packed_range` to use `apply_bitwise_binary_op` #8812

alamb · 2025-11-09T13:06:43Z

Which issue does this PR close?

related to Improvements to BooleanBufferBuilder / BooleanBuilder #8561

Rationale for this change

We added an optimized packed implementation in the following PR:

feat: add apply_unary_op and apply_binary_op bitwise operations #8619

Let's use it to make append_packed_range faster

What changes are included in this PR?

Use apply_bitwise_binary_op

Are these changes tested?

Functionally by CI
I will also run benchmarks for this PR

Are there any user-facing changes?

Faster peformance

alamb · 2025-11-09T14:19:08Z

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/faster_append (3727c20) to 43c7637 diff
BENCH_NAME=boolean_append_packed
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench boolean_append_packed
BENCH_FILTER=
BENCH_BRANCH_NAME=alamb_faster_append
Results will be posted here when complete

alamb · 2025-11-09T14:21:20Z

🤖: Benchmark completed

Details

group                    alamb_faster_append                    main
-----                    -------------------                    ----
boolean_append_packed    1.00      6.1±0.01µs        ? ?/sec    2.15     13.0±0.02µs        ? ?/sec

alamb · 2025-11-09T14:21:24Z

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/faster_append (3727c20) to 43c7637 diff
BENCH_NAME=concatenate_kernel
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench concatenate_kernel
BENCH_FILTER=
BENCH_BRANCH_NAME=alamb_faster_append
Results will be posted here when complete

alamb · 2025-11-09T14:31:38Z

🤖: Benchmark completed

Details

group                                                          alamb_faster_append                    main
-----                                                          -------------------                    ----
concat 1024 arrays boolean 4                                   1.00     22.9±0.06µs        ? ?/sec    1.24     28.2±0.06µs        ? ?/sec
concat 1024 arrays i32 4                                       1.03     15.0±0.07µs        ? ?/sec    1.00     14.5±0.12µs        ? ?/sec
concat 1024 arrays str 4                                       1.13     38.7±0.19µs        ? ?/sec    1.00     34.3±0.50µs        ? ?/sec
concat boolean 1024                                            1.00    315.5±0.46ns        ? ?/sec    1.40    440.7±1.59ns        ? ?/sec
concat boolean 8192 over 100 arrays                            1.00      5.1±0.01µs        ? ?/sec    10.59    54.4±0.44µs        ? ?/sec
concat boolean nulls 1024                                      1.00    539.0±0.93ns        ? ?/sec    1.46    785.2±4.14ns        ? ?/sec
concat boolean nulls 8192 over 100 arrays                      1.00     18.2±0.08µs        ? ?/sec    6.39    116.5±0.16µs        ? ?/sec
concat fixed size lists                                        1.05   783.2±27.82µs        ? ?/sec    1.00   747.3±30.38µs        ? ?/sec
concat i32 1024                                                1.00    388.7±1.19ns        ? ?/sec    1.01    391.8±1.15ns        ? ?/sec
concat i32 8192 over 100 arrays                                1.00    211.0±7.70µs        ? ?/sec    1.02   214.7±12.32µs        ? ?/sec
concat i32 nulls 1024                                          1.00    596.6±0.69ns        ? ?/sec    1.21    722.9±2.23ns        ? ?/sec
concat i32 nulls 8192 over 100 arrays                          1.00    238.7±5.19µs        ? ?/sec    1.18    282.6±4.51µs        ? ?/sec
concat str 1024                                                1.04     13.6±1.27µs        ? ?/sec    1.00     13.1±1.14µs        ? ?/sec
concat str 8192 over 100 arrays                                1.03    105.1±0.94ms        ? ?/sec    1.00    102.4±1.16ms        ? ?/sec
concat str nulls 1024                                          1.03      6.0±0.50µs        ? ?/sec    1.00      5.8±0.56µs        ? ?/sec
concat str nulls 8192 over 100 arrays                          1.03     53.4±0.40ms        ? ?/sec    1.00     51.8±1.03ms        ? ?/sec
concat str_dict 1024                                           1.00      2.7±0.01µs        ? ?/sec    1.02      2.8±0.01µs        ? ?/sec
concat str_dict_sparse 1024                                    1.02      7.0±0.05µs        ? ?/sec    1.00      6.9±0.02µs        ? ?/sec
concat struct with int32 and dicts size=1024 count=2           1.01      6.8±0.28µs        ? ?/sec    1.00      6.7±0.03µs        ? ?/sec
concat utf8_view  max_str_len=128 null_density=0               1.03     80.1±0.50µs        ? ?/sec    1.00     77.7±1.02µs        ? ?/sec
concat utf8_view  max_str_len=128 null_density=0.2             1.00     82.0±0.89µs        ? ?/sec    1.03     84.2±0.42µs        ? ?/sec
concat utf8_view  max_str_len=20 null_density=0                1.00     77.3±0.34µs        ? ?/sec    1.15     88.7±1.11µs        ? ?/sec
concat utf8_view  max_str_len=20 null_density=0.2              1.00     79.0±0.34µs        ? ?/sec    1.21     95.2±0.45µs        ? ?/sec
concat utf8_view all_inline max_str_len=12 null_density=0      1.03     47.7±4.07µs        ? ?/sec    1.00     46.4±3.04µs        ? ?/sec
concat utf8_view all_inline max_str_len=12 null_density=0.2    1.00     48.6±3.27µs        ? ?/sec    1.11     53.8±2.81µs        ? ?/sec

alamb · 2025-11-09T14:31:41Z

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/faster_append (3727c20) to 43c7637 diff
BENCH_NAME=filter_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench filter_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=alamb_faster_append
Results will be posted here when complete

alamb · 2025-11-09T14:56:09Z

🤖: Benchmark completed

Details

group                                                                         alamb_faster_append                    main
-----                                                                         -------------------                    ----
filter context decimal128 (kept 1/2)                                          1.00     42.9±1.81µs        ? ?/sec    1.02     43.5±5.32µs        ? ?/sec
filter context decimal128 high selectivity (kept 1023/1024)                   1.00     49.1±1.27µs        ? ?/sec    1.01     49.5±1.56µs        ? ?/sec
filter context decimal128 low selectivity (kept 1/1024)                       1.00    238.3±0.49ns        ? ?/sec    1.02    243.2±0.43ns        ? ?/sec
filter context f32 (kept 1/2)                                                 1.00     96.4±0.27µs        ? ?/sec    1.01     96.9±0.28µs        ? ?/sec
filter context f32 high selectivity (kept 1023/1024)                          1.00      9.8±0.34µs        ? ?/sec    1.34     13.2±0.55µs        ? ?/sec
filter context f32 low selectivity (kept 1/1024)                              1.00    488.2±1.13ns        ? ?/sec    1.20    584.3±1.17ns        ? ?/sec
filter context fsb with value length 20 (kept 1/2)                            1.00     79.5±0.18µs        ? ?/sec    1.00     79.6±0.16µs        ? ?/sec
filter context fsb with value length 20 high selectivity (kept 1023/1024)     1.00     79.5±0.16µs        ? ?/sec    1.00     79.5±0.17µs        ? ?/sec
filter context fsb with value length 20 low selectivity (kept 1/1024)         1.00     79.4±0.12µs        ? ?/sec    1.00     79.5±0.11µs        ? ?/sec
filter context fsb with value length 5 (kept 1/2)                             1.00     79.4±0.11µs        ? ?/sec    1.00     79.6±0.46µs        ? ?/sec
filter context fsb with value length 5 high selectivity (kept 1023/1024)      1.00     79.5±0.14µs        ? ?/sec    1.00     79.5±0.32µs        ? ?/sec
filter context fsb with value length 5 low selectivity (kept 1/1024)          1.00     79.4±0.13µs        ? ?/sec    1.00     79.6±0.25µs        ? ?/sec
filter context fsb with value length 50 (kept 1/2)                            1.00     79.9±2.44µs        ? ?/sec    1.00     79.5±0.21µs        ? ?/sec
filter context fsb with value length 50 high selectivity (kept 1023/1024)     1.00     79.5±0.18µs        ? ?/sec    1.00     79.6±1.02µs        ? ?/sec
filter context fsb with value length 50 low selectivity (kept 1/1024)         1.00     79.5±0.13µs        ? ?/sec    1.00     79.6±0.93µs        ? ?/sec
filter context i32 (kept 1/2)                                                 1.00     16.6±0.04µs        ? ?/sec    1.00     16.6±0.07µs        ? ?/sec
filter context i32 high selectivity (kept 1023/1024)                          1.00      6.2±0.40µs        ? ?/sec    1.00      6.3±0.39µs        ? ?/sec
filter context i32 low selectivity (kept 1/1024)                              1.01    239.3±0.36ns        ? ?/sec    1.00    236.3±0.33ns        ? ?/sec
filter context i32 w NULLs (kept 1/2)                                         1.00     96.9±0.87µs        ? ?/sec    1.00     96.7±0.26µs        ? ?/sec
filter context i32 w NULLs high selectivity (kept 1023/1024)                  1.00     10.0±0.28µs        ? ?/sec    1.28     12.8±0.43µs        ? ?/sec
filter context i32 w NULLs low selectivity (kept 1/1024)                      1.00    487.4±1.12ns        ? ?/sec    1.20    586.7±0.96ns        ? ?/sec
filter context mixed string view (kept 1/2)                                   1.05    125.5±4.80µs        ? ?/sec    1.00    119.7±0.50µs        ? ?/sec
filter context mixed string view high selectivity (kept 1023/1024)            1.00     54.6±1.94µs        ? ?/sec    1.02     55.9±1.19µs        ? ?/sec
filter context mixed string view low selectivity (kept 1/1024)                1.01    696.8±0.95ns        ? ?/sec    1.00    689.3±1.02ns        ? ?/sec
filter context short string view (kept 1/2)                                   1.02    123.3±3.92µs        ? ?/sec    1.00    121.0±4.58µs        ? ?/sec
filter context short string view high selectivity (kept 1023/1024)            1.00     52.6±1.09µs        ? ?/sec    1.06     56.0±0.72µs        ? ?/sec
filter context short string view low selectivity (kept 1/1024)                1.00    502.9±6.33ns        ? ?/sec    1.00    505.2±1.24ns        ? ?/sec
filter context string (kept 1/2)                                              1.02    579.8±8.26µs        ? ?/sec    1.00    566.8±4.58µs        ? ?/sec
filter context string dictionary (kept 1/2)                                   1.00     17.3±0.04µs        ? ?/sec    1.02     17.6±0.07µs        ? ?/sec
filter context string dictionary high selectivity (kept 1023/1024)            1.02      7.2±0.43µs        ? ?/sec    1.00      7.1±0.27µs        ? ?/sec
filter context string dictionary low selectivity (kept 1/1024)                1.00    823.9±4.00ns        ? ?/sec    1.01    832.4±2.34ns        ? ?/sec
filter context string dictionary w NULLs (kept 1/2)                           1.00     98.0±1.86µs        ? ?/sec    1.00     98.5±1.97µs        ? ?/sec
filter context string dictionary w NULLs high selectivity (kept 1023/1024)    1.00     10.8±0.43µs        ? ?/sec    1.26     13.5±0.40µs        ? ?/sec
filter context string dictionary w NULLs low selectivity (kept 1/1024)        1.00   1098.7±3.84ns        ? ?/sec    1.00   1093.3±1.69ns        ? ?/sec
filter context string high selectivity (kept 1023/1024)                       1.09   677.0±22.64µs        ? ?/sec    1.00   620.5±13.28µs        ? ?/sec
filter context string low selectivity (kept 1/1024)                           1.08   1000.9±2.17ns        ? ?/sec    1.00    929.1±2.18ns        ? ?/sec
filter context u8 (kept 1/2)                                                  1.00     22.5±0.07µs        ? ?/sec    1.00     22.5±0.06µs        ? ?/sec
filter context u8 high selectivity (kept 1023/1024)                           1.05      2.1±0.04µs        ? ?/sec    1.00  1949.6±13.23ns        ? ?/sec
filter context u8 low selectivity (kept 1/1024)                               1.00    243.9±0.36ns        ? ?/sec    1.00    243.8±0.48ns        ? ?/sec
filter context u8 w NULLs (kept 1/2)                                          1.00    102.3±0.30µs        ? ?/sec    1.00    102.3±0.17µs        ? ?/sec
filter context u8 w NULLs high selectivity (kept 1023/1024)                   1.00      5.3±0.02µs        ? ?/sec    1.52      8.0±0.03µs        ? ?/sec
filter context u8 w NULLs low selectivity (kept 1/1024)                       1.00    590.1±1.17ns        ? ?/sec    1.01    597.6±1.71ns        ? ?/sec
filter decimal128 (kept 1/2)                                                  1.00     49.8±0.56µs        ? ?/sec    1.03     51.1±2.95µs        ? ?/sec
filter decimal128 high selectivity (kept 1023/1024)                           1.01     53.7±1.88µs        ? ?/sec    1.00     52.9±1.55µs        ? ?/sec
filter decimal128 low selectivity (kept 1/1024)                               1.00      2.9±0.01µs        ? ?/sec    1.01      3.0±0.01µs        ? ?/sec
filter f32 (kept 1/2)                                                         1.00    117.0±0.20µs        ? ?/sec    1.00    117.1±0.23µs        ? ?/sec
filter fsb with value length 20 (kept 1/2)                                    1.00    144.4±0.29µs        ? ?/sec    1.00    144.5±0.52µs        ? ?/sec
filter fsb with value length 20 high selectivity (kept 1023/1024)             1.00     69.2±1.39µs        ? ?/sec    1.03     71.1±1.82µs        ? ?/sec
filter fsb with value length 20 low selectivity (kept 1/1024)                 1.00      2.7±0.01µs        ? ?/sec    1.00      2.7±0.01µs        ? ?/sec
filter fsb with value length 5 (kept 1/2)                                     1.00    150.6±0.23µs        ? ?/sec    1.01    152.7±0.79µs        ? ?/sec
filter fsb with value length 5 high selectivity (kept 1023/1024)              1.00     11.2±0.47µs        ? ?/sec    1.02     11.4±0.44µs        ? ?/sec
filter fsb with value length 5 low selectivity (kept 1/1024)                  1.00      2.6±0.01µs        ? ?/sec    1.01      2.6±0.01µs        ? ?/sec
filter fsb with value length 50 (kept 1/2)                                    1.00    162.1±9.10µs        ? ?/sec    1.04   168.3±12.51µs        ? ?/sec
filter fsb with value length 50 high selectivity (kept 1023/1024)             1.02   216.0±11.09µs        ? ?/sec    1.00    211.5±4.35µs        ? ?/sec
filter fsb with value length 50 low selectivity (kept 1/1024)                 1.00      2.6±0.01µs        ? ?/sec    1.01      2.7±0.04µs        ? ?/sec
filter i32 (kept 1/2)                                                         1.00     45.4±0.09µs        ? ?/sec    1.00     45.5±0.11µs        ? ?/sec
filter i32 high selectivity (kept 1023/1024)                                  1.04      8.8±0.40µs        ? ?/sec    1.00      8.5±0.22µs        ? ?/sec
filter i32 low selectivity (kept 1/1024)                                      1.01      2.9±0.01µs        ? ?/sec    1.00      2.9±0.04µs        ? ?/sec
filter optimize (kept 1/2)                                                    1.00     54.2±0.06µs        ? ?/sec    1.00     54.3±0.09µs        ? ?/sec
filter optimize high selectivity (kept 1023/1024)                             1.00      2.8±0.02µs        ? ?/sec    1.10      3.0±0.01µs        ? ?/sec
filter optimize low selectivity (kept 1/1024)                                 1.00      2.8±0.01µs        ? ?/sec    1.01      2.8±0.01µs        ? ?/sec
filter run array (kept 1/2)                                                   1.00    371.1±0.94µs        ? ?/sec    1.00    371.5±0.93µs        ? ?/sec
filter run array high selectivity (kept 1023/1024)                            1.00    395.8±1.56µs        ? ?/sec    1.00    396.2±1.70µs        ? ?/sec
filter run array low selectivity (kept 1/1024)                                1.00    282.1±0.86µs        ? ?/sec    1.00    282.3±1.89µs        ? ?/sec
filter single record batch                                                    1.00     46.1±0.08µs        ? ?/sec    1.00     46.2±0.12µs        ? ?/sec
filter u8 (kept 1/2)                                                          1.00     45.4±0.08µs        ? ?/sec    1.00     45.4±0.10µs        ? ?/sec
filter u8 high selectivity (kept 1023/1024)                                   1.00      3.9±0.01µs        ? ?/sec    1.05      4.1±0.01µs        ? ?/sec
filter u8 low selectivity (kept 1/1024)                                       1.00      3.0±0.01µs        ? ?/sec    1.00      3.0±0.01µs        ? ?/sec

alamb · 2025-11-09T16:12:51Z

Thank you for the review @rluvaton -- the improvements to concat are pretty exciting (for all types, not just boolean)

Dandandan

Nice, I restarted the failing job

alamb · 2025-11-09T20:54:56Z

Thanks @Dandandan

Looks like the integration test is also failing on main:

"Archery test With other arrows" Integration test failing on main: #8813

alamb · 2025-11-13T12:16:33Z

Thanks @rluvaton and @Dandandan

Change `BooleanBuffer::append_packed_range to use bitwise_binary_op

3727c20

alamb added the performance label Nov 9, 2025

alamb changed the title ~~Change `BooleanBuffer::append_packed_range to use bitwise_binary_op~~ Change BooleanBuffer::append_packed_range to use bitwise_binary_op Nov 9, 2025

This was referenced Nov 9, 2025

TESTING: Change `BooleanBuffer::append_packed_range to use bitwise_binary_op #8744

Closed

Improvements to BooleanBufferBuilder / BooleanBuilder #8561

Open

alamb marked this pull request as ready for review November 9, 2025 13:13

rluvaton approved these changes Nov 9, 2025

View reviewed changes

alamb changed the title ~~Change BooleanBuffer::append_packed_range to use bitwise_binary_op~~ Change BooleanBuffer::append_packed_range to use apply_bitwise_binary_op Nov 9, 2025

Dandandan approved these changes Nov 9, 2025

View reviewed changes

Merge branch 'main' into alamb/faster_append

165f783

github-actions bot added the arrow Changes to the arrow crate label Nov 11, 2025

alamb merged commit f8d9572 into apache:main Nov 13, 2025
26 checks passed

alamb deleted the alamb/faster_append branch November 13, 2025 12:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Change `BooleanBuffer::append_packed_range` to use `apply_bitwise_binary_op` #8812

Change `BooleanBuffer::append_packed_range` to use `apply_bitwise_binary_op` #8812

Uh oh!

alamb commented Nov 9, 2025

Uh oh!

alamb commented Nov 9, 2025

Uh oh!

alamb commented Nov 9, 2025

Uh oh!

alamb commented Nov 9, 2025

Uh oh!

alamb commented Nov 9, 2025

Uh oh!

alamb commented Nov 9, 2025

Uh oh!

alamb commented Nov 9, 2025

Uh oh!

alamb commented Nov 9, 2025

Uh oh!

Dandandan left a comment

Uh oh!

alamb commented Nov 9, 2025

Uh oh!

Uh oh!

alamb commented Nov 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Change BooleanBuffer::append_packed_range to use apply_bitwise_binary_op #8812

Change BooleanBuffer::append_packed_range to use apply_bitwise_binary_op #8812

Uh oh!

Conversation

alamb commented Nov 9, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

alamb commented Nov 9, 2025

Uh oh!

alamb commented Nov 9, 2025

Uh oh!

alamb commented Nov 9, 2025

Uh oh!

alamb commented Nov 9, 2025

Uh oh!

alamb commented Nov 9, 2025

Uh oh!

alamb commented Nov 9, 2025

Uh oh!

alamb commented Nov 9, 2025

Uh oh!

Dandandan left a comment

Choose a reason for hiding this comment

Uh oh!

alamb commented Nov 9, 2025

Uh oh!

Uh oh!

alamb commented Nov 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Change `BooleanBuffer::append_packed_range` to use `apply_bitwise_binary_op` #8812

Change `BooleanBuffer::append_packed_range` to use `apply_bitwise_binary_op` #8812