Skip to content

Improve StringArray(Utf8) sort performance (~2-4x faster) #7860

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

zhuqi-lucas
Copy link
Contributor

Which issue does this PR close?

Improve StringArray(Utf8) sort performance

Rationale for this change

Support prefix compare, and i optimized it to u32 prefix, and u64 increment compare, it will have best performance when experimenting.

What changes are included in this PR?

Support prefix compare, and i optimized it to u32 prefix, and u64 increment compare, it will have best performance when experimenting.

Are these changes tested?

Yes

critcmp  issue_7847  main --filter "sort string\["
group                                         issue_7847                             main
-----                                         ----------                             ----
sort string[0-400] nulls to indices 2^12      1.00     51.4±0.56µs        ? ?/sec    1.19     61.0±1.02µs        ? ?/sec
sort string[0-400] to indices 2^12            1.00     96.5±1.63µs        ? ?/sec    1.23    118.3±0.91µs        ? ?/sec
sort string[10] dict nulls to indices 2^12    1.00     72.4±1.00µs        ? ?/sec    1.00     72.5±0.61µs        ? ?/sec
sort string[10] dict to indices 2^12          1.00    137.1±1.51µs        ? ?/sec    1.01    138.1±1.06µs        ? ?/sec
sort string[10] nulls to indices 2^12         1.00     47.5±0.69µs        ? ?/sec    1.18     56.3±0.56µs        ? ?/sec
sort string[10] to indices 2^12               1.00     86.4±1.37µs        ? ?/sec    1.20    103.5±1.13µs        ? ?/sec

Are there any user-facing changes?

If there are user-facing changes then we may require documentation to be updated before approving the PR.

If there are any breaking changes to public APIs, please call them out.

@github-actions github-actions bot added the arrow Changes to the arrow crate label Jul 3, 2025
@zhuqi-lucas
Copy link
Contributor Author

cc @Dandandan @alamb

Did some experiment for StringArray, we also can get some improvement.

@alamb
Copy link
Contributor

alamb commented Jul 3, 2025

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.11.0-1016-gcp #16~24.04.1-Ubuntu SMP Wed May 28 02:40:52 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing issue_7847 (0c665fb) to 52ad7d7 diff
BENCH_NAME=sort_kernel
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench sort_kernel
BENCH_FILTER=
BENCH_BRANCH_NAME=issue_7847
Results will be posted here when complete

@alamb
Copy link
Contributor

alamb commented Jul 3, 2025

🤖: Benchmark completed

Details

group                                                   issue_7847                             main
-----                                                   ----------                             ----
lexsort (bool, bool) 2^12                               1.00    116.4±0.34µs        ? ?/sec    1.00    116.5±0.38µs        ? ?/sec
lexsort (bool, bool) nulls 2^12                         1.05    164.4±0.32µs        ? ?/sec    1.00    156.2±0.48µs        ? ?/sec
lexsort (f32, f32) 2^10                                 1.00     45.1±0.14µs        ? ?/sec    1.00     45.2±0.10µs        ? ?/sec
lexsort (f32, f32) 2^12                                 1.00    211.1±2.55µs        ? ?/sec    1.00    212.0±0.40µs        ? ?/sec
lexsort (f32, f32) 2^12 limit 10                        1.00     38.0±0.06µs        ? ?/sec    1.02     38.5±0.07µs        ? ?/sec
lexsort (f32, f32) 2^12 limit 100                       1.01     41.2±0.17µs        ? ?/sec    1.00     40.8±0.20µs        ? ?/sec
lexsort (f32, f32) 2^12 limit 1000                      1.00     78.1±0.13µs        ? ?/sec    1.01     79.1±0.53µs        ? ?/sec
lexsort (f32, f32) 2^12 limit 2^12                      1.00    211.5±0.49µs        ? ?/sec    1.01    212.8±1.18µs        ? ?/sec
lexsort (f32, f32) nulls 2^10                           1.03     54.8±0.14µs        ? ?/sec    1.00     53.4±0.12µs        ? ?/sec
lexsort (f32, f32) nulls 2^12                           1.02    256.1±0.59µs        ? ?/sec    1.00    251.9±0.38µs        ? ?/sec
lexsort (f32, f32) nulls 2^12 limit 10                  1.04     88.7±0.16µs        ? ?/sec    1.00     85.0±0.41µs        ? ?/sec
lexsort (f32, f32) nulls 2^12 limit 100                 1.05     89.8±0.18µs        ? ?/sec    1.00     85.9±0.54µs        ? ?/sec
lexsort (f32, f32) nulls 2^12 limit 1000                1.06    101.5±0.57µs        ? ?/sec    1.00     95.8±0.62µs        ? ?/sec
lexsort (f32, f32) nulls 2^12 limit 2^12                1.02    256.1±0.46µs        ? ?/sec    1.00    252.0±0.45µs        ? ?/sec
rank f32 2^12                                           1.02     69.4±0.16µs        ? ?/sec    1.00     68.2±0.12µs        ? ?/sec
rank f32 nulls 2^12                                     1.01     38.0±0.09µs        ? ?/sec    1.00     37.7±0.11µs        ? ?/sec
rank string[10] 2^12                                    1.00    232.8±0.56µs        ? ?/sec    1.00    233.3±0.56µs        ? ?/sec
rank string[10] nulls 2^12                              1.00    116.7±0.22µs        ? ?/sec    1.00    117.0±0.19µs        ? ?/sec
sort f32 2^12                                           1.00     60.1±0.60µs        ? ?/sec    1.08     65.1±0.78µs        ? ?/sec
sort f32 nulls 2^12                                     1.00     31.8±0.10µs        ? ?/sec    1.01     32.2±0.09µs        ? ?/sec
sort f32 nulls to indices 2^12                          1.00     68.2±0.18µs        ? ?/sec    1.03     70.5±0.27µs        ? ?/sec
sort f32 to indices 2^12                                1.00     72.2±0.15µs        ? ?/sec    1.07     77.3±0.26µs        ? ?/sec
sort i32 2^10                                           1.00      7.5±0.02µs        ? ?/sec    1.14      8.6±0.03µs        ? ?/sec
sort i32 2^12                                           1.00     36.3±0.10µs        ? ?/sec    1.17     42.4±0.12µs        ? ?/sec
sort i32 nulls 2^10                                     1.07      5.8±0.01µs        ? ?/sec    1.00      5.4±0.01µs        ? ?/sec
sort i32 nulls 2^12                                     1.08     24.7±0.07µs        ? ?/sec    1.00     22.9±0.07µs        ? ?/sec
sort i32 nulls to indices 2^10                          1.01     12.5±0.03µs        ? ?/sec    1.00     12.4±0.04µs        ? ?/sec
sort i32 nulls to indices 2^12                          1.00     53.1±0.11µs        ? ?/sec    1.00     53.0±0.09µs        ? ?/sec
sort i32 to indices 2^10                                1.01     11.6±0.02µs        ? ?/sec    1.00     11.5±0.04µs        ? ?/sec
sort i32 to indices 2^12                                1.00     55.1±0.25µs        ? ?/sec    1.00     55.1±0.25µs        ? ?/sec
sort primitive run 2^12                                 1.12      7.1±0.02µs        ? ?/sec    1.00      6.4±0.01µs        ? ?/sec
sort primitive run to indices 2^12                      1.00      8.6±0.02µs        ? ?/sec    1.04      9.0±0.02µs        ? ?/sec
sort string[0-400] nulls to indices 2^12                1.00    182.8±0.59µs        ? ?/sec  
sort string[0-400] to indices 2^12                      1.00    344.4±0.60µs        ? ?/sec  
sort string[10] dict nulls to indices 2^12              1.00    168.9±0.30µs        ? ?/sec    1.01    170.2±0.53µs        ? ?/sec
sort string[10] dict to indices 2^12                    1.00    298.1±0.46µs        ? ?/sec    1.00    299.2±1.39µs        ? ?/sec
sort string[10] nulls to indices 2^12                   1.37    185.9±0.25µs        ? ?/sec    1.00    136.2±0.22µs        ? ?/sec
sort string[10] to indices 2^12                         1.46    336.0±0.50µs        ? ?/sec    1.00    229.8±0.49µs        ? ?/sec
sort string_view[0-400] nulls to indices 2^12           1.02     83.5±0.18µs        ? ?/sec    1.00     82.3±0.20µs        ? ?/sec
sort string_view[0-400] to indices 2^12                 1.00    122.6±0.26µs        ? ?/sec    1.00    122.1±0.24µs        ? ?/sec
sort string_view[10] nulls to indices 2^12              1.02    110.8±0.34µs        ? ?/sec    1.00    109.1±0.25µs        ? ?/sec
sort string_view[10] to indices 2^12                    1.01    174.9±0.27µs        ? ?/sec    1.00    173.1±0.45µs        ? ?/sec
sort string_view_inlined[0-12] nulls to indices 2^12    1.02    105.4±0.66µs        ? ?/sec    1.00    103.7±0.75µs        ? ?/sec
sort string_view_inlined[0-12] to indices 2^12          1.01    165.0±0.22µs        ? ?/sec    1.00    163.1±0.29µs        ? ?/sec

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @zhuqi-lucas -- this looks quite cool. I have a few questions but the results are quite impressive

FYI @Dandandan

_ => valids.len(),
};

// 3. comparator function for mixed byte views
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does "byte view" mean in this context?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is mistake, thank you @alamb.

let min_len = a_len.min(b_len);

// 3. compare the prefix of the first 4 bytes
let pref = min_len.min(4);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add some comment about why checking the first 4 bytes specially is worthwhile? Is it an optimization for small strings?

Giveen the loop below uses read_unaligned for 8 byte pointers, maybe we should do the same thing for the 4 byte initial read?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @alamb , i added the comments in latest PR, and current implement will not have read_unaligned for 8 byte pointers.

return pa.cmp(&pb);
}

// 3.2 Use 8 bytes to compare one by one if the prefix is equal
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am quite surprised the built in implementation of cmp doesn't already have this optimization -- this code is basically some sort of manual vectorization -- I woud have expected that a_bytes and b_bytes would already do it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the performance without this non-prefix fallback @zhuqi-lucas ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point @alamb @Dandandan , let me remove this logic, it may be no difference for performance, i will test it, but i am confused my local benchmark seems different from the benchmark shows here. I may find a linux machine to reproduce it also.

// 3.2 Use 8 bytes to compare one by one if the prefix is equal
let mut i = pref;
while i + 8 <= min_len {
let raw_a = unsafe { std::ptr::read_unaligned(a_bytes.as_ptr().add(i) as *const u64) };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we know that a_bytes is aligned to an 8 byte boundary? Can't be be any arbitrary offset in the data buffer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @alamb for this good point, i will try to only keep one prefix compare and testing the result, the 8 byte boundary following should not improve too much and it's not clear for it.

@@ -113,6 +113,16 @@ fn add_benchmark(c: &mut Criterion) {
b.iter(|| bench_sort_to_indices(&arr, None))
});

let arr = create_string_array::<i32>(2usize.pow(12), 0.0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you add this benchmark as a separate PR it would be easier for me to automate running benchmarks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point @alamb , i agree, submit a PR to add rich sort string testing cases first:

#7867

.collect::<Vec<(u32, &[u8])>>();
.map(|idx| {
let slice: &[u8] = values.value(idx as usize).as_ref();
(idx, slice, slice.len())
Copy link
Contributor

@Dandandan Dandandan Jul 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we compute/cache the prefix here? I think that would save some conversion & memory access!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @Dandandan , very good point!

.collect::<Vec<(u32, &[u8])>>();
.map(|idx| {
let slice: &[u8] = values.value(idx as usize).as_ref();
(idx, slice, slice.len())
Copy link
Contributor

@Dandandan Dandandan Jul 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why store slice.len() here? That shouldn't make a difference as the slice already has the len already?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, we don't need to store the len.

@Dandandan
Copy link
Contributor

Dandandan commented Jul 3, 2025

I think this idea is really interesting and clever! I think we should be able to push this even further a bit by having the prefix computed upfront once (and remove the length) to reduce accessing the string data (so we can mostly do sorting "in-place") and save some conversion cost per comparison.

@zhuqi-lucas
Copy link
Contributor Author

🤖: Benchmark completed

Details

group                                                   issue_7847                             main
-----                                                   ----------                             ----
lexsort (bool, bool) 2^12                               1.00    116.4±0.34µs        ? ?/sec    1.00    116.5±0.38µs        ? ?/sec
lexsort (bool, bool) nulls 2^12                         1.05    164.4±0.32µs        ? ?/sec    1.00    156.2±0.48µs        ? ?/sec
lexsort (f32, f32) 2^10                                 1.00     45.1±0.14µs        ? ?/sec    1.00     45.2±0.10µs        ? ?/sec
lexsort (f32, f32) 2^12                                 1.00    211.1±2.55µs        ? ?/sec    1.00    212.0±0.40µs        ? ?/sec
lexsort (f32, f32) 2^12 limit 10                        1.00     38.0±0.06µs        ? ?/sec    1.02     38.5±0.07µs        ? ?/sec
lexsort (f32, f32) 2^12 limit 100                       1.01     41.2±0.17µs        ? ?/sec    1.00     40.8±0.20µs        ? ?/sec
lexsort (f32, f32) 2^12 limit 1000                      1.00     78.1±0.13µs        ? ?/sec    1.01     79.1±0.53µs        ? ?/sec
lexsort (f32, f32) 2^12 limit 2^12                      1.00    211.5±0.49µs        ? ?/sec    1.01    212.8±1.18µs        ? ?/sec
lexsort (f32, f32) nulls 2^10                           1.03     54.8±0.14µs        ? ?/sec    1.00     53.4±0.12µs        ? ?/sec
lexsort (f32, f32) nulls 2^12                           1.02    256.1±0.59µs        ? ?/sec    1.00    251.9±0.38µs        ? ?/sec
lexsort (f32, f32) nulls 2^12 limit 10                  1.04     88.7±0.16µs        ? ?/sec    1.00     85.0±0.41µs        ? ?/sec
lexsort (f32, f32) nulls 2^12 limit 100                 1.05     89.8±0.18µs        ? ?/sec    1.00     85.9±0.54µs        ? ?/sec
lexsort (f32, f32) nulls 2^12 limit 1000                1.06    101.5±0.57µs        ? ?/sec    1.00     95.8±0.62µs        ? ?/sec
lexsort (f32, f32) nulls 2^12 limit 2^12                1.02    256.1±0.46µs        ? ?/sec    1.00    252.0±0.45µs        ? ?/sec
rank f32 2^12                                           1.02     69.4±0.16µs        ? ?/sec    1.00     68.2±0.12µs        ? ?/sec
rank f32 nulls 2^12                                     1.01     38.0±0.09µs        ? ?/sec    1.00     37.7±0.11µs        ? ?/sec
rank string[10] 2^12                                    1.00    232.8±0.56µs        ? ?/sec    1.00    233.3±0.56µs        ? ?/sec
rank string[10] nulls 2^12                              1.00    116.7±0.22µs        ? ?/sec    1.00    117.0±0.19µs        ? ?/sec
sort f32 2^12                                           1.00     60.1±0.60µs        ? ?/sec    1.08     65.1±0.78µs        ? ?/sec
sort f32 nulls 2^12                                     1.00     31.8±0.10µs        ? ?/sec    1.01     32.2±0.09µs        ? ?/sec
sort f32 nulls to indices 2^12                          1.00     68.2±0.18µs        ? ?/sec    1.03     70.5±0.27µs        ? ?/sec
sort f32 to indices 2^12                                1.00     72.2±0.15µs        ? ?/sec    1.07     77.3±0.26µs        ? ?/sec
sort i32 2^10                                           1.00      7.5±0.02µs        ? ?/sec    1.14      8.6±0.03µs        ? ?/sec
sort i32 2^12                                           1.00     36.3±0.10µs        ? ?/sec    1.17     42.4±0.12µs        ? ?/sec
sort i32 nulls 2^10                                     1.07      5.8±0.01µs        ? ?/sec    1.00      5.4±0.01µs        ? ?/sec
sort i32 nulls 2^12                                     1.08     24.7±0.07µs        ? ?/sec    1.00     22.9±0.07µs        ? ?/sec
sort i32 nulls to indices 2^10                          1.01     12.5±0.03µs        ? ?/sec    1.00     12.4±0.04µs        ? ?/sec
sort i32 nulls to indices 2^12                          1.00     53.1±0.11µs        ? ?/sec    1.00     53.0±0.09µs        ? ?/sec
sort i32 to indices 2^10                                1.01     11.6±0.02µs        ? ?/sec    1.00     11.5±0.04µs        ? ?/sec
sort i32 to indices 2^12                                1.00     55.1±0.25µs        ? ?/sec    1.00     55.1±0.25µs        ? ?/sec
sort primitive run 2^12                                 1.12      7.1±0.02µs        ? ?/sec    1.00      6.4±0.01µs        ? ?/sec
sort primitive run to indices 2^12                      1.00      8.6±0.02µs        ? ?/sec    1.04      9.0±0.02µs        ? ?/sec
sort string[0-400] nulls to indices 2^12                1.00    182.8±0.59µs        ? ?/sec  
sort string[0-400] to indices 2^12                      1.00    344.4±0.60µs        ? ?/sec  
sort string[10] dict nulls to indices 2^12              1.00    168.9±0.30µs        ? ?/sec    1.01    170.2±0.53µs        ? ?/sec
sort string[10] dict to indices 2^12                    1.00    298.1±0.46µs        ? ?/sec    1.00    299.2±1.39µs        ? ?/sec
sort string[10] nulls to indices 2^12                   1.37    185.9±0.25µs        ? ?/sec    1.00    136.2±0.22µs        ? ?/sec
sort string[10] to indices 2^12                         1.46    336.0±0.50µs        ? ?/sec    1.00    229.8±0.49µs        ? ?/sec
sort string_view[0-400] nulls to indices 2^12           1.02     83.5±0.18µs        ? ?/sec    1.00     82.3±0.20µs        ? ?/sec
sort string_view[0-400] to indices 2^12                 1.00    122.6±0.26µs        ? ?/sec    1.00    122.1±0.24µs        ? ?/sec
sort string_view[10] nulls to indices 2^12              1.02    110.8±0.34µs        ? ?/sec    1.00    109.1±0.25µs        ? ?/sec
sort string_view[10] to indices 2^12                    1.01    174.9±0.27µs        ? ?/sec    1.00    173.1±0.45µs        ? ?/sec
sort string_view_inlined[0-12] nulls to indices 2^12    1.02    105.4±0.66µs        ? ?/sec    1.00    103.7±0.75µs        ? ?/sec
sort string_view_inlined[0-12] to indices 2^12          1.01    165.0±0.22µs        ? ?/sec    1.00    163.1±0.29µs        ? ?/sec

This PR shows slow for the benchmark. It seems different result from my local 🤔

> sort string[10] nulls to indices 2^12                   1.37    185.9±0.25µs        ? ?/sec    1.00    136.2±0.22µs        ? ?/sec
> sort string[10] to indices 2^12                         1.46    336.0±0.50µs        ? ?/sec    1.00    229.8±0.49µs        ? ?/sec

@zhuqi-lucas
Copy link
Contributor Author

I think this idea is really interesting and clever! I think we should be able to push this even further a bit by having the prefix computed upfront once (and remove the length) to reduce accessing the string data (so we can mostly do sorting "in-place") and save some conversion cost.

Thank you @Dandandan , great point, let me try to address this!

@zhuqi-lucas
Copy link
Contributor Author

Thanks @zhuqi-lucas -- this looks quite cool. I have a few questions but the results are quite impressive

FYI @Dandandan

Thank you @alamb for review, i will polish the PR to use one prefix first, we can optimize long string cases later if we find a good way!

@zhuqi-lucas
Copy link
Contributor Author

Thanks a lot @alamb @Dandandan , updated the latest benchmark from my local mac:

Amazing result, 2.3 faster ~ 4.5 faster 😺

group                                         issue_7847                             main
-----                                         ----------                             ----
sort string[0-100] nulls to indices 2^12      1.00     23.6±0.15µs        ? ?/sec    2.60     61.4±0.63µs        ? ?/sec
sort string[0-100] to indices 2^12            1.00     34.5±0.42µs        ? ?/sec    3.57    123.0±2.51µs        ? ?/sec
sort string[0-10] nulls to indices 2^12       1.00     26.0±2.89µs        ? ?/sec    2.92     76.1±2.45µs        ? ?/sec
sort string[0-10] to indices 2^12             1.00     37.5±0.41µs        ? ?/sec    4.43    166.1±4.20µs        ? ?/sec
sort string[0-400] nulls to indices 2^12      1.00     24.5±0.20µs        ? ?/sec    2.49     60.9±0.56µs        ? ?/sec
sort string[0-400] to indices 2^12            1.00     36.2±0.33µs        ? ?/sec    3.29    119.0±0.83µs        ? ?/sec
sort string[1000] nulls to indices 2^12       1.00     24.7±0.18µs        ? ?/sec    2.44     60.1±0.44µs        ? ?/sec
sort string[1000] to indices 2^12             1.00     35.0±0.25µs        ? ?/sec    3.24    113.3±0.75µs        ? ?/sec
sort string[100] nulls to indices 2^12        1.00     24.2±0.16µs        ? ?/sec    2.45     59.2±0.54µs        ? ?/sec
sort string[100] to indices 2^12              1.00     35.2±0.26µs        ? ?/sec    3.22    113.3±0.98µs        ? ?/sec
sort string[10] nulls to indices 2^12         1.00     24.2±0.20µs        ? ?/sec    2.36     57.0±0.44µs        ? ?/sec
sort string[10] to indices 2^12               1.00     33.5±0.21µs        ? ?/sec    3.12    104.5±0.82µs        ? ?/sec

But it may differs from the linux one, after #7867 merged, we can trigger the benchmark from github script, thanks!

@Dandandan Dandandan changed the title Improve StringArray(Utf8) sort performance (1.2x faster) Improve StringArray(Utf8) sort performance (~2-4x faster) Jul 4, 2025
v << (8 * (4 - len))
};
// Pack into a single u64: (prefix << 32) | length
let prefix64 = ((prefix_u32 as u64) << 32) | (len as u64);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a single string could theoretically be larger than u32::MAX, in this case bitwise or result is wrong I believe.

What about using 64 bits so it can use u64 (first 8 bytes) for prefix and u64 for length?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @Dandandan for this good suggestion!

Copy link
Contributor Author

@zhuqi-lucas zhuqi-lucas Jul 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, i found change to 8 bytes prefix and u64 for length, it will cause about 50% slow than 4 bytes prefix for most of the cases, it seems due to the 4 bytes prefix better L1 cache matching and less compare overhead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @Dandandan :

In my latest PR, I’ve added support for a u64 length field while still using a 4‑byte prefix, and benchmarks show no performance regression. Instead of packing prefix and length into one large key, we compare them separately:

  1. Compare the 4‑byte prefix

  2. If that’s equal (and only then), compare the u64 length(only any side is < 4 bytes which is possible to have padding)

  3. Finally, fall back to a full byte‑by‑byte compare

This keeps the hot path on the cheap 4‑byte prefix compare, and the additional length check is so infrequently needed that it doesn’t hurt throughput.

Dandandan pushed a commit that referenced this pull request Jul 4, 2025
# Which issue does this PR close?

#7860 (comment)

Add rich testing cases for sort string(utf8)  cc @alamb @Dandandan 

Preparation for experiment:

#7860

# Rationale for this change

Add rich testing cases for sort string(utf8)

# What changes are included in this PR?

Add rich testing cases for sort string(utf8)
# Are these changes tested?

Yes

# Are there any user-facing changes?

No
}
v << (8 * (4 - slice.len()))
};
(idx, prefix, len)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think len it can store len < bool as well ("is small") as the actual length is not used? This will also avoid doing this la < 4 check in the sort, so might be slightly faster.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively or additionaly we could store the &[u8] instead of the index so it doesn't have to retrieve it via values.value again in the sort.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @Dandandan for this idea, i tried now, but it show 30% performance decrease:

diff --git a/arrow-ord/src/sort.rs b/arrow-ord/src/sort.rs
index 093c52d867..29800663a0 100644
--- a/arrow-ord/src/sort.rs
+++ b/arrow-ord/src/sort.rs
@@ -301,14 +301,15 @@ fn sort_bytes<T: ByteArrayType>(
     // comparing up to 4 bytes as a single u32 we avoid the overhead
     // of a full lexicographical compare for the vast majority of cases.
 
-    // 1. Build a vector of (index, prefix, length) tuples
-    let mut valids: Vec<(u32, u32, u64)> = value_indices
+    // 1. Build a vector of (idx, prefix, is_small, slice) tuples
+    let mut valids: Vec<(u32, u32, bool, &[u8])> = value_indices
         .into_iter()
         .map(|idx| {
             let slice: &[u8] = values.value(idx as usize).as_ref();
-            let len = slice.len() as u64;
-            // Compute the 4‑byte prefix in BE order, or left‑pad if shorter
-            let prefix = if slice.len() >= 4 {
+            // store the bool for whether the slice is smaller than 4 bytes
+            let is_small = slice.len() < 4;
+            // prefix: if the slice is smaller than 4 bytes, left-pad it with zeros,
+            let prefix = if !is_small {
                 let raw = unsafe { std::ptr::read_unaligned(slice.as_ptr() as *const u32) };
                 u32::from_be(raw)
             } else {
@@ -318,7 +319,7 @@ fn sort_bytes<T: ByteArrayType>(
                 }
                 v << (8 * (4 - slice.len()))
             };
-            (idx, prefix, len)
+            (idx, prefix, is_small, slice)
         })
         .collect();
 
@@ -328,27 +329,24 @@ fn sort_bytes<T: ByteArrayType>(
         _ => valids.len(),
     };
 
-    // 3. Comparator: compare prefix, then (when both slices shorter than 4) length, otherwise full slice
-    let cmp_bytes = |a: &(u32, u32, u64), b: &(u32, u32, u64)| {
-        let (ia, pa, la) = *a;
-        let (ib, pb, lb) = *b;
-        // 3.1 prefix (first 4 bytes)
-        let ord = pa.cmp(&pb);
-        if ord != Ordering::Equal {
-            return ord;
+    // 3. Comparator: compare prefix first, then for both “small” slices compare length, and finally full lexicographical compare
+    let cmp_bytes = |a: &(u32, u32, bool, &[u8]), b: &(u32, u32, bool, &[u8])| {
+        let (_ia, pa, sa_small, sa) = a;
+        let (_ib, pb, sb_small, sb) = b;
+        // 3.1 Compare the 4‑byte prefix
+        match pa.cmp(&pb) {
+            Ordering::Equal => (),
+            non_eq => return non_eq,
         }
-        // 3.2 only if both slices had length < 4 (so prefix was padded)
-        // length compare only when prefix was padded (i.e. original length < 4)
-        if la < 4 || lb < 4 {
-            let ord = la.cmp(&lb);
-            if ord != Ordering::Equal {
-                return ord;
+        // 3.2 If both slices were shorter than 4 bytes, compare their actual lengths
+        if *sa_small && *sb_small {
+            match sa.len().cmp(&sb.len()) {
+                Ordering::Equal => (),
+                non_eq => return non_eq,
             }
         }
-        // 3.3 full lexicographical compare
-        let a_bytes: &[u8] = values.value(ia as usize).as_ref();
-        let b_bytes: &[u8] = values.value(ib as usize).as_ref();
-        a_bytes.cmp(b_bytes)
+        // 3.3 Otherwise, do a full byte‑wise lexicographical comparison
+        sa.cmp(sb)
     };
 
     // 4. Partially sort according to ascending/descending
@@ -366,9 +364,9 @@ fn sort_bytes<T: ByteArrayType>(
     if options.nulls_first {
         out.extend_from_slice(&nulls[..nulls.len().min(out_limit)]);
         let rem = out_limit - out.len();
-        out.extend(valids.iter().map(|&(i, _, _)| i).take(rem));
+        out.extend(valids.iter().map(|&(i, _, _, _)| i).take(rem));
     } else {
-        out.extend(valids.iter().map(|&(i, _, _)| i).take(out_limit));
+        out.extend(valids.iter().map(|&(i, _, _, _)| i).take(out_limit));
         let rem = out_limit - out.len();
         out.extend_from_slice(&nulls[..rem]);
     }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve StringArray(Utf8) sort performance
3 participants