Improve StringArray(Utf8) sort performance (~2-4x faster) #7860

zhuqi-lucas · 2025-07-03T13:48:08Z

Which issue does this PR close?

Improve StringArray(Utf8) sort performance

Closes #7847

Rationale for this change

Support prefix compare, and i optimized it to u32 prefix, and u64 increment compare, it will have best performance when experimenting.

What changes are included in this PR?

Support prefix compare, and i optimized it to u32 prefix, and u64 increment compare, it will have best performance when experimenting.

Are these changes tested?

Yes

critcmp  issue_7847  main --filter "sort string\["
group                                         issue_7847                             main
-----                                         ----------                             ----
sort string[0-400] nulls to indices 2^12      1.00     51.4±0.56µs        ? ?/sec    1.19     61.0±1.02µs        ? ?/sec
sort string[0-400] to indices 2^12            1.00     96.5±1.63µs        ? ?/sec    1.23    118.3±0.91µs        ? ?/sec
sort string[10] dict nulls to indices 2^12    1.00     72.4±1.00µs        ? ?/sec    1.00     72.5±0.61µs        ? ?/sec
sort string[10] dict to indices 2^12          1.00    137.1±1.51µs        ? ?/sec    1.01    138.1±1.06µs        ? ?/sec
sort string[10] nulls to indices 2^12         1.00     47.5±0.69µs        ? ?/sec    1.18     56.3±0.56µs        ? ?/sec
sort string[10] to indices 2^12               1.00     86.4±1.37µs        ? ?/sec    1.20    103.5±1.13µs        ? ?/sec

Are there any user-facing changes?

If there are user-facing changes then we may require documentation to be updated before approving the PR.

If there are any breaking changes to public APIs, please call them out.

zhuqi-lucas · 2025-07-03T13:49:08Z

cc @Dandandan @alamb

Did some experiment for StringArray, we also can get some improvement.

alamb · 2025-07-03T15:45:20Z

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.11.0-1016-gcp #16~24.04.1-Ubuntu SMP Wed May 28 02:40:52 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing issue_7847 (0c665fb) to 52ad7d7 diff
BENCH_NAME=sort_kernel
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench sort_kernel
BENCH_FILTER=
BENCH_BRANCH_NAME=issue_7847
Results will be posted here when complete

alamb · 2025-07-03T16:00:16Z

🤖: Benchmark completed

Details

group                                                   issue_7847                             main
-----                                                   ----------                             ----
lexsort (bool, bool) 2^12                               1.00    116.4±0.34µs        ? ?/sec    1.00    116.5±0.38µs        ? ?/sec
lexsort (bool, bool) nulls 2^12                         1.05    164.4±0.32µs        ? ?/sec    1.00    156.2±0.48µs        ? ?/sec
lexsort (f32, f32) 2^10                                 1.00     45.1±0.14µs        ? ?/sec    1.00     45.2±0.10µs        ? ?/sec
lexsort (f32, f32) 2^12                                 1.00    211.1±2.55µs        ? ?/sec    1.00    212.0±0.40µs        ? ?/sec
lexsort (f32, f32) 2^12 limit 10                        1.00     38.0±0.06µs        ? ?/sec    1.02     38.5±0.07µs        ? ?/sec
lexsort (f32, f32) 2^12 limit 100                       1.01     41.2±0.17µs        ? ?/sec    1.00     40.8±0.20µs        ? ?/sec
lexsort (f32, f32) 2^12 limit 1000                      1.00     78.1±0.13µs        ? ?/sec    1.01     79.1±0.53µs        ? ?/sec
lexsort (f32, f32) 2^12 limit 2^12                      1.00    211.5±0.49µs        ? ?/sec    1.01    212.8±1.18µs        ? ?/sec
lexsort (f32, f32) nulls 2^10                           1.03     54.8±0.14µs        ? ?/sec    1.00     53.4±0.12µs        ? ?/sec
lexsort (f32, f32) nulls 2^12                           1.02    256.1±0.59µs        ? ?/sec    1.00    251.9±0.38µs        ? ?/sec
lexsort (f32, f32) nulls 2^12 limit 10                  1.04     88.7±0.16µs        ? ?/sec    1.00     85.0±0.41µs        ? ?/sec
lexsort (f32, f32) nulls 2^12 limit 100                 1.05     89.8±0.18µs        ? ?/sec    1.00     85.9±0.54µs        ? ?/sec
lexsort (f32, f32) nulls 2^12 limit 1000                1.06    101.5±0.57µs        ? ?/sec    1.00     95.8±0.62µs        ? ?/sec
lexsort (f32, f32) nulls 2^12 limit 2^12                1.02    256.1±0.46µs        ? ?/sec    1.00    252.0±0.45µs        ? ?/sec
rank f32 2^12                                           1.02     69.4±0.16µs        ? ?/sec    1.00     68.2±0.12µs        ? ?/sec
rank f32 nulls 2^12                                     1.01     38.0±0.09µs        ? ?/sec    1.00     37.7±0.11µs        ? ?/sec
rank string[10] 2^12                                    1.00    232.8±0.56µs        ? ?/sec    1.00    233.3±0.56µs        ? ?/sec
rank string[10] nulls 2^12                              1.00    116.7±0.22µs        ? ?/sec    1.00    117.0±0.19µs        ? ?/sec
sort f32 2^12                                           1.00     60.1±0.60µs        ? ?/sec    1.08     65.1±0.78µs        ? ?/sec
sort f32 nulls 2^12                                     1.00     31.8±0.10µs        ? ?/sec    1.01     32.2±0.09µs        ? ?/sec
sort f32 nulls to indices 2^12                          1.00     68.2±0.18µs        ? ?/sec    1.03     70.5±0.27µs        ? ?/sec
sort f32 to indices 2^12                                1.00     72.2±0.15µs        ? ?/sec    1.07     77.3±0.26µs        ? ?/sec
sort i32 2^10                                           1.00      7.5±0.02µs        ? ?/sec    1.14      8.6±0.03µs        ? ?/sec
sort i32 2^12                                           1.00     36.3±0.10µs        ? ?/sec    1.17     42.4±0.12µs        ? ?/sec
sort i32 nulls 2^10                                     1.07      5.8±0.01µs        ? ?/sec    1.00      5.4±0.01µs        ? ?/sec
sort i32 nulls 2^12                                     1.08     24.7±0.07µs        ? ?/sec    1.00     22.9±0.07µs        ? ?/sec
sort i32 nulls to indices 2^10                          1.01     12.5±0.03µs        ? ?/sec    1.00     12.4±0.04µs        ? ?/sec
sort i32 nulls to indices 2^12                          1.00     53.1±0.11µs        ? ?/sec    1.00     53.0±0.09µs        ? ?/sec
sort i32 to indices 2^10                                1.01     11.6±0.02µs        ? ?/sec    1.00     11.5±0.04µs        ? ?/sec
sort i32 to indices 2^12                                1.00     55.1±0.25µs        ? ?/sec    1.00     55.1±0.25µs        ? ?/sec
sort primitive run 2^12                                 1.12      7.1±0.02µs        ? ?/sec    1.00      6.4±0.01µs        ? ?/sec
sort primitive run to indices 2^12                      1.00      8.6±0.02µs        ? ?/sec    1.04      9.0±0.02µs        ? ?/sec
sort string[0-400] nulls to indices 2^12                1.00    182.8±0.59µs        ? ?/sec  
sort string[0-400] to indices 2^12                      1.00    344.4±0.60µs        ? ?/sec  
sort string[10] dict nulls to indices 2^12              1.00    168.9±0.30µs        ? ?/sec    1.01    170.2±0.53µs        ? ?/sec
sort string[10] dict to indices 2^12                    1.00    298.1±0.46µs        ? ?/sec    1.00    299.2±1.39µs        ? ?/sec
sort string[10] nulls to indices 2^12                   1.37    185.9±0.25µs        ? ?/sec    1.00    136.2±0.22µs        ? ?/sec
sort string[10] to indices 2^12                         1.46    336.0±0.50µs        ? ?/sec    1.00    229.8±0.49µs        ? ?/sec
sort string_view[0-400] nulls to indices 2^12           1.02     83.5±0.18µs        ? ?/sec    1.00     82.3±0.20µs        ? ?/sec
sort string_view[0-400] to indices 2^12                 1.00    122.6±0.26µs        ? ?/sec    1.00    122.1±0.24µs        ? ?/sec
sort string_view[10] nulls to indices 2^12              1.02    110.8±0.34µs        ? ?/sec    1.00    109.1±0.25µs        ? ?/sec
sort string_view[10] to indices 2^12                    1.01    174.9±0.27µs        ? ?/sec    1.00    173.1±0.45µs        ? ?/sec
sort string_view_inlined[0-12] nulls to indices 2^12    1.02    105.4±0.66µs        ? ?/sec    1.00    103.7±0.75µs        ? ?/sec
sort string_view_inlined[0-12] to indices 2^12          1.01    165.0±0.22µs        ? ?/sec    1.00    163.1±0.29µs        ? ?/sec

alamb

Thanks @zhuqi-lucas -- this looks quite cool. I have a few questions but the results are quite impressive

FYI @Dandandan

alamb · 2025-07-03T16:29:39Z

arrow-ord/src/sort.rs

+        _ => valids.len(),
+    };
+
+    // 3. comparator function for mixed byte views


what does "byte view" mean in this context?

This is mistake, thank you @alamb.

alamb · 2025-07-03T16:32:49Z

arrow-ord/src/sort.rs

+        let min_len = a_len.min(b_len);
+
+        // 3. compare the prefix of the first 4 bytes
+        let pref = min_len.min(4);


Could you add some comment about why checking the first 4 bytes specially is worthwhile? Is it an optimization for small strings?

Giveen the loop below uses read_unaligned for 8 byte pointers, maybe we should do the same thing for the 4 byte initial read?

Thank you @alamb , i added the comments in latest PR, and current implement will not have read_unaligned for 8 byte pointers.

alamb · 2025-07-03T16:33:47Z

arrow-ord/src/sort.rs

+            return pa.cmp(&pb);
+        }
+
+        // 3.2 Use 8 bytes to compare one by one if the prefix is equal


I am quite surprised the built in implementation of cmp doesn't already have this optimization -- this code is basically some sort of manual vectorization -- I woud have expected that a_bytes and b_bytes would already do it.

What is the performance without this non-prefix fallback @zhuqi-lucas ?

Good point @alamb @Dandandan , let me remove this logic, it may be no difference for performance, i will test it, but i am confused my local benchmark seems different from the benchmark shows here. I may find a linux machine to reproduce it also.

alamb · 2025-07-03T16:34:12Z

arrow-ord/src/sort.rs

+        // 3.2 Use 8 bytes to compare one by one if the prefix is equal
+        let mut i = pref;
+        while i + 8 <= min_len {
+            let raw_a = unsafe { std::ptr::read_unaligned(a_bytes.as_ptr().add(i) as *const u64) };


How do we know that a_bytes is aligned to an 8 byte boundary? Can't be be any arbitrary offset in the data buffer?

Thank you @alamb for this good point, i will try to only keep one prefix compare and testing the result, the 8 byte boundary following should not improve too much and it's not clear for it.

alamb · 2025-07-03T16:34:46Z

arrow/benches/sort_kernel.rs

@@ -113,6 +113,16 @@ fn add_benchmark(c: &mut Criterion) {
        b.iter(|| bench_sort_to_indices(&arr, None))
    });

+    let arr = create_string_array::<i32>(2usize.pow(12), 0.0);


if you add this benchmark as a separate PR it would be easier for me to automate running benchmarks

Good point @alamb , i agree, submit a PR to add rich sort string testing cases first:

#7867

Dandandan · 2025-07-03T16:51:59Z

arrow-ord/src/sort.rs

-        .collect::<Vec<(u32, &[u8])>>();
+        .map(|idx| {
+            let slice: &[u8] = values.value(idx as usize).as_ref();
+            (idx, slice, slice.len())


Can't we compute/cache the prefix here? I think that would save some conversion & memory access!

Thank you @Dandandan , very good point!

Dandandan · 2025-07-03T16:55:11Z

arrow-ord/src/sort.rs

-        .collect::<Vec<(u32, &[u8])>>();
+        .map(|idx| {
+            let slice: &[u8] = values.value(idx as usize).as_ref();
+            (idx, slice, slice.len())


Why store slice.len() here? That shouldn't make a difference as the slice already has the len already?

Good point, we don't need to store the len.

Dandandan · 2025-07-03T16:59:20Z

I think this idea is really interesting and clever! I think we should be able to push this even further a bit by having the prefix computed upfront once (and remove the length) to reduce accessing the string data (so we can mostly do sorting "in-place") and save some conversion cost per comparison.

zhuqi-lucas · 2025-07-04T02:21:58Z

🤖: Benchmark completed

Details

group                                                   issue_7847                             main
-----                                                   ----------                             ----
lexsort (bool, bool) 2^12                               1.00    116.4±0.34µs        ? ?/sec    1.00    116.5±0.38µs        ? ?/sec
lexsort (bool, bool) nulls 2^12                         1.05    164.4±0.32µs        ? ?/sec    1.00    156.2±0.48µs        ? ?/sec
lexsort (f32, f32) 2^10                                 1.00     45.1±0.14µs        ? ?/sec    1.00     45.2±0.10µs        ? ?/sec
lexsort (f32, f32) 2^12                                 1.00    211.1±2.55µs        ? ?/sec    1.00    212.0±0.40µs        ? ?/sec
lexsort (f32, f32) 2^12 limit 10                        1.00     38.0±0.06µs        ? ?/sec    1.02     38.5±0.07µs        ? ?/sec
lexsort (f32, f32) 2^12 limit 100                       1.01     41.2±0.17µs        ? ?/sec    1.00     40.8±0.20µs        ? ?/sec
lexsort (f32, f32) 2^12 limit 1000                      1.00     78.1±0.13µs        ? ?/sec    1.01     79.1±0.53µs        ? ?/sec
lexsort (f32, f32) 2^12 limit 2^12                      1.00    211.5±0.49µs        ? ?/sec    1.01    212.8±1.18µs        ? ?/sec
lexsort (f32, f32) nulls 2^10                           1.03     54.8±0.14µs        ? ?/sec    1.00     53.4±0.12µs        ? ?/sec
lexsort (f32, f32) nulls 2^12                           1.02    256.1±0.59µs        ? ?/sec    1.00    251.9±0.38µs        ? ?/sec
lexsort (f32, f32) nulls 2^12 limit 10                  1.04     88.7±0.16µs        ? ?/sec    1.00     85.0±0.41µs        ? ?/sec
lexsort (f32, f32) nulls 2^12 limit 100                 1.05     89.8±0.18µs        ? ?/sec    1.00     85.9±0.54µs        ? ?/sec
lexsort (f32, f32) nulls 2^12 limit 1000                1.06    101.5±0.57µs        ? ?/sec    1.00     95.8±0.62µs        ? ?/sec
lexsort (f32, f32) nulls 2^12 limit 2^12                1.02    256.1±0.46µs        ? ?/sec    1.00    252.0±0.45µs        ? ?/sec
rank f32 2^12                                           1.02     69.4±0.16µs        ? ?/sec    1.00     68.2±0.12µs        ? ?/sec
rank f32 nulls 2^12                                     1.01     38.0±0.09µs        ? ?/sec    1.00     37.7±0.11µs        ? ?/sec
rank string[10] 2^12                                    1.00    232.8±0.56µs        ? ?/sec    1.00    233.3±0.56µs        ? ?/sec
rank string[10] nulls 2^12                              1.00    116.7±0.22µs        ? ?/sec    1.00    117.0±0.19µs        ? ?/sec
sort f32 2^12                                           1.00     60.1±0.60µs        ? ?/sec    1.08     65.1±0.78µs        ? ?/sec
sort f32 nulls 2^12                                     1.00     31.8±0.10µs        ? ?/sec    1.01     32.2±0.09µs        ? ?/sec
sort f32 nulls to indices 2^12                          1.00     68.2±0.18µs        ? ?/sec    1.03     70.5±0.27µs        ? ?/sec
sort f32 to indices 2^12                                1.00     72.2±0.15µs        ? ?/sec    1.07     77.3±0.26µs        ? ?/sec
sort i32 2^10                                           1.00      7.5±0.02µs        ? ?/sec    1.14      8.6±0.03µs        ? ?/sec
sort i32 2^12                                           1.00     36.3±0.10µs        ? ?/sec    1.17     42.4±0.12µs        ? ?/sec
sort i32 nulls 2^10                                     1.07      5.8±0.01µs        ? ?/sec    1.00      5.4±0.01µs        ? ?/sec
sort i32 nulls 2^12                                     1.08     24.7±0.07µs        ? ?/sec    1.00     22.9±0.07µs        ? ?/sec
sort i32 nulls to indices 2^10                          1.01     12.5±0.03µs        ? ?/sec    1.00     12.4±0.04µs        ? ?/sec
sort i32 nulls to indices 2^12                          1.00     53.1±0.11µs        ? ?/sec    1.00     53.0±0.09µs        ? ?/sec
sort i32 to indices 2^10                                1.01     11.6±0.02µs        ? ?/sec    1.00     11.5±0.04µs        ? ?/sec
sort i32 to indices 2^12                                1.00     55.1±0.25µs        ? ?/sec    1.00     55.1±0.25µs        ? ?/sec
sort primitive run 2^12                                 1.12      7.1±0.02µs        ? ?/sec    1.00      6.4±0.01µs        ? ?/sec
sort primitive run to indices 2^12                      1.00      8.6±0.02µs        ? ?/sec    1.04      9.0±0.02µs        ? ?/sec
sort string[0-400] nulls to indices 2^12                1.00    182.8±0.59µs        ? ?/sec  
sort string[0-400] to indices 2^12                      1.00    344.4±0.60µs        ? ?/sec  
sort string[10] dict nulls to indices 2^12              1.00    168.9±0.30µs        ? ?/sec    1.01    170.2±0.53µs        ? ?/sec
sort string[10] dict to indices 2^12                    1.00    298.1±0.46µs        ? ?/sec    1.00    299.2±1.39µs        ? ?/sec
sort string[10] nulls to indices 2^12                   1.37    185.9±0.25µs        ? ?/sec    1.00    136.2±0.22µs        ? ?/sec
sort string[10] to indices 2^12                         1.46    336.0±0.50µs        ? ?/sec    1.00    229.8±0.49µs        ? ?/sec
sort string_view[0-400] nulls to indices 2^12           1.02     83.5±0.18µs        ? ?/sec    1.00     82.3±0.20µs        ? ?/sec
sort string_view[0-400] to indices 2^12                 1.00    122.6±0.26µs        ? ?/sec    1.00    122.1±0.24µs        ? ?/sec
sort string_view[10] nulls to indices 2^12              1.02    110.8±0.34µs        ? ?/sec    1.00    109.1±0.25µs        ? ?/sec
sort string_view[10] to indices 2^12                    1.01    174.9±0.27µs        ? ?/sec    1.00    173.1±0.45µs        ? ?/sec
sort string_view_inlined[0-12] nulls to indices 2^12    1.02    105.4±0.66µs        ? ?/sec    1.00    103.7±0.75µs        ? ?/sec
sort string_view_inlined[0-12] to indices 2^12          1.01    165.0±0.22µs        ? ?/sec    1.00    163.1±0.29µs        ? ?/sec

This PR shows slow for the benchmark. It seems different result from my local 🤔

> sort string[10] nulls to indices 2^12                   1.37    185.9±0.25µs        ? ?/sec    1.00    136.2±0.22µs        ? ?/sec
> sort string[10] to indices 2^12                         1.46    336.0±0.50µs        ? ?/sec    1.00    229.8±0.49µs        ? ?/sec

zhuqi-lucas · 2025-07-04T03:13:31Z

I think this idea is really interesting and clever! I think we should be able to push this even further a bit by having the prefix computed upfront once (and remove the length) to reduce accessing the string data (so we can mostly do sorting "in-place") and save some conversion cost.

Thank you @Dandandan , great point, let me try to address this!

zhuqi-lucas · 2025-07-04T03:41:28Z

Thanks @zhuqi-lucas -- this looks quite cool. I have a few questions but the results are quite impressive

FYI @Dandandan

Thank you @alamb for review, i will polish the PR to use one prefix first, we can optimize long string cases later if we find a good way!

zhuqi-lucas · 2025-07-04T04:00:51Z

Thanks a lot @alamb @Dandandan , updated the latest benchmark from my local mac:

Amazing result, 2.3 faster ~ 4.5 faster 😺

group                                         issue_7847                             main
-----                                         ----------                             ----
sort string[0-100] nulls to indices 2^12      1.00     23.6±0.15µs        ? ?/sec    2.60     61.4±0.63µs        ? ?/sec
sort string[0-100] to indices 2^12            1.00     34.5±0.42µs        ? ?/sec    3.57    123.0±2.51µs        ? ?/sec
sort string[0-10] nulls to indices 2^12       1.00     26.0±2.89µs        ? ?/sec    2.92     76.1±2.45µs        ? ?/sec
sort string[0-10] to indices 2^12             1.00     37.5±0.41µs        ? ?/sec    4.43    166.1±4.20µs        ? ?/sec
sort string[0-400] nulls to indices 2^12      1.00     24.5±0.20µs        ? ?/sec    2.49     60.9±0.56µs        ? ?/sec
sort string[0-400] to indices 2^12            1.00     36.2±0.33µs        ? ?/sec    3.29    119.0±0.83µs        ? ?/sec
sort string[1000] nulls to indices 2^12       1.00     24.7±0.18µs        ? ?/sec    2.44     60.1±0.44µs        ? ?/sec
sort string[1000] to indices 2^12             1.00     35.0±0.25µs        ? ?/sec    3.24    113.3±0.75µs        ? ?/sec
sort string[100] nulls to indices 2^12        1.00     24.2±0.16µs        ? ?/sec    2.45     59.2±0.54µs        ? ?/sec
sort string[100] to indices 2^12              1.00     35.2±0.26µs        ? ?/sec    3.22    113.3±0.98µs        ? ?/sec
sort string[10] nulls to indices 2^12         1.00     24.2±0.20µs        ? ?/sec    2.36     57.0±0.44µs        ? ?/sec
sort string[10] to indices 2^12               1.00     33.5±0.21µs        ? ?/sec    3.12    104.5±0.82µs        ? ?/sec

But it may differs from the linux one, after #7867 merged, we can trigger the benchmark from github script, thanks!

Dandandan · 2025-07-04T16:44:43Z

arrow-ord/src/sort.rs

+                v << (8 * (4 - len))
+            };
+            // Pack into a single u64: (prefix << 32) | length
+            let prefix64 = ((prefix_u32 as u64) << 32) | (len as u64);


I think a single string could theoretically be larger than u32::MAX, in this case bitwise or result is wrong I believe.

What about using 64 bits so it can use u64 (first 8 bytes) for prefix and u64 for length?

Thank you @Dandandan for this good suggestion!

Interesting, i found change to 8 bytes prefix and u64 for length, it will cause about 50% slow than 4 bytes prefix for most of the cases, it seems due to the 4 bytes prefix better L1 cache matching and less compare overhead.

Thank you @Dandandan :

In my latest PR, I’ve added support for a u64 length field while still using a 4‑byte prefix, and benchmarks show no performance regression. Instead of packing prefix and length into one large key, we compare them separately:

Compare the 4‑byte prefix

If that’s equal (and only then), compare the u64 length(only any side is < 4 bytes which is possible to have padding)

Finally, fall back to a full byte‑by‑byte compare

This keeps the hot path on the cheap 4‑byte prefix compare, and the additional length check is so infrequently needed that it doesn’t hurt throughput.

@alamb

# Which issue does this PR close? #7860 (comment) Add rich testing cases for sort string(utf8) cc @alamb @Dandandan Preparation for experiment: #7860 # Rationale for this change Add rich testing cases for sort string(utf8) # What changes are included in this PR? Add rich testing cases for sort string(utf8) # Are these changes tested? Yes # Are there any user-facing changes? No

Dandandan · 2025-07-08T06:15:37Z

arrow-ord/src/sort.rs

+                }
+                v << (8 * (4 - slice.len()))
+            };
+            (idx, prefix, len)


I think len it can store len < bool as well ("is small") as the actual length is not used? This will also avoid doing this la < 4 check in the sort, so might be slightly faster.

Alternatively or additionaly we could store the &[u8] instead of the index so it doesn't have to retrieve it via values.value again in the sort.

Thank you @Dandandan for this idea, i tried now, but it show 30% performance decrease:

diff --git a/arrow-ord/src/sort.rs b/arrow-ord/src/sort.rs index 093c52d867..29800663a0 100644 --- a/arrow-ord/src/sort.rs +++ b/arrow-ord/src/sort.rs @@ -301,14 +301,15 @@ fn sort_bytes<T: ByteArrayType>( // comparing up to 4 bytes as a single u32 we avoid the overhead // of a full lexicographical compare for the vast majority of cases. - // 1. Build a vector of (index, prefix, length) tuples - let mut valids: Vec<(u32, u32, u64)> = value_indices + // 1. Build a vector of (idx, prefix, is_small, slice) tuples + let mut valids: Vec<(u32, u32, bool, &[u8])> = value_indices .into_iter() .map(|idx| { let slice: &[u8] = values.value(idx as usize).as_ref(); - let len = slice.len() as u64; - // Compute the 4‑byte prefix in BE order, or left‑pad if shorter - let prefix = if slice.len() >= 4 { + // store the bool for whether the slice is smaller than 4 bytes + let is_small = slice.len() < 4; + // prefix: if the slice is smaller than 4 bytes, left-pad it with zeros, + let prefix = if !is_small { let raw = unsafe { std::ptr::read_unaligned(slice.as_ptr() as *const u32) }; u32::from_be(raw) } else { @@ -318,7 +319,7 @@ fn sort_bytes<T: ByteArrayType>( } v << (8 * (4 - slice.len())) }; - (idx, prefix, len) + (idx, prefix, is_small, slice) }) .collect(); @@ -328,27 +329,24 @@ fn sort_bytes<T: ByteArrayType>( _ => valids.len(), }; - // 3. Comparator: compare prefix, then (when both slices shorter than 4) length, otherwise full slice - let cmp_bytes = |a: &(u32, u32, u64), b: &(u32, u32, u64)| { - let (ia, pa, la) = *a; - let (ib, pb, lb) = *b; - // 3.1 prefix (first 4 bytes) - let ord = pa.cmp(&pb); - if ord != Ordering::Equal { - return ord; + // 3. Comparator: compare prefix first, then for both “small” slices compare length, and finally full lexicographical compare + let cmp_bytes = |a: &(u32, u32, bool, &[u8]), b: &(u32, u32, bool, &[u8])| { + let (_ia, pa, sa_small, sa) = a; + let (_ib, pb, sb_small, sb) = b; + // 3.1 Compare the 4‑byte prefix + match pa.cmp(&pb) { + Ordering::Equal => (), + non_eq => return non_eq, } - // 3.2 only if both slices had length < 4 (so prefix was padded) - // length compare only when prefix was padded (i.e. original length < 4) - if la < 4 || lb < 4 { - let ord = la.cmp(&lb); - if ord != Ordering::Equal { - return ord; + // 3.2 If both slices were shorter than 4 bytes, compare their actual lengths + if *sa_small && *sb_small { + match sa.len().cmp(&sb.len()) { + Ordering::Equal => (), + non_eq => return non_eq, } } - // 3.3 full lexicographical compare - let a_bytes: &[u8] = values.value(ia as usize).as_ref(); - let b_bytes: &[u8] = values.value(ib as usize).as_ref(); - a_bytes.cmp(b_bytes) + // 3.3 Otherwise, do a full byte‑wise lexicographical comparison + sa.cmp(sb) }; // 4. Partially sort according to ascending/descending @@ -366,9 +364,9 @@ fn sort_bytes<T: ByteArrayType>( if options.nulls_first { out.extend_from_slice(&nulls[..nulls.len().min(out_limit)]); let rem = out_limit - out.len(); - out.extend(valids.iter().map(|&(i, _, _)| i).take(rem)); + out.extend(valids.iter().map(|&(i, _, _, _)| i).take(rem)); } else { - out.extend(valids.iter().map(|&(i, _, _)| i).take(out_limit)); + out.extend(valids.iter().map(|&(i, _, _, _)| i).take(out_limit)); let rem = out_limit - out.len(); out.extend_from_slice(&nulls[..rem]); }

zhuqi-lucas added 3 commits July 3, 2025 20:34

experiment improve sort stringarray

0a8db05

add benchmark

5019b69

polish code

0c665fb

github-actions bot added the arrow Changes to the arrow crate label Jul 3, 2025

alamb reviewed Jul 3, 2025

View reviewed changes

Dandandan reviewed Jul 3, 2025

View reviewed changes

Benchmark: Add rich testing cases for sort string(utf8)

3f30e41

zhuqi-lucas mentioned this pull request Jul 4, 2025

Benchmark: Add rich testing cases for sort string(utf8) #7867

Merged

zhuqi-lucas added 3 commits July 4, 2025 11:43

address comments polish code

0bbb117

Merge branch 'add_rich_benchmark_for_sort_string' into issue_7847

0f7e353

polish comments

e77f208

Dandandan changed the title ~~Improve StringArray(Utf8) sort performance (1.2x faster)~~ Improve StringArray(Utf8) sort performance (~2-4x faster) Jul 4, 2025

Dandandan reviewed Jul 4, 2025

View reviewed changes

zhuqi-lucas added 2 commits July 5, 2025 19:57

address comments

8ce60b1

Merge remote-tracking branch 'upstream/main' into issue_7847

360ca41

Dandandan reviewed Jul 8, 2025

View reviewed changes

Improve StringArray(Utf8) sort performance (~2-4x faster) #7860

Are you sure you want to change the base?

Improve StringArray(Utf8) sort performance (~2-4x faster) #7860

Uh oh!

Conversation

zhuqi-lucas commented Jul 3, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

zhuqi-lucas commented Jul 3, 2025

Uh oh!

alamb commented Jul 3, 2025

Uh oh!

alamb commented Jul 3, 2025

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Dandandan Jul 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Dandandan Jul 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Dandandan commented Jul 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhuqi-lucas commented Jul 4, 2025

Uh oh!

zhuqi-lucas commented Jul 4, 2025

Uh oh!

zhuqi-lucas commented Jul 4, 2025

Uh oh!

zhuqi-lucas commented Jul 4, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhuqi-lucas Jul 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Dandandan Jul 3, 2025 •

edited

Loading

Dandandan Jul 3, 2025 •

edited

Loading

Dandandan commented Jul 3, 2025 •

edited

Loading

zhuqi-lucas Jul 5, 2025 •

edited

Loading