-
Notifications
You must be signed in to change notification settings - Fork 971
Improve StringArray(Utf8) sort performance (~2-4x faster) #7860
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
cc @Dandandan @alamb Did some experiment for StringArray, we also can get some improvement. |
🤖 |
🤖: Benchmark completed Details
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @zhuqi-lucas -- this looks quite cool. I have a few questions but the results are quite impressive
FYI @Dandandan
arrow-ord/src/sort.rs
Outdated
_ => valids.len(), | ||
}; | ||
|
||
// 3. comparator function for mixed byte views |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does "byte view" mean in this context?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is mistake, thank you @alamb.
arrow-ord/src/sort.rs
Outdated
let min_len = a_len.min(b_len); | ||
|
||
// 3. compare the prefix of the first 4 bytes | ||
let pref = min_len.min(4); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add some comment about why checking the first 4 bytes specially is worthwhile? Is it an optimization for small strings?
Giveen the loop below uses read_unaligned
for 8 byte pointers, maybe we should do the same thing for the 4 byte initial read?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @alamb , i added the comments in latest PR, and current implement will not have read_unaligned for 8 byte pointers.
arrow-ord/src/sort.rs
Outdated
return pa.cmp(&pb); | ||
} | ||
|
||
// 3.2 Use 8 bytes to compare one by one if the prefix is equal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am quite surprised the built in implementation of cmp
doesn't already have this optimization -- this code is basically some sort of manual vectorization -- I woud have expected that a_bytes
and b_bytes
would already do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the performance without this non-prefix fallback @zhuqi-lucas ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point @alamb @Dandandan , let me remove this logic, it may be no difference for performance, i will test it, but i am confused my local benchmark seems different from the benchmark shows here. I may find a linux machine to reproduce it also.
arrow-ord/src/sort.rs
Outdated
// 3.2 Use 8 bytes to compare one by one if the prefix is equal | ||
let mut i = pref; | ||
while i + 8 <= min_len { | ||
let raw_a = unsafe { std::ptr::read_unaligned(a_bytes.as_ptr().add(i) as *const u64) }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we know that a_bytes
is aligned to an 8 byte boundary? Can't be be any arbitrary offset in the data buffer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @alamb for this good point, i will try to only keep one prefix compare and testing the result, the 8 byte boundary following should not improve too much and it's not clear for it.
arrow/benches/sort_kernel.rs
Outdated
@@ -113,6 +113,16 @@ fn add_benchmark(c: &mut Criterion) { | |||
b.iter(|| bench_sort_to_indices(&arr, None)) | |||
}); | |||
|
|||
let arr = create_string_array::<i32>(2usize.pow(12), 0.0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you add this benchmark as a separate PR it would be easier for me to automate running benchmarks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
arrow-ord/src/sort.rs
Outdated
.collect::<Vec<(u32, &[u8])>>(); | ||
.map(|idx| { | ||
let slice: &[u8] = values.value(idx as usize).as_ref(); | ||
(idx, slice, slice.len()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we compute/cache the prefix here? I think that would save some conversion & memory access!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @Dandandan , very good point!
arrow-ord/src/sort.rs
Outdated
.collect::<Vec<(u32, &[u8])>>(); | ||
.map(|idx| { | ||
let slice: &[u8] = values.value(idx as usize).as_ref(); | ||
(idx, slice, slice.len()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why store slice.len()
here? That shouldn't make a difference as the slice already has the len already?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, we don't need to store the len.
I think this idea is really interesting and clever! I think we should be able to push this even further a bit by having the prefix computed upfront once (and remove the length) to reduce accessing the string data (so we can mostly do sorting "in-place") and save some conversion cost per comparison. |
This PR shows slow for the benchmark. It seems different result from my local 🤔 > sort string[10] nulls to indices 2^12 1.37 185.9±0.25µs ? ?/sec 1.00 136.2±0.22µs ? ?/sec
> sort string[10] to indices 2^12 1.46 336.0±0.50µs ? ?/sec 1.00 229.8±0.49µs ? ?/sec |
Thank you @Dandandan , great point, let me try to address this! |
Thank you @alamb for review, i will polish the PR to use one prefix first, we can optimize long string cases later if we find a good way! |
Thanks a lot @alamb @Dandandan , updated the latest benchmark from my local mac: Amazing result, 2.3 faster ~ 4.5 faster 😺 group issue_7847 main
----- ---------- ----
sort string[0-100] nulls to indices 2^12 1.00 23.6±0.15µs ? ?/sec 2.60 61.4±0.63µs ? ?/sec
sort string[0-100] to indices 2^12 1.00 34.5±0.42µs ? ?/sec 3.57 123.0±2.51µs ? ?/sec
sort string[0-10] nulls to indices 2^12 1.00 26.0±2.89µs ? ?/sec 2.92 76.1±2.45µs ? ?/sec
sort string[0-10] to indices 2^12 1.00 37.5±0.41µs ? ?/sec 4.43 166.1±4.20µs ? ?/sec
sort string[0-400] nulls to indices 2^12 1.00 24.5±0.20µs ? ?/sec 2.49 60.9±0.56µs ? ?/sec
sort string[0-400] to indices 2^12 1.00 36.2±0.33µs ? ?/sec 3.29 119.0±0.83µs ? ?/sec
sort string[1000] nulls to indices 2^12 1.00 24.7±0.18µs ? ?/sec 2.44 60.1±0.44µs ? ?/sec
sort string[1000] to indices 2^12 1.00 35.0±0.25µs ? ?/sec 3.24 113.3±0.75µs ? ?/sec
sort string[100] nulls to indices 2^12 1.00 24.2±0.16µs ? ?/sec 2.45 59.2±0.54µs ? ?/sec
sort string[100] to indices 2^12 1.00 35.2±0.26µs ? ?/sec 3.22 113.3±0.98µs ? ?/sec
sort string[10] nulls to indices 2^12 1.00 24.2±0.20µs ? ?/sec 2.36 57.0±0.44µs ? ?/sec
sort string[10] to indices 2^12 1.00 33.5±0.21µs ? ?/sec 3.12 104.5±0.82µs ? ?/sec But it may differs from the linux one, after #7867 merged, we can trigger the benchmark from github script, thanks! |
arrow-ord/src/sort.rs
Outdated
v << (8 * (4 - len)) | ||
}; | ||
// Pack into a single u64: (prefix << 32) | length | ||
let prefix64 = ((prefix_u32 as u64) << 32) | (len as u64); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a single string could theoretically be larger than u32::MAX, in this case bitwise or result is wrong I believe.
What about using 64 bits so it can use u64 (first 8 bytes) for prefix and u64
for length?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @Dandandan for this good suggestion!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, i found change to 8 bytes prefix and u64 for length, it will cause about 50% slow than 4 bytes prefix for most of the cases, it seems due to the 4 bytes prefix better L1 cache matching and less compare overhead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @Dandandan :
In my latest PR, I’ve added support for a u64 length field while still using a 4‑byte prefix, and benchmarks show no performance regression. Instead of packing prefix and length into one large key, we compare them separately:
-
Compare the 4‑byte prefix
-
If that’s equal (and only then), compare the u64 length(only any side is < 4 bytes which is possible to have padding)
-
Finally, fall back to a full byte‑by‑byte compare
This keeps the hot path on the cheap 4‑byte prefix compare, and the additional length check is so infrequently needed that it doesn’t hurt throughput.
# Which issue does this PR close? #7860 (comment) Add rich testing cases for sort string(utf8) cc @alamb @Dandandan Preparation for experiment: #7860 # Rationale for this change Add rich testing cases for sort string(utf8) # What changes are included in this PR? Add rich testing cases for sort string(utf8) # Are these changes tested? Yes # Are there any user-facing changes? No
} | ||
v << (8 * (4 - slice.len())) | ||
}; | ||
(idx, prefix, len) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think len
it can store len < bool
as well ("is small") as the actual length is not used? This will also avoid doing this la < 4
check in the sort, so might be slightly faster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively or additionaly we could store the &[u8]
instead of the index so it doesn't have to retrieve it via values.value
again in the sort.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @Dandandan for this idea, i tried now, but it show 30% performance decrease:
diff --git a/arrow-ord/src/sort.rs b/arrow-ord/src/sort.rs
index 093c52d867..29800663a0 100644
--- a/arrow-ord/src/sort.rs
+++ b/arrow-ord/src/sort.rs
@@ -301,14 +301,15 @@ fn sort_bytes<T: ByteArrayType>(
// comparing up to 4 bytes as a single u32 we avoid the overhead
// of a full lexicographical compare for the vast majority of cases.
- // 1. Build a vector of (index, prefix, length) tuples
- let mut valids: Vec<(u32, u32, u64)> = value_indices
+ // 1. Build a vector of (idx, prefix, is_small, slice) tuples
+ let mut valids: Vec<(u32, u32, bool, &[u8])> = value_indices
.into_iter()
.map(|idx| {
let slice: &[u8] = values.value(idx as usize).as_ref();
- let len = slice.len() as u64;
- // Compute the 4‑byte prefix in BE order, or left‑pad if shorter
- let prefix = if slice.len() >= 4 {
+ // store the bool for whether the slice is smaller than 4 bytes
+ let is_small = slice.len() < 4;
+ // prefix: if the slice is smaller than 4 bytes, left-pad it with zeros,
+ let prefix = if !is_small {
let raw = unsafe { std::ptr::read_unaligned(slice.as_ptr() as *const u32) };
u32::from_be(raw)
} else {
@@ -318,7 +319,7 @@ fn sort_bytes<T: ByteArrayType>(
}
v << (8 * (4 - slice.len()))
};
- (idx, prefix, len)
+ (idx, prefix, is_small, slice)
})
.collect();
@@ -328,27 +329,24 @@ fn sort_bytes<T: ByteArrayType>(
_ => valids.len(),
};
- // 3. Comparator: compare prefix, then (when both slices shorter than 4) length, otherwise full slice
- let cmp_bytes = |a: &(u32, u32, u64), b: &(u32, u32, u64)| {
- let (ia, pa, la) = *a;
- let (ib, pb, lb) = *b;
- // 3.1 prefix (first 4 bytes)
- let ord = pa.cmp(&pb);
- if ord != Ordering::Equal {
- return ord;
+ // 3. Comparator: compare prefix first, then for both “small” slices compare length, and finally full lexicographical compare
+ let cmp_bytes = |a: &(u32, u32, bool, &[u8]), b: &(u32, u32, bool, &[u8])| {
+ let (_ia, pa, sa_small, sa) = a;
+ let (_ib, pb, sb_small, sb) = b;
+ // 3.1 Compare the 4‑byte prefix
+ match pa.cmp(&pb) {
+ Ordering::Equal => (),
+ non_eq => return non_eq,
}
- // 3.2 only if both slices had length < 4 (so prefix was padded)
- // length compare only when prefix was padded (i.e. original length < 4)
- if la < 4 || lb < 4 {
- let ord = la.cmp(&lb);
- if ord != Ordering::Equal {
- return ord;
+ // 3.2 If both slices were shorter than 4 bytes, compare their actual lengths
+ if *sa_small && *sb_small {
+ match sa.len().cmp(&sb.len()) {
+ Ordering::Equal => (),
+ non_eq => return non_eq,
}
}
- // 3.3 full lexicographical compare
- let a_bytes: &[u8] = values.value(ia as usize).as_ref();
- let b_bytes: &[u8] = values.value(ib as usize).as_ref();
- a_bytes.cmp(b_bytes)
+ // 3.3 Otherwise, do a full byte‑wise lexicographical comparison
+ sa.cmp(sb)
};
// 4. Partially sort according to ascending/descending
@@ -366,9 +364,9 @@ fn sort_bytes<T: ByteArrayType>(
if options.nulls_first {
out.extend_from_slice(&nulls[..nulls.len().min(out_limit)]);
let rem = out_limit - out.len();
- out.extend(valids.iter().map(|&(i, _, _)| i).take(rem));
+ out.extend(valids.iter().map(|&(i, _, _, _)| i).take(rem));
} else {
- out.extend(valids.iter().map(|&(i, _, _)| i).take(out_limit));
+ out.extend(valids.iter().map(|&(i, _, _, _)| i).take(out_limit));
let rem = out_limit - out.len();
out.extend_from_slice(&nulls[..rem]);
}
Which issue does this PR close?
Improve StringArray(Utf8) sort performance
Rationale for this change
Support prefix compare, and i optimized it to u32 prefix, and u64 increment compare, it will have best performance when experimenting.
What changes are included in this PR?
Support prefix compare, and i optimized it to u32 prefix, and u64 increment compare, it will have best performance when experimenting.
Are these changes tested?
Yes
Are there any user-facing changes?
If there are user-facing changes then we may require documentation to be updated before approving the PR.
If there are any breaking changes to public APIs, please call them out.