Improvements to UTF-8 statistics truncation #6870

etseidl · 2024-12-11T17:15:51Z

Which issue does this PR close?

Closes #6867.

Rationale for this change

See issue.

What changes are included in this PR?

For max statistics replaces truncate_utf8().and_then(increment_utf8) with a new function truncate_and_increment_utf8(). This defers the creation of a new Vec<u8> until all processing is complete. This also changes increment_utf8 to operate on entire unicode code points rather than doing arithmetic on the individual UTF-8 encoded bytes. Finally, this modifies the truncation logic so that UTF-8 handling is only done for columns whose logical type is String (or converted type UTF8).

The new increment logic is up to 30X faster for pathological cases of strings that cannot be truncated, and is no slower than the current code for simple cases where only the last byte of a string needs to be incremented.

Are there any user-facing changes?

No API changes, but will potentially produce different truncated max statistics.

alamb

Thanks @etseidl -- the tests in this PR are amazing

I am not sure about the logic to increment the utf8 bytes. Let me know what you think

parquet/src/column/writer/mod.rs

…hema

alamb

Thank you @etseidl -- I went through this logic quite carefully and I think it looks great.

It only copies the string values once (like the current code)
I am convinced it is correct ❤️ (hopefully those will not be famous last words)

So all in all, thank you and great job. Thank you

alamb · 2024-12-14T01:05:49Z

parquet/src/column/writer/mod.rs

-                Err(_) => Some(data[..l].to_vec()),
-            })
+            .and_then(|l|
+                // don't do extra work if this column isn't UTF-8


alamb · 2024-12-14T01:45:02Z

parquet/src/column/writer/mod.rs

+                if self.is_utf8() {
+                    match str::from_utf8(data) {
+                        Ok(str_data) => truncate_utf8(str_data, l),
+                        Err(_) => Some(data[..l].to_vec()),


it is a somewhat questionable move to truncate this on invalid data, but I see that is wht the code used to do so seems good to me

Hmm, good point. The old code simply tried utf first, and then fell back. Here we're actually expecting valid UTF8 so perhaps it's better to return an error. I'd hope some string validation was done before getting this far. I'll think on this some more.

I think we should leave it as is and maybe document that if non utf8 data is passed in it will be truncated with bytes

I've managed a test that exercises this path via the SerializedFileWriter API. It truncates as expected (i.e. as binary, not UTF-8), but now I worry that it's possible to create invalid data. Oddly both parquet-java and pyarrow seem ok with non-utf8 string data.

To paraphrase a wise man I know: Every day I wake up. And then I remember Parquet exists. 🫤

I've left the logic here as is, but added documentation and a test. We can revisit if this ever becomes an issue.

To paraphrase a wise man I know: Every day I wake up. And then I remember Parquet exists. 🫤

I solace myself with this quote from a former coworker:

"Legacy Code, n: code that is getting the job done, and pretty well at that"

Not that we can't / shouldn't improve it of course 🤣

thanks again for all the help here

parquet/src/column/writer/mod.rs

adriangb · 2024-12-16T18:35:24Z

parquet/src/column/writer/mod.rs

+    /// Returns `true` if this column's logical type is a UTF-8 string.
+    fn is_utf8(&self) -> bool {
+        self.get_descriptor().logical_type() == Some(LogicalType::String)
+            || self.get_descriptor().converted_type() == ConvertedType::UTF8
+    }


I assume this works for dictionary encoded columns as well right?

Yes, regardless of the encoding, the statistics are for the data itself. You wouldn't see a dictionary key here.

* fix a few edge cases with utf-8 incrementing * add todo * simplify truncation * add another test * note case where string should render right to left * rework entirely, also avoid UTF8 processing if not required by the schema * more consistent naming * modify some tests to truncate in the middle of a multibyte char * add test and docstring * document truncate_min_value too

etseidl and others added 5 commits December 10, 2024 15:09

fix a few edge cases with utf-8 incrementing

0f7af02

add todo

52706f9

simplify truncation

80fa0dd

add another test

f1726ab

Merge branch 'apache:main' into increment_utf8

7b88e91

github-actions bot added the parquet Changes to the parquet crate label Dec 11, 2024

etseidl mentioned this pull request Dec 11, 2024

Parquet UTF-8 max statistics are overly pessimistic #6867

Closed

alamb reviewed Dec 12, 2024

View reviewed changes

alamb mentioned this pull request Dec 12, 2024

Implement predicate pruning for like expressions (prefix matching) apache/datafusion#12978

Open

findepi reviewed Dec 12, 2024

View reviewed changes

parquet/src/column/writer/mod.rs Outdated Show resolved Hide resolved

etseidl added 2 commits December 12, 2024 10:03

note case where string should render right to left

c4d9474

rework entirely, also avoid UTF8 processing if not required by the sc…

7a7fd0e

…hema

etseidl changed the title ~~Fix some edge cases in UTF-8 incrementing~~ Improvements to UTF-8 statistics truncation Dec 13, 2024

alamb approved these changes Dec 14, 2024

View reviewed changes

etseidl added 2 commits December 14, 2024 09:04

more consistent naming

400f5f8

modify some tests to truncate in the middle of a multibyte char

006a388

alamb approved these changes Dec 15, 2024

View reviewed changes

etseidl added 2 commits December 16, 2024 09:51

add test and docstring

f251b00

document truncate_min_value too

e7d0af8

alamb merged commit 9ffa065 into apache:main Dec 16, 2024
16 checks passed

adriangb reviewed Dec 16, 2024

View reviewed changes

etseidl deleted the increment_utf8 branch December 16, 2024 20:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improvements to UTF-8 statistics truncation #6870

Improvements to UTF-8 statistics truncation #6870

etseidl commented Dec 11, 2024 •

edited

Loading

alamb left a comment

alamb left a comment

alamb Dec 14, 2024

alamb Dec 14, 2024

etseidl Dec 14, 2024

alamb Dec 14, 2024

etseidl Dec 16, 2024

etseidl Dec 16, 2024

alamb Dec 16, 2024

alamb Dec 16, 2024

adriangb Dec 16, 2024

etseidl Dec 16, 2024

Improvements to UTF-8 statistics truncation #6870

Improvements to UTF-8 statistics truncation #6870

Conversation

etseidl commented Dec 11, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

alamb left a comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

etseidl commented Dec 11, 2024 •

edited

Loading