Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvements to UTF-8 statistics truncation #6870

Merged
merged 11 commits into from
Dec 16, 2024
Merged

Conversation

etseidl
Copy link
Contributor

@etseidl etseidl commented Dec 11, 2024

Which issue does this PR close?

Closes #6867.

Rationale for this change

See issue.

What changes are included in this PR?

For max statistics replaces truncate_utf8().and_then(increment_utf8) with a new function truncate_and_increment_utf8(). This defers the creation of a new Vec<u8> until all processing is complete. This also changes increment_utf8 to operate on entire unicode code points rather than doing arithmetic on the individual UTF-8 encoded bytes. Finally, this modifies the truncation logic so that UTF-8 handling is only done for columns whose logical type is String (or converted type UTF8).

The new increment logic is up to 30X faster for pathological cases of strings that cannot be truncated, and is no slower than the current code for simple cases where only the last byte of a string needs to be incremented.

Are there any user-facing changes?

No API changes, but will potentially produce different truncated max statistics.

@github-actions github-actions bot added the parquet Changes to the parquet crate label Dec 11, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @etseidl -- the tests in this PR are amazing

I am not sure about the logic to increment the utf8 bytes. Let me know what you think

parquet/src/column/writer/mod.rs Outdated Show resolved Hide resolved
parquet/src/column/writer/mod.rs Show resolved Hide resolved
parquet/src/column/writer/mod.rs Show resolved Hide resolved
parquet/src/column/writer/mod.rs Show resolved Hide resolved
parquet/src/column/writer/mod.rs Outdated Show resolved Hide resolved
@etseidl etseidl changed the title Fix some edge cases in UTF-8 incrementing Improvements to UTF-8 statistics truncation Dec 13, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @etseidl -- I went through this logic quite carefully and I think it looks great.

  1. It only copies the string values once (like the current code)
  2. I am convinced it is correct ❤️ (hopefully those will not be famous last words)

So all in all, thank you and great job. Thank you

Err(_) => Some(data[..l].to_vec()),
})
.and_then(|l|
// don't do extra work if this column isn't UTF-8
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯

if self.is_utf8() {
match str::from_utf8(data) {
Ok(str_data) => truncate_utf8(str_data, l),
Err(_) => Some(data[..l].to_vec()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is a somewhat questionable move to truncate this on invalid data, but I see that is wht the code used to do so seems good to me

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, good point. The old code simply tried utf first, and then fell back. Here we're actually expecting valid UTF8 so perhaps it's better to return an error. I'd hope some string validation was done before getting this far. I'll think on this some more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should leave it as is and maybe document that if non utf8 data is passed in it will be truncated with bytes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've managed a test that exercises this path via the SerializedFileWriter API. It truncates as expected (i.e. as binary, not UTF-8), but now I worry that it's possible to create invalid data. Oddly both parquet-java and pyarrow seem ok with non-utf8 string data.

To paraphrase a wise man I know: Every day I wake up. And then I remember Parquet exists. 🫤

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left the logic here as is, but added documentation and a test. We can revisit if this ever becomes an issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To paraphrase a wise man I know: Every day I wake up. And then I remember Parquet exists. 🫤

I solace myself with this quote from a former coworker:

"Legacy Code, n: code that is getting the job done, and pretty well at that"

Not that we can't / shouldn't improve it of course 🤣

thanks again for all the help here

parquet/src/column/writer/mod.rs Show resolved Hide resolved
parquet/src/column/writer/mod.rs Outdated Show resolved Hide resolved
parquet/src/column/writer/mod.rs Outdated Show resolved Hide resolved
@alamb alamb merged commit 9ffa065 into apache:main Dec 16, 2024
16 checks passed
Comment on lines +881 to +885
/// Returns `true` if this column's logical type is a UTF-8 string.
fn is_utf8(&self) -> bool {
self.get_descriptor().logical_type() == Some(LogicalType::String)
|| self.get_descriptor().converted_type() == ConvertedType::UTF8
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this works for dictionary encoded columns as well right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, regardless of the encoding, the statistics are for the data itself. You wouldn't see a dictionary key here.

etseidl added a commit to etseidl/arrow-rs that referenced this pull request Dec 16, 2024
* fix a few edge cases with utf-8 incrementing

* add todo

* simplify truncation

* add another test

* note case where string should render right to left

* rework entirely, also avoid UTF8 processing if not required by the schema

* more consistent naming

* modify some tests to truncate in the middle of a multibyte char

* add test and docstring

* document truncate_min_value too
@etseidl etseidl deleted the increment_utf8 branch December 16, 2024 20:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Parquet UTF-8 max statistics are overly pessimistic
4 participants