-
Notifications
You must be signed in to change notification settings - Fork 483
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RecordReaderImpl.getValueRange() may cause incorrect results #1061
Comments
Thank you for reporting, @PengleiShi .
|
AFAIK, this doesn't happen between Apache ORC writer and reader, right? @PengleiShi |
Yes, it doesn't. In the case i mentioned, the files were wrote by trino(which has own orc writer) and read by spark(which depends on Apache ORC reader). |
Most of files wrote by trino have proper statistics. I will try to re-generate some problem orc files which can be public. |
@dongjoon-hyun. Trino won't write string column statistics if string value is bigger than 64 bytes
20220310_100444_03858_nbvwj_53625cc9-7183-4beb-be48-9d059d8fa560.zip |
@PengleiShi do you mind sharing the stats of the problematic case above? |
@pgaref here i have uploaded a problematic file for test. Its meta shows below |
orc version: 1.6.11, sql:
select xxx from xxx where str is not null
Recently i found some orc files wrote by trino didn't have complete statistics in files meta(maybe a presto bug), this causes
OrcProto.ColumnStatistics
can't be deserialized to any specificColumnStatisticsImpl
such asStringStatisticsImpl
, thenRecordReaderImpl.getValueRange()
returnsValueRange
with nulllower
andRecordReaderImpl.pickRowGroups()
skips this row group, which should not be skipped. In normal conditions except above, everything is ok. And i found orc-1.5.x can handle above case according toRecordReaderImpl.UNKNOWN_VALUE
, which has removed in 1.6.x. Maybe we could add it back for better compatibility. @dongjoon-hyun @omalleyThe text was updated successfully, but these errors were encountered: