-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-44042: [C++][Parquet] Limit num-of row-groups when build parquet #44043
base: main
Are you sure you want to change the base?
Conversation
|
Is erroring out the right solution? The row group @ggershinsky What is the intent here? Should |
Ok, the C++ reader always uses the physical row group number as the row group ordinal for AAD computation: arrow/cpp/src/parquet/file_reader.cc Lines 270 to 278 in b4b22a4
This begs the question: why is the row group ordinal stored in Thrift metadata if it's simply ignored when reading? |
And by the way, the same audit should be done for column and page ordinals. |
Isn't it a bug on the C++ side? It may produces wrong AAD when reading a Parquet file created by ParquetRewriter which binpacks several encrypted files. EDIT: My bad. ParquetRewriter does not support binpack encrypted files yet. |
Hopefully it's not possible to concatenate encrypted files together? Otherwise one can create malicious data (replay attack). |
The number of row groups (columns, pages) should be limited for encrypted files only. I don't think this exception should be thrown for unencrypted files. |
Yep, merging encrypted files is impossible, the AAD has a unique file id component. |
It seems to be optional and disabled by default according to https://github.com/apache/parquet-format/blob/master/Encryption.md#441-aad-prefix ? Regardless, my question was about ordinal reuse or shuffling. Should readers verify that ordinals correspond to the physical row group numbers? What is the intent exactly? The spec does not say how these should be handled. |
It's a different parameter,
The rg ordinals in thrift are a utility, which can be helpful for readers that split a single file into parallel reading threads. The readers can also run an rg loop/counter. The values will be the same, unless the file is tampered with. Note - the thrift footer is tamper-proof, so the rg ordinals are safe; but the row groups / pages can be shuffled by an attacker. Reading a page from a shuffled rg will throw an exception. |
Since we're talking about a security feature, it would be much better if the spec gave clear guidelines instead of letting implementers do the necessary guesswork (and potentially make errors). |
I do agree the spec often feels laconic, not only here but in other places too. Expanding it, with ample explanation and reasoning, would result in a different document (implementation guide), much longer than the spec. |
@ggershinsky Would you mind add a patch to parquet-format? Then I can move forward here |
Sure, here it goes apache/parquet-format#453 |
apache/parquet-format#453 is merged and I understand this feature is for encryption. However, still this value is written to the metadata, should I:
|
Perhaps a quick check that other Parquet readers only use the row group ordinal for encrypted files? |
Rationale for this change
Limit num-of row-groups when build parquet
What changes are included in this PR?
Limit num-of row-groups when build parquet
Are these changes tested?
No
Are there any user-facing changes?
No