Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-44042: [C++][Parquet] Limit num-of row-groups when build parquet #44043

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

mapleFU
Copy link
Member

@mapleFU mapleFU commented Sep 10, 2024

Rationale for this change

Limit num-of row-groups when build parquet

What changes are included in this PR?

Limit num-of row-groups when build parquet

Are these changes tested?

No

Are there any user-facing changes?

No

@mapleFU mapleFU requested a review from wgtmac as a code owner September 10, 2024 11:59
@mapleFU
Copy link
Member Author

mapleFU commented Sep 10, 2024

@wgtmac @pitrou would you mind take a look?

For testing, I don't know would it be too slow, if not I'd add one

Copy link

⚠️ GitHub issue #44042 has been automatically assigned in GitHub to PR creator.

@pitrou
Copy link
Member

pitrou commented Sep 10, 2024

Is erroring out the right solution? The row group ordinal is optional and was introduced for encryption. Furthermore, it's not obvious that it should be unique. https://github.com/apache/parquet-format/blob/master/Encryption.md gives no guidelines as to ordinal generation.

@ggershinsky What is the intent here? Should ordinal be unique? But then, should the Parquet reader check for uniqueness? Otherwise a malicious file could still be crafted.

@pitrou
Copy link
Member

pitrou commented Sep 10, 2024

Ok, the C++ reader always uses the physical row group number as the row group ordinal for AAD computation:

constexpr auto kEncryptedRowGroupsLimit = 32767;
if (i > kEncryptedRowGroupsLimit) {
throw ParquetException("Encrypted files cannot contain more than 32767 row groups");
}
CryptoContext ctx(col->has_dictionary_page(), row_group_ordinal_,
static_cast<int16_t>(i), meta_decryptor, data_decryptor);
return PageReader::Open(stream, col->num_values(), col->compression(), properties_,
always_compressed, &ctx);

This begs the question: why is the row group ordinal stored in Thrift metadata if it's simply ignored when reading?

@pitrou
Copy link
Member

pitrou commented Sep 10, 2024

And by the way, the same audit should be done for column and page ordinals.

@wgtmac
Copy link
Member

wgtmac commented Sep 10, 2024

Ok, the C++ reader always uses the physical row group number as the row group ordinal for AAD computation:

Isn't it a bug on the C++ side? It may produces wrong AAD when reading a Parquet file created by ParquetRewriter which binpacks several encrypted files.

EDIT: My bad. ParquetRewriter does not support binpack encrypted files yet.

@pitrou
Copy link
Member

pitrou commented Sep 10, 2024

It may produces wrong AAD when reading a Parquet file created by ParquetRewriter which binpacks several encrypted files.

Hopefully it's not possible to concatenate encrypted files together? Otherwise one can create malicious data (replay attack).

@ggershinsky
Copy link
Contributor

The number of row groups (columns, pages) should be limited for encrypted files only. I don't think this exception should be thrown for unencrypted files.

@ggershinsky
Copy link
Contributor

ggershinsky commented Sep 10, 2024

It may produces wrong AAD when reading a Parquet file created by ParquetRewriter which binpacks several encrypted files.

Hopefully it's not possible to concatenate encrypted files together? Otherwise one can create malicious data (replay attack).

Yep, merging encrypted files is impossible, the AAD has a unique file id component.

@pitrou
Copy link
Member

pitrou commented Sep 10, 2024

Yep, merging encrypted files is impossible, the AAD has a unique file id component.

It seems to be optional and disabled by default according to https://github.com/apache/parquet-format/blob/master/Encryption.md#441-aad-prefix ?

Regardless, my question was about ordinal reuse or shuffling. Should readers verify that ordinals correspond to the physical row group numbers? What is the intent exactly? The spec does not say how these should be handled.

@ggershinsky
Copy link
Contributor

Yep, merging encrypted files is impossible, the AAD has a unique file id component.

It seems to be optional and disabled by default according to https://github.com/apache/parquet-format/blob/master/Encryption.md#441-aad-prefix ?

It's a different parameter, aad_file_unique, a mandatory part of the AAD suffix,
https://github.com/apache/parquet-format/blob/master/Encryption.md#442-aad-suffix
https://github.com/apache/parquet-format/blob/master/Encryption.md#52-crypto-structures

Regardless, my question was about ordinal reuse or shuffling. Should readers verify that ordinals correspond to the physical row group numbers? What is the intent exactly? The spec does not say how these should be handled.

The rg ordinals in thrift are a utility, which can be helpful for readers that split a single file into parallel reading threads. The readers can also run an rg loop/counter. The values will be the same, unless the file is tampered with. Note - the thrift footer is tamper-proof, so the rg ordinals are safe; but the row groups / pages can be shuffled by an attacker. Reading a page from a shuffled rg will throw an exception.
The situation is similar for column and page ordinals. The spec says "The suffix part of a module AAD protects against module swapping inside a file.", but doesn't go into smaller details because they can derived directly from the provided information.

@pitrou
Copy link
Member

pitrou commented Sep 10, 2024

The spec says "The suffix part of a module AAD protects against module swapping inside a file.", but doesn't go into smaller details because they can derived directly from the provided information.

Since we're talking about a security feature, it would be much better if the spec gave clear guidelines instead of letting implementers do the necessary guesswork (and potentially make errors).

@ggershinsky
Copy link
Contributor

I do agree the spec often feels laconic, not only here but in other places too. Expanding it, with ample explanation and reasoning, would result in a different document (implementation guide), much longer than the spec.
However, the spec details are sufficient for a secure implementation. In this example, a reader can use either a row group counter (the term "ordinal" has a well-defined meaning) or the thrift value; either one will support the intended security model. Still, since other ordinals (column and page) don't have a thrift duplicate, I can add a clarification text that makes this point explicit. Will send a patch.

@mapleFU
Copy link
Member Author

mapleFU commented Sep 23, 2024

@ggershinsky Would you mind add a patch to parquet-format? Then I can move forward here

@ggershinsky
Copy link
Contributor

Sure, here it goes apache/parquet-format#453

@mapleFU
Copy link
Member Author

mapleFU commented Sep 29, 2024

apache/parquet-format#453 is merged and I understand this feature is for encryption. However, still this value is written to the metadata, should I:

  1. Just throw like current impl when row-group ordinal is greater than i16::max
  2. Not writing the ordinal when not encryption, so this check can be loosen

@pitrou
Copy link
Member

pitrou commented Sep 29, 2024

Perhaps a quick check that other Parquet readers only use the row group ordinal for encrypted files?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants