-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-41719: [C++][Parquet] Cannot read encrypted parquet datasets via _metadata file #41821
base: main
Are you sure you want to change the base?
Conversation
|
7be8f5e
to
26cfd63
Compare
@AudriusButkevicius could you check if the read metadata contains expected properties? |
Hey, thanks for working on this, looks great, however seems that something is missing. You can read the metadata file, but seems no files are actually read when reading the dataset via the metadata file, so the table ends up empty (but with the right schema).
Code that reproduces the issue Test codeimport os
import tempfile
import pyarrow.parquet.encryption as pe
import pyarrow.parquet as pq
import pyarrow.dataset as ds
import pyarrow as pa
import base64
import polars as pl
class KmsClient(pe.KmsClient):
def unwrap_key(self, wrapped_key, master_key_identifier):
return base64.b64decode(wrapped_key)
def wrap_key(self, key_bytes, master_key_identifier):
return base64.b64encode(key_bytes)
def write(location):
cf = pe.CryptoFactory(lambda *a, **k: KmsClient())
df = pl.DataFrame({
"col1": [1, 2, 3],
"col2": [1, 2, 3],
"year": [2020, 2020, 2021]
})
ecfg = pe.EncryptionConfiguration(
footer_key="TEST",
column_keys={
"TEST": ["col2"]
},
double_wrapping=False,
plaintext_footer=False,
)
table = df.to_arrow()
print("Writing")
print(table)
parquet_encryption_cfg = ds.ParquetEncryptionConfig(
cf, pe.KmsConnectionConfig(), ecfg
)
metadata_collector = []
pq.write_to_dataset(
table,
location,
partitioning=ds.partitioning(
schema=pa.schema([
pa.field("year", pa.int16())
]),
flavor="hive"
),
encryption_config=parquet_encryption_cfg,
metadata_collector=metadata_collector
)
encryption_properties = cf.file_encryption_properties(pe.KmsConnectionConfig(), ecfg)
pq.write_metadata(
pa.schema(
field
for field in table.schema
if field.name != "year"
),
os.path.join(location, "_metadata"),
metadata_collector,
encryption_properties=encryption_properties,
)
print("write done")
def read(location):
decryption_config = pe.DecryptionConfiguration(cache_lifetime=300)
kms_connection_config = pe.KmsConnectionConfig()
cf = pe.CryptoFactory(lambda *a, **k: KmsClient())
parquet_decryption_cfg = ds.ParquetDecryptionConfig(
cf, kms_connection_config, decryption_config
)
decryption_properties = cf.file_decryption_properties(
kms_connection_config, decryption_config)
pq_scan_opts = ds.ParquetFragmentScanOptions(
decryption_config=parquet_decryption_cfg,
# If using build from master
decryption_properties=decryption_properties
)
pformat = pa.dataset.ParquetFileFormat(default_fragment_scan_options=pq_scan_opts)
dataset = ds.parquet_dataset(
os.path.join(location, "_metadata"),
format=pformat,
partitioning=ds.partitioning(
schema=pa.schema([
pa.field("year", pa.int16())
]),
flavor="hive"
)
)
print("Reading")
print(dataset.to_table())
if __name__ == '__main__':
location = tempfile.mkdtemp(suffix=None, prefix=None, dir=None)
print(location)
os.makedirs(location, exist_ok=True)
write(location)
print("\n")
read(location) |
Seems that dataset.get_fragments() doesn't return anything. |
I think c++ logic as-is doesn't store row groups, let me take a look and get back. |
I think the whole premise of _metadata files (not _common_metadata) is to store row group details as well as paths, so when you perform a read via the _metadata file, it knows exactly which files and row groups to read without having to open every file in the dataset. At least this is what happens when encryption is disabled. |
Although I understand the intention of this issue and the corresponding fix, I don't think the design of parquet encryption has included the |
BTW, |
26cfd63
to
e0c9418
Compare
Why would this be an issue? Because these files might be encrypted with different keys? |
Yes, these files may have different master keys, which are unable to be referenced by a single footer IMHO. |
But the files can have the same key and decryption would error if not? That sounds ok to me. |
Python error is thrown here, but I'm not sure why. |
My intended use of this is to reduce strain on the filesystem when reading large (many files) datasets from a network attached filesystem, by reading the metadata file instead of many separate files. I also have a hard requirement for encryption sadly as the data is sensitive. It would be amazing if this worked with encrypted datasets assuming the key is the same. I would also be ok storing the metadata in plaintext, perform fragment filtering based on row-group stats, and then re-read and decrypt footers of the chosen files. Obviously this is ok for my usecase but generally might not be ok. |
I didn't get any error when reading, seems that it just returns no data. |
For the record: the community is inclined to deprecate I'm not sure if we want to support and maintain this. cc @emkornfield @pitrou @mapleFU |
Left my 2c there. Explained why I would be sad if that happened, and would probably have to re-implement the same feature. |
@wgtmac I'm not sure I follow. We already have |
Yep, we haven't worked on supporting this (basically, there was no requirement; seemed heading towards deprecation). |
What is the theoretical limit, assuming a 256-bit AES key? Also, if column key encryption is used, wouldn't the limit basically become irrelevant? |
cd5bf41
to
a4d58e0
Compare
cpp/src/parquet/file_writer.cc
Outdated
if (file_encryption_properties->encrypted_footer()) { | ||
PARQUET_THROW_NOT_OK(sink->Write(kParquetEMagic, 4)); | ||
|
||
PARQUET_ASSIGN_OR_THROW(int64_t position, sink->Tell()); | ||
auto metadata_start = static_cast<uint64_t>(position); | ||
|
||
auto writer_props = parquet::WriterProperties::Builder() | ||
.encryption(file_encryption_properties) | ||
->build(); | ||
auto builder = FileMetaDataBuilder::Make(metadata.schema(), writer_props); | ||
|
||
auto footer_metadata = builder->Finish(metadata.key_value_metadata()); | ||
auto crypto_metadata = builder->GetCryptoMetaData(); | ||
WriteFileCryptoMetaData(*crypto_metadata, sink.get()); | ||
|
||
auto footer_encryptor = file_encryptor->GetFooterEncryptor(); | ||
WriteEncryptedFileMetadata(metadata, sink.get(), footer_encryptor, true); | ||
PARQUET_ASSIGN_OR_THROW(position, sink->Tell()); | ||
auto footer_and_crypto_len = static_cast<uint32_t>(position - metadata_start); | ||
PARQUET_THROW_NOT_OK( | ||
sink->Write(reinterpret_cast<uint8_t*>(&footer_and_crypto_len), 4)); | ||
PARQUET_THROW_NOT_OK(sink->Write(kParquetEMagic, 4)); | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm doing metadata encryption similarly to approach here:
arrow/cpp/src/parquet/file_writer.cc
Lines 416 to 432 in a4d58e0
if (file_encryption_properties->encrypted_footer()) { | |
// encrypted footer | |
file_metadata_ = metadata_->Finish(key_value_metadata_); | |
PARQUET_ASSIGN_OR_THROW(int64_t position, sink_->Tell()); | |
uint64_t metadata_start = static_cast<uint64_t>(position); | |
auto crypto_metadata = metadata_->GetCryptoMetaData(); | |
WriteFileCryptoMetaData(*crypto_metadata, sink_.get()); | |
auto footer_encryptor = file_encryptor_->GetFooterEncryptor(); | |
WriteEncryptedFileMetadata(*file_metadata_, sink_.get(), footer_encryptor, true); | |
PARQUET_ASSIGN_OR_THROW(position, sink_->Tell()); | |
uint32_t footer_and_crypto_len = static_cast<uint32_t>(position - metadata_start); | |
PARQUET_THROW_NOT_OK( | |
sink_->Write(reinterpret_cast<uint8_t*>(&footer_and_crypto_len), 4)); | |
PARQUET_THROW_NOT_OK(sink_->Write(kParquetEMagic, 4)); | |
} else { // Encrypted file with plaintext footer |
However it seems decryption fails (see below) when using
RowGroup
metadata (after deserialization and decryption).arrow/cpp/src/arrow/dataset/file_parquet.cc
Lines 1084 to 1086 in 4d5041a
ARROW_ASSIGN_OR_RAISE(auto path, | |
FileFromRowGroup(filesystem.get(), base_path, *row_group, | |
options.validate_column_chunk_paths)); |
@wgtmac @pitrou does anything obviously wrong stands out here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @mapleFU
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What error did you get? I suspect it has some issues with row group ordinal. Please search "row group ordinal" from https://github.com/apache/parquet-format/blob/master/Encryption.md
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The actual error is thrown here:
arrow/cpp/src/arrow/dataset/file_parquet.cc
Line 1009 in 4d5041a
auto path = row_group.ColumnChunk(0)->file_path(); |
in debugger I see
row_group.ColumnChunk(0)
as null and row_group.ColumnChunk(0)->file_path()
fails.
The error message I'm getting is:
unknown file: Failure
C++ exception with description "Failed decryption finalization" thrown in the test body.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the test FileMetadata
object this reads several files, extracts their metadata and merges them (metadata->AppendRowGroups(*metadata_vector[1])
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could solve this by storing file aad prefixes into key_value_metadata
so they can be used at read time to decode row group metadata. This seems to work (see cpp/src/parquet/metadata.cc), but reading actual data with scanner->ToTable()
is somewhat less straightforward. but seems doable. Before proceeding, I'd like to ask here:
- does this approach make sense?
- what would a good location be for injecting aad prefixes be for reading column data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The format has a field for aad prefixes, https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L1126 .
It is file-wide, of course. So if parquet files had different aad prefixes, a new one has to be used for the merged metadata.
Per my comment above, I think the only way to make this work is to fully decrypt the footers of the parquet files (using their encryption params, inc keys and aad_prefixes), merge them, and then encrypt the result with a new key/aad_prefix
a4d58e0
to
fa448cf
Compare
fa448cf
to
4348123
Compare
4348123
to
f70a009
Compare
d784b1e
to
d764b57
Compare
7c4b5bb
to
ec833ef
Compare
ec833ef
to
35b22d6
Compare
ec79220
to
366ffc9
Compare
366ffc9
to
c015985
Compare
for (size_t i = 0; i < metadata_list.size(); i++) { | ||
const auto& file_metadata = metadata_list[i]; | ||
keys.push_back("row_group_aad_" + std::to_string(i)); | ||
values.push_back(file_metadata->file_aad()); | ||
if (i > 0) { | ||
metadata_list[0]->AppendRowGroups(*file_metadata); | ||
} | ||
} | ||
|
||
// Create a new FileMetadata object with the created AADs as key_value_metadata. | ||
auto fmd_builder = | ||
parquet::FileMetaDataBuilder::Make(metadata_list[0]->schema(), writer_props); | ||
const std::shared_ptr<const KeyValueMetadata> file_aad_metadata = | ||
::arrow::key_value_metadata(keys, values); | ||
auto metadata = fmd_builder->Finish(file_aad_metadata); | ||
metadata->AppendRowGroups(*metadata_list[0]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ggershinsky As proposed I now decrypt all footers and then coalesce them into a single footer. As decrypting data files at read times requires AADs I also store those into key_value_metadata
with row_group_aad_{i}
keys. Does this seem reasonable design?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rok sorry for the delay, I've been away for a while.
store those into key_value_metadata with row_group_aad_{i} keys.
Do we need to store the aad_prefixes ? Once a footer of a parquet file is decrypted, the file key and aad_prefix can be dropped. Aad_prefix is a user-provided unique ID of a file. So we can(*) generate a new one for the new _metadata file that keeps the coalesced footer. Also, it'd be good to generate a new key. Then the coalesced footer can be encrypted in the _metadata file.
(*) it's possible to write/encrypt the _metadata file without a new aad_prefix, if the user app level doesn't check the file id. You can simply pass a null pointer.
if (key_value_metadata_ && key_value_metadata_->Contains("row_group_aad_0")) { | ||
PARQUET_ASSIGN_OR_THROW( | ||
auto aad, | ||
key_value_metadata_->Get("row_group_aad_" + std::to_string(row_groups[0]))); | ||
out->set_file_decryptor_aad(aad); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pitrou regarding your suggestion to avoid storing AADs as key_value_metadata
entries and instead read them from files at dataset scan time - do you have a suggestion on how/where to detect we'll no longer be reading _metadata
file but rather data files? If we can do that we can change the decryptor's AAD there and read the desired data file.
Rationale for this change
Metadata written into _metadata file appears to not be encrypted.
What changes are included in this PR?
This adds a code path to encrypt _metadata file and a test.
Are these changes tested?
Yes
Are there any user-facing changes?
This adds user facing
encryption_properties
parameter topyarrow.parquet.write_metadata
.