-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add modular parquet de-/encryption #19858
Comments
I quite agree with most of what is being said in the arrow-r/s thread:
If you want to take this up, I suggest you first made a PR with a rough draft. |
Glad to see this! |
I didn't know what to make of your comment. |
I think this is out of scope for us. I don't understand the use case for us. |
If columnar encryption support is added to Polars, it would be great if it was compatible with the "Key Management Tools" API used by Arrow and parquet-mr, which allows integrating with a KMS and stores key material in a standard JSON format. There's a design doc for this at https://docs.google.com/document/d/1bEu903840yb95k9q2X-BlsYKuXoygE4VnMDl9xz_zhk
We often need to use PyArrow to read Parquet files and Datasets rather than Polars native Parquet reading as we write Parquet files that use columnar encryption. This is OK but it would be nice to have native Polars support. We use columnar encryption because it lets us keep some columns unencrypted and only encrypt sensitive columns. This makes debugging issues easier compared to whole file encryption or file system permissions as engineers can see most of the relevant data in files without needing to get access to more sensitive data columns. |
Talking here from work experience. I had customers who had the requirement that the data is always encrypted at rest, mostly when dealing with third-party customer data . When working in Python and with small parquet files the approach @adamreeve mentioned works but it is an issue with parquet files that exceed the memory limit or when working in Rust. Personally, I am mostly seeing the benefit here with streaming LazyFrames. Another solution from my point would be that I could take a shot on implementing the batch-wise (streaming) scan part for the AnonymousScan trait. Correct me if I'm wrong here but this is not implemented inside the planning and pipe engine yet. The trait then could be used to implement any reader, e.g. the arrow-rs parquet reader, as streaming reader. |
Description
Currently there are efforts to implement parquet de-/encryption into arrow-rs:
apache/arrow-rs#6637
I suggest to extend the existing scan,read,sink and write interfaces to handle encrypted parquet.
I offer to implement it.
The text was updated successfully, but these errors were encountered: