Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add modular parquet de-/encryption #19858

Open
brainslush opened this issue Nov 18, 2024 · 6 comments
Open

Add modular parquet de-/encryption #19858

brainslush opened this issue Nov 18, 2024 · 6 comments
Labels
enhancement New feature or an improvement of an existing feature

Comments

@brainslush
Copy link

Description

Currently there are efforts to implement parquet de-/encryption into arrow-rs:
apache/arrow-rs#6637

I suggest to extend the existing scan,read,sink and write interfaces to handle encrypted parquet.

I offer to implement it.

@brainslush brainslush added the enhancement New feature or an improvement of an existing feature label Nov 18, 2024
@coastalwhite
Copy link
Collaborator

I quite agree with most of what is being said in the arrow-r/s thread:

  • We are interested but unless someone steps up to drive this effort resource or time-wise, I am not sure we have the bandwidth at the moment.
  • Any implementation of this should use one of the pure Rut cryptography efforts (ring or RustCrypto).
  • I am not 100% aware of the standard surrounding this, but I think this should be a relatively transparent step, maybe in PageReader or just above.

If you want to take this up, I suggest you first made a PR with a rough draft.

@rok
Copy link

rok commented Nov 21, 2024

Glad to see this!
apache/arrow-rs#6637 will most likely be ring based. I'd be happy to help with reviews etc.

@brainslush
Copy link
Author

brainslush commented Nov 26, 2024

I quite agree with most of what is being said in the arrow-r/s thread:

* We are interested but unless someone steps up to drive this effort resource or time-wise, I am not sure we have the bandwidth at the moment.

* Any implementation of this should use one of the pure Rut cryptography efforts (ring or RustCrypto).

* I am not 100% aware of the standard surrounding this, but I think this should be a relatively transparent step, maybe in PageReader or just above.

If you want to take this up, I suggest you first made a PR with a rough draft.

I didn't know what to make of your comment.
tbh until now I didn't know that polars had their own parquet implementation. I assumed that it is was based on the arrow-rs impelementation. I somewhat expected this to be less work. Which doesn't mean that I won't try to implement en-/decryption.
I would raise the question whether is it still necessary to have a polars specific implementation of the parquet reader or not?

@ritchie46
Copy link
Member

I think this is out of scope for us. I don't understand the use case for us.

@adamreeve
Copy link
Contributor

If columnar encryption support is added to Polars, it would be great if it was compatible with the "Key Management Tools" API used by Arrow and parquet-mr, which allows integrating with a KMS and stores key material in a standard JSON format. There's a design doc for this at https://docs.google.com/document/d/1bEu903840yb95k9q2X-BlsYKuXoygE4VnMDl9xz_zhk

I don't understand the use case for us.

We often need to use PyArrow to read Parquet files and Datasets rather than Polars native Parquet reading as we write Parquet files that use columnar encryption. This is OK but it would be nice to have native Polars support.

We use columnar encryption because it lets us keep some columns unencrypted and only encrypt sensitive columns. This makes debugging issues easier compared to whole file encryption or file system permissions as engineers can see most of the relevant data in files without needing to get access to more sensitive data columns.

@brainslush
Copy link
Author

I think this is out of scope for us. I don't understand the use case for us.

Talking here from work experience. I had customers who had the requirement that the data is always encrypted at rest, mostly when dealing with third-party customer data . When working in Python and with small parquet files the approach @adamreeve mentioned works but it is an issue with parquet files that exceed the memory limit or when working in Rust. Personally, I am mostly seeing the benefit here with streaming LazyFrames.

Another solution from my point would be that I could take a shot on implementing the batch-wise (streaming) scan part for the AnonymousScan trait. Correct me if I'm wrong here but this is not implemented inside the planning and pipe engine yet.

The trait then could be used to implement any reader, e.g. the arrow-rs parquet reader, as streaming reader.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature
Projects
None yet
Development

No branches or pull requests

5 participants