Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API #34616

Merged
merged 45 commits into from
Oct 11, 2023

Conversation

tolleybot
Copy link
Contributor

@tolleybot tolleybot commented Mar 17, 2023

Rationale for this change

The purpose of this pull request is to support modular encryption in the new Dataset API. See https://docs.google.com/document/d/13EysCNC6-Nu9wnJ8YpdzmD-aMLn4i2KXUJTNqIihy7A/edit# for supporting document.

What changes are included in this PR?

I made improvements to the C++ and Python code to enable the Dataset API to have per-file settings for each file saved. Previously, the Dataset API applied the same encryption properties to all saved files, but now I've updated the code to allow for greater flexibility. In the Python code, I've added support for the changes by updating the ParquetFormat class to accept DatasetEncryptionConfiguration and DatasetDecryptionConfiguration structures. With these changes, you can pass the format object to the write_dataset function, giving you the ability to set unique encryption properties for each file in your Dataset.

Are these changes tested?

Yes, unit tests are included. I have also included a python sample project.

Are there any user-facing changes?

Yes, as stated above the ParquetFormat class has optional parameters for DatasetEncryptionConfiguration and DatasetDecryptionConfiguration through setters and getters. The Dataset now has the option using this to set different file encryption properties per file

@github-actions
Copy link

@github-actions
Copy link

⚠️ GitHub issue #29238 has been automatically assigned in GitHub to PR creator.

@github-actions
Copy link

⚠️ GitHub issue #29238 has been automatically assigned in GitHub to PR creator.

1 similar comment
@github-actions
Copy link

⚠️ GitHub issue #29238 has been automatically assigned in GitHub to PR creator.

Copy link
Member

@mapleFU mapleFU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! Remember using clang-format to format cpp code

cpp/src/parquet/properties.h Outdated Show resolved Hide resolved
@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Mar 17, 2023
@tolleybot tolleybot marked this pull request as ready for review March 17, 2023 21:41
Copy link
Member

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for starting this @tolleybot. I did a first pass looking through it, and it seems like we could use a bit more testing, both in C++ and in Python.

cpp/src/arrow/dataset/dataset_encryption_test.cc Outdated Show resolved Hide resolved
cpp/src/arrow/dataset/dataset_encryption_test.cc Outdated Show resolved Hide resolved
cpp/src/arrow/dataset/dataset_encryption_test.cc Outdated Show resolved Hide resolved
cpp/src/arrow/dataset/dataset_encryption_test.cc Outdated Show resolved Hide resolved
cpp/src/arrow/dataset/dataset_encryption_test.cc Outdated Show resolved Hide resolved
python/examples/dataset/write_dataset_encrypted.py Outdated Show resolved Hide resolved
cpp/src/arrow/dataset/dataset_encryption_test.cc Outdated Show resolved Hide resolved
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Mar 20, 2023
@wjones127
Copy link
Member

On the C++ formatting: If you install clang-tools 14 (must be that version) and set the path the binaries with CLANG_TOOLS_PATH , then when you reconfigure CMake it will create a format target and a lint target that can be used to check style. (You'll know it's configured right when the CMake configure output says Found ClangTools: /some/path/bin/clang-format-14)

For Python formatting, see these instructions: https://arrow.apache.org/docs/developers/python.html#coding-style

@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Mar 22, 2023
@github-actions github-actions bot added awaiting changes Awaiting changes awaiting change review Awaiting change review and removed awaiting change review Awaiting change review awaiting changes Awaiting changes labels Mar 23, 2023
@jorisvandenbossche
Copy link
Member

@github-actions crossbow submit -g python

@jorisvandenbossche

This comment has been minimized.

@github-actions
Copy link

Revision: d7a6b55

Submitted crossbow builds: ursacomputing/crossbow @ actions-f652f7cf6d

Task Status
test-conda-python-3.10 Github Actions
test-conda-python-3.10-cython2 Github Actions
test-conda-python-3.10-hdfs-2.9.2 Github Actions
test-conda-python-3.10-hdfs-3.2.1 Github Actions
test-conda-python-3.10-pandas-latest Github Actions
test-conda-python-3.10-pandas-nightly Github Actions
test-conda-python-3.10-spark-v3.4.1 Github Actions
test-conda-python-3.10-substrait Github Actions
test-conda-python-3.11 Github Actions
test-conda-python-3.11-dask-latest Github Actions
test-conda-python-3.11-dask-upstream_devel Github Actions
test-conda-python-3.11-hypothesis Github Actions
test-conda-python-3.11-pandas-upstream_devel Github Actions
test-conda-python-3.11-spark-master Github Actions
test-conda-python-3.8 Github Actions
test-conda-python-3.8-pandas-1.0 Github Actions
test-conda-python-3.8-spark-v3.4.1 Github Actions
test-conda-python-3.9 Github Actions
test-conda-python-3.9-pandas-latest Github Actions
test-cuda-python Github Actions
test-debian-11-python-3 Azure
test-fedora-35-python-3 Azure
test-ubuntu-20.04-python-3 Azure
test-ubuntu-22.04-python-3 Github Actions

@jorisvandenbossche
Copy link
Member

@github-actions crossbow submit test-conda-python-3.10-pandas-latest

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Oct 10, 2023
@github-actions
Copy link

Revision: b323457

Submitted crossbow builds: ursacomputing/crossbow @ actions-6e2d0464d5

Task Status
test-conda-python-3.10-pandas-latest Github Actions

@anjakefala
Copy link
Collaborator

I updated the branch with the latest main, so we can see if that helps with the unrelated failures.

@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Oct 10, 2023
@ianmcook
Copy link
Member

@github-actions crossbow submit -g python

@github-actions
Copy link

Revision: ced2ed2

Submitted crossbow builds: ursacomputing/crossbow @ actions-6cee43f1c5

Task Status
test-conda-python-3.10 Github Actions
test-conda-python-3.10-cython2 Github Actions
test-conda-python-3.10-hdfs-2.9.2 Github Actions
test-conda-python-3.10-hdfs-3.2.1 Github Actions
test-conda-python-3.10-pandas-latest Github Actions
test-conda-python-3.10-pandas-nightly Github Actions
test-conda-python-3.10-spark-v3.5.0 Github Actions
test-conda-python-3.10-substrait Github Actions
test-conda-python-3.11 Github Actions
test-conda-python-3.11-dask-latest Github Actions
test-conda-python-3.11-dask-upstream_devel Github Actions
test-conda-python-3.11-hypothesis Github Actions
test-conda-python-3.11-pandas-upstream_devel Github Actions
test-conda-python-3.11-spark-master Github Actions
test-conda-python-3.8 Github Actions
test-conda-python-3.8-pandas-1.0 Github Actions
test-conda-python-3.8-spark-v3.5.0 Github Actions
test-conda-python-3.9 Github Actions
test-conda-python-3.9-pandas-latest Github Actions
test-cuda-python Github Actions
test-debian-11-python-3 Azure
test-fedora-35-python-3 Azure
test-ubuntu-20.04-python-3 Azure
test-ubuntu-22.04-python-3 Github Actions

@pitrou
Copy link
Member

pitrou commented Oct 10, 2023

@github-actions crossbow submit verifypython*

@github-actions
Copy link

Revision: ced2ed2

Submitted crossbow builds: ursacomputing/crossbow @ actions-379629a2a0

Task Status
verify-rc-source-python-linux-almalinux-8-amd64 Github Actions
verify-rc-source-python-linux-conda-latest-amd64 Github Actions
verify-rc-source-python-linux-ubuntu-20.04-amd64 Github Actions
verify-rc-source-python-linux-ubuntu-22.04-amd64 Github Actions
verify-rc-source-python-macos-amd64 Github Actions
verify-rc-source-python-macos-arm64 Github Actions
verify-rc-source-python-macos-conda-amd64 Github Actions

@apache apache deleted a comment from github-actions bot Oct 10, 2023
@jorisvandenbossche
Copy link
Member

It seems that all triggered builds are green!

Thanks @tolleybot and all others involved to get this over the finish line

@jorisvandenbossche jorisvandenbossche merged commit 0793432 into apache:main Oct 11, 2023
@jorisvandenbossche jorisvandenbossche removed the awaiting change review Awaiting change review label Oct 11, 2023
@tolleybot
Copy link
Contributor Author

Thanks, everyone for all the reviews and work on this!

@pitrou
Copy link
Member

pitrou commented Oct 11, 2023

Thank you @tolleybot for contributing this in the first place :-)

@conbench-apache-arrow
Copy link

After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit 0793432.

There were 7 benchmark results indicating a performance regression:

The full Conbench report has more details. It also includes information about 6 possible false positives for unstable benchmarks that are known to sometimes produce them.

@jorisvandenbossche
Copy link
Member

FYI I checked the reports above and they are all flakes (the benchmarks reports are stable beyond this commit)

JerAguilon pushed a commit to JerAguilon/arrow that referenced this pull request Oct 23, 2023
…tion in the new Dataset API (apache#34616)

### Rationale for this change

The purpose of this pull request is to support modular encryption in the new Dataset API.  See [https://docs.google.com/document/d/13EysCNC6-Nu9wnJ8YpdzmD-aMLn4i2KXUJTNqIihy7A/edit#](url) for supporting document.

### What changes are included in this PR?

I made improvements to the C++ and Python code to enable the Dataset API to have per-file settings for each file saved. Previously, the Dataset API applied the same encryption properties to all saved files, but now I've updated the code to allow for greater flexibility. In the Python code, I've added support for the changes by updating the ParquetFormat class to accept DatasetEncryptionConfiguration and DatasetDecryptionConfiguration structures. With these changes, you can pass the format object to the write_dataset function, giving you the ability to set unique encryption properties for each file in your Dataset.

### Are these changes tested?

Yes, unit tests are included. I have also included a python sample project.

### Are there any user-facing changes?

Yes,  as stated above the ParquetFormat class has optional parameters for DatasetEncryptionConfiguration and DatasetDecryptionConfiguration through setters and getters.  The Dataset now has the option using this to set different file encryption properties per file

* Closes: apache#29238

Lead-authored-by: Don <[email protected]>
Co-authored-by: Donald Tolley <[email protected]>
Co-authored-by: Joris Van den Bossche <[email protected]>
Co-authored-by: anjakefala <[email protected]>
Co-authored-by: Sutou Kouhei <[email protected]>
Co-authored-by: Weston Pace <[email protected]>
Co-authored-by: Gang Wu <[email protected]>
Co-authored-by: scoder <[email protected]>
Co-authored-by: Will Jones <[email protected]>
Signed-off-by: Joris Van den Bossche <[email protected]>
loicalleyne pushed a commit to loicalleyne/arrow that referenced this pull request Nov 13, 2023
…tion in the new Dataset API (apache#34616)

### Rationale for this change

The purpose of this pull request is to support modular encryption in the new Dataset API.  See [https://docs.google.com/document/d/13EysCNC6-Nu9wnJ8YpdzmD-aMLn4i2KXUJTNqIihy7A/edit#](url) for supporting document.

### What changes are included in this PR?

I made improvements to the C++ and Python code to enable the Dataset API to have per-file settings for each file saved. Previously, the Dataset API applied the same encryption properties to all saved files, but now I've updated the code to allow for greater flexibility. In the Python code, I've added support for the changes by updating the ParquetFormat class to accept DatasetEncryptionConfiguration and DatasetDecryptionConfiguration structures. With these changes, you can pass the format object to the write_dataset function, giving you the ability to set unique encryption properties for each file in your Dataset.

### Are these changes tested?

Yes, unit tests are included. I have also included a python sample project.

### Are there any user-facing changes?

Yes,  as stated above the ParquetFormat class has optional parameters for DatasetEncryptionConfiguration and DatasetDecryptionConfiguration through setters and getters.  The Dataset now has the option using this to set different file encryption properties per file

* Closes: apache#29238

Lead-authored-by: Don <[email protected]>
Co-authored-by: Donald Tolley <[email protected]>
Co-authored-by: Joris Van den Bossche <[email protected]>
Co-authored-by: anjakefala <[email protected]>
Co-authored-by: Sutou Kouhei <[email protected]>
Co-authored-by: Weston Pace <[email protected]>
Co-authored-by: Gang Wu <[email protected]>
Co-authored-by: scoder <[email protected]>
Co-authored-by: Will Jones <[email protected]>
Signed-off-by: Joris Van den Bossche <[email protected]>
dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
…tion in the new Dataset API (apache#34616)

### Rationale for this change

The purpose of this pull request is to support modular encryption in the new Dataset API.  See [https://docs.google.com/document/d/13EysCNC6-Nu9wnJ8YpdzmD-aMLn4i2KXUJTNqIihy7A/edit#](url) for supporting document.

### What changes are included in this PR?

I made improvements to the C++ and Python code to enable the Dataset API to have per-file settings for each file saved. Previously, the Dataset API applied the same encryption properties to all saved files, but now I've updated the code to allow for greater flexibility. In the Python code, I've added support for the changes by updating the ParquetFormat class to accept DatasetEncryptionConfiguration and DatasetDecryptionConfiguration structures. With these changes, you can pass the format object to the write_dataset function, giving you the ability to set unique encryption properties for each file in your Dataset.

### Are these changes tested?

Yes, unit tests are included. I have also included a python sample project.

### Are there any user-facing changes?

Yes,  as stated above the ParquetFormat class has optional parameters for DatasetEncryptionConfiguration and DatasetDecryptionConfiguration through setters and getters.  The Dataset now has the option using this to set different file encryption properties per file

* Closes: apache#29238

Lead-authored-by: Don <[email protected]>
Co-authored-by: Donald Tolley <[email protected]>
Co-authored-by: Joris Van den Bossche <[email protected]>
Co-authored-by: anjakefala <[email protected]>
Co-authored-by: Sutou Kouhei <[email protected]>
Co-authored-by: Weston Pace <[email protected]>
Co-authored-by: Gang Wu <[email protected]>
Co-authored-by: scoder <[email protected]>
Co-authored-by: Will Jones <[email protected]>
Signed-off-by: Joris Van den Bossche <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API