-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Encryption integration and test #13066
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| private String name; | ||
| private Configuration conf; | ||
| private FileIO fileIO; | ||
| private KeyManagementClient keyManagementClient; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if some this logic should live in the parent BaseMetastoreCatalog class, I assume it will be relevant for most of the other sub classes as well as other implementers? I'm not sure though, and it can always get moved later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed wrt relevance, but there are a couple of issues:
- Security. If table properties are fetched by a catalog from the storage backend (eg the metadata file), then a "key removal attack" becomes possible. In this attack, the key id is removed from the table properties by tampering with the metadata file. This makes the writers produce unencrypted data files.. Therefore, it is important to make sure each catalog client gets the table key id directly from its catalog service, and not from the storage. Also, one of the classes that extend BaseMetastoreCatalog is the HadoopCatalog, which is not safe for encryption because it does not have a catalog service independent of the storage backend.
- Code re-use is thin. Most of the PR logic is in the "table operations". But the "newTableOps" is an abstract method in BaseMetastoreCatalog; which makes sense since many things (inc table key retrieval) are catalog-specific.
Maybe we should start with the Hive catalog? (Plus the REST catalog in another PR). Then, as more catalogs are added, we can check if/how we can safely move a common encryption code into a parent or util?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After the last update, the reusable code grew; but still, we have the security challenge. Also, the Hive and REST catalogs don't have a common parent. I'd still take the option of handling this later, as more catalogs are added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to handle this later after more catalogs are added.
| if (encryption() instanceof StandardEncryptionManager) { | ||
| TableMetadata.Builder builder = TableMetadata.buildFrom(metadata); | ||
| for (Map.Entry<String, EncryptedKey> entry : | ||
| EncryptionUtil.encryptionKeys(encryption()).entrySet()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
EncryptionUtil.encryptionKeys seems like a missing part of the PR, could you check please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, this PR is still not functional - it depends on #7770 , and also misses a few parts. Once the 7770 is merged, I'll update this one.
smaheshwar-pltr
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for driving this work, @ggershinsky. Scattered small comments / questions (feel free to ignore implementation-y ones, realise this work is incomplete without #7770 😄).
I wanted to try out REST integration on top of new changes, so I put up #13225, inspired by this PR.
| List<Object[]> expected = | ||
| ImmutableList.of(row(1L, "a", 1.0F), row(2L, "b", 2.0F), row(3L, "c", Float.NaN)); | ||
|
|
||
| assertEquals("Should return all expected rows", expected, sql("SELECT * FROM %s", tableName)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| assertEquals("Should return all expected rows", expected, sql("SELECT * FROM %s", tableName)); | |
| assertEquals( | |
| "Should return all expected rows", | |
| expected, | |
| sql("SELECT * FROM %s ORDER BY id", tableName)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spotless check is ok with the current form.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was referring to adding ORDER BY id here as is generally done in tests like these
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, this is just a copy from the regular TestSelect, https://github.com/apache/iceberg/blob/main/spark/v4.0/spark/src/test/java/org/apache/iceberg/spark/sql/TestSelect.java#L110
| EncryptionUtil.createEncryptionManager( | ||
| encryptedKeysFromMetadata, tableProperties, keyManagementClient); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we need something like https://github.com/apache/iceberg/pull/13224/files#diff-f8f95b203a96a9ba193bcafdff1813f9f5be1da4d596bedc8db5178dd33d4900R44-R104 for this but can update once #7770 merges 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The full version of the patch is pushed, please let me know if things are still missing.
|
|
||
| if (encryptionKeyIdFromHMS != null) { | ||
| checkEncryptionProperties(encryptionKeyIdFromHMS, dekLengthFromHMS); | ||
| encryptedKeysFromMetadata = current().encryptionKeys(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just assigning this field here isn't sufficient right? I think we need to add the loaded keys to the current encryption manager's state - perhaps via a new API, or maybe there are ideas like https://github.com/apache/iceberg/pull/13225/files#diff-8b93b61734b5ccae516e7dbeb26dd1b2329707b87935f32b545638594cb8e2e9R269-R271 to force re-initializaton in io and encryption methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This started as a partial/sketch PR. Now I've pushed the full version, please let me know if things are still missing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This hasn't been addressed - perhaps it's controversial, but I'd have thought refreshing should allow new snapshots to still be read by updating the encryption manager?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
6af950f is a short test that throws - let me know what you think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @smaheshwar-pltr for this usecase and the question. The high level API works ok, but indeed the lower level API doesn't get the kms name and the updated key list.
I'm in favor of supporting this at all API levels. Will appreciate if you'd send the unitest and the fix (as an addition to this PR, or as a part of 13225, or as a separate PR).
(btw, the kms part can be handled by adding catalog.initialize(catalogName, catalogConfig) before catalog.loadTable; then no need in the TestBase change).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems to me EncryptionManager caches the key map at construction and isn’t rebuilt/updated when current() changes. doRefresh() assigns encryptedKeysFromMetadata = current().encryptionKeys(), but the active manager never sees those.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you'd send the unitest and the fix (as an addition to this PR, or as a part of 13225, or as a separate PR).
Hi @smaheshwar-pltr are you ok with this?
(alternatively, I can send a fix commit with you as co-author)
| throw new RuntimeException("No metadata folder found for table " + tableName); | ||
| } | ||
|
|
||
| // Find manifest list and metadata files; check their encryption |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ggershinsky, curious what came of the metadata JSON encryption discussion. Do folks think that information contained within metadata JSONs are not sufficiently sensitive to be encrypted in storage?
Also, I recall you mentioning that for REST catalogs, this discussion isn't significant because they can make either decision themselves. This makes sense to me - but I do see parts of the codebase (e.g. below) where clients seem to attempt to read directly from the metadata JSON in storage, which I guess wouldn't work if it's been silently encrypted by the REST catalog.
iceberg/core/src/main/java/org/apache/iceberg/ManifestsTable.java
Lines 84 to 85 in 931865e
| io.newInputFile( | |
| location != null ? location : table().operations().current().metadataFileLocation()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
curious what came of the metadata JSON encryption discussion. Do folks think that information contained within metadata JSONs are not sufficiently sensitive to be encrypted in storage?
The discussion resulted in an agreement to skip the metadata.json encryption. The information there is not sensitive; but losing a metadata.json key (due to a catalog glitch) would lead to losing the table.
The main reason for encrypting the json was to protect the file integrity (against keyId removal and other attacks). These attacks can be addressed by alternative means - using the REST catalog, or checking the keyId in other catalogs, etc.
if it's been silently encrypted by the REST catalog.
Nope, we don't presume REST should/will encrypt the metadata file. Instead, the REST client implementation should make sure the table key ID is fetched directly from the REST catalog backend, and not from the file / storage backend.
|
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions. |
|
wip |
|
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions. |
|
wip |
|
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions. |
|
wip |
56cc3b3 to
bb57377
Compare
|
cc @huaxingao |
| encryptionPropsFromMetadata(metadata.properties()); | ||
|
|
||
| String newMetadataLocation; | ||
| if (encryption() instanceof StandardEncryptionManager) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: encryption() is called again at L219. Shall we assign the result to a local variable?
Co-Authored-By: Sreesh Maheshwar <[email protected]>
2316c8e to
335d777
Compare
|
The comments are addressed. |
huaxingao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
I will leave this PR open for a few more days in case @smaheshwar-pltr or others have more comments. |
|
Thanks @ggershinsky for the PR! Thanks @RussellSpitzer @szlta @smaheshwar-pltr for the review! |
This reverts commit 30ca573.
No description provided.