-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hive: Optimize tableExists API in hive catalog #11597
Conversation
Skip creation of hive table operation when check existence of iceberg table in hive catalog
FYI @szehon-ho and @haizhou-zhao if you are interested |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thx @dramaticlly , nice work
Quick question: Is this a behavioral change? Previously we failed when the metadata was corrupt. After this, we succeed. How do we handle corrupt metadata in other catalog implementations? |
Thank you @pvary I think this indeed introduce a behavioral change. Majority of existing catalogs (except ECSCatalog) rely on this default implementation in Catalog interface where we tried to load the table first and return true if load is successful. I believe table exists here imply 2 things where both table entry exist in catalog as well as latest table metadata.json is not corrupted. Personally I think we can focus on former only and here's my thought process There are roughly 3 places where
|
Thanks for the check @dramaticlly! Thanks, |
Do we want to do the same opimization for the |
Another minor behavioral change is , earlier if the user had access to both HMS table and storage, the table exists would pass. |
hive-metastore/src/main/java/org/apache/iceberg/hive/HiveCatalog.java
Outdated
Show resolved
Hide resolved
hive-metastore/src/main/java/org/apache/iceberg/hive/HiveCatalog.java
Outdated
Show resolved
Hide resolved
hive-metastore/src/main/java/org/apache/iceberg/hive/HiveCatalog.java
Outdated
Show resolved
Hide resolved
@@ -271,7 +271,7 @@ default Transaction newReplaceTableTransaction( | |||
} | |||
|
|||
/** | |||
* Check whether table exists. | |||
* Check whether table exists, including metadata table. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
* Check whether table exists, including metadata table. | |
* Check whether table or metadata table exists. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the changes! I just went through for educating myself and left some nits. Let me know if they doesn't make sense!
hive-metastore/src/main/java/org/apache/iceberg/hive/HiveCatalog.java
Outdated
Show resolved
Hide resolved
hive-metastore/src/main/java/org/apache/iceberg/hive/HiveCatalog.java
Outdated
Show resolved
Hide resolved
hive-metastore/src/test/java/org/apache/iceberg/hive/HiveTableTest.java
Outdated
Show resolved
Hide resolved
@@ -271,7 +271,7 @@ default Transaction newReplaceTableTransaction( | |||
} | |||
|
|||
/** | |||
* Check whether table exists. | |||
* Check whether table or metadata table exists. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is checking whether the metadata table exists a new behavior? Since we're changing the comment for the Catalog interface here, it might affect catalogs that have not specifically overridden this function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe it's the existing (undocumented) behaviour
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I didn't know about this behavior.
Changing this doc string in the Catalog
interface will also change the doc string for functions that overrides tableExists
. For example, the EcsCatalog
overrides the tableExists
and inherits the doc string in its javadocs https://iceberg.apache.org/javadoc/1.7.0/org/apache/iceberg/dell/ecs/EcsCatalog.html
but it doesn't look like this behavior is supported in the EcsCatalog
.
There are a couple more places where this happens https://grep.app/search?q=boolean%20tableExists%28&filter[repo.pattern][0]=apache/iceberg
Perhaps it is better to move this into HiveCatalog
or any other catalogs that support this behavior to avoid confusion in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea it did cross my mind to be a bit hesitant to change the top level javadoc, I am not sure if metadata tables is a catalog wide concept. Kevin's suggestion makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's otherwise, all other catalog reuse the default implementation where they tried to load table first as a way to determine whether table exists. The ECSCatalog is the only concrete implementation override this method which might not handle the metadata tables. Partly it was because such behaviour was implicit without unit test or documentation, until @szehon-ho mentioned it in #11597 (comment)
My vote is that we can override the javadoc in ECSCatalog instead, what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would agree that we don't want to expose the metadata Table concept here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you @danielcweeks , revert as suggested. My main concern is that if we dont document this explicitly, the future override of this method in catalog might not even take metadata table into consideration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the metadata table concept only applied to loadTable
right now
https://github.com/search?q=repo%3Aapache%2Ficeberg%20isValidMetadataIdentifier&type=code
so its a side effect of tableExists
using the underlying loadTable
function
on a meta level, its a bit weird to check if a metadata table exists. if the underlying iceberg table exists, I would assume its metadata table also exists.
that said, I'm in favor of documenting this hidden behavior of tableExists
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea I think the main concern with putting on top level javadoc is that , there's a lot of Catalog in the wild that may not (and may not want to) support this. As @kevinjqliu points out it seems its a bit of an outlier that it happened here (that we should keep for backward compatbiility) and doc can probably be added in HiveCatalog specifically for this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for the optimization!
there seem to be implementations of HiveCatalog in other projects, maybe we can raise this optimization there as well
https://grep.app/search?q=HiveCatalog&filter[path.pattern][0]=HiveCatalog.java
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the changes! Just some really small nits, otherwise I'm fine with the patch.
hive-metastore/src/test/java/org/apache/iceberg/hive/HiveTableTest.java
Outdated
Show resolved
Hide resolved
@danielcweeks: If you don't have more comments here, then I think we could merge this PR |
@rdblue @danielcweeks can you help take anther look to see if we can move forward on this PR? |
@dramaticlly minor comment, but other than that LGTM. |
Thanks @dramaticlly !! |
Thank you @danielcweeks @szehon-ho @pvary @kevinjqliu @gaborkaszab @haizhou-zhao for the review! I will look into similar change for hive view existence check |
Skip creation of hive table operation when check existence of iceberg table in hive catalog.
Today the table existence rely on load the table first and return true if table can be loaded
iceberg/api/src/main/java/org/apache/iceberg/catalog/Catalog.java
Lines 279 to 286 in 3badfe0
I found there's opportunity for improvement on hive catalog where we can skip the instantiate of HiveTableOperations, avoid reading the iceberg metadata.json file by only rely on record within hive catalog.
This is important for REST based catalog which delegate work to hiveCatalog as API call volume can be high and this optimization can reduce API overhead and latency.
Why this is safe?
HiveOperationsBase.validateTableIsIceberg
is also used in catalog listTables API to differentiate the iceberg table from hive table