Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hive: Optimize tableExists API in hive catalog #11597

Merged
merged 11 commits into from
Dec 12, 2024

Conversation

dramaticlly
Copy link
Contributor

Skip creation of hive table operation when check existence of iceberg table in hive catalog.

Today the table existence rely on load the table first and return true if table can be loaded

default boolean tableExists(TableIdentifier identifier) {
try {
loadTable(identifier);
return true;
} catch (NoSuchTableException e) {
return false;
}
}

I found there's opportunity for improvement on hive catalog where we can skip the instantiate of HiveTableOperations, avoid reading the iceberg metadata.json file by only rely on record within hive catalog.

This is important for REST based catalog which delegate work to hiveCatalog as API call volume can be high and this optimization can reduce API overhead and latency.

Why this is safe? HiveOperationsBase.validateTableIsIceberg is also used in catalog listTables API to differentiate the iceberg table from hive table

Skip creation of hive table operation when check existence of iceberg table in hive catalog
@github-actions github-actions bot added the hive label Nov 20, 2024
@dramaticlly
Copy link
Contributor Author

FYI @szehon-ho and @haizhou-zhao if you are interested

Copy link
Contributor

@haizhou-zhao haizhou-zhao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx @dramaticlly , nice work

@pvary
Copy link
Contributor

pvary commented Nov 20, 2024

Quick question: Is this a behavioral change? Previously we failed when the metadata was corrupt. After this, we succeed.

How do we handle corrupt metadata in other catalog implementations?

@dramaticlly
Copy link
Contributor Author

dramaticlly commented Nov 20, 2024

Quick question: Is this a behavioral change? Previously we failed when the metadata was corrupt. After this, we succeed.

How do we handle corrupt metadata in other catalog implementations?

Thank you @pvary I think this indeed introduce a behavioral change. Majority of existing catalogs (except ECSCatalog) rely on this default implementation in Catalog interface where we tried to load the table first and return true if load is successful. I believe table exists here imply 2 things where both table entry exist in catalog as well as latest table metadata.json is not corrupted.

Personally I think we can focus on former only and here's my thought process

There are roughly 3 places where catalog.tableExists was used in iceberg code base

  1. Check before table can be registered in registerTable API
  • I believe behaviour change is allowed here as long as entry exist in catalog, register shall fail regardless files is corrupted
  1. Check before table stage creation, this is only used in REST catalog handler
  • I believe behaviour change is also allowed here as long as entry exist in catalog, stage creation shall fail regardless version files is corrupted
  1. REST API to check for table existence:
    head:
    tags:
    - Catalog API
    summary: Check if a table exists
    operationId: tableExists
  • I think this is what I originally hoped for to optimize on, to speed up on the existence check without rely on reading metadata first. The reason is that sometimes existence check is all we need without subsequent load table call

@pvary
Copy link
Contributor

pvary commented Nov 21, 2024

Thanks for the check @dramaticlly!
I agree that this behavioural change is small, but I would like to raise awareness around the community about this. Could you please write a letter to the dev list describing what is planned here? So if there is someone who is against the change, they could raise their voices.

Thanks,
Peter

@pvary
Copy link
Contributor

pvary commented Nov 21, 2024

Do we want to do the same opimization for the viewExists method too?

@karuppayya
Copy link
Contributor

Another minor behavioral change is , earlier if the user had access to both HMS table and storage, the table exists would pass.
With the change, tableExists would pass with only access to HMS table?

@@ -271,7 +271,7 @@ default Transaction newReplaceTableTransaction(
}

/**
* Check whether table exists.
* Check whether table exists, including metadata table.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
* Check whether table exists, including metadata table.
* Check whether table or metadata table exists.

Copy link
Collaborator

@gaborkaszab gaborkaszab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes! I just went through for educating myself and left some nits. Let me know if they doesn't make sense!

@@ -271,7 +271,7 @@ default Transaction newReplaceTableTransaction(
}

/**
* Check whether table exists.
* Check whether table or metadata table exists.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is checking whether the metadata table exists a new behavior? Since we're changing the comment for the Catalog interface here, it might affect catalogs that have not specifically overridden this function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it's the existing (undocumented) behaviour

Copy link
Contributor

@kevinjqliu kevinjqliu Dec 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I didn't know about this behavior.

Changing this doc string in the Catalog interface will also change the doc string for functions that overrides tableExists. For example, the EcsCatalog overrides the tableExists and inherits the doc string in its javadocs https://iceberg.apache.org/javadoc/1.7.0/org/apache/iceberg/dell/ecs/EcsCatalog.html
but it doesn't look like this behavior is supported in the EcsCatalog.

There are a couple more places where this happens https://grep.app/search?q=boolean%20tableExists%28&filter[repo.pattern][0]=apache/iceberg

Perhaps it is better to move this into HiveCatalog or any other catalogs that support this behavior to avoid confusion in the future.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea it did cross my mind to be a bit hesitant to change the top level javadoc, I am not sure if metadata tables is a catalog wide concept. Kevin's suggestion makes sense.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's otherwise, all other catalog reuse the default implementation where they tried to load table first as a way to determine whether table exists. The ECSCatalog is the only concrete implementation override this method which might not handle the metadata tables. Partly it was because such behaviour was implicit without unit test or documentation, until @szehon-ho mentioned it in #11597 (comment)

My vote is that we can override the javadoc in ECSCatalog instead, what do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would agree that we don't want to expose the metadata Table concept here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you @danielcweeks , revert as suggested. My main concern is that if we dont document this explicitly, the future override of this method in catalog might not even take metadata table into consideration.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the metadata table concept only applied to loadTable right now
https://github.com/search?q=repo%3Aapache%2Ficeberg%20isValidMetadataIdentifier&type=code

so its a side effect of tableExists using the underlying loadTable function

on a meta level, its a bit weird to check if a metadata table exists. if the underlying iceberg table exists, I would assume its metadata table also exists.

that said, I'm in favor of documenting this hidden behavior of tableExists

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea I think the main concern with putting on top level javadoc is that , there's a lot of Catalog in the wild that may not (and may not want to) support this. As @kevinjqliu points out it seems its a bit of an outlier that it happened here (that we should keep for backward compatbiility) and doc can probably be added in HiveCatalog specifically for this PR.

@github-actions github-actions bot added the DELL label Dec 4, 2024
Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for the optimization!

there seem to be implementations of HiveCatalog in other projects, maybe we can raise this optimization there as well
https://grep.app/search?q=HiveCatalog&filter[path.pattern][0]=HiveCatalog.java

Copy link
Collaborator

@gaborkaszab gaborkaszab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes! Just some really small nits, otherwise I'm fine with the patch.

@pvary
Copy link
Contributor

pvary commented Dec 9, 2024

@danielcweeks: If you don't have more comments here, then I think we could merge this PR

@dramaticlly
Copy link
Contributor Author

@rdblue @danielcweeks can you help take anther look to see if we can move forward on this PR?

@danielcweeks
Copy link
Contributor

@dramaticlly minor comment, but other than that LGTM.

@danielcweeks danielcweeks merged commit a3dcfd1 into apache:main Dec 12, 2024
49 checks passed
@danielcweeks
Copy link
Contributor

Thanks @dramaticlly !!

@dramaticlly
Copy link
Contributor Author

Thank you @danielcweeks @szehon-ho @pvary @kevinjqliu @gaborkaszab @haizhou-zhao for the review! I will look into similar change for hive view existence check

@dramaticlly dramaticlly deleted the hiveTableExists branch December 12, 2024 18:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants