Skip to content

Commit

Permalink
[KYUUBI #6018] Speed up GetTables operation for Spark session catalog
Browse files Browse the repository at this point in the history
# 🔍 Description
## Issue References 🔗

This pull request aims to speed up the GetTables operation for the Spark session catalog.
As reported in #4956, #5949, the GetTables operation is quite slow in some cases, and in #4444, `kyuubi.operation.getTables.ignoreTableProperties` was introduced to speed up the V2 catalog, but not covers session catalog.

## Describe Your Solution 🔧

Extend the scope of `kyuubi.operation.getTables.ignoreTableProperties` to cover the GetTables operation for the Spark session catalog.

Currently, the basic step of GetTables in the Spark engine is
```
val catalog: String = getCatalog(spark, catalogName)
val databases: Seq[String] = sessionCatalog.listDatabases(schemaPattern)
val identifiers: Seq[TableIdentifier] = catalog.listTables(db, tablePattern, includeLocalTempViews = false)
val tableObjects: Seq[CatalogTable] = catalog.getTablesByName(identifiers)
```
then filter `tableObjects` with `tableTypes: Set[String]`.

The cost of `catalog.getTablesByName(identifiers)` is quite high when the table number is large, e.g. dozen thousand.

For some cases, listing tables only for table name display, it is worth speeding up the operation while ignoring some properties(e.g. table comments) and query criteria(specifically in this case, when `kyuubi.operation.getTables.ignoreTableProperties=true`, criteria `tableTypes` will be ignored, and all tables and views will be treated as TABLE to return.)

## Types of changes 🔖

- [ ] Bugfix (non-breaking change which fixes an issue)
- [x] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing functionality to change)

## Test Plan 🧪

Pass GA

---

# Checklist 📝

- [x] This patch was not authored or co-authored using [Generative Tooling](https://www.apache.org/legal/generative-tooling.html)

**Be nice. Be informative.**

Closes #6018 from pan3793/fast-get-table.

Closes #6018

058001c [Cheng Pan] fix
405b124 [Cheng Pan] fix
615b747 [Cheng Pan] Speed up GetTables operation

Authored-by: Cheng Pan <[email protected]>
Signed-off-by: Cheng Pan <[email protected]>
(cherry picked from commit d474768)
Signed-off-by: Cheng Pan <[email protected]>
  • Loading branch information
pan3793 committed Jan 29, 2024
1 parent ff2d15f commit f86628f
Show file tree
Hide file tree
Showing 3 changed files with 30 additions and 12 deletions.
2 changes: 1 addition & 1 deletion docs/configuration/settings.md
Original file line number Diff line number Diff line change
Expand Up @@ -376,7 +376,7 @@ You can configure the Kyuubi properties in `$KYUUBI_HOME/conf/kyuubi-defaults.co

| Key | Default | Meaning | Type | Since |
|--------------------------------------------------|---------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------|-------|
| kyuubi.operation.getTables.ignoreTableProperties | false | Speed up the `GetTables` operation by returning table identities only. | boolean | 1.8.0 |
| kyuubi.operation.getTables.ignoreTableProperties | false | Speed up the `GetTables` operation by ignoring `tableTypes` query criteria, and returning table identities only. | boolean | 1.8.0 |
| kyuubi.operation.idle.timeout | PT3H | Operation will be closed when it's not accessed for this duration of time | duration | 1.0.0 |
| kyuubi.operation.interrupt.on.cancel | true | When true, all running tasks will be interrupted if one cancels a query. When false, all running tasks will remain until finished. | boolean | 1.2.0 |
| kyuubi.operation.language | SQL | Choose a programing language for the following inputs<ul><li>SQL: (Default) Run all following statements as SQL queries.</li><li>SCALA: Run all following input as scala codes</li><li>PYTHON: (Experimental) Run all following input as Python codes with Spark engine</li></ul> | string | 1.5.0 |
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -163,31 +163,48 @@ object SparkCatalogUtils extends Logging {
val namespaces = listNamespacesWithPattern(catalog, schemaPattern)
catalog match {
case builtin if builtin.name() == SESSION_CATALOG =>
val catalog = spark.sessionState.catalog
val databases = catalog.listDatabases(schemaPattern)
val sessionCatalog = spark.sessionState.catalog
val databases = sessionCatalog.listDatabases(schemaPattern)

def isMatchedTableType(tableTypes: Set[String], tableType: String): Boolean = {
val typ = if (tableType.equalsIgnoreCase(VIEW)) VIEW else TABLE
tableTypes.exists(typ.equalsIgnoreCase)
}

databases.flatMap { db =>
val identifiers = catalog.listTables(db, tablePattern, includeLocalTempViews = false)
catalog.getTablesByName(identifiers)
.filter(t => isMatchedTableType(tableTypes, t.tableType.name)).map { t =>
val typ = if (t.tableType.name == VIEW) VIEW else TABLE
val identifiers =
sessionCatalog.listTables(db, tablePattern, includeLocalTempViews = false)
if (ignoreTableProperties) {
identifiers.map { ti: TableIdentifier =>
Row(
catalogName,
t.database,
t.identifier.table,
typ,
t.comment.getOrElse(""),
ti.database.getOrElse("default"),
ti.table,
TABLE, // ignore tableTypes criteria and simply treat all table type as TABLE
"",
null,
null,
null,
null,
null)
}
} else {
sessionCatalog.getTablesByName(identifiers)
.filter(t => isMatchedTableType(tableTypes, t.tableType.name)).map { t =>
val typ = if (t.tableType.name == VIEW) VIEW else TABLE
Row(
catalogName,
t.database,
t.identifier.table,
typ,
t.comment.getOrElse(""),
null,
null,
null,
null,
null)
}
}
}
case tc: TableCatalog =>
val tp = tablePattern.r.pattern
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3226,7 +3226,8 @@ object KyuubiConf {

val OPERATION_GET_TABLES_IGNORE_TABLE_PROPERTIES: ConfigEntry[Boolean] =
buildConf("kyuubi.operation.getTables.ignoreTableProperties")
.doc("Speed up the `GetTables` operation by returning table identities only.")
.doc("Speed up the `GetTables` operation by ignoring `tableTypes` query criteria, " +
"and returning table identities only.")
.version("1.8.0")
.booleanConf
.createWithDefault(false)
Expand Down

0 comments on commit f86628f

Please sign in to comment.