Skip to content

Commit 4b9b246

Browse files
pan3793dongjoon-hyun
authored andcommitted
[SPARK-51466][SQL][HIVE] Eliminate Hive built-in UDFs initialization on Hive UDF evaluation
### What changes were proposed in this pull request? Fork a few methods from Hive to eliminate calls of `org.apache.hadoop.hive.ql.exec.FunctionRegistry` to avoid initializing Hive built-in UDFs ### Why are the changes needed? Currently, when the user runs a query that contains Hive UDF, it triggers `o.a.h.hive.ql.exec.FunctionRegistry` initialization, which also initializes the [Hive built-in UDFs, UDAFs and UDTFs](https://github.com/apache/hive/blob/rel/release-2.3.10/ql/src/java/org/apache/hadoop/hive/ql/exec/FunctionRegistry.java#L500). Since [SPARK-51029](https://issues.apache.org/jira/browse/SPARK-51029) (apache#49725) removes hive-llap-common from the Spark binary distributions, `NoClassDefFoundError` occurs. ``` org.apache.spark.sql.execution.QueryExecutionException: java.lang.NoClassDefFoundError: org/apache/hadoop/hive/llap/security/LlapSigner$Signable at java.base/java.lang.Class.getDeclaredConstructors0(Native Method) at java.base/java.lang.Class.privateGetDeclaredConstructors(Class.java:3373) at java.base/java.lang.Class.getConstructor0(Class.java:3578) at java.base/java.lang.Class.getDeclaredConstructor(Class.java:2754) at org.apache.hive.common.util.ReflectionUtil.newInstance(ReflectionUtil.java:79) at org.apache.hadoop.hive.ql.exec.Registry.registerGenericUDTF(Registry.java:208) at org.apache.hadoop.hive.ql.exec.Registry.registerGenericUDTF(Registry.java:201) at org.apache.hadoop.hive.ql.exec.FunctionRegistry.<clinit>(FunctionRegistry.java:500) at org.apache.hadoop.hive.ql.udf.generic.GenericUDF.initializeAndFoldConstants(GenericUDF.java:160) at org.apache.spark.sql.hive.HiveGenericUDFEvaluator.returnInspector$lzycompute(hiveUDFEvaluators.scala:118) at org.apache.spark.sql.hive.HiveGenericUDFEvaluator.returnInspector(hiveUDFEvaluators.scala:117) at org.apache.spark.sql.hive.HiveGenericUDF.dataType$lzycompute(hiveUDFs.scala:132) at org.apache.spark.sql.hive.HiveGenericUDF.dataType(hiveUDFs.scala:132) at org.apache.spark.sql.hive.HiveUDFExpressionBuilder$.makeHiveFunctionExpression(HiveSessionStateBuilder.scala:197) at org.apache.spark.sql.hive.HiveUDFExpressionBuilder$.$anonfun$makeExpression$1(HiveSessionStateBuilder.scala:177) at org.apache.spark.util.Utils$.withContextClassLoader(Utils.scala:187) at org.apache.spark.sql.hive.HiveUDFExpressionBuilder$.makeExpression(HiveSessionStateBuilder.scala:171) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.$anonfun$makeFunctionBuilder$1(SessionCatalog.scala:1689) ... ``` Actually, Spark does not use those Hive built-in functions, but still needs to pull those transitive deps to make Hive happy. By eliminating Hive built-in UDFs initialization, Spark can get rid of those transitive deps, and gain a small performance improvement on the first call Hive UDF. ### Does this PR introduce _any_ user-facing change? No, except for a small perf improvement on the first call Hive UDF. ### How was this patch tested? Pass GHA to ensure the porting code is correct. Manually tested that call Hive UDF, UDAF and UDTF won't trigger `org.apache.hadoop.hive.ql.exec.FunctionRegistry.<clinit>` ``` $ bin/spark-sql // UDF spark-sql (default)> create temporary function hive_uuid as 'org.apache.hadoop.hive.ql.udf.UDFUUID'; Time taken: 0.878 seconds spark-sql (default)> select hive_uuid(); 840356e5-ce2a-4d6c-9383-294d620ec32b Time taken: 2.264 seconds, Fetched 1 row(s) // GenericUDF spark-sql (default)> create temporary function hive_sha2 as 'org.apache.hadoop.hive.ql.udf.generic.GenericUDFSha2'; Time taken: 0.023 seconds spark-sql (default)> select hive_sha2('ABC', 256); b5d4045c3f466fa91fe2cc6abe79232a1a57cdf104f7a26e716e0a1e2789df78 Time taken: 0.157 seconds, Fetched 1 row(s) // UDAF spark-sql (default)> create temporary function hive_percentile as 'org.apache.hadoop.hive.ql.udf.UDAFPercentile'; Time taken: 0.032 seconds spark-sql (default)> select hive_percentile(id, 0.5) from range(100); 49.5 Time taken: 0.474 seconds, Fetched 1 row(s) // GenericUDAF spark-sql (default)> create temporary function hive_sum as 'org.apache.hadoop.hive.ql.udf.generic.GenericUDAFSum'; Time taken: 0.017 seconds spark-sql (default)> select hive_sum(*) from range(100); 4950 Time taken: 1.25 seconds, Fetched 1 row(s) // GenericUDTF spark-sql (default)> create temporary function hive_replicate_rows as 'org.apache.hadoop.hive.ql.udf.generic.GenericUDTFReplicateRows'; Time taken: 0.012 seconds spark-sql (default)> select hive_replicate_rows(3L, id) from range(3); 3 0 3 0 3 0 3 1 3 1 3 1 3 2 3 2 3 2 Time taken: 0.19 seconds, Fetched 9 row(s) ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#50232 from pan3793/eliminate-hive-udf-init. Authored-by: Cheng Pan <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
1 parent d965217 commit 4b9b246

File tree

10 files changed

+669
-20
lines changed

10 files changed

+669
-20
lines changed

project/SparkBuild.scala

+10
Original file line numberDiff line numberDiff line change
@@ -412,6 +412,8 @@ object SparkBuild extends PomBuild {
412412
/* Hive console settings */
413413
enable(Hive.settings)(hive)
414414

415+
enable(HiveThriftServer.settings)(hiveThriftServer)
416+
415417
enable(SparkConnectCommon.settings)(connectCommon)
416418
enable(SparkConnect.settings)(connect)
417419
enable(SparkConnectClient.settings)(connectClient)
@@ -1203,6 +1205,14 @@ object Hive {
12031205
)
12041206
}
12051207

1208+
object HiveThriftServer {
1209+
lazy val settings = Seq(
1210+
excludeDependencies ++= Seq(
1211+
ExclusionRule("org.apache.hive", "hive-llap-common"),
1212+
ExclusionRule("org.apache.hive", "hive-llap-client"))
1213+
)
1214+
}
1215+
12061216
object YARN {
12071217
val genConfigProperties = TaskKey[Unit]("gen-config-properties",
12081218
"Generate config.properties which contains a setting whether Hadoop is provided or not")

sql/hive-thriftserver/pom.xml

-10
Original file line numberDiff line numberDiff line change
@@ -148,16 +148,6 @@
148148
<artifactId>byte-buddy-agent</artifactId>
149149
<scope>test</scope>
150150
</dependency>
151-
<dependency>
152-
<groupId>${hive.group}</groupId>
153-
<artifactId>hive-llap-common</artifactId>
154-
<scope>${hive.llap.scope}</scope>
155-
</dependency>
156-
<dependency>
157-
<groupId>${hive.group}</groupId>
158-
<artifactId>hive-llap-client</artifactId>
159-
<scope>${hive.llap.scope}</scope>
160-
</dependency>
161151
<dependency>
162152
<groupId>net.sf.jpam</groupId>
163153
<artifactId>jpam</artifactId>

sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/session/HiveSessionImpl.java

+9-1
Original file line numberDiff line numberDiff line change
@@ -673,15 +673,23 @@ public void close() throws HiveSQLException {
673673
hiveHist.closeStream();
674674
}
675675
try {
676+
// Forcibly initialize thread local Hive so that
677+
// SessionState#unCacheDataNucleusClassLoaders won't trigger
678+
// Hive built-in UDFs initialization.
679+
Hive.getWithoutRegisterFns(sessionState.getConf());
676680
sessionState.close();
677681
} finally {
678682
sessionState = null;
679683
}
680-
} catch (IOException ioe) {
684+
} catch (IOException | HiveException ioe) {
681685
throw new HiveSQLException("Failure to close", ioe);
682686
} finally {
683687
if (sessionState != null) {
684688
try {
689+
// Forcibly initialize thread local Hive so that
690+
// SessionState#unCacheDataNucleusClassLoaders won't trigger
691+
// Hive built-in UDFs initialization.
692+
Hive.getWithoutRegisterFns(sessionState.getConf());
685693
sessionState.close();
686694
} catch (Throwable t) {
687695
LOG.warn("Error closing session", t);

sql/hive-thriftserver/src/test/resources/log4j2.properties

+3
Original file line numberDiff line numberDiff line change
@@ -92,3 +92,6 @@ logger.parquet2.level = error
9292

9393
logger.thriftserver.name = org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation
9494
logger.thriftserver.level = off
95+
96+
logger.dagscheduler.name = org.apache.spark.scheduler.DAGScheduler
97+
logger.dagscheduler.level = error

0 commit comments

Comments
 (0)