Refactor SparkSession to fix Spark-4.0 build #12227

nartal1 · 2025-02-25T22:00:13Z

This contributes to build issue #12062 and is a follow up of #12198.

Spark made package name update in Spark-4.0 for SparkSession to support connect.
In this PR, we added support to pick the correct SparkSession for different Spark versions.
For Spark version < 4.0 - It is org.apache.spark.sql.SparkSesssion
For Spark version 4.0 - It is org.apache.spark.sql.classic.SparkSession

In addition to the package name change, sqlContext is removed from Dataframe. Have updated the InternalColumnarRddConverter.scala file accordingly.

Before this change:

[INFO] compiling 480 Scala sources and 58 Java sources to /home/nartal/spark-rapids-2504/spark-rapids/scala2.13/sql-plugin/target/spark400/classes ...
[ERROR] [Error] /home/nartal/spark-rapids-2504/spark-rapids/sql-plugin/src/main/scala/org/apache/spark/sql/rapids/GpuInsertIntoHadoopFsRelationCommand.scala:209: type mismatch;
 found   : SparkSession (in org.apache.spark.sql) 
 required: SparkSession (in org.apache.spark.sql.classic) 
[ERROR] [Error] /home/nartal/spark-rapids-2504/spark-rapids/sql-plugin/src/main/scala/org/apache/spark/sql/rapids/execution/InternalColumnarRddConverter.scala:664: value sqlContext is not a member of org.apache.spark.sql.DataFrame
[ERROR] [Error] /home/nartal/spark-rapids-2504/spark-rapids/sql-plugin/src/main/scala/org/apache/spark/sql/rapids/execution/InternalColumnarRddConverter.scala:668: value sqlContext is not a member of org.apache.spark.sql.DataFrame
[ERROR] [Error] /home/nartal/spark-rapids-2504/spark-rapids/sql-plugin/src/main/scala/org/apache/spark/sql/rapids/execution/InternalColumnarRddConverter.scala:718: value sqlContext is not a member of org.apache.spark.sql.DataFrame
[ERROR] [Error] /home/nartal/spark-rapids-2504/spark-rapids/sql-plugin/src/main/scala/org/apache/spark/sql/rapids/execution/TrampolineUtil.scala:104: value cleanupAnyExistingSession is not a member of object org.apache.spark.sql.SparkSession
[ERROR] [Error] /home/nartal/spark-rapids-2504/spark-rapids/sql-plugin/src/main/spark320/scala/org/apache/spark/sql/rapids/shims/Spark32XShimsUtils.scala:55: value leafNodeDefaultParallelism is not a member of org.apache.spark.sql.SparkSession
[ERROR] [Error] /home/nartal/spark-rapids-2504/spark-rapids/sql-plugin/src/main/spark332db/scala/com/nvidia/spark/rapids/shims/GpuInsertIntoHiveTable.scala:143: type mismatch;
 found   : SparkSession (in org.apache.spark.sql) 
 required: SparkSession (in org.apache.spark.sql.classic) 
[ERROR] [Error] /home/nartal/spark-rapids-2504/spark-rapids/sql-plugin/src/main/spark332db/scala/org/apache/spark/sql/rapids/GpuFileFormatWriter.scala:179: type mismatch;
 found   : SparkSession (in org.apache.spark.sql) 
 required: SparkSession (in org.apache.spark.sql.classic) 
[ERROR] [Error] /home/nartal/spark-rapids-2504/spark-rapids/sql-plugin/src/main/spark332db/scala/org/apache/spark/sql/rapids/shims/GpuCreateDataSourceTableAsSelectCommandShims.scala:113: type mismatch;
 found   : SparkSession (in org.apache.spark.sql) 
 required: SparkSession (in org.apache.spark.sql.classic) 
[ERROR] [Error] /home/nartal/spark-rapids-2504/spark-rapids/sql-plugin/src/main/spark332db/scala/org/apache/spark/sql/rapids/shims/GpuDataSource.scala:90: type mismatch;
 found   : org.apache.spark.sql.catalyst.TableIdentifier
 required: String
[ERROR] [Error] /home/nartal/spark-rapids-2504/spark-rapids/sql-plugin/src/main/spark350db143/scala/com/nvidia/spark/rapids/shims/PartitionedFileUtilsShim.scala:38: not enough arguments for method splitFiles: (file: org.apache.spark.sql.execution.datasources.FileStatusWithMetadata, filePath: org.apache.hadoop.fs.Path, isSplitable: Boolean, maxSplitBytes: Long, partitionValues: org.apache.spark.sql.catalyst.InternalRow): Seq[org.apache.spark.sql.execution.datasources.PartitionedFile].
Unspecified value parameter partitionValues.
[ERROR] [Error] /home/nartal/spark-rapids-2504/spark-rapids/sql-plugin/src/main/spark400/scala/org/apache/spark/sql/execution/rapids/shims/FilePartitionShims.scala:29: not enough arguments for method getPartitionedFile: (file: org.apache.spark.sql.execution.datasources.FileStatusWithMetadata, filePath: org.apache.hadoop.fs.Path, partitionValues: org.apache.spark.sql.catalyst.InternalRow, start: Long, length: Long): org.apache.spark.sql.execution.datasources.PartitionedFile.
Unspecified value parameter length.
[ERROR] [Error] /home/nartal/spark-rapids-2504/spark-rapids/sql-plugin/src/main/spark400/scala/org/apache/spark/sql/hive/rapids/shims/CommandUtilsShim.scala:30: type mismatch;
 found   : SparkSession (in org.apache.spark.sql) 
 required: SparkSession (in org.apache.spark.sql.classic) 
[ERROR] [Error] /home/nartal/spark-rapids-2504/spark-rapids/sql-plugin/src/main/spark400/scala/org/apache/spark/sql/nvidia/DFUDFShims.scala:24: object ExpressionUtils is not a member of package org.apache.spark.sql.internal
[ERROR] [Error] /home/nartal/spark-rapids-2504/spark-rapids/sql-plugin/src/main/spark400/scala/org/apache/spark/sql/nvidia/DFUDFShims.scala:27: not found: value expression
[ERROR] [Error] /home/nartal/spark-rapids-2504/spark-rapids/sql-plugin/src/main/spark400/scala/org/apache/spark/sql/nvidia/DFUDFShims.scala:28: not found: value column
[ERROR] 16 errors found

After this change:

[ERROR] [Error] /home/nartal/spark-rapids-2504/spark-rapids/sql-plugin/src/main/spark350db143/scala/com/nvidia/spark/rapids/shims/PartitionedFileUtilsShim.scala:38: not enough arguments for method splitFiles: (file: org.apache.spark.sql.execution.datasources.FileStatusWithMetadata, filePath: org.apache.hadoop.fs.Path, isSplitable: Boolean, maxSplitBytes: Long, partitionValues: org.apache.spark.sql.catalyst.InternalRow): Seq[org.apache.spark.sql.execution.datasources.PartitionedFile].
Unspecified value parameter partitionValues.
[ERROR] [Error] /home/nartal/spark-rapids-2504/spark-rapids/sql-plugin/src/main/spark400/scala/org/apache/spark/sql/execution/rapids/shims/FilePartitionShims.scala:29: not enough arguments for method getPartitionedFile: (file: org.apache.spark.sql.execution.datasources.FileStatusWithMetadata, filePath: org.apache.hadoop.fs.Path, partitionValues: org.apache.spark.sql.catalyst.InternalRow, start: Long, length: Long): org.apache.spark.sql.execution.datasources.PartitionedFile.
Unspecified value parameter length.
[ERROR] two errors found

Signed-off-by: Niranjan Artal <[email protected]>

gerashegalov · 2025-02-26T20:06:09Z

...lugin/src/main/spark320/scala/org/apache/spark/sql/rapids/shims/TrampolineConnectShims.scala

nit: So far this file looks more like SparkSessionShims

I agree, for now it looks like SparkSessionShims but I kept a generic name so that we can add any new methods/type aliases if required.

gerashegalov · 2025-02-26T20:08:01Z

...lugin/src/main/spark400/scala/org/apache/spark/sql/rapids/shims/TrampolineConnectShims.scala

+
+object TrampolineConnectShims {
+
+  type SparkSession = org.apache.spark.sql.classic.SparkSession


This means that we should drop preview2 as an accepted Spark 4 version in spark400.SparkShimServiceProvider

Done. PTAL.

…ark_4_session_build_fix

nartal1 · 2025-02-26T23:47:16Z

build

Refactor SparkSession to fix Spark-4.0 build

822c254

Signed-off-by: Niranjan Artal <[email protected]>

nartal1 self-assigned this Feb 25, 2025

nartal1 added bug Something isn't working build Related to CI / CD or cleanly building Spark 4.0+ Spark 4.0+ issues labels Feb 25, 2025

nartal1 requested review from gerashegalov, razajafri and revans2 February 25, 2025 22:00

nartal1 mentioned this pull request Feb 26, 2025

Refactor SparkStrategy to fix Spark-4.0 build #12198

Merged

gerashegalov reviewed Feb 26, 2025

View reviewed changes

nartal1 added 2 commits February 26, 2025 15:14

Merge branch 'branch-25.04' of github.com:NVIDIA/spark-rapids into sp…

03c5347

…ark_4_session_build_fix

addressed review comments

0d29309

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor SparkSession to fix Spark-4.0 build #12227

Refactor SparkSession to fix Spark-4.0 build #12227

nartal1 commented Feb 25, 2025

gerashegalov Feb 26, 2025

nartal1 Feb 26, 2025

gerashegalov Feb 26, 2025

nartal1 Feb 26, 2025

nartal1 commented Feb 26, 2025


		object TrampolineConnectShims {

		type SparkSession = org.apache.spark.sql.classic.SparkSession

Refactor SparkSession to fix Spark-4.0 build #12227

Are you sure you want to change the base?

Refactor SparkSession to fix Spark-4.0 build #12227

Conversation

nartal1 commented Feb 25, 2025

gerashegalov Feb 26, 2025

Choose a reason for hiding this comment

nartal1 Feb 26, 2025

Choose a reason for hiding this comment

gerashegalov Feb 26, 2025

Choose a reason for hiding this comment

nartal1 Feb 26, 2025

Choose a reason for hiding this comment

nartal1 commented Feb 26, 2025