Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor SparkSession to fix Spark-4.0 build #12227

Open
wants to merge 3 commits into
base: branch-25.04
Choose a base branch
from

Conversation

nartal1
Copy link
Collaborator

@nartal1 nartal1 commented Feb 25, 2025

This contributes to build issue #12062 and is a follow up of #12198.

Spark made package name update in Spark-4.0 for SparkSession to support connect.
In this PR, we added support to pick the correct SparkSession for different Spark versions.
For Spark version < 4.0 - It is org.apache.spark.sql.SparkSesssion
For Spark version 4.0 - It is org.apache.spark.sql.classic.SparkSession

In addition to the package name change, sqlContext is removed from Dataframe. Have updated the InternalColumnarRddConverter.scala file accordingly.

Before this change:

[INFO] compiling 480 Scala sources and 58 Java sources to /home/nartal/spark-rapids-2504/spark-rapids/scala2.13/sql-plugin/target/spark400/classes ...
[ERROR] [Error] /home/nartal/spark-rapids-2504/spark-rapids/sql-plugin/src/main/scala/org/apache/spark/sql/rapids/GpuInsertIntoHadoopFsRelationCommand.scala:209: type mismatch;
 found   : SparkSession (in org.apache.spark.sql) 
 required: SparkSession (in org.apache.spark.sql.classic) 
[ERROR] [Error] /home/nartal/spark-rapids-2504/spark-rapids/sql-plugin/src/main/scala/org/apache/spark/sql/rapids/execution/InternalColumnarRddConverter.scala:664: value sqlContext is not a member of org.apache.spark.sql.DataFrame
[ERROR] [Error] /home/nartal/spark-rapids-2504/spark-rapids/sql-plugin/src/main/scala/org/apache/spark/sql/rapids/execution/InternalColumnarRddConverter.scala:668: value sqlContext is not a member of org.apache.spark.sql.DataFrame
[ERROR] [Error] /home/nartal/spark-rapids-2504/spark-rapids/sql-plugin/src/main/scala/org/apache/spark/sql/rapids/execution/InternalColumnarRddConverter.scala:718: value sqlContext is not a member of org.apache.spark.sql.DataFrame
[ERROR] [Error] /home/nartal/spark-rapids-2504/spark-rapids/sql-plugin/src/main/scala/org/apache/spark/sql/rapids/execution/TrampolineUtil.scala:104: value cleanupAnyExistingSession is not a member of object org.apache.spark.sql.SparkSession
[ERROR] [Error] /home/nartal/spark-rapids-2504/spark-rapids/sql-plugin/src/main/spark320/scala/org/apache/spark/sql/rapids/shims/Spark32XShimsUtils.scala:55: value leafNodeDefaultParallelism is not a member of org.apache.spark.sql.SparkSession
[ERROR] [Error] /home/nartal/spark-rapids-2504/spark-rapids/sql-plugin/src/main/spark332db/scala/com/nvidia/spark/rapids/shims/GpuInsertIntoHiveTable.scala:143: type mismatch;
 found   : SparkSession (in org.apache.spark.sql) 
 required: SparkSession (in org.apache.spark.sql.classic) 
[ERROR] [Error] /home/nartal/spark-rapids-2504/spark-rapids/sql-plugin/src/main/spark332db/scala/org/apache/spark/sql/rapids/GpuFileFormatWriter.scala:179: type mismatch;
 found   : SparkSession (in org.apache.spark.sql) 
 required: SparkSession (in org.apache.spark.sql.classic) 
[ERROR] [Error] /home/nartal/spark-rapids-2504/spark-rapids/sql-plugin/src/main/spark332db/scala/org/apache/spark/sql/rapids/shims/GpuCreateDataSourceTableAsSelectCommandShims.scala:113: type mismatch;
 found   : SparkSession (in org.apache.spark.sql) 
 required: SparkSession (in org.apache.spark.sql.classic) 
[ERROR] [Error] /home/nartal/spark-rapids-2504/spark-rapids/sql-plugin/src/main/spark332db/scala/org/apache/spark/sql/rapids/shims/GpuDataSource.scala:90: type mismatch;
 found   : org.apache.spark.sql.catalyst.TableIdentifier
 required: String
[ERROR] [Error] /home/nartal/spark-rapids-2504/spark-rapids/sql-plugin/src/main/spark350db143/scala/com/nvidia/spark/rapids/shims/PartitionedFileUtilsShim.scala:38: not enough arguments for method splitFiles: (file: org.apache.spark.sql.execution.datasources.FileStatusWithMetadata, filePath: org.apache.hadoop.fs.Path, isSplitable: Boolean, maxSplitBytes: Long, partitionValues: org.apache.spark.sql.catalyst.InternalRow): Seq[org.apache.spark.sql.execution.datasources.PartitionedFile].
Unspecified value parameter partitionValues.
[ERROR] [Error] /home/nartal/spark-rapids-2504/spark-rapids/sql-plugin/src/main/spark400/scala/org/apache/spark/sql/execution/rapids/shims/FilePartitionShims.scala:29: not enough arguments for method getPartitionedFile: (file: org.apache.spark.sql.execution.datasources.FileStatusWithMetadata, filePath: org.apache.hadoop.fs.Path, partitionValues: org.apache.spark.sql.catalyst.InternalRow, start: Long, length: Long): org.apache.spark.sql.execution.datasources.PartitionedFile.
Unspecified value parameter length.
[ERROR] [Error] /home/nartal/spark-rapids-2504/spark-rapids/sql-plugin/src/main/spark400/scala/org/apache/spark/sql/hive/rapids/shims/CommandUtilsShim.scala:30: type mismatch;
 found   : SparkSession (in org.apache.spark.sql) 
 required: SparkSession (in org.apache.spark.sql.classic) 
[ERROR] [Error] /home/nartal/spark-rapids-2504/spark-rapids/sql-plugin/src/main/spark400/scala/org/apache/spark/sql/nvidia/DFUDFShims.scala:24: object ExpressionUtils is not a member of package org.apache.spark.sql.internal
[ERROR] [Error] /home/nartal/spark-rapids-2504/spark-rapids/sql-plugin/src/main/spark400/scala/org/apache/spark/sql/nvidia/DFUDFShims.scala:27: not found: value expression
[ERROR] [Error] /home/nartal/spark-rapids-2504/spark-rapids/sql-plugin/src/main/spark400/scala/org/apache/spark/sql/nvidia/DFUDFShims.scala:28: not found: value column
[ERROR] 16 errors found

After this change:

[ERROR] [Error] /home/nartal/spark-rapids-2504/spark-rapids/sql-plugin/src/main/spark350db143/scala/com/nvidia/spark/rapids/shims/PartitionedFileUtilsShim.scala:38: not enough arguments for method splitFiles: (file: org.apache.spark.sql.execution.datasources.FileStatusWithMetadata, filePath: org.apache.hadoop.fs.Path, isSplitable: Boolean, maxSplitBytes: Long, partitionValues: org.apache.spark.sql.catalyst.InternalRow): Seq[org.apache.spark.sql.execution.datasources.PartitionedFile].
Unspecified value parameter partitionValues.
[ERROR] [Error] /home/nartal/spark-rapids-2504/spark-rapids/sql-plugin/src/main/spark400/scala/org/apache/spark/sql/execution/rapids/shims/FilePartitionShims.scala:29: not enough arguments for method getPartitionedFile: (file: org.apache.spark.sql.execution.datasources.FileStatusWithMetadata, filePath: org.apache.hadoop.fs.Path, partitionValues: org.apache.spark.sql.catalyst.InternalRow, start: Long, length: Long): org.apache.spark.sql.execution.datasources.PartitionedFile.
Unspecified value parameter length.
[ERROR] two errors found

@nartal1 nartal1 self-assigned this Feb 25, 2025
@nartal1 nartal1 added bug Something isn't working build Related to CI / CD or cleanly building Spark 4.0+ Spark 4.0+ issues labels Feb 25, 2025
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: So far this file looks more like SparkSessionShims

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, for now it looks like SparkSessionShims but I kept a generic name so that we can add any new methods/type aliases if required.


object TrampolineConnectShims {

type SparkSession = org.apache.spark.sql.classic.SparkSession
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means that we should drop preview2 as an accepted Spark 4 version in spark400.SparkShimServiceProvider

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. PTAL.

@nartal1
Copy link
Collaborator Author

nartal1 commented Feb 26, 2025

build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working build Related to CI / CD or cleanly building Spark 4.0+ Spark 4.0+ issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants