Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-51537][CONNECT][CORE] construct the session-specific classloader based on the default session classloader on executor #50334

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

wbo4958
Copy link
Contributor

@wbo4958 wbo4958 commented Mar 20, 2025

What changes were proposed in this pull request?

This PR is to construct the session-specific classloader based on the default session classloader which has already added the global jars (e.g., added by --jars ) into the classpath on the executor side in the connect mode.

Why are the changes needed?

In Spark Connect mode, when connecting to a non-local (e.g., standalone) cluster, the executor creates an isolated session state that includes a session-specific classloader for each task. However, a notable issue arises: this session-specific classloader does not include the global JARs specified by the --jars option in the classpath. This oversight can lead to deserialization exceptions. For example:

Caused by: java.lang.ClassCastException: cannot assign instance of java.lang.invoke.SerializedLambda to field org.apache.spark.rdd.MapPartitionsRDD.f of type scala.Function3 in instance of org.apache.spark.rdd.MapPartitionsRDD
        at java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2096)

Does this PR introduce any user-facing change?

No

How was this patch tested?

The newly added test can pass. And the below manual test can pass,

  1. clone the minimum project that could repro this issue
git clone [email protected]:wbo4958/ConnectMLIssue.git
  1. Compile the project
mvn clean package
  1. Start a standalone cluster
$SPARK_HOME/sbin/start-master.sh -h localhost
$SPARK_HOME/sbin/start-worker.sh spark://localhost:7077
  1. Start a connect server connecting to the spark standalone cluster
./standalone.sh
  1. Play around the demo

Running the below code under the pyspark client environment.

python repro-issue.py

Without this PR, you're going to see the below exception

Caused by: java.lang.ClassCastException: cannot assign instance of java.lang.invoke.SerializedLambda to field org.apache.spark.rdd.MapPartitionsRDD.f of type scala.Function3 in instance of org.apache.spark.rdd.MapPartitionsRDD
	at java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2096)
	at java.io.ObjectStreamClass$FieldReflector.checkObjectFieldValueTypes(ObjectStreamClass.java:2060)
	at java.io.ObjectStreamClass.checkObjFieldValueTypes(ObjectStreamClass.java:1347)
	at java.io.ObjectInputStream$FieldValues.defaultCheckFieldValues(ObjectInputStream.java:2679)
	at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2486)
	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2257)
	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1733)
	at java.io.ObjectInputStream$FieldValues.<init>(ObjectInputStream.java:2606)
	at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2457)
	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2257)
	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1733)
	at java.io.ObjectInputStream.readObject(ObjectInputStream.java:509)
	at java.io.ObjectInputStream.readObject(ObjectInputStream.java:467)
	at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:88)
	at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:136)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:86)
	at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:171)
	at org.apache.spark.scheduler.Task.run(Task.scala:147)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$5(Executor.scala:645)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:80)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:77)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:100)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:648)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.lang.Thread.run(Thread.java:840)

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions github-actions bot added the CORE label Mar 20, 2025
@wbo4958 wbo4958 changed the title [SPARK-51537][CONNECT] [constructed classpath using both global jars and session specific jars in executor [SPARK-51537][CONNECT][CORE] [constructed classpath using both global jars and session specific jars in executor Mar 20, 2025

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
… the session specific jars
@wbo4958 wbo4958 force-pushed the connect-executor-classpath branch from 7b963dc to bbe2a94 Compare March 21, 2025 02:17
@wbo4958 wbo4958 changed the title [SPARK-51537][CONNECT][CORE] [constructed classpath using both global jars and session specific jars in executor [SPARK-51537][CONNECT][CORE] construct classpath using both global jars and session specific jars in executor Mar 21, 2025
@wbo4958 wbo4958 marked this pull request as ready for review March 21, 2025 02:43
@wbo4958 wbo4958 changed the title [SPARK-51537][CONNECT][CORE] construct classpath using both global jars and session specific jars in executor [SPARK-51537][CONNECT][CORE] construct classpath using both global jars and session specific jars on executor Mar 21, 2025
@wbo4958
Copy link
Contributor Author

wbo4958 commented Mar 21, 2025

Hi @hvanhovell @zhenlineo @HyukjinKwon @vicennial, Could you help review this PR? Thx very much.

@wbo4958 wbo4958 marked this pull request as ready for review March 25, 2025 08:01
@wbo4958 wbo4958 changed the title [SPARK-51537][CONNECT][CORE] construct classpath using both global jars and session specific jars on executor [SPARK-51537][CONNECT][CORE] construct the session-specific classloader based on the default session classloader on executor Mar 25, 2025
@vicennial
Copy link
Contributor

Thanks for identifying this issue, @wbo4958! While your PR resolves the executor-side problem, I believe we have a chance to refine our approach to cover both executor operations (e.g., typical UDFs) and driver operations (e.g., custom data sources) in one unified solution.

The high-level proposal: In the ArtifactManager, add an initialisation step that would copy JARs from the underlying session.sparkContext.addedJars(DEFAULT_SESSION_ID) into session.sparkContext.addedJars(session.sessionUUID).
Advantages:

  • Enhanced session isolation
    • Global JARs are copied during initialization, so any subsequent changes to the default session jars do not affect the session-specific context.
    • This isolation is particularly beneficial in standalone clusters where Spark Connect sessions coexist with traditional sessions (i.e., those interacting directly with SparkContext).
  • Since the copied global JARs behave as session-scoped JARs, no extra modifications to the executor’s code or classloader are required.

The negative here is duplicating the global JARs for each new Spark Connect session will naturally consume more resources. We could mitigate this by adding a Spark configuration option to toggle whether global jars are inherited into a Spark Connect session.

WDYT?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants