[SPARK-51118][PYTHON] Fix ExtractPythonUDFs to check the chained UDF input types for fallback #50341

ueshin · 2025-03-20T21:00:52Z

What changes were proposed in this pull request?

Fixes ExtractPythonUDFs to check the chained UDF input types for fallback.

Why are the changes needed?

Currently the fallback of Arrow-optimized Python UDF to non Arrow for the case the UDF has UDT input/output only works with not chained UDFs because it checks only the last UDFs.

For example:

from pyspark.sql.functions import udf
from pyspark.sql.types import *
from pyspark.testing.sqlutils import ExamplePoint, ExamplePointUDT

row = Row(
    label=1.0,
    point=ExamplePoint(1.0, 2.0),
)

df = spark.createDataFrame([row])

@udf(returnType=DoubleType(), useArrow=True)
def udtInDoubleOut(e):
    return e.y

@udf(returnType=DoubleType(), useArrow=True)
def doubleInDoubleOut(d):
    return d * 100.0

df.select(doubleInDoubleOut(udtInDoubleOut(df.point))).show()

This doesn't fallback to non Arrow because doubleInDoubleOut looks like no UDT input/output and fails with:

pyspark.errors.exceptions.captured.PythonException:
  An exception was thrown from the Python worker. Please see the stack trace below.
Traceback (most recent call last):
  ...
AttributeError: 'list' object has no attribute 'y'

Does this PR introduce any user-facing change?

Yes, the fallback will work with chained UDFs, too.

How was this patch tested?

Added the related tests.

Was this patch authored or co-authored using generative AI tooling?

No.

ueshin · 2025-03-21T00:24:51Z

I'll change the implementation.

zhengruifeng · 2025-03-21T04:45:20Z

sql/core/src/main/scala/org/apache/spark/sql/execution/python/ExtractPythonUDFs.scala

@@ -173,7 +173,7 @@ object ExtractPythonUDFs extends Rule[LogicalPlan] with Logging {
  private def canEvaluateInPython(e: PythonUDF): Boolean = {
    e.children match {
      // single PythonUDF child could be chained and evaluated in Python
-      case Seq(u: PythonUDF) => e.evalType == u.evalType && canEvaluateInPython(u)
+      case Seq(u: PythonUDF) => correctEvalType(e) == correctEvalType(u) && canEvaluateInPython(u)


I am wondering if is possible to add a rewrite-like rule for SQL_ARROW_BATCHED_UDF <-> SQL_BATCHED_UDF conversion?

I guess we can, but I feel it's too much .. as this fallback should be temporary, and we should support UDTs with Arrow-optimized Python UDFs soon.

HyukjinKwon · 2025-03-24T02:10:22Z

Merged to master and branch-4.0.

…input types for fallback ### What changes were proposed in this pull request? Fixes `ExtractPythonUDFs` to check the chained UDF input types for fallback. ### Why are the changes needed? Currently the fallback of Arrow-optimized Python UDF to non Arrow for the case the UDF has UDT input/output only works with not chained UDFs because it checks only the last UDFs. For example: ```py from pyspark.sql.functions import udf from pyspark.sql.types import * from pyspark.testing.sqlutils import ExamplePoint, ExamplePointUDT row = Row( label=1.0, point=ExamplePoint(1.0, 2.0), ) df = spark.createDataFrame([row]) udf(returnType=DoubleType(), useArrow=True) def udtInDoubleOut(e): return e.y udf(returnType=DoubleType(), useArrow=True) def doubleInDoubleOut(d): return d * 100.0 df.select(doubleInDoubleOut(udtInDoubleOut(df.point))).show() ``` This doesn't fallback to non Arrow because `doubleInDoubleOut` looks like no UDT input/output and fails with: ``` pyspark.errors.exceptions.captured.PythonException: An exception was thrown from the Python worker. Please see the stack trace below. Traceback (most recent call last): ... AttributeError: 'list' object has no attribute 'y' ``` ### Does this PR introduce _any_ user-facing change? Yes, the fallback will work with chained UDFs, too. ### How was this patch tested? Added the related tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #50341 from ueshin/issues/SPARK-51118/chained_udf_with_udt. Authored-by: Takuya Ueshin <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]> (cherry picked from commit 4e30f96) Signed-off-by: Hyukjin Kwon <[email protected]>

…input types for fallback ### What changes were proposed in this pull request? Fixes `ExtractPythonUDFs` to check the chained UDF input types for fallback. ### Why are the changes needed? Currently the fallback of Arrow-optimized Python UDF to non Arrow for the case the UDF has UDT input/output only works with not chained UDFs because it checks only the last UDFs. For example: ```py from pyspark.sql.functions import udf from pyspark.sql.types import * from pyspark.testing.sqlutils import ExamplePoint, ExamplePointUDT row = Row( label=1.0, point=ExamplePoint(1.0, 2.0), ) df = spark.createDataFrame([row]) udf(returnType=DoubleType(), useArrow=True) def udtInDoubleOut(e): return e.y udf(returnType=DoubleType(), useArrow=True) def doubleInDoubleOut(d): return d * 100.0 df.select(doubleInDoubleOut(udtInDoubleOut(df.point))).show() ``` This doesn't fallback to non Arrow because `doubleInDoubleOut` looks like no UDT input/output and fails with: ``` pyspark.errors.exceptions.captured.PythonException: An exception was thrown from the Python worker. Please see the stack trace below. Traceback (most recent call last): ... AttributeError: 'list' object has no attribute 'y' ``` ### Does this PR introduce _any_ user-facing change? Yes, the fallback will work with chained UDFs, too. ### How was this patch tested? Added the related tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#50341 from ueshin/issues/SPARK-51118/chained_udf_with_udt. Authored-by: Takuya Ueshin <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

Fix ExtractPythonUDFs to check the chained UDF input types for fallback.

Loading
Loading status checks…

5ba34d0

ueshin requested review from HyukjinKwon, zhengruifeng and xinrong-meng March 20, 2025 21:00

github-actions bot added SQL PYTHON labels Mar 20, 2025

Fix.

Loading
Loading status checks…

6c54dd0

ueshin marked this pull request as draft March 21, 2025 00:24

zhengruifeng approved these changes Mar 21, 2025

View reviewed changes

Fix.

Loading
Loading status checks…

c58f088

ueshin requested a review from zhengruifeng March 21, 2025 02:30

ueshin marked this pull request as ready for review March 21, 2025 03:11

zhengruifeng reviewed Mar 21, 2025

View reviewed changes

HyukjinKwon approved these changes Mar 24, 2025

View reviewed changes

HyukjinKwon closed this in 4e30f96 Mar 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-51118][PYTHON] Fix ExtractPythonUDFs to check the chained UDF input types for fallback #50341

[SPARK-51118][PYTHON] Fix ExtractPythonUDFs to check the chained UDF input types for fallback #50341

ueshin commented Mar 20, 2025 •

edited

Loading

ueshin commented Mar 21, 2025

zhengruifeng Mar 21, 2025

ueshin Mar 21, 2025

zhengruifeng Mar 21, 2025

HyukjinKwon commented Mar 24, 2025 •

edited

Loading

[SPARK-51118][PYTHON] Fix ExtractPythonUDFs to check the chained UDF input types for fallback #50341

[SPARK-51118][PYTHON] Fix ExtractPythonUDFs to check the chained UDF input types for fallback #50341

Conversation

ueshin commented Mar 20, 2025 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

ueshin commented Mar 21, 2025

zhengruifeng Mar 21, 2025

Choose a reason for hiding this comment

ueshin Mar 21, 2025

Choose a reason for hiding this comment

zhengruifeng Mar 21, 2025

Choose a reason for hiding this comment

HyukjinKwon commented Mar 24, 2025 • edited Loading

ueshin commented Mar 20, 2025 •

edited

Loading

HyukjinKwon commented Mar 24, 2025 •

edited

Loading