[SPARK-51242][CONENCT][PYTHON] Improve Column performance when DQC is disabled

itholic · HyukjinKwon · commit 489ba0d24975 · 2025-02-19T08:40:08.000+09:00
### What changes were proposed in this pull request? This PR proposes to improve Column performance when DQC(DataFrameQueryContext) is disabled by delaying to call `getActiveSession` which is pretty expensive. ### Why are the changes needed? To improve the performance of Column operations. ### Does this PR introduce _any_ user-facing change? No, API changes but only improves the performance ### How was this patch tested? Manually tested, and also the existing CI should pass. ```python >>> spark.conf.get("spark.python.sql.dataFrameDebugging.enabled") 'false' ``` **Before fix** ```python >>> import time >>> import pyspark.sql.functions as F >>> >>> c = F.col("name") >>> start = time.time() >>> for i in range(10000): ... _ = c.alias("a") ... >>> print(time.time() - start) 2.061354875564575 ``` **After fix** ```python >>> import time >>> import pyspark.sql.functions as F >>> >>> c = F.col("name") >>> start = time.time() >>> for i in range(10000): ... _ = c.alias("a") ... >>> print(time.time() - start) 0.8050589561462402 ``` And there is no difference when the flag is on: ```python >>> spark.conf.get("spark.python.sql.dataFrameDebugging.enabled") 'true' ``` **Before fix** ```python >>> import time >>> import pyspark.sql.functions as F >>> >>> c = F.col("name") >>> start = time.time() >>> for i in range(10000): ... _ = c.alias("a") ... >>> print(time.time() - start) 3.755108118057251 ``` **After fix** ```python >>> import time >>> import pyspark.sql.functions as F >>> >>> c = F.col("name") >>> start = time.time() >>> for i in range(10000): ... _ = c.alias("a") ... >>> print(time.time() - start) 3.6577670574188232 ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#49982 from itholic/DQC_improvement. Authored-by: Haejoon Lee <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
diff --git a/python/pyspark/errors/utils.py b/python/pyspark/errors/utils.py
@@ -255,9 +255,7 @@ def wrapper(*args: Any, **kwargs: Any) -> Any:
         from pyspark.sql import SparkSession
         from pyspark.sql.utils import is_remote
 
-        spark = SparkSession.getActiveSession()
-
-        if spark is not None and hasattr(func, "__name__") and is_debugging_enabled():
+        if hasattr(func, "__name__") and is_debugging_enabled():
             if is_remote():
                 # Getting the configuration requires RPC call. Uses the default value for now.
                 depth = 1
@@ -268,6 +266,9 @@ def wrapper(*args: Any, **kwargs: Any) -> Any:
                 finally:
                     set_current_origin(None, None)
             else:
+                spark = SparkSession.getActiveSession()
+                if spark is None:
+                    return func(*args, **kwargs)
                 assert spark._jvm is not None
                 jvm_pyspark_origin = getattr(
                     spark._jvm, "org.apache.spark.sql.catalyst.trees.PySparkCurrentOrigin"