Skip to content

Commit 489ba0d

Browse files
itholicHyukjinKwon
authored andcommitted
[SPARK-51242][CONENCT][PYTHON] Improve Column performance when DQC is disabled
### What changes were proposed in this pull request? This PR proposes to improve Column performance when DQC(DataFrameQueryContext) is disabled by delaying to call `getActiveSession` which is pretty expensive. ### Why are the changes needed? To improve the performance of Column operations. ### Does this PR introduce _any_ user-facing change? No, API changes but only improves the performance ### How was this patch tested? Manually tested, and also the existing CI should pass. ```python >>> spark.conf.get("spark.python.sql.dataFrameDebugging.enabled") 'false' ``` **Before fix** ```python >>> import time >>> import pyspark.sql.functions as F >>> >>> c = F.col("name") >>> start = time.time() >>> for i in range(10000): ... _ = c.alias("a") ... >>> print(time.time() - start) 2.061354875564575 ``` **After fix** ```python >>> import time >>> import pyspark.sql.functions as F >>> >>> c = F.col("name") >>> start = time.time() >>> for i in range(10000): ... _ = c.alias("a") ... >>> print(time.time() - start) 0.8050589561462402 ``` And there is no difference when the flag is on: ```python >>> spark.conf.get("spark.python.sql.dataFrameDebugging.enabled") 'true' ``` **Before fix** ```python >>> import time >>> import pyspark.sql.functions as F >>> >>> c = F.col("name") >>> start = time.time() >>> for i in range(10000): ... _ = c.alias("a") ... >>> print(time.time() - start) 3.755108118057251 ``` **After fix** ```python >>> import time >>> import pyspark.sql.functions as F >>> >>> c = F.col("name") >>> start = time.time() >>> for i in range(10000): ... _ = c.alias("a") ... >>> print(time.time() - start) 3.6577670574188232 ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#49982 from itholic/DQC_improvement. Authored-by: Haejoon Lee <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
1 parent 4134e9f commit 489ba0d

File tree

1 file changed

+4
-3
lines changed

1 file changed

+4
-3
lines changed

python/pyspark/errors/utils.py

+4-3
Original file line numberDiff line numberDiff line change
@@ -255,9 +255,7 @@ def wrapper(*args: Any, **kwargs: Any) -> Any:
255255
from pyspark.sql import SparkSession
256256
from pyspark.sql.utils import is_remote
257257

258-
spark = SparkSession.getActiveSession()
259-
260-
if spark is not None and hasattr(func, "__name__") and is_debugging_enabled():
258+
if hasattr(func, "__name__") and is_debugging_enabled():
261259
if is_remote():
262260
# Getting the configuration requires RPC call. Uses the default value for now.
263261
depth = 1
@@ -268,6 +266,9 @@ def wrapper(*args: Any, **kwargs: Any) -> Any:
268266
finally:
269267
set_current_origin(None, None)
270268
else:
269+
spark = SparkSession.getActiveSession()
270+
if spark is None:
271+
return func(*args, **kwargs)
271272
assert spark._jvm is not None
272273
jvm_pyspark_origin = getattr(
273274
spark._jvm, "org.apache.spark.sql.catalyst.trees.PySparkCurrentOrigin"

0 commit comments

Comments
 (0)