[SPARK-51551] [ML] [PYTHON] [CONNECT] For tuning algorithm, allow using save / load to replace cache #50324

WeichenXu123 · 2025-03-19T10:25:47Z

What changes were proposed in this pull request?

For tuning algorithm, allow using save / load to replace cache.

Why are the changes needed?

Dataframe persisting is not well supported in certain cases, so we need a replacement.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Manually.

Was this patch authored or co-authored using generative AI tooling?

No.

Signed-off-by: Weichen Xu <[email protected]>

zhengruifeng

need tests against this change

zhengruifeng · 2025-03-19T10:38:42Z

python/pyspark/ml/tuning.py

@@ -75,6 +76,15 @@
 ]


+_SPARKML_TUNING_TEMP_DFS_PATH = "SPARKML_TUNING_TEMP_DFS_PATH"


what about a new parameter instead of this env variable?

Spark config is not available in some circumstances. so I suggest to use environmental variable to config.

Signed-off-by: Weichen Xu <[email protected]>

zhengruifeng · 2025-03-21T08:27:07Z

python/pyspark/ml/tuning.py

+            validation = datasets[i][1]
+            train = datasets[i][0]
+
+            if tmp_dfs_path:


let's define a helper function:

def _cache(df): if ...: df.cache() else: spark = df._session df.save spark.read...

then we need to handler the uncache step together in the helper function, i.e., it should be a context manager. is this you want ?

python/pyspark/ml/tuning.py

Signed-off-by: Weichen Xu <[email protected]>

update

f81d8a4

Signed-off-by: Weichen Xu <[email protected]>

github-actions bot added ML PYTHON labels Mar 19, 2025

WeichenXu123 marked this pull request as draft March 19, 2025 10:32

zhengruifeng reviewed Mar 19, 2025

View reviewed changes

WeichenXu123 added 3 commits March 19, 2025 20:44

update

dafd60e

Signed-off-by: Weichen Xu <[email protected]>

update

922c7ac

Signed-off-by: Weichen Xu <[email protected]>

update

ed68f98

Signed-off-by: Weichen Xu <[email protected]>

WeichenXu123 marked this pull request as ready for review March 20, 2025 10:01

WeichenXu123 requested a review from zhengruifeng March 20, 2025 10:01

WeichenXu123 added 2 commits March 20, 2025 20:27

update

d37e90c

Signed-off-by: Weichen Xu <[email protected]>

format

9354ffd

Signed-off-by: Weichen Xu <[email protected]>

zhengruifeng approved these changes Mar 21, 2025

View reviewed changes

zhengruifeng reviewed Mar 21, 2025

View reviewed changes

WeichenXu123 added 2 commits March 21, 2025 18:29

update

66f3785

Signed-off-by: Weichen Xu <[email protected]>

update

1c1152d

Signed-off-by: Weichen Xu <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-51551] [ML] [PYTHON] [CONNECT] For tuning algorithm, allow using save / load to replace cache #50324

[SPARK-51551] [ML] [PYTHON] [CONNECT] For tuning algorithm, allow using save / load to replace cache #50324

WeichenXu123 commented Mar 19, 2025

zhengruifeng left a comment

zhengruifeng Mar 19, 2025

WeichenXu123 Mar 20, 2025

zhengruifeng Mar 21, 2025

WeichenXu123 Mar 21, 2025

		@@ -75,6 +76,15 @@
		]


		_SPARKML_TUNING_TEMP_DFS_PATH = "SPARKML_TUNING_TEMP_DFS_PATH"

[SPARK-51551] [ML] [PYTHON] [CONNECT] For tuning algorithm, allow using save / load to replace cache #50324

Are you sure you want to change the base?

[SPARK-51551] [ML] [PYTHON] [CONNECT] For tuning algorithm, allow using save / load to replace cache #50324

Conversation

WeichenXu123 commented Mar 19, 2025

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

zhengruifeng left a comment

Choose a reason for hiding this comment

zhengruifeng Mar 19, 2025

Choose a reason for hiding this comment

WeichenXu123 Mar 20, 2025

Choose a reason for hiding this comment

zhengruifeng Mar 21, 2025

Choose a reason for hiding this comment

WeichenXu123 Mar 21, 2025

Choose a reason for hiding this comment