[SPARK-52470][ML][CONNECT] Support model summary offloading #51187

WeichenXu123 · 2025-06-16T14:43:11Z

What changes were proposed in this pull request?

This PR makes Spark Connect ML supporting model summary offloading.

Model summary offloading is hard to support because it contains a Spark dataset which can't be easily serialized in Spark driver (NOTE: we can't java serializer to serialize the Spark dataset logical plan otherwise it is a RCE vulnerability),

to address the issue, when saving Summary to disk, it only saves the necessary data fields,
when loading Summary back, the client needs to send the dataset to Spark driver again,
to achieve it, 2 new proto messages are introduced:

CreateSummary in MlCommand

  // This is for re-creating the model summary when the model summary is lost
  // (model summary is lost when the model is offloaded and then loaded back)
  message CreateSummary {
    ObjectRef model_ref = 1;
    Relation dataset = 2;
  }

2: model_summary_dataset in MlRelation

  // (Optional) the dataset for restoring the model summary
  optional Relation model_summary_dataset = 3;

Why are the changes needed?

Support model summary offloading.
Without this, the model summary will be evicted from Spark driver memory after default 15min timeout, results in model.summary API unavailability.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Unit tests.

Was this patch authored or co-authored using generative AI tooling?

No.

Signed-off-by: Weichen Xu <[email protected]>

WeichenXu123 · 2025-06-17T12:40:49Z

The CI failure https://github.com/WeichenXu123/spark/actions/runs/15703082903/job/44242454735#step:7:3170 is not related to my PR.

WeichenXu123 · 2025-06-17T12:41:51Z

merged to master

LuciferYang · 2025-06-18T08:02:22Z

@WeichenXu123 The test case "pyspark.ml.tests.connect.test_parity_pipeline" experienced timeout issues twice in the recent commit pipelines. And I don't think I've encountered this issue before. Could you please help confirm whether this is related to the current pr? Thanks ~

also cc @zhengruifeng

zhengruifeng · 2025-06-19T02:29:30Z

it may also hang in other ml connect tests:
pyspark.ml.tests.connect.test_parity_clustering in https://github.com/apache/spark/actions/runs/15742416356/job/44370943991

pyspark.ml.tests.connect.test_parity_classification in https://github.com/apache/spark/actions/runs/15741346719/job/44367299872

WeichenXu123 added 15 commits June 13, 2025 20:34

init

f7ee079

Signed-off-by: Weichen Xu <[email protected]>

update

b173dff

Signed-off-by: Weichen Xu <[email protected]>

update

3ab575c

Signed-off-by: Weichen Xu <[email protected]>

update

12a86bd

Signed-off-by: Weichen Xu <[email protected]>

update

675e4eb

Signed-off-by: Weichen Xu <[email protected]>

update

5d13cde

Signed-off-by: Weichen Xu <[email protected]>

update

54767b5

Signed-off-by: Weichen Xu <[email protected]>

update

498113c

Signed-off-by: Weichen Xu <[email protected]>

update

dc429d7

Signed-off-by: Weichen Xu <[email protected]>

update

7f6c237

Signed-off-by: Weichen Xu <[email protected]>

update

dc8344e

Signed-off-by: Weichen Xu <[email protected]>

update

af51306

Signed-off-by: Weichen Xu <[email protected]>

update

0b098fc

Signed-off-by: Weichen Xu <[email protected]>

update

b0d0efd

Signed-off-by: Weichen Xu <[email protected]>

update

4791e3c

Signed-off-by: Weichen Xu <[email protected]>

github-actions bot added SQL ML MLLIB PYTHON CONNECT labels Jun 16, 2025

WeichenXu123 requested a review from zhengruifeng June 16, 2025 14:45

WeichenXu123 added 3 commits June 17, 2025 09:15

update

c65b80f

Signed-off-by: Weichen Xu <[email protected]>

update

e16cc79

Signed-off-by: Weichen Xu <[email protected]>

update

94e2ce7

Signed-off-by: Weichen Xu <[email protected]>

zhengruifeng approved these changes Jun 17, 2025

View reviewed changes

WeichenXu123 added 5 commits June 17, 2025 10:49

format

f225e8c

Signed-off-by: Weichen Xu <[email protected]>

format python

41c6ed5

Signed-off-by: Weichen Xu <[email protected]>

fix test

7dac9f7

Signed-off-by: Weichen Xu <[email protected]>

Merge branch 'master' into SPARK-52470

2487105

format

5fb8a21

Signed-off-by: Weichen Xu <[email protected]>

format

d4e9d93

Signed-off-by: Weichen Xu <[email protected]>

WeichenXu123 closed this in fd74b5e Jun 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-52470][ML][CONNECT] Support model summary offloading #51187

[SPARK-52470][ML][CONNECT] Support model summary offloading #51187

WeichenXu123 commented Jun 16, 2025 •

edited

Loading

Uh oh!

WeichenXu123 commented Jun 17, 2025

Uh oh!

WeichenXu123 commented Jun 17, 2025

Uh oh!

LuciferYang commented Jun 18, 2025 •

edited

Loading

Uh oh!

zhengruifeng commented Jun 19, 2025

Uh oh!

Uh oh!

[SPARK-52470][ML][CONNECT] Support model summary offloading #51187

[SPARK-52470][ML][CONNECT] Support model summary offloading #51187

Conversation

WeichenXu123 commented Jun 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

WeichenXu123 commented Jun 17, 2025

Uh oh!

WeichenXu123 commented Jun 17, 2025

Uh oh!

LuciferYang commented Jun 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhengruifeng commented Jun 19, 2025

Uh oh!

Uh oh!

WeichenXu123 commented Jun 16, 2025 •

edited

Loading

LuciferYang commented Jun 18, 2025 •

edited

Loading