Skip to content

[SPARK-52470][ML][CONNECT] Support model summary offloading #51187

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 24 commits into from

Conversation

WeichenXu123
Copy link
Contributor

@WeichenXu123 WeichenXu123 commented Jun 16, 2025

What changes were proposed in this pull request?

This PR makes Spark Connect ML supporting model summary offloading.

Model summary offloading is hard to support because it contains a Spark dataset which can't be easily serialized in Spark driver (NOTE: we can't java serializer to serialize the Spark dataset logical plan otherwise it is a RCE vulnerability),

to address the issue, when saving Summary to disk, it only saves the necessary data fields,
when loading Summary back, the client needs to send the dataset to Spark driver again,
to achieve it, 2 new proto messages are introduced:

  1. CreateSummary in MlCommand
  // This is for re-creating the model summary when the model summary is lost
  // (model summary is lost when the model is offloaded and then loaded back)
  message CreateSummary {
    ObjectRef model_ref = 1;
    Relation dataset = 2;
  }

2: model_summary_dataset in MlRelation

  // (Optional) the dataset for restoring the model summary
  optional Relation model_summary_dataset = 3;

Why are the changes needed?

Support model summary offloading.
Without this, the model summary will be evicted from Spark driver memory after default 15min timeout, results in model.summary API unavailability.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Unit tests.

Was this patch authored or co-authored using generative AI tooling?

No.

Signed-off-by: Weichen Xu <[email protected]>
Signed-off-by: Weichen Xu <[email protected]>
Signed-off-by: Weichen Xu <[email protected]>
Signed-off-by: Weichen Xu <[email protected]>
Signed-off-by: Weichen Xu <[email protected]>
Signed-off-by: Weichen Xu <[email protected]>
Signed-off-by: Weichen Xu <[email protected]>
Signed-off-by: Weichen Xu <[email protected]>
Signed-off-by: Weichen Xu <[email protected]>
Signed-off-by: Weichen Xu <[email protected]>
Signed-off-by: Weichen Xu <[email protected]>
Signed-off-by: Weichen Xu <[email protected]>
Signed-off-by: Weichen Xu <[email protected]>
Signed-off-by: Weichen Xu <[email protected]>
Signed-off-by: Weichen Xu <[email protected]>
Signed-off-by: Weichen Xu <[email protected]>
Signed-off-by: Weichen Xu <[email protected]>
Signed-off-by: Weichen Xu <[email protected]>
Signed-off-by: Weichen Xu <[email protected]>
Signed-off-by: Weichen Xu <[email protected]>
Signed-off-by: Weichen Xu <[email protected]>
Signed-off-by: Weichen Xu <[email protected]>
Signed-off-by: Weichen Xu <[email protected]>
@WeichenXu123
Copy link
Contributor Author

@WeichenXu123
Copy link
Contributor Author

merged to master

@LuciferYang
Copy link
Contributor

LuciferYang commented Jun 18, 2025

@WeichenXu123 The test case "pyspark.ml.tests.connect.test_parity_pipeline" experienced timeout issues twice in the recent commit pipelines. And I don't think I've encountered this issue before. Could you please help confirm whether this is related to the current pr? Thanks ~

image

also cc @zhengruifeng

@zhengruifeng
Copy link
Contributor

it may also hang in other ml connect tests:
pyspark.ml.tests.connect.test_parity_clustering in https://github.com/apache/spark/actions/runs/15742416356/job/44370943991

pyspark.ml.tests.connect.test_parity_classification in https://github.com/apache/spark/actions/runs/15741346719/job/44367299872

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants