[SPARK- 52810][SDP][SQL] Spark Pipelines CLI Selection Options #51507

JiaqiWang18 · 2025-07-16T01:27:05Z

What changes were proposed in this pull request?

We want to give user the ability to choose a subset of datasets (ex: tables, materialized views) to include in a run.
And the ability to specify if they should ran as regular refresh or full refresh.
Below arguments being added to the spark-pipelines CLI to achieve this

--full-refresh: List of datasets to reset and recompute.

--full-refresh-all: Boolean, whether to perform a full graph reset and recompute.

--refresh: List of datasets to update.

If no options are specified, the default is to perform a refresh for all datasets in the pipeline.

To enable above:

new CLI options are added to the python CLI
proto changes are made to allow passing them to spark
changes in spark pipelines codebase to use TableFilter to control graph refresh

Why are the changes needed?

These changes are needed because we want to give users option to control what to run and how to run for their pipelines.

Does this PR introduce any user-facing change?

Yes, new CLI options are being added. However, SDP haven't been released yet so no user should be impacted.

How was this patch tested?

Added new test suite in the python CLI to verify argument parsing.
Added new test suite in scala codebase to use the newly added CLI options to run a full pipeline to verify behavior.

Was this patch authored or co-authored using generative AI tooling?

No

JiaqiWang18 · 2025-07-16T18:02:08Z

@AnishMahto

JiaqiWang18 · 2025-07-16T18:42:21Z

@sryza

AnishMahto

Flushing out some thoughts! Haven't looked at tests yet.

AnishMahto · 2025-07-16T19:56:57Z

python/pyspark/pipelines/cli.py

+    if full_refresh_all:
+        if full_refresh:
+            raise PySparkException(
+                errorClass="CONFLICTING_PIPELINE_REFRESH_OPTIONS", messageParameters={}


Thoughts on having sub error classes for mismatched combinations? Or maybe just pass along which two configs are conflicting as a message parameter?

Added logic to pass along the conflicting option

AnishMahto · 2025-07-16T20:03:27Z

python/pyspark/pipelines/cli.py

+    result = []
+    for table_list in table_lists:
+        result.extend(table_list)
+    return result if result else None


If result is an empty list, do we still want to return None? Or should we just return the empty list? What is the implication of either here

Removed this by using the extend option in arg parser to avoid creating nested list.

AnishMahto · 2025-07-16T20:05:57Z

python/pyspark/pipelines/cli.py

+        "--full-refresh",
+        type=parse_table_list,
+        action="append",
+        help="List of datasets to reset and recompute (comma-separated).",


Here and below, should we document default behavior if this arg is not specified at all?

Will extend split using commas?

AnishMahto · 2025-07-16T20:11:43Z

python/pyspark/pipelines/cli.py

-        run(spec_path=spec_path)
+        run(
+            spec_path=spec_path,
+            full_refresh=flatten_table_lists(args.full_refresh),


Why do we need to flatten args.full_refresh and args.refresh? I thought we defined their types with the parse_table_list function, which returns List[str]

This is for the case if user provide the same args multiple times.
Ex: (--full_refresh: "a,b" --full_refresh: "c,d"). Then we will receive a nested list [["a","b"],["c"]]. Need to perform a flattening to transform it into a 1D list.

Ah got it, makes sense

If we were to mark this argument field as extend rather than append, would we still need to do any manual flattening?

Very good point, extend creates a 1D list directly.

python/pyspark/pipelines/spark_connect_pipeline.py

sql/connect/common/src/main/protobuf/spark/connect/pipelines.proto

sql/connect/server/src/main/scala/org/apache/spark/sql/connect/pipelines/PipelinesHandler.scala

AnishMahto · 2025-07-16T20:49:23Z

sql/connect/server/src/main/scala/org/apache/spark/sql/connect/pipelines/PipelinesHandler.scala

@@ -224,6 +225,64 @@ private[connect] object PipelinesHandler extends Logging {
      sessionHolder: SessionHolder): Unit = {
    val dataflowGraphId = cmd.getDataflowGraphId
    val graphElementRegistry = DataflowGraphRegistry.getDataflowGraphOrThrow(dataflowGraphId)
+


Can we extract all this added logic to deduce the full refresh and regular refresh table filters into its own function? And then as part of the scala docs, map the expected filter results depending on what combination of full refresh and partial refresh is selected

extracted a createTableFilters function

...ipelines/src/main/scala/org/apache/spark/sql/pipelines/graph/PipelineUpdateContextImpl.scala

AnishMahto · 2025-07-16T21:04:38Z

sql/connect/server/src/main/scala/org/apache/spark/sql/connect/pipelines/PipelinesHandler.scala

+    if (refreshTables.nonEmpty && fullRefreshTables.nonEmpty) {
+      // check if there is an intersection between the subset
+      val intersection = refreshTableNames.intersect(fullRefreshTableNames)
+      if (intersection.nonEmpty) {
+        throw new IllegalArgumentException(
+          "Datasets specified for refresh and full refresh cannot overlap: " +
+            s"${intersection.mkString(", ")}")
+      }
+    }
+
+    val fullRefreshTablesFilter: TableFilter = if (fullRefreshAll) {
+      AllTables
+    } else if (fullRefreshTables.nonEmpty) {
+      SomeTables(fullRefreshTableNames)
+    } else {
+      NoTables
+    }
+
+    val refreshTablesFilter: TableFilter =
+      if (refreshTables.nonEmpty) {
+        SomeTables(refreshTableNames)
+      } else if (fullRefreshTablesFilter != NoTables) {
+        NoTables
+      } else {
+        AllTables
+      }


just an optional nit, but as a code reader it's difficult for me to reason about the combinations of fullRefreshTables and refreshTables when reading them as sequential but related validation here.

My suggestion would be to restructure this as a match statement, that explicitly handles each combination. Ex.

(fullRefreshTables, refreshTableNames) match { case (Nil, Nil) => ... case (fullRefreshTables, Nil) => ... case ... }

extracted a createTableFilters function

sryza · 2025-07-16T21:03:27Z

python/pyspark/pipelines/cli.py

@@ -28,7 +28,7 @@
 import yaml
 from dataclasses import dataclass
 from pathlib import Path
-from typing import Any, Generator, Mapping, Optional, Sequence
+from typing import Any, Generator, Mapping, Optional, Sequence, List


Out of alphabetical order: you may need to run dev/reformat-python to format this.

actually it didn't reformat this but I manually reordered it

python/pyspark/pipelines/cli.py

sryza · 2025-07-16T21:17:07Z

python/pyspark/pipelines/cli.py

@@ -217,8 +217,30 @@ def change_dir(path: Path) -> Generator[None, None, None]:
        os.chdir(prev)


-def run(spec_path: Path) -> None:
-    """Run the pipeline defined with the given spec."""
+def run(


If we expect it to vary across run for the same pipeline, it should be a CLI arg. If we expect it to be static for a pipeline, it should live in the spec. I would expect selections to vary across runs.

python/pyspark/pipelines/spark_connect_pipeline.py

sryza · 2025-07-16T21:23:10Z

python/pyspark/pipelines/tests/test_cli.py

+    not should_test_connect or not have_yaml,
+    connect_requirement_message or yaml_requirement_message,
+)
+class CLIValidationTests(unittest.TestCase):


Is there a meaningful difference between the kinds of tests that are included in this class and the kinds of tests that included in the other class in this file?

yeah I think they can be combined into one.

.../test/scala/org/apache/spark/sql/connect/pipelines/SparkDeclarativePipelinesServerTest.scala

...r/src/test/scala/org/apache/spark/sql/connect/pipelines/PipelineRefreshFunctionalSuite.scala

sryza

LGTM!

sryza · 2025-07-17T19:44:36Z

Merged to master

jackywang-db added 9 commits July 14, 2025 15:37

cli args

b24698c

working 1 test

1d6ec2c

2 test pass

b8ef4c3

more tests

7fbe8e7

add server side validation

109de10

add validation tests

7e990c5

python cli tests

e494a57

modify backend tests

4105c97

test overhaul

e580bb4

github-actions bot added SQL PYTHON CONNECT labels Jul 16, 2025

jackywang-db added 4 commits July 15, 2025 20:42

fmt

4d26d77

fmt

e94c85f

fmt

695054f

fmt

1abe5c3

JiaqiWang18 changed the title ~~[WIP][SPARK- 52810][SDP][SQL] Spark Pipelines CLI Selection Options~~ [SPARK- 52810][SDP][SQL] Spark Pipelines CLI Selection Options Jul 16, 2025

AnishMahto reviewed Jul 16, 2025

View reviewed changes

sryza reviewed Jul 16, 2025

View reviewed changes

jackywang-db added 4 commits July 16, 2025 21:39

address feedback

1693ac5

rename proto

e56d39b

fmt

b04ac24

nit

f21d79f

JiaqiWang18 requested review from sryza and AnishMahto July 17, 2025 18:26

sryza approved these changes Jul 17, 2025

View reviewed changes

sryza closed this in 9204b05 Jul 17, 2025

[SPARK- 52810][SDP][SQL] Spark Pipelines CLI Selection Options #51507

[SPARK- 52810][SDP][SQL] Spark Pipelines CLI Selection Options #51507

Conversation

JiaqiWang18 commented Jul 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

JiaqiWang18 commented Jul 16, 2025

Uh oh!

JiaqiWang18 commented Jul 16, 2025

Uh oh!

AnishMahto left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JiaqiWang18 Jul 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JiaqiWang18 Jul 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sryza left a comment

Choose a reason for hiding this comment

Uh oh!

sryza commented Jul 17, 2025

Uh oh!

Uh oh!

JiaqiWang18 commented Jul 16, 2025 •

edited

Loading

JiaqiWang18 Jul 17, 2025 •

edited

Loading

JiaqiWang18 Jul 17, 2025 •

edited

Loading