Skip to content

[SPARK- 52810][SDP][SQL] Spark Pipelines CLI Selection Options #51507

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

JiaqiWang18
Copy link
Contributor

@JiaqiWang18 JiaqiWang18 commented Jul 16, 2025

What changes were proposed in this pull request?

We want to give user the ability to choose a subset of datasets (ex: tables, materialized views) to include in a run.
And the ability to specify if they should ran as regular refresh or full refresh.
Below arguments being added to the spark-pipelines CLI to achieve this

--full-refresh: List of datasets to reset and recompute.

--full-refresh-all: Boolean, whether to perform a full graph reset and recompute.

--refresh: List of datasets to update.

If no options are specified, the default is to perform a refresh for all datasets in the pipeline.

To enable above:

  • new CLI options are added to the python CLI
  • proto changes are made to allow passing them to spark
  • changes in spark pipelines codebase to use TableFilter to control graph refresh

Why are the changes needed?

These changes are needed because we want to give users option to control what to run and how to run for their pipelines.

Does this PR introduce any user-facing change?

Yes, new CLI options are being added. However, SDP haven't been released yet so no user should be impacted.

How was this patch tested?

Added new test suite in the python CLI to verify argument parsing.
Added new test suite in scala codebase to use the newly added CLI options to run a full pipeline to verify behavior.

Was this patch authored or co-authored using generative AI tooling?

No

@JiaqiWang18 JiaqiWang18 changed the title [WIP][SPARK- 52810][SDP][SQL] Spark Pipelines CLI Selection Options [SPARK- 52810][SDP][SQL] Spark Pipelines CLI Selection Options Jul 16, 2025
@JiaqiWang18
Copy link
Contributor Author

@AnishMahto

@JiaqiWang18
Copy link
Contributor Author

@sryza

Copy link
Contributor

@AnishMahto AnishMahto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flushing out some thoughts! Haven't looked at tests yet.

if full_refresh_all:
if full_refresh:
raise PySparkException(
errorClass="CONFLICTING_PIPELINE_REFRESH_OPTIONS", messageParameters={}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts on having sub error classes for mismatched combinations? Or maybe just pass along which two configs are conflicting as a message parameter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added logic to pass along the conflicting option

result = []
for table_list in table_lists:
result.extend(table_list)
return result if result else None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If result is an empty list, do we still want to return None? Or should we just return the empty list? What is the implication of either here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed this by using the extend option in arg parser to avoid creating nested list.

"--full-refresh",
type=parse_table_list,
action="append",
help="List of datasets to reset and recompute (comma-separated).",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here and below, should we document default behavior if this arg is not specified at all?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will extend split using commas?

run(spec_path=spec_path)
run(
spec_path=spec_path,
full_refresh=flatten_table_lists(args.full_refresh),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to flatten args.full_refresh and args.refresh? I thought we defined their types with the parse_table_list function, which returns List[str]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for the case if user provide the same args multiple times.
Ex: (--full_refresh: "a,b" --full_refresh: "c,d"). Then we will receive a nested list [["a","b"],["c"]]. Need to perform a flattening to transform it into a 1D list.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah got it, makes sense

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we were to mark this argument field as extend rather than append, would we still need to do any manual flattening?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good point, extend creates a 1D list directly.

@@ -224,6 +225,64 @@ private[connect] object PipelinesHandler extends Logging {
sessionHolder: SessionHolder): Unit = {
val dataflowGraphId = cmd.getDataflowGraphId
val graphElementRegistry = DataflowGraphRegistry.getDataflowGraphOrThrow(dataflowGraphId)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we extract all this added logic to deduce the full refresh and regular refresh table filters into its own function? And then as part of the scala docs, map the expected filter results depending on what combination of full refresh and partial refresh is selected

Copy link
Contributor Author

@JiaqiWang18 JiaqiWang18 Jul 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extracted a createTableFilters function

Comment on lines 259 to 284
if (refreshTables.nonEmpty && fullRefreshTables.nonEmpty) {
// check if there is an intersection between the subset
val intersection = refreshTableNames.intersect(fullRefreshTableNames)
if (intersection.nonEmpty) {
throw new IllegalArgumentException(
"Datasets specified for refresh and full refresh cannot overlap: " +
s"${intersection.mkString(", ")}")
}
}

val fullRefreshTablesFilter: TableFilter = if (fullRefreshAll) {
AllTables
} else if (fullRefreshTables.nonEmpty) {
SomeTables(fullRefreshTableNames)
} else {
NoTables
}

val refreshTablesFilter: TableFilter =
if (refreshTables.nonEmpty) {
SomeTables(refreshTableNames)
} else if (fullRefreshTablesFilter != NoTables) {
NoTables
} else {
AllTables
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just an optional nit, but as a code reader it's difficult for me to reason about the combinations of fullRefreshTables and refreshTables when reading them as sequential but related validation here.

My suggestion would be to restructure this as a match statement, that explicitly handles each combination. Ex.

(fullRefreshTables, refreshTableNames) match {
      case (Nil, Nil) => ...
      case (fullRefreshTables, Nil) => ...
      case ...
}

Copy link
Contributor Author

@JiaqiWang18 JiaqiWang18 Jul 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extracted a createTableFilters function

@@ -28,7 +28,7 @@
import yaml
from dataclasses import dataclass
from pathlib import Path
from typing import Any, Generator, Mapping, Optional, Sequence
from typing import Any, Generator, Mapping, Optional, Sequence, List
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of alphabetical order: you may need to run dev/reformat-python to format this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually it didn't reformat this but I manually reordered it

@@ -217,8 +217,30 @@ def change_dir(path: Path) -> Generator[None, None, None]:
os.chdir(prev)


def run(spec_path: Path) -> None:
"""Run the pipeline defined with the given spec."""
def run(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we expect it to vary across run for the same pipeline, it should be a CLI arg. If we expect it to be static for a pipeline, it should live in the spec. I would expect selections to vary across runs.

not should_test_connect or not have_yaml,
connect_requirement_message or yaml_requirement_message,
)
class CLIValidationTests(unittest.TestCase):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a meaningful difference between the kinds of tests that are included in this class and the kinds of tests that included in the other class in this file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I think they can be combined into one.

@JiaqiWang18 JiaqiWang18 requested review from sryza and AnishMahto July 17, 2025 18:26
Copy link
Contributor

@sryza sryza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@sryza sryza closed this in 9204b05 Jul 17, 2025
@sryza
Copy link
Contributor

sryza commented Jul 17, 2025

Merged to master

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants