Skip to content

[SPARK-52283][CONNECT] Declarative Pipelines DataflowGraph creation and resolution #51003

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

aakash-db
Copy link

@aakash-db aakash-db commented May 23, 2025

What changes were proposed in this pull request?

This PR introduces the DataflowGraph, a container for Declarative Pipelines datasets and flows, as described in the Declarative Pipelines SPIP. It also adds functionality for

  • Constructing a graph by registering a set of graph elements in succession (GraphRegistrationContext)
  • "Resolving" a graph, which means resolving each of the flows within a graph. Resolving a flow means:
    • Validating that its plan can be successfully analyzed
    • Determining the schema of the data it will produce
    • Determining what upstream datasets within the graph it depends on

It also introduces various secondary changes:

  • Changes to SparkBuild to support declarative pipelines.
  • Updates to the pom.xml for the module.
  • New error conditions

Why are the changes needed?

In order to implement Declarative Pipelines.

Does this PR introduce any user-facing change?

No changes to existing behavior.

How was this patch tested?

New test suites:

  • ConnectValidPipelineSuite – test cases where the graph can be successfully resolved
  • ConnectValidPipelineSuite – test cases where the graph fails to be resolved

Was this patch authored or co-authored using generative AI tooling?

No

@sryza sryza changed the title [SPARK-52283][CONNECT] SDP DataflowGraph creation and resolution [SPARK-52283][CONNECT] Declarative Pipelines DataflowGraph creation and resolution May 23, 2025
@sryza sryza self-requested a review May 23, 2025 21:01
@sryza sryza self-assigned this May 23, 2025
)
},

(assembly / test) := { },
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this assembly stuff?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apparently not, removed.

import org.apache.spark.sql.util.CaseInsensitiveStringMap

/**
* Test suite for converting a [[PipelineDefinition]]s into a connected [[DataflowGraph]]. These
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Test suite for converting a [[PipelineDefinition]]s into a connected [[DataflowGraph]]. These
* Test suite resolving the flows in a [[DataflowGraph]]. These

import org.apache.spark.sql.types.{IntegerType, StructType}

/**
* Test suite for converting one or more [[Pipeline]]s into a connected [[DataflowGraph]]. These
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Test suite for converting one or more [[Pipeline]]s into a connected [[DataflowGraph]]. These
* Test suite for resolving the flows in a [[DataflowGraph]]. These

Copy link

@jonmio jonmio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flushing some comments

@@ -2025,6 +2031,18 @@
],
"sqlState" : "42613"
},
"INCOMPATIBLE_BATCH_VIEW_READ": {
"message": [
"View <datasetIdentifier> is not a streaming view and must be referenced using read. This check can be disabled by setting Spark conf pipelines.incompatibleViewCheck.enabled = false."
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of this conf and do we really need it?

* limitations under the License.
*/

package org.apache.spark.sql.pipelines
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Should this be in some other package/directory?

* limitations under the License.
*/

package org.apache.spark.sql.pipelines
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

}

/**
* Core processor that is responsible for analyzing each flow and sort the nodes in
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does core mean here?


/**
* Processes the node of the graph, re-arranging them if they are not topologically sorted.
* Takes care of resolving the flows and virtualization if needed for the nodes.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate on what virtualization entails?

* @param upstreamNodes Upstream nodes for the node
* @return
*/
def processNode(node: GraphElement, upstreamNodes: Seq[GraphElement]): Seq[GraphElement] = {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Document return. I'm especially curious why this is a Seq and when processNode would return more than one element

Comment on lines +89 to +92
// Table will be virtual in either of the following scenarios:
// 1. If table is present in context.fullRefreshTables
// 2. If table has any virtual inputs (flows or tables)
// 3. If the table pre-existing metadata is different from current metadata
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I follow this comment - it seems like resolvedInputs will always contain a pointer to a VirtualTableInput for any table being resolved?

val result =
flowFunctionResult match {
case f if f.dataFrame.isSuccess =>
// Merge the flow's inputs' confs into confs for this flow, throwing if any conflict
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you provide an example in the docs of why conflicting confs can't be supported?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants