[SPARK-52283][CONNECT] Declarative Pipelines `DataflowGraph` creation and resolution #51003

aakash-db · 2025-05-23T20:32:21Z

What changes were proposed in this pull request?

This PR introduces the DataflowGraph, a container for Declarative Pipelines datasets and flows, as described in the Declarative Pipelines SPIP. It also adds functionality for

Constructing a graph by registering a set of graph elements in succession (GraphRegistrationContext)
"Resolving" a graph, which means resolving each of the flows within a graph. Resolving a flow means:
- Validating that its plan can be successfully analyzed
- Determining the schema of the data it will produce
- Determining what upstream datasets within the graph it depends on

It also introduces various secondary changes:

Changes to SparkBuild to support declarative pipelines.
Updates to the pom.xml for the module.
New error conditions

Why are the changes needed?

In order to implement Declarative Pipelines.

Does this PR introduce any user-facing change?

No changes to existing behavior.

How was this patch tested?

New test suites:

ConnectValidPipelineSuite – test cases where the graph can be successfully resolved
ConnectValidPipelineSuite – test cases where the graph fails to be resolved

Was this patch authored or co-authored using generative AI tooling?

No

project/SparkBuild.scala

sql/pipelines/pom.xml

project/SparkBuild.scala

sql/pipelines/src/main/scala/org/apache/spark/sql/pipelines/graph/GraphIdentifierManager.scala

sql/pipelines/src/main/scala/org/apache/spark/sql/pipelines/graph/PipelinesErrors.scala

sql/pipelines/src/main/scala/org/apache/spark/sql/pipelines/graph/FlowAnalysis.scala

sql/pipelines/src/main/scala/org/apache/spark/sql/pipelines/graph/GraphIdentifierManager.scala

...pipelines/src/main/scala/org/apache/spark/sql/pipelines/graph/GraphRegistrationContext.scala

sql/pipelines/src/main/scala/org/apache/spark/sql/pipelines/graph/Flow.scala

sryza · 2025-05-23T23:05:33Z

project/SparkBuild.scala

+      )
+    },
+
+    (assembly / test) := { },


Do we need this assembly stuff?

apparently not, removed.

sql/pipelines/src/main/scala/org/apache/spark/sql/pipelines/graph/DataflowGraph.scala

sryza · 2025-05-23T23:16:10Z

...ipelines/src/test/scala/org/apache/spark/sql/pipelines/graph/ConnectValidPipelineSuite.scala

+import org.apache.spark.sql.util.CaseInsensitiveStringMap
+
+/**
+ * Test suite for converting a [[PipelineDefinition]]s into a connected [[DataflowGraph]]. These


Suggested change

* Test suite for converting a [[PipelineDefinition]]s into a connected [[DataflowGraph]]. These

* Test suite resolving the flows in a [[DataflowGraph]]. These

sryza · 2025-05-23T23:16:35Z

...elines/src/test/scala/org/apache/spark/sql/pipelines/graph/ConnectInvalidPipelineSuite.scala

+import org.apache.spark.sql.types.{IntegerType, StructType}
+
+/**
+ * Test suite for converting one or more [[Pipeline]]s into a connected [[DataflowGraph]]. These


Suggested change

* Test suite for converting one or more [[Pipeline]]s into a connected [[DataflowGraph]]. These

* Test suite for resolving the flows in a [[DataflowGraph]]. These

jonmio

flushing some comments

jonmio · 2025-05-23T23:30:14Z

common/utils/src/main/resources/error/error-conditions.json

@@ -2025,6 +2031,18 @@
    ],
    "sqlState" : "42613"
  },
+  "INCOMPATIBLE_BATCH_VIEW_READ": {
+    "message": [
+      "View <datasetIdentifier> is not a streaming view and must be referenced using read. This check can be disabled by setting Spark conf pipelines.incompatibleViewCheck.enabled = false."


What is the purpose of this conf and do we really need it?

jonmio · 2025-05-23T23:37:14Z

sql/pipelines/src/main/scala/org/apache/spark/sql/pipelines/AnalysisWarning.scala

+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.pipelines


Nit: Should this be in some other package/directory?

jonmio · 2025-05-23T23:37:37Z

sql/pipelines/src/main/scala/org/apache/spark/sql/pipelines/Language.scala

+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.pipelines


jonmio · 2025-05-23T23:39:03Z