-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-52283][CONNECT] Declarative Pipelines DataflowGraph
creation and resolution
#51003
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
DataflowGraph
creation and resolutionDataflowGraph
creation and resolution
sql/pipelines/src/main/scala/org/apache/spark/sql/pipelines/graph/GraphIdentifierManager.scala
Outdated
Show resolved
Hide resolved
sql/pipelines/src/main/scala/org/apache/spark/sql/pipelines/graph/PipelinesErrors.scala
Outdated
Show resolved
Hide resolved
sql/pipelines/src/main/scala/org/apache/spark/sql/pipelines/graph/FlowAnalysis.scala
Outdated
Show resolved
Hide resolved
sql/pipelines/src/main/scala/org/apache/spark/sql/pipelines/graph/FlowAnalysis.scala
Outdated
Show resolved
Hide resolved
sql/pipelines/src/main/scala/org/apache/spark/sql/pipelines/graph/GraphIdentifierManager.scala
Outdated
Show resolved
Hide resolved
...pipelines/src/main/scala/org/apache/spark/sql/pipelines/graph/GraphRegistrationContext.scala
Show resolved
Hide resolved
sql/pipelines/src/main/scala/org/apache/spark/sql/pipelines/graph/Flow.scala
Show resolved
Hide resolved
project/SparkBuild.scala
Outdated
) | ||
}, | ||
|
||
(assembly / test) := { }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this assembly stuff?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
apparently not, removed.
sql/pipelines/src/main/scala/org/apache/spark/sql/pipelines/graph/DataflowGraph.scala
Outdated
Show resolved
Hide resolved
import org.apache.spark.sql.util.CaseInsensitiveStringMap | ||
|
||
/** | ||
* Test suite for converting a [[PipelineDefinition]]s into a connected [[DataflowGraph]]. These |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Test suite for converting a [[PipelineDefinition]]s into a connected [[DataflowGraph]]. These | |
* Test suite resolving the flows in a [[DataflowGraph]]. These |
import org.apache.spark.sql.types.{IntegerType, StructType} | ||
|
||
/** | ||
* Test suite for converting one or more [[Pipeline]]s into a connected [[DataflowGraph]]. These |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Test suite for converting one or more [[Pipeline]]s into a connected [[DataflowGraph]]. These | |
* Test suite for resolving the flows in a [[DataflowGraph]]. These |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
flushing some comments
@@ -2025,6 +2031,18 @@ | |||
], | |||
"sqlState" : "42613" | |||
}, | |||
"INCOMPATIBLE_BATCH_VIEW_READ": { | |||
"message": [ | |||
"View <datasetIdentifier> is not a streaming view and must be referenced using read. This check can be disabled by setting Spark conf pipelines.incompatibleViewCheck.enabled = false." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the purpose of this conf and do we really need it?
* limitations under the License. | ||
*/ | ||
|
||
package org.apache.spark.sql.pipelines |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Should this be in some other package/directory?
* limitations under the License. | ||
*/ | ||
|
||
package org.apache.spark.sql.pipelines |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here
} | ||
|
||
/** | ||
* Core processor that is responsible for analyzing each flow and sort the nodes in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does core mean here?
|
||
/** | ||
* Processes the node of the graph, re-arranging them if they are not topologically sorted. | ||
* Takes care of resolving the flows and virtualization if needed for the nodes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate on what virtualization entails?
* @param upstreamNodes Upstream nodes for the node | ||
* @return | ||
*/ | ||
def processNode(node: GraphElement, upstreamNodes: Seq[GraphElement]): Seq[GraphElement] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Document return. I'm especially curious why this is a Seq and when processNode would return more than one element
// Table will be virtual in either of the following scenarios: | ||
// 1. If table is present in context.fullRefreshTables | ||
// 2. If table has any virtual inputs (flows or tables) | ||
// 3. If the table pre-existing metadata is different from current metadata |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I follow this comment - it seems like resolvedInputs
will always contain a pointer to a VirtualTableInput for any table being resolved?
val result = | ||
flowFunctionResult match { | ||
case f if f.dataFrame.isSuccess => | ||
// Merge the flow's inputs' confs into confs for this flow, throwing if any conflict |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you provide an example in the docs of why conflicting confs can't be supported?
What changes were proposed in this pull request?
This PR introduces the
DataflowGraph
, a container for Declarative Pipelines datasets and flows, as described in the Declarative Pipelines SPIP. It also adds functionality forGraphRegistrationContext
)It also introduces various secondary changes:
SparkBuild
to support declarative pipelines.pom.xml
for the module.Why are the changes needed?
In order to implement Declarative Pipelines.
Does this PR introduce any user-facing change?
No changes to existing behavior.
How was this patch tested?
New test suites:
ConnectValidPipelineSuite
– test cases where the graph can be successfully resolvedConnectValidPipelineSuite
– test cases where the graph fails to be resolvedWas this patch authored or co-authored using generative AI tooling?
No