Replies: 1 comment 2 replies
-
Hey @kinghuang - thanks for putting this very detailed proposal together. I think you are right that we have some duplicative concepts. And the way you frame the similarities between presets and partition sets makes a lot of sense to me. For the pipelines that you're considering this for, do they have schedules? Or do you typically kick them off manually? Schedules add another layer of configuration-mapping, so I'm wondering how that fits in for you. Currently, we have APIs that enable people to create schedules and partition set definitions with a single invocation. That relies on the fact that partition set definitions exist outside of pipeline definitions (unlike presets). So if we wanted to do the consolidation that you're proposing, and we wanted to preserve those APIs, we might need to change preset to live outside the pipeline definition (like partition set currently does). Do you foresee issues with going in that direction? Independently, I have found the fact that we have both presets and modes somewhat duplicative. For example, if someone wants to connect to a development database when running their pipeline in development and a production database when running their pipeline in production, they can accomplish this with both modes (using configured) and presets. That has made me wonder whether there's some way that it would make sense to fold preset into mode. However, it would require some thinking to figure out how that fits with what you've proposed here. |
Beta Was this translation helpful? Give feedback.
-
Motivation
Dagster pipelines can have predefined run configs for pipeline executions derived from presets and partition sets. Despite their related functions, they are treated quite differently.
The differences and lack of interplay between preset definitions and partition set definitions were very confusing to me when I was learning to use Dagster. What seemed like two different but complementary things, turned out to be mostly overlapping and exclusionary. I've ended up always using PartitionSetDefinition for pipeline configurations whether partitions are involved or not, because it is essentially a more capable version of PresetDefinition. But, this complicates API calls into Dagster (see point 3 above). I believe the dichotomy between
PresetDefinition
andPartitionSetDefinition
is unnecessary and makes pipeline configuration and execution more complicated in Dagster. I also find presets and partition sets too limiting in their current forms, making it hard to scale pipelines beyond simple configuration scenarios.I propose that PresetDefinition and PartitionSetDefinition be unified to reduce complexity and open the way to more adaptable pipeline configurations. This will also affect related concepts like executing a pipeline versus backfilling.
The Grand Unified Theory of Pipeline Presets
PresetDefinition
andPartitionSetDefinition
serve the same end goal: to produce a run configuration. In Dagit's playground view, they are shown together as options in the same list. The key difference between the two is thatPartitionSetDefinition
has a variable: the partition. Given Dagster's tag line as a data orchestrator, I suspect this was meant to support common data partitioning practices where a dataset may be partitioned to improve scalability, performance, security, or other reasons. A common strategy is to divide a dataset by year and month. The usage documentation for Defining a Partition Set shows date-based partitioning as an example.So, a
PartitionSetDefinition
is essentially aPresetDefinition
with apartition
variable. Suppose we had a pipeline with a single solid that echoed its input text. We can pre-configure this pipeline to output “hello” or “bonjour” with the following two methods.Here is some example code for both options and screenshots of the result in Dagit.
Putting aside the differences in how presets and partition sets are implemented, notice the static
"hello"
and"bonjour"
strings in the preset definitions, and thep.value
statement referencing the chosen partition's value in the partition set definition.The “hello” and “bonjour” values in
greetings_set
aren't really partitions in the data partitioning sense. Partitions here serve as a simple variable in the pipeline configuration, specifying which word to pass to theecho
solid. Thepartition
inPartitionSetDefinition
is effectively an opinionated variable.If a partition set is a preset with a single variable, then it should be possible to describe a
PartitionSetDefinition
using aPresetDefinition
if the latter supported variables.Here is a hypothetical preset definition with a variable.
The
dynamic
preset definition specifies a dict of variables viavariables_fn
that has apartition
variable with"hello"
and"bonjour"
as values. This is analogous to thepartition_fn
parameter in thegreetings_set
partition set definition. Then, therun_config_fn
parameter is used to provide a function that will take in the selected variables and return a run configuration, likerun_config_fn_for_partition
.There is no need to use “partition” as the variable name. We can set it “greeting” instead. And, for a non-dynamic dict of variables, use a
variables
parameter in place ofvariables_fn
.As the
variables
key implies, it should be possible to specify more than one variable. Returning to date-based data partitions, consider a partitioning scheme with a range of year and month components. Using aPartitionSetDefinition
, the partitions would be specified as a linear list of values topartition_fn
like the following.For a large date range, this results in a very long list of partition choices.
Using
year
andmonth
variables, this can be expressed as the following.By extending
PresetDefinition
with the capability to handle variables, it can encompass the functionality ofPartitionSetDefinition
, while also enabling new configuration possibilities through the use of zero to many variables.Even with no variables, the
run_config_fn
parameter gives a chance for the run config to be customized or generated when requested likerun_config_fn_for_partition
inPartitionSetDefinition
.Proposed PresetDefinition API
Here is the proposed signature for the PresetDefinition class.
Pipeline Execution versus Backfill
The differences between running a pipeline and performing a backfill is equally confusing and limiting in Dagster. As with
PresetDefinition
andPartitionSetDefinition
, a backfill is essentially multiple instances of a pipeline over a matrix of variables. There is a similar concept in GitLab CI/CD where a job can have a matrix of variables, resulting in multiple instances of the job being run. IfPartitionSetDefinition
is absorbed into an enhancedPresetDefinition
, it will no longer make sense to have backfill be a completely separate operation.In order for the
dagster pipeline execute
CLI command to support variables, it will need something different than the--partitions
,--all
,--from
and--to
options found in thedagster-pipeline backfill
command. A--var KEY=VALUE
option that can be specified multiple times, whose values are interpreted similarly to solid selection might suffice.Given the following variable declaration (copied from above):
Here are some options and their interpretations.
--var year=2017+ --var 'month=*'
--var year=2015,2016 --var month=1,2
--var 'year=201*' --var month=12
--all
Wither PartitionSetDefinition?
Pipeline partitions and backfills are still valuable concepts in Dagster as a data orchestrator. A
PartitionedPresetDefinition
could be created that subclassesPresetDefinition
and provides an implementation and interface equivalent toPartitionSetDefinition
. The key point is for it to be a subclass that provides an opinionated form ofPresetDefinition
, not an alternative toPresetDefinition
.There may also be other common configuration matrixes in the future that can be provided as subclasses of
PresetDefinition
.Run Matrixes in Dagit
Specifying
n
number of variables in a preset will result in ann
-dimensional matrix of possible run configurations. Dagit could display a hierarchy for the Run Matrix. Or, the preset definition could take a further parameter function that allows pipeline authors to provide a flattened and ordered form of the variables in a single dimension, like in the case of a continuous range of year-month variables.Related Issues and Discussions
Beta Was this translation helpful? Give feedback.
All reactions