RFC: Unify Partition Sets and Presets as Presets with Variables #3904

kinghuang · 2021-03-19T18:59:25Z

kinghuang
Mar 19, 2021

Motivation

Dagster pipelines can have predefined run configs for pipeline executions derived from presets and partition sets. Despite their related functions, they are treated quite differently.

Preset definitions are handled as part of a pipeline definition. Partition set definitions are created as standalone objects with a pipeline name as a parameter.
Partition set definitions and pipelines are both top level objects in repositories, but preset definitions are encompassed by the pipelines they belong to.
Pipeline runs only take a preset as an argument. Backfills only take a partition set.
Preset definitions have a static run config. Partition sets call a function to generate a run config.

The differences and lack of interplay between preset definitions and partition set definitions were very confusing to me when I was learning to use Dagster. What seemed like two different but complementary things, turned out to be mostly overlapping and exclusionary. I've ended up always using PartitionSetDefinition for pipeline configurations whether partitions are involved or not, because it is essentially a more capable version of PresetDefinition. But, this complicates API calls into Dagster (see point 3 above). I believe the dichotomy between PresetDefinition and PartitionSetDefinition is unnecessary and makes pipeline configuration and execution more complicated in Dagster. I also find presets and partition sets too limiting in their current forms, making it hard to scale pipelines beyond simple configuration scenarios.

I propose that PresetDefinition and PartitionSetDefinition be unified to reduce complexity and open the way to more adaptable pipeline configurations. This will also affect related concepts like executing a pipeline versus backfilling.

The Grand Unified Theory of Pipeline Presets

PresetDefinition and PartitionSetDefinition serve the same end goal: to produce a run configuration. In Dagit's playground view, they are shown together as options in the same list. The key difference between the two is that PartitionSetDefinition has a variable: the partition. Given Dagster's tag line as a data orchestrator, I suspect this was meant to support common data partitioning practices where a dataset may be partitioned to improve scalability, performance, security, or other reasons. A common strategy is to divide a dataset by year and month. The usage documentation for Defining a Partition Set shows date-based partitioning as an example.

So, a PartitionSetDefinition is essentially a PresetDefinition with a partition variable. Suppose we had a pipeline with a single solid that echoed its input text. We can pre-configure this pipeline to output “hello” or “bonjour” with the following two methods.

Create two preset definitions, one each for “hello” and “bonjour”.
Create one partition set definition, with partitions for “hello” and “bonjour”.

Here is some example code for both options and screenshots of the result in Dagit.

from dagster import repository, pipeline, solid
from dagster import PartitionSetDefinition, PresetDefinition


@solid
def echo(context, text: str):
  context.log.info(text)


@pipeline(
  preset_defs=[
    PresetDefinition(
      "hello",
      run_config={"solids": {"echo": {"inputs": {"text": "hello"}}}}
    ),
    PresetDefinition(
      "bonjour",
      run_config={"solids": {"echo": {"inputs": {"text": "bonjour"}}}}
    )
  ]
)
def greetings_pipeline():
  echo()


greetings_set = PartitionSetDefinition(
  name="greetings_set",
  pipeline_name="matrix_pipeline",
  partition_fn=lambda: ["hello", "bonjour"],
  run_config_fn_for_partition=lambda p: \
    {"solids": {"echo": {"inputs": {"text": p.value}}}}
)


@repository
def matrix_repository():
  return [greetings_pipeline, greetings_set]

Putting aside the differences in how presets and partition sets are implemented, notice the static "hello" and "bonjour" strings in the preset definitions, and the p.value statement referencing the chosen partition's value in the partition set definition.

The “hello” and “bonjour” values in greetings_set aren't really partitions in the data partitioning sense. Partitions here serve as a simple variable in the pipeline configuration, specifying which word to pass to the echo solid. The partition in PartitionSetDefinition is effectively an opinionated variable.

If a partition set is a preset with a single variable, then it should be possible to describe a PartitionSetDefinition using a PresetDefinition if the latter supported variables.

Here is a hypothetical preset definition with a variable.

@pipeline(
  preset_defs=[
    PresetDefinition(
      "dynamic",
      variables_fn=lambda: {"partition": ["hello", "bonjour"]},
      run_config_fn=lambda **vars: \
        {"solids": {"echo": {"inputs": {"text": vars["partition"]}}}}
    )
  ]
)

The dynamic preset definition specifies a dict of variables via variables_fn that has a partition variable with "hello" and "bonjour" as values. This is analogous to the partition_fn parameter in the greetings_set partition set definition. Then, the run_config_fn parameter is used to provide a function that will take in the selected variables and return a run configuration, like run_config_fn_for_partition.

There is no need to use “partition” as the variable name. We can set it “greeting” instead. And, for a non-dynamic dict of variables, use a variables parameter in place of variables_fn.

@pipeline(
  preset_defs=[
    PresetDefinition(
      "dynamic",
      variables={"greeting": ["hello", "bonjour"]},
      run_config_fn=lambda **vars: \
        {"solids": {"echo": {"inputs": {"text": vars["greeting"]}}}}
    )
  ]
)

As the variables key implies, it should be possible to specify more than one variable. Returning to date-based data partitions, consider a partitioning scheme with a range of year and month components. Using a PartitionSetDefinition, the partitions would be specified as a linear list of values to partition_fn like the following.

dates_set = PartitionSetDefinition(
  name="dates_set",
  pipeline_name="greetings_pipeline",
  partition_fn=date_partition_range(
    start=datetime.datetime(2015, 1, 1),
    end=datetime.datetime(2021, 1, 1),
    delta_range="months",
  ),
  run_config_fn_for_partition=lambda p: \
    {"solids": {"echo": {"inputs": {"text": p.value.strftime("%Y-%m")}}}}
)

For a large date range, this results in a very long list of partition choices.

Using year and month variables, this can be expressed as the following.

@pipeline(
  preset_defs=[
    PresetDefinition(
      "dates",
      variables={
        "year": range(2015, 2021),
        "month": range(1, 13),
      },
      run_config_fn=lambda year, month: \
        {"solids": {"echo": {"inputs": {"text": f"{year}-{m:02d}"}}}}
    )
  ]
)

By extending PresetDefinition with the capability to handle variables, it can encompass the functionality of PartitionSetDefinition, while also enabling new configuration possibilities through the use of zero to many variables.

Even with no variables, the run_config_fn parameter gives a chance for the run config to be customized or generated when requested like run_config_fn_for_partition in PartitionSetDefinition.

Proposed PresetDefinition API

Here is the proposed signature for the PresetDefinition class.

class PresetDefinition(
  name,
  variables=None,       # Specify variables or variables_fn, not both.
  variables_fn=None,
  validate_fn=None,     # Validate whether a chosen combination of variables is valid.
  run_config=None,      # Specify run_config or run_config_fn, not both.
  run_config_fn=None,
  solid_selection=None,
  mode=None,
  tags=None,
)

Pipeline Execution versus Backfill

The differences between running a pipeline and performing a backfill is equally confusing and limiting in Dagster. As with PresetDefinition and PartitionSetDefinition, a backfill is essentially multiple instances of a pipeline over a matrix of variables. There is a similar concept in GitLab CI/CD where a job can have a matrix of variables, resulting in multiple instances of the job being run. If PartitionSetDefinition is absorbed into an enhanced PresetDefinition, it will no longer make sense to have backfill be a completely separate operation.

In order for the dagster pipeline execute CLI command to support variables, it will need something different than the --partitions, --all, --from and --to options found in the dagster-pipeline backfill command. A --var KEY=VALUE option that can be specified multiple times, whose values are interpreted similarly to solid selection might suffice.

Given the following variable declaration (copied from above):

variables={
  "year": range(2015, 2021),
  "month": range(1, 13),
}

Here are some options and their interpretations.

Options	Interpretation
`--var year=2017+ --var 'month=*'`	Execute pipeline runs for years 2017-01 through 2020-12.
`--var year=2015,2016 --var month=1,2`	Execute pipeline runs for 2015-01, 2015-02, 2016-01, and 2015-02.
`--var 'year=201*' --var month=12`	Execute pipeline runs for 2015-12, 2016-12, 2017-12, 2018-12, 2019-12.
no vars	Don't execute any pipeline runs for the given preset.
`--all`	Execute pipeline runs for all possible variable values (from 2015-01 to 2020-12).

Wither PartitionSetDefinition?

Pipeline partitions and backfills are still valuable concepts in Dagster as a data orchestrator. A PartitionedPresetDefinition could be created that subclasses PresetDefinition and provides an implementation and interface equivalent to PartitionSetDefinition. The key point is for it to be a subclass that provides an opinionated form of PresetDefinition, not an alternative to PresetDefinition.

There may also be other common configuration matrixes in the future that can be provided as subclasses of PresetDefinition.

Run Matrixes in Dagit

Specifying n number of variables in a preset will result in an n-dimensional matrix of possible run configurations. Dagit could display a hierarchy for the Run Matrix. Or, the preset definition could take a further parameter function that allows pipeline authors to provide a flattened and ordered form of the variables in a single dimension, like in the case of a continuous range of year-month variables.

Related Issues and Discussions

sryza · 2021-03-25T03:25:00Z

sryza
Mar 25, 2021

Hey @kinghuang - thanks for putting this very detailed proposal together. I think you are right that we have some duplicative concepts. And the way you frame the similarities between presets and partition sets makes a lot of sense to me.

For the pipelines that you're considering this for, do they have schedules? Or do you typically kick them off manually? Schedules add another layer of configuration-mapping, so I'm wondering how that fits in for you.

Currently, we have APIs that enable people to create schedules and partition set definitions with a single invocation. That relies on the fact that partition set definitions exist outside of pipeline definitions (unlike presets). So if we wanted to do the consolidation that you're proposing, and we wanted to preserve those APIs, we might need to change preset to live outside the pipeline definition (like partition set currently does). Do you foresee issues with going in that direction?

Independently, I have found the fact that we have both presets and modes somewhat duplicative. For example, if someone wants to connect to a development database when running their pipeline in development and a production database when running their pipeline in production, they can accomplish this with both modes (using configured) and presets. That has made me wonder whether there's some way that it would make sense to fold preset into mode. However, it would require some thinking to figure out how that fits with what you've proposed here.

2 replies

kinghuang Mar 25, 2021
Author

We're still kicking off pipelines manually. But, schedules and sensors are both on our targets over the next month. I wasn't aware of the interaction between schedules and partition set definitions!

I don't foresee any difficulties having presets live outside of the pipeline definition. In fact, I think that might be an advantage. Something that I've been thinking about (though haven't tried) is creating a repository that declares partition set definitions for a pipeline in a different repository. I'm not sure if that is technically possible today. In short, I'd publish a pipeline in my Dagster repository that some other team can configure by provide partition set definitions for in their repository.

The interaction between modes and presets/partition sets is actually something that I've half-written a proposal on, too. There's definitely some overlap there. I'm finding that the idea of using modes to switch resources for different environments hasn't worked out in my pipelines with the current interaction model. I think tweaking the role of modes and also adding assets into the mix can produce some interesting possibilities.

sryza Mar 26, 2021

I'm finding that the idea of using modes to switch resources for different environments hasn't worked out in my pipelines with the current interaction model.

I’m interested to hear more about that. What are the difficulties you’re experiencing?

I think tweaking the role of modes and also adding assets into the mix can produce some interesting possibilities.

If you have ideas, I’m curious to hear what you have in mind.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Unify Partition Sets and Presets as Presets with Variables #3904

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

RFC: Unify Partition Sets and Presets as Presets with Variables #3904

kinghuang Mar 19, 2021

Motivation

The Grand Unified Theory of Pipeline Presets

Proposed PresetDefinition API

Pipeline Execution versus Backfill

Wither PartitionSetDefinition?

Run Matrixes in Dagit

Related Issues and Discussions

Replies: 1 comment · 2 replies

sryza Mar 25, 2021

kinghuang Mar 25, 2021 Author

sryza Mar 26, 2021

kinghuang
Mar 19, 2021

Replies: 1 comment 2 replies

sryza
Mar 25, 2021

kinghuang Mar 25, 2021
Author