Skip to content

Meta make

Kirill Müller edited this page Mar 7, 2018 · 19 revisions

I'd like to propose three very simple verbs that might get us halfway towards the goal of a DSL (#233). This proposal addresses the low-level technical part, which I think is required for any DSL.

meta-make

Input: A drake plan. Output: A named list. Example of two equivalent plans:

# Proposed plan
plan <- drake::drake_plan(
  x = 1,
  y = 2,
  meta_plan = drake_plan(
    a = x,
    b = y
  ),
  results = meta_make(meta_plan)
)

# Equivalent plan that works in the current implementation
plan <- drake::drake_plan(
  x = 1,
  y = 2,
  meta_plan = NULL,
  results = { meta_plan; list(
    a = x,
    b = y
  )}
)

drake::make(plan)
#> cache /tmp/RtmpArc3UM/.drake
#> connect 1 import: plan
#> connect 4 targets: x, y, meta_plan, results
#> check 1 item: list
#> check 3 items: meta_plan, x, y
#> target meta_plan
#> check 1 item: results
#> load 2 items: x, y
#> target results
drake::readd(results)
#> cache /tmp/RtmpArc3UM/.drake
#> $a
#> [1] 1
#> 
#> $b
#> [1] 2

Created on 2018-03-07 by the reprex package (v0.2.0).

The argument to meta_make() can be a target, that's where it becomes really powerful. If meta_make() is called with an up to date target and unchanged code, the results remain up to date too.

unpack

Input: A named list. For each element, a target is created in the plan. Example of two equivalent plans:

# Proposed plan
drake_plan(
  results = list(a = 1, b = 2, c = 3),
  unpack(results)
)

# Equivalent plan that works in the current implementation
drake_plan(
  results = list(a = 1, b = 2, c = 3),
  a = results$a,
  b = results$b,
  c = results$c
)

The arguments to unpack() can be targets, that's where it becomes really powerful. This is related to #283 (multi-file output; and the equivalent for R objects), but I don't think #283 is a prerequisite. If unpack() is called with an up to date target and unchanged code, all resulting targets (from the last run) remain up to date too.

The unpacking is a declarative operation, we don't (necessarily) need to materialize all targets. In particular, if the target is the result of a previous call to meta_make(), the results are already unpacked.

pack

Semantics identical to tibble::lst(): Construct a list from a set of targets. The main difference is that this is a declarative operation that doesn't physically construct the list yet. It can be used to bundle targets together for use in a subsequent operation. Example of two equivalent plans:

# Proposed plan
drake_plan(
  a = 1,
  b = 2,
  packed = pack(a, b)
)

# Equivalent plan that works in the current implementation
drake_plan(
  a = 1,
  b = 2,
  packed = tibble::lst(a, b)
)

Essentially the opposite of unpack().

Why three verbs?

We could do meta-make + unpack as a single operation, and not implement pack at all. I'm following the Unix philosophy here, because I feel that we can only gain by exposing these operations separately, if only for testing. From the separate verbs, we can provide a flat_meta_make() (meta-make + unpack) or even a pack + meta-make + unpack verb. These operations feel simple enough to be understood individually and in combination.

These three verbs seem the simplest possible solution to me, maybe I'm missing a different decomposition into even simpler operations.

Challenges

  • Delayed plan evaluation, possibly a new target state "unknown"
  • Visualization: We don't always want to expand the constructed plans when visualizing them
  • Storing object hierarchies: When storing x <- list(a = 1, b = 2), we want to be able to access x$a and x$b without loading x
  • ...

Implementation ideas

The new verbs can be implemented in a similar way to dbplyr: When executed, they return a lightweight data structure that contains all the information necessary to assemble the result. (In dbplyr, tbl %>% select(a, b) %>% filter(a > 5) creates an object that has a sql_render() method which composes the corresponding SQL, and only calling collect() will actually run the query.) This means that the objects returned by meta_make() et al. can just be serialized without special treatment.

Named lists vs. tibbles

The examples above use named lists for illustration. This means that names for objects/targets must be strings (just like in the current implementation, so not a restriction).

Ideally I'd prefer arbitrary (multivariate) keys to describe targets, and a nested tibble as data structure. (Let's not discuss this in too much detail for now.) If we support two-column data frames (target + x) from the start, we might be able to support multivariate keys later; I'd prefer this over the named list approach.

Alternatively, we might want to stick with named lists and provide seamless support for the enframe() and deframe() verbs that convert a named list to a two-column tibble and vice versa.

Towards a DSL?

With a data-frame-based approach and multivariate keys, the focus of the DSL will be more efficient/elegant/straightforward ways to construct plans, which then are passed on to meta_make().

On the other hand, restricting target names to simple strings may be enough if our DSL adds multivariate keys on top of that. Again, let's postpone discussion on that detail.

Clone this wiki locally