Meta make

I'd like to propose three very simple verbs that might get us halfway towards the goal of a DSL (#233). This proposal addresses the low-level technical part, which I think is required for any DSL.

meta-make

Input: A drake plan. Output: A named list. Example of two equivalent plans:

# Proposed plan
plan <- drake::drake_plan(
  x = 1,
  y = 2,
  meta_plan = drake_plan(
    a = x,
    b = y
  ),
  results = meta_make(meta_plan)
)

# Equivalent plan that works in the current implementation
plan <- drake::drake_plan(
  x = 1,
  y = 2,
  meta_plan = NULL,
  results = { meta_plan; list(
    a = x,
    b = y
  )}
)

drake::make(plan)
#> target meta_plan
#> target results
drake::readd(results)
#> cache /tmp/RtmpArc3UM/.drake
#> $a
#> [1] 1
#> 
#> $b
#> [1] 2

Created on 2018-03-07 by the reprex package (v0.2.0).

The argument to meta_make() can be a target, that's where it becomes really powerful. If meta_make() is called with an up to date target and unchanged code, the results remain up to date too.

Subtle difference: The list returned by meta_make() is just a list of pointers, not a list of objects. Therefore, calling loadd() or readd() on such a target won't load all results into memory. See below for an implementation sketch.

unpack

Input: A named list. For each element, a target is created in the plan. Example of two equivalent plans:

# Proposed plan
plan <- drake::drake_plan(
  results = list(a = 1, b = 2, c = 3),
  unpack(results)
)

# Equivalent plan that works in the current implementation
plan <- drake::drake_plan(
  results = list(a = 1, b = 2, c = 3),
  a = results$a,
  b = results$b,
  c = results$c
)

drake::make(plan)
#> target result
#> target a
#> target b
#> target c
drake::readd(b)
#> cache /tmp/RtmpriGbJu/.drake
#> [1] 2

Created on 2018-03-07 by the reprex package (v0.2.0).

The arguments to unpack() can be targets, that's where it becomes really powerful. This is related to #283 (multi-file output; and the equivalent for R objects), but I don't think #283 is a prerequisite. If unpack() is called with an up to date target and unchanged code, all resulting targets (from the last run) remain up to date too.

The unpacking is a declarative operation, we don't (necessarily) need to materialize all targets. In particular, if the target is the result of a previous call to meta_make(), the results are already unpacked.

pack

Semantics identical to tibble::lst(): Construct a list from a set of targets. The main difference is that this is a declarative operation that doesn't physically construct the list yet. It can be used to bundle targets together for use in a subsequent operation. Example of two equivalent plans:

# Proposed plan
plan <- drake::drake_plan(
  a = 1,
  b = 2,
  packed = pack(a, b)
)

# Equivalent plan that works in the current implementation
plan <- drake::drake_plan(
  a = 1,
  b = 2,
  packed = tibble::lst(a, b)
)

drake::make(plan)
#> target a
#> target b
#> target packed
drake::readd(packed)
#> cache /tmp/Rtmp6mdjtc/.drake
#> $a
#> [1] 1
#> 
#> $b
#> [1] 2

Created on 2018-03-07 by the reprex package (v0.2.0).

Essentially the opposite of unpack().

Like with meta_make(), a list of pointers is returned when calling loadd() or readd() for such a target. See below for implementation details.

Why three verbs?

We could do meta-make + unpack as a single operation, and not implement pack at all. I'm following the Unix philosophy here, because I feel that we can only gain by exposing these operations separately, if only for testing. From the separate verbs, we can provide a flat_meta_make() (meta-make + unpack) or even a pack + meta-make + unpack verb. These operations feel simple enough to be understood individually and in combination.

These three verbs seem the simplest possible solution to me, maybe I'm missing a different decomposition into even simpler operations.

Challenges

Delayed plan evaluation, possibly a new target state "unknown"
Visualization: We don't always want to expand the constructed plans when visualizing them
Storing object hierarchies: When storing x <- list(a = 1, b = 2), we want to be able to access x$a and x$b without loading x
...

Implementation ideas

The new verbs can be implemented in a similar way to dbplyr: When executed, they return a lightweight data structure that contains all the information necessary to assemble the result. (In dbplyr, tbl %>% select(a, b) %>% filter(a > 5) creates an object that has a sql_render() method which composes the corresponding SQL, and only calling collect() will actually run the query.) This means that the objects returned by meta_make() et al. can just be serialized without special treatment.

Implementation sketch

The return value of the new verbs could be S3 objects of classes "drake_meta_make", "drake_unpack" and "drake_pack", respectively. When the scheduler sees that a command returned an object of these classes, appropriate action is taken:

For "drake_meta_make", jobs are enqueued to the scheduler
- results will be stored in a separate storr namespace (one result per meta-target)
- the class will have $ and [[ methods overridden, its .Names attribute will correspond to the actual target names
  - for now this assumes that all jobs can read from the storr
- readd() would just return the "drake_meta_make" object
For "drake_unpack", targets are added to the plan, making sure that no duplicates are created
- the dependency graph is rewired to account for the new targets
- if unpack() is called on a "drake_meta_make" object, we avoid copies by creating pointers (S3 objects of yet another class, say, "drake_pointer"), which are handled specially in loadd() and readd()
For "drake_pack", a list of "drake_pointer" objects is constructed and stored
- the class will also have $ and [[ methods overridden
- perhaps "drake_meta_make" can also be just a list of "drake_pointer" objects

Named lists vs. tibbles

The examples above use named lists for illustration. This means that names for objects/targets must be strings (just like in the current implementation, so not a restriction).

Ideally I'd prefer arbitrary (multivariate) keys to describe targets, and a nested tibble as data structure. (Let's not discuss this in too much detail for now.) If we support two-column data frames (target + x) from the start, we might be able to support multivariate keys later; I'd prefer this over the named list approach.

Alternatively, we might want to stick with named lists and provide seamless support for the enframe() and deframe() verbs that convert a named list to a two-column tibble and vice versa.

Towards a DSL?

With a data-frame-based approach and multivariate keys, the focus of the DSL will be more efficient/elegant/straightforward ways to construct plans, which then are passed on to meta_make().

On the other hand, restricting target names to simple strings may be enough if our DSL adds multivariate keys on top of that. Again, let's postpone discussion on that detail.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Meta make

meta-make

unpack

pack

Why three verbs?

Challenges

Implementation ideas

Implementation sketch

Named lists vs. tibbles

Towards a DSL?

Clone this wiki locally