Skip to content

Meta make

Kirill Müller edited this page Mar 9, 2018 · 19 revisions

I'd like to propose three very simple verbs that might get us halfway towards the goal of a DSL (#233). This proposal addresses the low-level technical part, which I think is required for any DSL. This proposal is discussed in #304.

meta-make

Input: A drake plan. Output: A named list. Example of two equivalent plans:

# Proposed plan
plan <- drake::drake_plan(
  x = 1,
  y = 2,
  meta_plan = drake_plan(
    a = x,
    b = y
  ),
  results = meta_make(meta_plan)
)

# Equivalent plan that works in the current implementation
plan <- drake::drake_plan(
  x = 1,
  y = 2,
  meta_plan = NULL,
  results = { meta_plan; list(
    a = x,
    b = y
  )}
)

drake::make(plan)
#> target meta_plan
#> target results
drake::readd(results)
#> cache /tmp/RtmpArc3UM/.drake
#> $a
#> [1] 1
#> 
#> $b
#> [1] 2

Created on 2018-03-07 by the reprex package (v0.2.0).

The argument to meta_make() can be a target, that's where it becomes really powerful. If meta_make() is called with an up to date target and unchanged code, the results remain up to date too.

Subtle difference: The list returned by meta_make() is just a list of pointers, not a list of objects. Therefore, calling loadd() or readd() on such a target won't load all results into memory. See below for an implementation sketch.

unpack

Input: A named list. For each element, a target is created in the plan. Example of two equivalent plans:

# Proposed plan
plan <- drake::drake_plan(
  results = list(a = 1, b = 2, c = 3),
  unpack(results)
)

# Equivalent plan that works in the current implementation
plan <- drake::drake_plan(
  results = list(a = 1, b = 2, c = 3),
  a = results$a,
  b = results$b,
  c = results$c
)

drake::make(plan)
#> target results
#> target a
#> target b
#> target c
drake::readd(b)
#> cache /tmp/RtmpriGbJu/.drake
#> [1] 2

Created on 2018-03-07 by the reprex package (v0.2.0).

The arguments to unpack() can be targets, that's where it becomes really powerful. This is related to #283 (multi-file output; and the equivalent for R objects), but I don't think #283 is a prerequisite. If unpack() is called with an up to date target and unchanged code, all resulting targets (from the last run) remain up to date too.

The unpacking is a declarative operation, we don't (necessarily) need to materialize all targets. In particular, if the target is the result of a previous call to meta_make(), the results are already unpacked.

pack

Semantics identical to tibble::lst(): Construct a list from a set of targets. The main difference is that this is a declarative operation that doesn't physically construct the list yet. It can be used to bundle targets together for use in a subsequent operation. Example of two equivalent plans:

# Proposed plan
plan <- drake::drake_plan(
  a = 1,
  b = 2,
  packed = pack(a, b)
)

# Equivalent plan that works in the current implementation
plan <- drake::drake_plan(
  a = 1,
  b = 2,
  packed = tibble::lst(a, b)
)

drake::make(plan)
#> target a
#> target b
#> target packed
drake::readd(packed)
#> cache /tmp/Rtmp6mdjtc/.drake
#> $a
#> [1] 1
#> 
#> $b
#> [1] 2

Created on 2018-03-07 by the reprex package (v0.2.0).

Essentially the opposite of unpack().

Like with meta_make(), a list of pointers is returned when calling loadd() or readd() for such a target. See below for implementation details.

Why three verbs?

We could do meta-make + unpack as a single operation, and not implement pack at all. I'm following the Unix philosophy here, because I feel that we can only gain by exposing these operations separately, if only for testing. From the separate verbs, we can provide a flat_meta_make() (meta-make + unpack) or even a pack + meta-make + unpack verb. These operations feel simple enough to be understood individually and in combination.

These three verbs seem the simplest possible solution to me, maybe I'm missing a different decomposition into even simpler operations.

Challenges

  • Delayed plan evaluation, possibly a new target state "unknown"
  • Visualization: We don't always want to expand the constructed plans when visualizing them
  • Storing object hierarchies: When storing x <- list(a = 1, b = 2), we want to be able to access x$a and x$b without loading x
  • ...

Implementation ideas

The new verbs can be implemented in a similar way to dbplyr: When executed, they return a lightweight data structure that contains all the information necessary to assemble the result. (In dbplyr, tbl %>% select(a, b) %>% filter(a > 5) creates an object that has a sql_render() method which composes the corresponding SQL, and only calling collect() will actually run the query.) This means that the objects returned by meta_make() et al. can just be serialized without special treatment.

Implementation sketch

The return value of the new verbs could be S3 objects of classes "drake_meta_make", "drake_unpack" and "drake_pack", respectively. When the scheduler sees that a command returned an object of these classes, appropriate action is taken:

  • For "drake_meta_make", jobs are enqueued to the scheduler
    • results will be stored in a separate storr namespace (one result per meta-target)
    • the class will have $ and [[ methods overridden, its .Names attribute will correspond to the actual target names
      • for now this assumes that all jobs can read from the storr
    • readd() would just return the "drake_meta_make" object
  • For "drake_unpack", targets are added to the plan, making sure that no duplicates are created
    • the dependency graph is rewired to account for the new targets
    • if unpack() is called on a "drake_meta_make" object, we avoid copies by creating pointers (S3 objects of yet another class, say, "drake_pointer"), which are handled specially in loadd() and readd()
  • For "drake_pack", a list of "drake_pointer" objects is constructed and stored
    • the class will also have $ and [[ methods overridden
    • perhaps "drake_meta_make" can also be just a list of "drake_pointer" objects

Named lists vs. tibbles

The examples above use named lists for illustration. This means that names for objects/targets must be strings (just like in the current implementation, so not a restriction).

Ideally I'd prefer arbitrary (multivariate) keys to describe targets, and a nested tibble as data structure. (Let's not discuss this in too much detail for now.) If we support two-column data frames (target + x) from the start, we might be able to support multivariate keys later; I'd prefer this over the named list approach.

Alternatively, we might want to stick with named lists and provide seamless support for the enframe() and deframe() verbs that convert a named list to a two-column tibble and vice versa.

Towards a DSL?

With a data-frame-based approach and multivariate keys, the focus of the DSL will be more efficient/elegant/straightforward ways to construct plans, which then are passed on to meta_make().

On the other hand, restricting target names to simple strings may be enough if our DSL adds multivariate keys on top of that. Again, let's postpone discussion on that detail.