Development practices for large pipelines #921

hdhshowalter-lilly · 2022-09-01T15:34:42Z

hdhshowalter-lilly
Sep 1, 2022

@wlandau , in a separate conversation about a specific application of targets you mentioned that most of the time the goal when using targets is to save the most important outputs (vs. potentially creating a bunch of targets for intermediate output). My question: is there a straightforward way to go ahead and declare targets for such intermediate output, but to then "turn them off" (in terms of what gets saved in the data store) at a point of your choosing?

I ask because, with such an option, you could still have the ability to easily load intermediate output during development (i.e., when piloting with a small amount of pipeline throughput), but to then preclude such output from being produced during large production runs. I see that tar_delete() exists, so maybe that is one way to achieve this? But as a targets newbie, I'd also like to know if there are any pitfalls to that approach.

wlandau · 2022-09-01T17:22:15Z

wlandau
Sep 1, 2022
Maintainer

is there a straightforward way to go ahead and declare targets for such intermediate output, but to then "turn them off" (in terms of what gets saved in the data store) at a point of your choosing?

No, this would not be automatic or convenient because it breaks the concept of a Make-like pipeline with a directed acyclic graph of dependencies. To see this, let's take a simple pipeline.

library(targets)
tar_script(
  list(
    tar_target(data, data.frame(x = rnorm(100), y = rnorm(100))),
    tar_target(model, lm(y ~ x, data = data)),
    tar_target(summary, coef(model))
  )
)

tar_make()
#> • start target data
#> • built target data
#> • start target model
#> • built target model
#> • start target summary
#> • built target summary
#> • end pipeline: 0.108 seconds

tar_visnetwork()

lm() objects can actually get quite big (as opposed to biglm), but if we simply remove ours from the pipeline, we run into problems. The summary target is no longer valid, it is no longer connected to the upstream data, and it cannot run.

tar_script(
  list(
    tar_target(data, data.frame(x = rnorm(100), y = rnorm(100))),
#    tar_target(model, lm(y ~ x, data = data)),
    tar_target(summary, coef(model))
  )
)

tar_visnetwork()

tar_make()
#> • start target summary
#> ✖ error target summary
#> • end pipeline: 0.091 seconds
#> Error:
#> ! Error running targets::tar_make()
#>   Target errors: targets::tar_meta(fields = error, complete_only = TRUE)
#>   Tips: https://books.ropensci.org/targets/debugging.html
#>   Last error: object 'model' not found

You would need to recode the pipeline so that the upstream target or the downstream target takes care of the modeling step.

library(targets)
tar_script(
  list(
    tar_target(data, data.frame(x = rnorm(100), y = rnorm(100))),
    tar_target(summary, coef(lm(y ~ x, data = data)))
  )
)

tar_visnetwork()

tar_make()
#> • start target data
#> • built target data
#> • start target summary
#> • built target summary
#> • end pipeline: 0.085 seconds

A pipeline is automated and encapsulated, which is at the extreme opposite of a traditional interactive analyses in R with low computational power but the ability to dissect and inspect anything. That's always the tradeoff.

2 replies

hdhshowalter-lilly Sep 2, 2022
Author

The general concept you've described (with great example) makes sense, but instead of actually removing the model target from the pipeline, what I'm inquiring about is a state where it still exists in the network yet the output is not stored (since, after development and dubugging, the summary output is really the primary thing you need). But I get that this is a fundamental departure from the concept of targets being "out of/up to date".

wlandau Sep 2, 2022
Maintainer

If the output of model is not stored on disk, then the summary target in the first pipeline would not be able to run. It is not enough to keep model in memory because summary could be running on a parallel worker, or the pipeline could try to start with summary if the other targets are up to date.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Development practices for large pipelines #921

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

Development practices for large pipelines #921

hdhshowalter-lilly Sep 1, 2022

Replies: 1 comment · 2 replies

wlandau Sep 1, 2022 Maintainer

hdhshowalter-lilly Sep 2, 2022 Author

wlandau Sep 2, 2022 Maintainer

hdhshowalter-lilly
Sep 1, 2022

Replies: 1 comment 2 replies

wlandau
Sep 1, 2022
Maintainer

hdhshowalter-lilly Sep 2, 2022
Author

wlandau Sep 2, 2022
Maintainer