Development practices for large pipelines #921
Replies: 1 comment 2 replies
-
No, this would not be automatic or convenient because it breaks the concept of a Make-like pipeline with a directed acyclic graph of dependencies. To see this, let's take a simple pipeline. library(targets)
tar_script(
list(
tar_target(data, data.frame(x = rnorm(100), y = rnorm(100))),
tar_target(model, lm(y ~ x, data = data)),
tar_target(summary, coef(model))
)
)
tar_make()
#> • start target data
#> • built target data
#> • start target model
#> • built target model
#> • start target summary
#> • built target summary
#> • end pipeline: 0.108 seconds
tar_visnetwork()
tar_script(
list(
tar_target(data, data.frame(x = rnorm(100), y = rnorm(100))),
# tar_target(model, lm(y ~ x, data = data)),
tar_target(summary, coef(model))
)
)
tar_visnetwork() tar_make()
#> • start target summary
#> ✖ error target summary
#> • end pipeline: 0.091 seconds
#> Error:
#> ! Error running targets::tar_make()
#> Target errors: targets::tar_meta(fields = error, complete_only = TRUE)
#> Tips: https://books.ropensci.org/targets/debugging.html
#> Last error: object 'model' not found You would need to recode the pipeline so that the upstream target or the downstream target takes care of the modeling step. library(targets)
tar_script(
list(
tar_target(data, data.frame(x = rnorm(100), y = rnorm(100))),
tar_target(summary, coef(lm(y ~ x, data = data)))
)
)
tar_visnetwork() tar_make()
#> • start target data
#> • built target data
#> • start target summary
#> • built target summary
#> • end pipeline: 0.085 seconds A pipeline is automated and encapsulated, which is at the extreme opposite of a traditional interactive analyses in R with low computational power but the ability to dissect and inspect anything. That's always the tradeoff. |
Beta Was this translation helpful? Give feedback.
-
@wlandau , in a separate conversation about a specific application of
targets
you mentioned that most of the time the goal when usingtargets
is to save the most important outputs (vs. potentially creating a bunch of targets for intermediate output). My question: is there a straightforward way to go ahead and declare targets for such intermediate output, but to then "turn them off" (in terms of what gets saved in the data store) at a point of your choosing?I ask because, with such an option, you could still have the ability to easily load intermediate output during development (i.e., when piloting with a small amount of pipeline throughput), but to then preclude such output from being produced during large production runs. I see that tar_delete() exists, so maybe that is one way to achieve this? But as a
targets
newbie, I'd also like to know if there are any pitfalls to that approach.Beta Was this translation helpful? Give feedback.
All reactions