diff --git a/episodes/branch.Rmd b/episodes/branch.Rmd index 23d4fcb5..9ec33475 100644 --- a/episodes/branch.Rmd +++ b/episodes/branch.Rmd @@ -1,6 +1,6 @@ --- title: 'Branching' -teaching: 10 +teaching: 30 exercises: 2 --- @@ -30,6 +30,14 @@ Episode summary: Show how to use branching library(targets) library(tarchetypes) library(broom) + +# sandpaper renders this lesson from episodes/ +# need to emulate this behavior during interactive development +# would be preferable to use here::here() but it doesn't work for some reason +if (interactive()) { + setwd("episodes") +} + source("files/lesson_functions.R") # Increase width for printing tibbles @@ -102,15 +110,14 @@ This seems to indicate that the model is highly significant. But wait a moment... is this really an appropriate model? Recall that there are three species of penguins in the dataset. It is possible that the relationship between bill depth and length **varies by species**. -We should probably test some alternative models. -These could include models that add a parameter for species, or add an interaction effect between species and bill length. +Let's try making one model *per* species (three models total) to see how that does (this is technically not the correct statistical approach, but our focus here is to learn `targets`, not statistics). Now our workflow is getting more complicated. This is what a workflow for such an analysis might look like **without branching** (make sure to add `library(broom)` to `packages.R`): ```{r} #| label = "example-model-show-1", #| eval = FALSE, -#| code = readLines("files/plans/plan_5.R")[2:31] +#| code = readLines("files/plans/plan_5.R")[2:36] ``` ```{r} @@ -133,19 +140,32 @@ Let's look at the summary of one of the models: #| eval: true #| echo: [2] pushd(plan_5_dir) -tar_read(species_summary) +tar_read(adelie_summary) popd() ``` So this way of writing the pipeline works, but is repetitive: we have to call `glance()` each time we want to obtain summary statistics for each model. -Furthermore, each summary target (`combined_summary`, etc.) is explicitly named and typed out manually. +Furthermore, each summary target (`adelie_summary`, etc.) is explicitly named and typed out manually. It would be fairly easy to make a typo and end up with the wrong model being summarized. +Before moving on, let's define another **custom function** function: `model_glance()`. +You will need to write custom functions frequently when using `targets`, so it's good to get used to it! + +As the name `model_glance()` suggests (it is good to write functions with names that indicate their purpose), this will build a model then immediately run `glance()` on it. +The reason for doing so is that we get a **dataframe as a result**, which is very helpful for branching, as we will see in the next section. +Save this in `R/functions.R`: + +```{r} +#| label = "model-glance", +#| eval = FALSE, +#| code = readLines("files/tar_functions/model_glance_orig.R") +``` + ## Example with branching ### First attempt -Let's see how to write the same plan using **dynamic branching**: +Let's see how to write the same plan using **dynamic branching** (after running it, we will go through the new version in detail to understand each step): ```{r} #| label = "example-model-show-3", @@ -165,63 +185,65 @@ pushd(plan_6_dir) # simulate already running the plan once write_example_plan("plan_5.R") tar_make(reporter = "silent") -write_example_plan("plan_6.R") +# run version of plan that uses `model_glance_orig()` (doesn't include species +# names in output) +write_example_plan("plan_6b.R") tar_make() -example_branch_name <- tar_branch_names(model_summaries, 1) +example_branch_name <- tar_branch_names(species_summary, 1) popd() ``` -There is a series of smaller targets (branches) that are each named like `r example_branch_name`, then one overall `model_summaries` target. +There is a series of smaller targets (branches) that are each named like `r example_branch_name`, then one overall `species_summary` target. That is the result of specifying targets using branching: each of the smaller targets are the "branches" that comprise the overall target. Since `targets` has no way of knowing ahead of time how many branches there will be or what they represent, it names each one using this series of numbers and letters (the "hash"). `targets` builds each branch one at a time, then combines them into the overall target. -Next, let's look in more detail about how the workflow is set up, starting with how we defined the models: +Next, let's look in more detail about how the workflow is set up, starting with how we set up the data: ```{r} #| label = "model-def", -#| code = readLines("files/plans/plan_6.R")[14:22], +#| code = readLines("files/plans/plan_6.R")[14:19], #| eval = FALSE ``` -Unlike the non-branching version, we defined the models **in a list** (instead of one target per model). -This is because dynamic branching is similar to the `base::apply()` or [`purrrr::map()`](https://purrr.tidyverse.org/reference/map.html) method of looping: it applies a function to each element of a list. -So we need to prepare the input for looping as a list. +Unlike the non-branching version, we added a step that **groups the data**. +This is because dynamic branching is similar to the [`tidyverse` approach](https://dplyr.tidyverse.org/articles/grouping.html) of applying the same function to a grouped dataframe. +So we use the `tar_group_by()` function to specify the groups in our input data: one group per species. -Next, take a look at the command to build the target `model_summaries`. +Next, take a look at the command to build the target `species_summary`. ```{r} #| label = "model-summaries", -#| code = readLines("files/plans/plan_6.R")[23:28], +#| code = readLines("files/plans/plan_6.R")[22:27], #| eval = FALSE ``` -As before, the first argument is the name of the target to build, and the second is the command to build it. +As before, the first argument to `tar_target()` is the name of the target to build, and the second is the command to build it. -Here, we apply the `glance()` function to each element of `models` (the `[[1]]` is necessary because when the function gets applied, each element is actually a nested list, and we need to remove one layer of nesting). +Here, we apply our custom `model_glance()` function to each group (in other words, each species) in `penguins_data_grouped`. Finally, there is an argument we haven't seen before, `pattern`, which indicates that this target should be built using dynamic branching. -`map` means to apply the command to each element of the input list (`models`) sequentially. +`map` means to apply the function to each group of the input data (`penguins_data_grouped`) sequentially. Now that we understand how the branching workflow is constructed, let's inspect the output: ```{r} #| label: example-model-show-4 #| eval: FALSE -tar_read(model_summaries) +tar_read(species_summary) ``` ```{r} #| label: example-model-hide-4 #| echo: FALSE pushd(plan_6_dir) -tar_read(model_summaries) +tar_read(species_summary) popd() ``` The model summary statistics are all included in a single dataframe. -But there's one problem: **we can't tell which row came from which model!** It would be unwise to assume that they are in the same order as the list of models. +But there's one problem: **we can't tell which row came from which species!** It would be unwise to assume that they are in the same order as the input data. This is due to the way dynamic branching works: by default, there is no information about the provenance of each target preserved in the output. @@ -230,58 +252,43 @@ How can we fix this? ### Second attempt The key to obtaining useful output from branching pipelines is to include the necessary information in the output of each individual branch. -Here, we want to know the kind of model that corresponds to each row of the model summaries. -To do that, we need to write a **custom function**. -You will need to write custom functions frequently when using `targets`, so it's good to get used to it! +Here, we want to know the species that corresponds to each row of the model summaries. -Here is the function. Save this in `R/functions.R`: +We can achieve this by modifying our `model_glance` function. Be sure to save it after modifying it to include a column for species: ```{r} #| label: example-model-show-5 #| eval: FALSE -#| file: files/tar_functions/glance_with_mod_name.R +#| file: files/tar_functions/model_glance.R ``` -Our new pipeline looks almost the same as before, but this time we use the custom function instead of `glance()`. +Our new pipeline looks exactly the same as before; we have made a modification, but to a **function**, not the pipeline. -```{r} -#| label = "example-model-show-6", -#| code = readLines("files/plans/plan_7.R")[2:29], -#| eval = FALSE -``` +Since `targets` tracks the contents of each custom function, it realizes that it needs to recompute `species_summary` and runs this target again with the newly modified function. ```{r} #| label: example-model-hide-6 #| echo: FALSE pushd(plan_6_dir) -write_example_plan("plan_7.R") +write_example_plan("plan_6.R") tar_make() popd() ``` -And this time, when we load the `model_summaries`, we can tell which model corresponds to which row (you may need to scroll to the right to see it). +And this time, when we load the `model_summaries`, we can tell which model corresponds to which row (the `.before = 1` in `mutate()` ensures that it shows up before the other columns). ```{r} #| label: example-model-7 #| echo: [2] #| warning: false pushd(plan_6_dir) -tar_read(model_summaries) +tar_read(species_summary) popd() ``` Next we will add one more target, a prediction of bill depth based on each model. These will be needed for plotting the models in the report. -Such a prediction can be obtained with the `augment()` function of the `broom` package. +Such a prediction can be obtained with the `augment()` function of the `broom` package, and we create a custom function that outputs predicted points as a dataframe much like we did for the model summaries. -```{r} -#| label: example-augment -#| echo: [2, 3] -#| eval: true -pushd(plan_6_dir) -tar_load(models) -augment(models[[1]]) -popd() -``` ::::::::::::::::::::::::::::::::::::: {.challenge} @@ -291,19 +298,19 @@ Can you add the model predictions using `augment()`? You will need to define a c :::::::::::::::::::::::::::::::::: {.solution} -Define the new function as `augment_with_mod_name()`. It is the same as `glance_with_mod_name()`, but use `augment()` instead of `glance()`: +Define the new function as `model_augment()`. It is the same as `model_glance()`, but use `augment()` instead of `glance()`: ```{r} #| label: example-model-augment-func #| eval: FALSE -#| file: files/tar_functions/augment_with_mod_name.R +#| file: files/tar_functions/model_augment.R ``` Add the step to the workflow: ```{r} #| label = "example-model-augment-show", -#| code = readLines("files/plans/plan_8.R")[2:35], +#| code = readLines("files/plans/plan_7.R")[2:36], #| eval = FALSE ``` @@ -311,13 +318,41 @@ Add the step to the workflow: ::::::::::::::::::::::::::::::::::::: +### Further simplify the workflow + +You may have noticed that we can further simplify the workflow: there is no need to have separate `penguins_data` and `penguins_data_grouped` dataframes. +In general it is best to keep the number of named objects as small as possible to make it easier to reason about your code. +Let's combine the cleaning and grouping step into a single command: + +```{r} +#| label = "example-model-show-8", +#| eval = FALSE, +#| code = readLines("files/plans/plan_8.R")[2:35] +``` + +And run it once more: + +```{r} +#| label: example-model-hide-8 +#| echo: false +pushd(plan_6_dir) +# simulate already running the plan once +write_example_plan("plan_7.R") +tar_make(reporter = "silent") +# run version of plan that uses `model_glance_orig()` (doesn't include species +# names in output) +write_example_plan("plan_8.R") +tar_make() +popd() +``` + ::::::::::::::::::::::::::::::::::::: {.callout} ## Best practices for branching -Dynamic branching is designed to work well with **dataframes** (tibbles). +Dynamic branching is designed to work well with **dataframes** (it can also use [lists](https://books.ropensci.org/targets/dynamic.html#list-iteration), but that is more advanced, so we recommend using dataframes when possible). -So if possible, write your custom functions to accept dataframes as input and return them as output, and always include any necessary metadata as a column or columns. +It is recommended to write your custom functions to accept dataframes as input and return them as output, and always include any necessary metadata as a column or columns. ::::::::::::::::::::::::::::::::::::: diff --git a/episodes/files/plans/plan_10.R b/episodes/files/plans/plan_10.R index be92fd01..2dbc9d21 100644 --- a/episodes/files/plans/plan_10.R +++ b/episodes/files/plans/plan_10.R @@ -16,27 +16,26 @@ tar_plan( path_to_file("penguins_raw.csv"), read_csv(!!.x, show_col_types = FALSE) ), - # Clean data - penguins_data = clean_penguin_data(penguins_data_raw), - # Build models - models = list( - combined_model = lm( - bill_depth_mm ~ bill_length_mm, data = penguins_data), - species_model = lm( - bill_depth_mm ~ bill_length_mm + species, data = penguins_data), - interaction_model = lm( - bill_depth_mm ~ bill_length_mm * species, data = penguins_data) + # Clean and group data + tar_group_by( + penguins_data, + clean_penguin_data(penguins_data_raw), + species ), - # Get model summaries + # Get summary of combined model with all species together + combined_summary = model_glance_slow(penguins_data), + # Get summary of one model per species tar_target( - model_summaries, - glance_with_mod_name_slow(models), - pattern = map(models) + species_summary, + model_glance_slow(penguins_data), + pattern = map(penguins_data) ), - # Get model predictions + # Get predictions of combined model with all species together + combined_predictions = model_augment_slow(penguins_data), + # Get predictions of one model per species tar_target( - model_predictions, - augment_with_mod_name_slow(models), - pattern = map(models) + species_predictions, + model_augment_slow(penguins_data), + pattern = map(penguins_data) ) ) diff --git a/episodes/files/plans/plan_11.R b/episodes/files/plans/plan_11.R index 5c9af52f..6b23b0b3 100644 --- a/episodes/files/plans/plan_11.R +++ b/episodes/files/plans/plan_11.R @@ -9,34 +9,32 @@ tar_plan( path_to_file("penguins_raw.csv"), read_csv(!!.x, show_col_types = FALSE) ), - # Clean data - penguins_data = clean_penguin_data(penguins_data_raw), - # Build models - models = list( - combined_model = lm( - bill_depth_mm ~ bill_length_mm, data = penguins_data), - species_model = lm( - bill_depth_mm ~ bill_length_mm + species, data = penguins_data), - interaction_model = lm( - bill_depth_mm ~ bill_length_mm * species, data = penguins_data) + # Clean and group data + tar_group_by( + penguins_data, + clean_penguin_data(penguins_data_raw), + species ), - # Get model summaries + # Get summary of combined model with all species together + combined_summary = model_glance(penguins_data), + # Get summary of one model per species tar_target( - model_summaries, - glance_with_mod_name(models), - pattern = map(models) + species_summary, + model_glance(penguins_data), + pattern = map(penguins_data) ), - # Get model predictions + # Get predictions of combined model with all species together + combined_predictions = model_augment(penguins_data), + # Get predictions of one model per species tar_target( - model_predictions, - augment_with_mod_name(models), - pattern = map(models) + species_predictions, + model_augment(penguins_data), + pattern = map(penguins_data) ), # Generate report tar_quarto( penguin_report, path = "penguin_report.qmd", - quiet = FALSE, - packages = c("targets", "tidyverse") + quiet = FALSE ) ) diff --git a/episodes/files/plans/plan_5.R b/episodes/files/plans/plan_5.R index 882876cc..cecaae2b 100644 --- a/episodes/files/plans/plan_5.R +++ b/episodes/files/plans/plan_5.R @@ -16,16 +16,21 @@ tar_plan( bill_depth_mm ~ bill_length_mm, data = penguins_data ), - species_model = lm( - bill_depth_mm ~ bill_length_mm + species, - data = penguins_data + adelie_model = lm( + bill_depth_mm ~ bill_length_mm, + data = filter(penguins_data, species == "Adelie") ), - interaction_model = lm( - bill_depth_mm ~ bill_length_mm * species, - data = penguins_data + chinstrap_model = lm( + bill_depth_mm ~ bill_length_mm, + data = filter(penguins_data, species == "Chinstrap") + ), + gentoo_model = lm( + bill_depth_mm ~ bill_length_mm, + data = filter(penguins_data, species == "Gentoo") ), # Get model summaries combined_summary = glance(combined_model), - species_summary = glance(species_model), - interaction_summary = glance(interaction_model) + adelie_summary = glance(adelie_model), + chinstrap_summary = glance(chinstrap_model), + gentoo_summary = glance(gentoo_model) ) diff --git a/episodes/files/plans/plan_6.R b/episodes/files/plans/plan_6.R index fad7536b..33f30d95 100644 --- a/episodes/files/plans/plan_6.R +++ b/episodes/files/plans/plan_6.R @@ -11,19 +11,18 @@ tar_plan( ), # Clean data penguins_data = clean_penguin_data(penguins_data_raw), - # Build models - models = list( - combined_model = lm( - bill_depth_mm ~ bill_length_mm, data = penguins_data), - species_model = lm( - bill_depth_mm ~ bill_length_mm + species, data = penguins_data), - interaction_model = lm( - bill_depth_mm ~ bill_length_mm * species, data = penguins_data) + # Group data + tar_group_by( + penguins_data_grouped, + penguins_data, + species ), - # Get model summaries + # Build combined model with all species together + combined_summary = model_glance(penguins_data), + # Build one model per species tar_target( - model_summaries, - glance(models[[1]]), - pattern = map(models) + species_summary, + model_glance(penguins_data_grouped), + pattern = map(penguins_data_grouped) ) ) diff --git a/episodes/files/plans/plan_6b.R b/episodes/files/plans/plan_6b.R new file mode 100644 index 00000000..28ac909c --- /dev/null +++ b/episodes/files/plans/plan_6b.R @@ -0,0 +1,28 @@ +options(tidyverse.quiet = TRUE) +source("R/packages.R") +source("R/functions.R") + +tar_plan( + # Load raw data + tar_file_read( + penguins_data_raw, + path_to_file("penguins_raw.csv"), + read_csv(!!.x, show_col_types = FALSE) + ), + # Clean data + penguins_data = clean_penguin_data(penguins_data_raw), + # Group data + tar_group_by( + penguins_data_grouped, + penguins_data, + species + ), + # Build combined model with all species together + combined_summary = model_glance_orig(penguins_data), + # Build one model per species + tar_target( + species_summary, + model_glance_orig(penguins_data_grouped), + pattern = map(penguins_data_grouped) + ) +) diff --git a/episodes/files/plans/plan_7.R b/episodes/files/plans/plan_7.R index 346cca74..af844230 100644 --- a/episodes/files/plans/plan_7.R +++ b/episodes/files/plans/plan_7.R @@ -11,19 +11,26 @@ tar_plan( ), # Clean data penguins_data = clean_penguin_data(penguins_data_raw), - # Build models - models = list( - combined_model = lm( - bill_depth_mm ~ bill_length_mm, data = penguins_data), - species_model = lm( - bill_depth_mm ~ bill_length_mm + species, data = penguins_data), - interaction_model = lm( - bill_depth_mm ~ bill_length_mm * species, data = penguins_data) + # Group data + tar_group_by( + penguins_data_grouped, + penguins_data, + species ), - # Get model summaries + # Get summary of combined model with all species together + combined_summary = model_glance(penguins_data), + # Get summary of one model per species tar_target( - model_summaries, - glance_with_mod_name(models), - pattern = map(models) + species_summary, + model_glance(penguins_data_grouped), + pattern = map(penguins_data_grouped) + ), + # Get predictions of combined model with all species together + combined_predictions = model_augment(penguins_data_grouped), + # Get predictions of one model per species + tar_target( + species_predictions, + model_augment(penguins_data_grouped), + pattern = map(penguins_data_grouped) ) ) diff --git a/episodes/files/plans/plan_8.R b/episodes/files/plans/plan_8.R index 8a6779ef..9d76b4a4 100644 --- a/episodes/files/plans/plan_8.R +++ b/episodes/files/plans/plan_8.R @@ -9,27 +9,26 @@ tar_plan( path_to_file("penguins_raw.csv"), read_csv(!!.x, show_col_types = FALSE) ), - # Clean data - penguins_data = clean_penguin_data(penguins_data_raw), - # Build models - models = list( - combined_model = lm( - bill_depth_mm ~ bill_length_mm, data = penguins_data), - species_model = lm( - bill_depth_mm ~ bill_length_mm + species, data = penguins_data), - interaction_model = lm( - bill_depth_mm ~ bill_length_mm * species, data = penguins_data) + # Clean and group data + tar_group_by( + penguins_data, + clean_penguin_data(penguins_data_raw), + species ), - # Get model summaries + # Get summary of combined model with all species together + combined_summary = model_glance(penguins_data), + # Get summary of one model per species tar_target( - model_summaries, - glance_with_mod_name(models), - pattern = map(models) + species_summary, + model_glance(penguins_data), + pattern = map(penguins_data) ), - # Get model predictions + # Get predictions of combined model with all species together + combined_predictions = model_augment(penguins_data), + # Get predictions of one model per species tar_target( - model_predictions, - augment_with_mod_name(models), - pattern = map(models) + species_predictions, + model_augment(penguins_data), + pattern = map(penguins_data) ) ) diff --git a/episodes/files/plans/plan_9.R b/episodes/files/plans/plan_9.R index 164359b1..5eb6ed7b 100644 --- a/episodes/files/plans/plan_9.R +++ b/episodes/files/plans/plan_9.R @@ -16,27 +16,26 @@ tar_plan( path_to_file("penguins_raw.csv"), read_csv(!!.x, show_col_types = FALSE) ), - # Clean data - penguins_data = clean_penguin_data(penguins_data_raw), - # Build models - models = list( - combined_model = lm( - bill_depth_mm ~ bill_length_mm, data = penguins_data), - species_model = lm( - bill_depth_mm ~ bill_length_mm + species, data = penguins_data), - interaction_model = lm( - bill_depth_mm ~ bill_length_mm * species, data = penguins_data) + # Clean and group data + tar_group_by( + penguins_data, + clean_penguin_data(penguins_data_raw), + species ), - # Get model summaries + # Get summary of combined model with all species together + combined_summary = model_glance(penguins_data), + # Get summary of one model per species tar_target( - model_summaries, - glance_with_mod_name(models), - pattern = map(models) + species_summary, + model_glance(penguins_data), + pattern = map(penguins_data) ), - # Get model predictions + # Get predictions of combined model with all species together + combined_predictions = model_augment(penguins_data), + # Get predictions of one model per species tar_target( - model_predictions, - augment_with_mod_name(models), - pattern = map(models) + species_predictions, + model_augment(penguins_data), + pattern = map(penguins_data) ) ) diff --git a/episodes/files/tar_functions/model_augment.R b/episodes/files/tar_functions/model_augment.R new file mode 100644 index 00000000..68875d00 --- /dev/null +++ b/episodes/files/tar_functions/model_augment.R @@ -0,0 +1,16 @@ +model_augment <- function(penguins_data) { + # Make model + model <- lm( + bill_depth_mm ~ bill_length_mm, + data = penguins_data) + # Get species name + species_name <- unique(penguins_data$species) + # If this is the combined dataset with multiple + # species, changed name to 'combined' + if (length(species_name) > 1) { + species_name <- "combined" + } + # Get model summary and add species name + augment(model) |> + mutate(species = species_name, .before = 1) +} diff --git a/episodes/files/tar_functions/model_augment_slow.R b/episodes/files/tar_functions/model_augment_slow.R new file mode 100644 index 00000000..8dd99fe6 --- /dev/null +++ b/episodes/files/tar_functions/model_augment_slow.R @@ -0,0 +1,17 @@ +model_augment_slow <- function(penguins_data) { + Sys.sleep(4) + # Make model + model <- lm( + bill_depth_mm ~ bill_length_mm, + data = penguins_data) + # Get species name + species_name <- unique(penguins_data$species) + # If this is the combined dataset with multiple + # species, changed name to 'combined' + if (length(species_name) > 1) { + species_name <- "combined" + } + # Get model summary and add species name + augment(model) |> + mutate(species = species_name, .before = 1) +} diff --git a/episodes/files/tar_functions/model_glance.R b/episodes/files/tar_functions/model_glance.R new file mode 100644 index 00000000..c324161f --- /dev/null +++ b/episodes/files/tar_functions/model_glance.R @@ -0,0 +1,16 @@ +model_glance <- function(penguins_data) { + # Make model + model <- lm( + bill_depth_mm ~ bill_length_mm, + data = penguins_data) + # Get species name + species_name <- unique(penguins_data$species) + # If this is the combined dataset with multiple + # species, changed name to 'combined' + if (length(species_name) > 1) { + species_name <- "combined" + } + # Get model summary and add species name + glance(model) |> + mutate(species = species_name, .before = 1) +} diff --git a/episodes/files/tar_functions/model_glance_orig.R b/episodes/files/tar_functions/model_glance_orig.R new file mode 100644 index 00000000..a0c3fdd4 --- /dev/null +++ b/episodes/files/tar_functions/model_glance_orig.R @@ -0,0 +1,6 @@ +model_glance_orig <- function(penguins_data) { + model <- lm( + bill_depth_mm ~ bill_length_mm, + data = penguins_data) + broom::glance(model) +} diff --git a/episodes/files/tar_functions/model_glance_slow.R b/episodes/files/tar_functions/model_glance_slow.R new file mode 100644 index 00000000..ba37fe66 --- /dev/null +++ b/episodes/files/tar_functions/model_glance_slow.R @@ -0,0 +1,17 @@ +model_glance_slow <- function(penguins_data) { + Sys.sleep(4) + # Make model + model <- lm( + bill_depth_mm ~ bill_length_mm, + data = penguins_data) + # Get species name + species_name <- unique(penguins_data$species) + # If this is the combined dataset with multiple + # species, changed name to 'combined' + if (length(species_name) > 1) { + species_name <- "combined" + } + # Get model summary and add species name + glance(model) |> + mutate(species = species_name, .before = 1) +} diff --git a/episodes/parallel.Rmd b/episodes/parallel.Rmd index 1bdcb79b..1fa7c573 100644 --- a/episodes/parallel.Rmd +++ b/episodes/parallel.Rmd @@ -1,6 +1,6 @@ --- title: 'Parallel Processing' -teaching: 10 +teaching: 15 exercises: 2 --- @@ -30,6 +30,11 @@ Episode summary: Show how to use parallel processing library(targets) library(tarchetypes) library(broom) + +if (interactive()) { + setwd("episodes") +} + source("files/lesson_functions.R") # Increase width for printing tibbles @@ -76,7 +81,7 @@ It should now look like this: There is still one more thing we need to modify only for the purposes of this demo: if we ran the analysis in parallel now, you wouldn't notice any difference in compute time because the functions are so fast. -So let's make "slow" versions of `glance_with_mod_name()` and `augment_with_mod_name()` using the `Sys.sleep()` function, which just tells the computer to wait some number of seconds. +So let's make "slow" versions of `model_glance()` and `model_augment()` using the `Sys.sleep()` function, which just tells the computer to wait some number of seconds. This will simulate a long-running computation and enable us to see the difference between running sequentially and in parallel. Add these functions to `functions.R` (you can copy-paste the original ones, then modify them): @@ -85,8 +90,8 @@ Add these functions to `functions.R` (you can copy-paste the original ones, then #| label: slow-funcs #| eval: false #| file: -#| - files/tar_functions/glance_with_mod_name_slow.R -#| - files/tar_functions/augment_with_mod_name_slow.R +#| - files/tar_functions/model_glance_slow.R +#| - files/tar_functions/model_augment_slow.R ``` Then, change the plan to use the "slow" version of the functions: @@ -105,38 +110,13 @@ Finally, run the pipeline with `tar_make()` as normal. #| message: false #| echo: false -# FIXME: parallel code uses all available CPUs and hangs when rendering website -# with sandpaper::build_lesson(), even though it only uses 2 when run -# interactively -# -# plan_10_dir <- make_tempdir() -# pushd(plan_10_dir) -# write_example_plan("plan_9.R") -# tar_make(reporter = "silent") -# write_example_plan("plan_10.R") -# tar_make() -# popd() - -# Solution for now is to hard-code output -cat("✔ skip target penguins_data_raw_file -✔ skip target penguins_data_raw -✔ skip target penguins_data -✔ skip target models -• start branch model_predictions_5ad4cec5 -• start branch model_predictions_c73912d5 -• start branch model_predictions_91696941 -• start branch model_summaries_5ad4cec5 -• start branch model_summaries_c73912d5 -• start branch model_summaries_91696941 -• built branch model_predictions_5ad4cec5 [4.884 seconds] -• built branch model_predictions_c73912d5 [4.896 seconds] -• built branch model_predictions_91696941 [4.006 seconds] -• built pattern model_predictions -• built branch model_summaries_5ad4cec5 [4.011 seconds] -• built branch model_summaries_c73912d5 [4.011 seconds] -• built branch model_summaries_91696941 [4.011 seconds] -• built pattern model_summaries -• end pipeline [15.153 seconds]") +plan_10_dir <- make_tempdir() +pushd(plan_10_dir) +write_example_plan("plan_9.R") +tar_make(reporter = "silent") +write_example_plan("plan_10.R") +tar_make() +popd() ``` Notice that although the time required to build each individual target is about 4 seconds, the total time to run the entire workflow is less than the sum of the individual target times! That is proof that processes are running in parallel **and saving you time**. diff --git a/episodes/quarto.Rmd b/episodes/quarto.Rmd index e9d724a6..4f095276 100644 --- a/episodes/quarto.Rmd +++ b/episodes/quarto.Rmd @@ -30,6 +30,11 @@ Episode summary: Show how to write reports with Quarto library(targets) library(tarchetypes) library(quarto) # don't actually need to load, but put here so renv catches it + +if (interactive()) { + setwd("episodes") +} + source("files/lesson_functions.R") # Increase width for printing tibbles @@ -106,7 +111,6 @@ Then, add one more target to the pipeline using the `tar_quarto()` function like tar_dir({ library(quarto) - write_example_plan(9) readr::read_lines("https://raw.githubusercontent.com/joelnitta/penguins-targets/main/penguin_report.qmd") |> readr::write_lines("penguin_report.qmd") # Run it @@ -130,36 +134,25 @@ How does this work? The answer lies **inside** the `penguin_report.qmd` file. Let's look at the start of the file: -``````{markdown} ---- -title: "Simpson's Paradox in Palmer Penguins" -format: - html: - toc: true -execute: - echo: false ---- - ```{r} -#| label: load -#| message: false -targets::tar_load(penguin_models_augmented) -targets::tar_load(penguin_models_summary) - -library(tidyverse) -``` +#| label: show-penguin-report-qmd +#| echo: FALSE +#| results: 'asis' -This is an example analysis of penguins on the Palmer Archipelago in Antarctica. +penguin_qmd <- readr::read_lines("https://raw.githubusercontent.com/joelnitta/penguins-targets/main/penguin_report.qmd") -`````` +cat("````{.markdown}\n") +cat(penguin_qmd[1:24], sep = "\n") +cat("\n````") +``` The lines in between `---` and `---` at the very beginning are called the "YAML header", and contain directions about how to render the document. The R code to be executed is specified by the lines between `` ```{r} `` and `` ``` ``. This is called a "code chunk", since it is a portion of code interspersed within prose text. -Take a closer look at the R code chunk. Notice the two calls to `targets::tar_load()`. Do you remember what that function does? It loads the targets built during the workflow. +Take a closer look at the R code chunk. Notice the use of `targets::tar_load()`. Do you remember what that function does? It loads the targets built during the workflow. -Now things should make a bit more sense: `targets` knows that the report depends on the targets built during the workflow, `penguin_models_augmented` and `penguin_models_summary`, **because they are loaded in the report with `tar_load()`.** +Now things should make a bit more sense: `targets` knows that the report depends on the targets built during the workflow like `combined_summary` and `species_summary` **because they are loaded in the report with `tar_load()`.** ## Generating dynamic content @@ -169,13 +162,13 @@ The call to `tar_load()` at the start of `penguin_report.qmd` is really the key ## Challenge: Spot the dynamic contents -Read through `penguin_report.qmd` and try to find instances where the targets built during the workflow (`penguin_models_augmented` and `penguin_models_summary`) are used to dynamically produce text and plots. +Read through `penguin_report.qmd` and try to find instances where the targets built during the workflow (`combined_summary`, etc.) are used to dynamically produce text and plots. :::::::::::::::::::::::::::::::::: {.solution} -- In the code chunk labeled `results-stats`, statistics from the models like *P*-value and adjusted *R* squared are extracted, then inserted into the text with in-line code like `` `r knitr::inline_expr("mod_stats$combined$r.squared")` ``. +- In the code chunk labeled `results-stats`, statistics from the models like *R* squared are extracted, then inserted into the text with in-line code like `` `r knitr::inline_expr("combined_r2")` ``. -- There are two figures, one for the combined model and one for the separate model (code chunks labeled `fig-combined-plot` and `fig-separate-plot`, respectively). These are built using the points predicted from the model in `penguin_models_augmented`. +- There are two figures, one for the combined model and one for the separate models (code chunks labeled `fig-combined-plot` and `fig-separate-plot`, respectively). These are built using the points predicted from the model in `combined_predictions` and `species_predictions`. :::::::::::::::::::::::::::::::::: diff --git a/renv/activate.R b/renv/activate.R index d13f9932..8638f7fe 100644 --- a/renv/activate.R +++ b/renv/activate.R @@ -98,6 +98,66 @@ local({ unloadNamespace("renv") # load bootstrap tools + ansify <- function(text) { + if (renv_ansify_enabled()) + renv_ansify_enhanced(text) + else + renv_ansify_default(text) + } + + renv_ansify_enabled <- function() { + + override <- Sys.getenv("RENV_ANSIFY_ENABLED", unset = NA) + if (!is.na(override)) + return(as.logical(override)) + + pane <- Sys.getenv("RSTUDIO_CHILD_PROCESS_PANE", unset = NA) + if (identical(pane, "build")) + return(FALSE) + + testthat <- Sys.getenv("TESTTHAT", unset = "false") + if (tolower(testthat) %in% "true") + return(FALSE) + + iderun <- Sys.getenv("R_CLI_HAS_HYPERLINK_IDE_RUN", unset = "false") + if (tolower(iderun) %in% "false") + return(FALSE) + + TRUE + + } + + renv_ansify_default <- function(text) { + text + } + + renv_ansify_enhanced <- function(text) { + + # R help links + pattern <- "`\\?(renv::(?:[^`])+)`" + replacement <- "`\033]8;;ide:help:\\1\a?\\1\033]8;;\a`" + text <- gsub(pattern, replacement, text, perl = TRUE) + + # runnable code + pattern <- "`(renv::(?:[^`])+)`" + replacement <- "`\033]8;;ide:run:\\1\a\\1\033]8;;\a`" + text <- gsub(pattern, replacement, text, perl = TRUE) + + # return ansified text + text + + } + + renv_ansify_init <- function() { + + envir <- renv_envir_self() + if (renv_ansify_enabled()) + assign("ansify", renv_ansify_enhanced, envir = envir) + else + assign("ansify", renv_ansify_default, envir = envir) + + } + `%||%` <- function(x, y) { if (is.null(x)) y else x } @@ -142,7 +202,10 @@ local({ # compute common indent indent <- regexpr("[^[:space:]]", lines) common <- min(setdiff(indent, -1L)) - leave - paste(substring(lines, common), collapse = "\n") + text <- paste(substring(lines, common), collapse = "\n") + + # substitute in ANSI links for executable renv code + ansify(text) } @@ -305,8 +368,11 @@ local({ quiet = TRUE ) - if ("headers" %in% names(formals(utils::download.file))) - args$headers <- renv_bootstrap_download_custom_headers(url) + if ("headers" %in% names(formals(utils::download.file))) { + headers <- renv_bootstrap_download_custom_headers(url) + if (length(headers) && is.character(headers)) + args$headers <- headers + } do.call(utils::download.file, args) @@ -385,10 +451,21 @@ local({ for (type in types) { for (repos in renv_bootstrap_repos()) { + # build arguments for utils::available.packages() call + args <- list(type = type, repos = repos) + + # add custom headers if available -- note that + # utils::available.packages() will pass this to download.file() + if ("headers" %in% names(formals(utils::download.file))) { + headers <- renv_bootstrap_download_custom_headers(repos) + if (length(headers) && is.character(headers)) + args$headers <- headers + } + # retrieve package database db <- tryCatch( as.data.frame( - utils::available.packages(type = type, repos = repos), + do.call(utils::available.packages, args), stringsAsFactors = FALSE ), error = identity @@ -470,6 +547,14 @@ local({ } + renv_bootstrap_github_token <- function() { + for (envvar in c("GITHUB_TOKEN", "GITHUB_PAT", "GH_TOKEN")) { + envval <- Sys.getenv(envvar, unset = NA) + if (!is.na(envval)) + return(envval) + } + } + renv_bootstrap_download_github <- function(version) { enabled <- Sys.getenv("RENV_BOOTSTRAP_FROM_GITHUB", unset = "TRUE") @@ -477,16 +562,16 @@ local({ return(FALSE) # prepare download options - pat <- Sys.getenv("GITHUB_PAT") - if (nzchar(Sys.which("curl")) && nzchar(pat)) { + token <- renv_bootstrap_github_token() + if (nzchar(Sys.which("curl")) && nzchar(token)) { fmt <- "--location --fail --header \"Authorization: token %s\"" - extra <- sprintf(fmt, pat) + extra <- sprintf(fmt, token) saved <- options("download.file.method", "download.file.extra") options(download.file.method = "curl", download.file.extra = extra) on.exit(do.call(base::options, saved), add = TRUE) - } else if (nzchar(Sys.which("wget")) && nzchar(pat)) { + } else if (nzchar(Sys.which("wget")) && nzchar(token)) { fmt <- "--header=\"Authorization: token %s\"" - extra <- sprintf(fmt, pat) + extra <- sprintf(fmt, token) saved <- options("download.file.method", "download.file.extra") options(download.file.method = "wget", download.file.extra = extra) on.exit(do.call(base::options, saved), add = TRUE)