carpentries-incubator · joelnitta · Dec 24, 2024 · Dec 2, 2024 · Dec 11, 2024 · Dec 23, 2024
diff --git a/episodes/branch.Rmd b/episodes/branch.Rmd
@@ -1,6 +1,6 @@
 ---
 title: 'Branching'
-teaching: 10
+teaching: 30
 exercises: 2
 ---
 
@@ -30,6 +30,14 @@ Episode summary: Show how to use branching
 library(targets)
 library(tarchetypes)
 library(broom)
+
+# sandpaper renders this lesson from episodes/
+# need to emulate this behavior during interactive development
+# would be preferable to use here::here() but it doesn't work for some reason
+if (interactive()) {
+  setwd("episodes")
+}
+
 source("files/lesson_functions.R")
 
 # Increase width for printing tibbles
@@ -102,15 +110,14 @@ This seems to indicate that the model is highly significant.
 
 But wait a moment... is this really an appropriate model? Recall that there are three species of penguins in the dataset. It is possible that the relationship between bill depth and length **varies by species**.
 
-We should probably test some alternative models.
-These could include models that add a parameter for species, or add an interaction effect between species and bill length.
+Let's try making one model *per* species (three models total) to see how that does (this is technically not the correct statistical approach, but our focus here is to learn `targets`, not statistics).
 
 Now our workflow is getting more complicated. This is what a workflow for such an analysis might look like **without branching** (make sure to add `library(broom)` to `packages.R`):
 
 ```{r}
 #| label = "example-model-show-1",
 #| eval = FALSE,
-#| code = readLines("files/plans/plan_5.R")[2:31]
+#| code = readLines("files/plans/plan_5.R")[2:36]
 ```
 
 ```{r}
@@ -133,19 +140,32 @@ Let's look at the summary of one of the models:
 #| eval: true
 #| echo: [2]
 pushd(plan_5_dir)
-tar_read(species_summary)
+tar_read(adelie_summary)
 popd()
 ```
 
 So this way of writing the pipeline works, but is repetitive: we have to call `glance()` each time we want to obtain summary statistics for each model.
-Furthermore, each summary target (`combined_summary`, etc.) is explicitly named and typed out manually.
+Furthermore, each summary target (`adelie_summary`, etc.) is explicitly named and typed out manually.
 It would be fairly easy to make a typo and end up with the wrong model being summarized.
 
+Before moving on, let's define another **custom function** function: `model_glance()`.
+You will need to write custom functions frequently when using `targets`, so it's good to get used to it!
+
+As the name `model_glance()` suggests (it is good to write functions with names that indicate their purpose), this will build a model then immediately run `glance()` on it.
+The reason for doing so is that we get a **dataframe as a result**, which is very helpful for branching, as we will see in the next section.
+Save this in `R/functions.R`:
+
+```{r}
+#| label = "model-glance",
+#| eval = FALSE,
+#| code = readLines("files/tar_functions/model_glance_orig.R")
+```
+
 ## Example with branching
 
 ### First attempt
 
-Let's see how to write the same plan using **dynamic branching**:
+Let's see how to write the same plan using **dynamic branching** (after running it, we will go through the new version in detail to understand each step):
 
 ```{r}
 #| label = "example-model-show-3",
@@ -165,63 +185,65 @@ pushd(plan_6_dir)
 # simulate already running the plan once
 write_example_plan("plan_5.R")
 tar_make(reporter = "silent")
-write_example_plan("plan_6.R")
+# run version of plan that uses `model_glance_orig()` (doesn't include species
+# names in output)
+write_example_plan("plan_6b.R")
 tar_make()
-example_branch_name <- tar_branch_names(model_summaries, 1)
+example_branch_name <- tar_branch_names(species_summary, 1)
 popd()
 ```
 
-There is a series of smaller targets (branches) that are each named like `r example_branch_name`, then one overall `model_summaries` target.
+There is a series of smaller targets (branches) that are each named like `r example_branch_name`, then one overall `species_summary` target.
 That is the result of specifying targets using branching: each of the smaller targets are the "branches" that comprise the overall target.
 Since `targets` has no way of knowing ahead of time how many branches there will be or what they represent, it names each one using this series of numbers and letters (the "hash").
 `targets` builds each branch one at a time, then combines them into the overall target.
 
-Next, let's look in more detail about how the workflow is set up, starting with how we defined the models:
+Next, let's look in more detail about how the workflow is set up, starting with how we set up the data:
 
 ```{r}
 #| label = "model-def",
-#| code = readLines("files/plans/plan_6.R")[14:22],
+#| code = readLines("files/plans/plan_6.R")[14:19],
 #| eval = FALSE
 ```
 
-Unlike the non-branching version, we defined the models **in a list** (instead of one target per model).
-This is because dynamic branching is similar to the `base::apply()` or [`purrrr::map()`](https://purrr.tidyverse.org/reference/map.html) method of looping: it applies a function to each element of a list.
-So we need to prepare the input for looping as a list.
+Unlike the non-branching version, we added a step that **groups the data**.
+This is because dynamic branching is similar to the [`tidyverse` approach](https://dplyr.tidyverse.org/articles/grouping.html) of applying the same function to a grouped dataframe.
+So we use the `tar_group_by()` function to specify the groups in our input data: one group per species.
 
-Next, take a look at the command to build the target `model_summaries`.
+Next, take a look at the command to build the target `species_summary`.
 
 ```{r}
 #| label = "model-summaries",
-#| code = readLines("files/plans/plan_6.R")[23:28],
+#| code = readLines("files/plans/plan_6.R")[22:27],
 #| eval = FALSE
 ```
 
-As before, the first argument is the name of the target to build, and the second is the command to build it.
+As before, the first argument to `tar_target()` is the name of the target to build, and the second is the command to build it.
 
-Here, we apply the `glance()` function to each element of `models` (the `[[1]]` is necessary because when the function gets applied, each element is actually a nested list, and we need to remove one layer of nesting).
+Here, we apply our custom `model_glance()` function to each group (in other words, each species) in `penguins_data_grouped`.
 
 Finally, there is an argument we haven't seen before, `pattern`, which indicates that this target should be built using dynamic branching.
-`map` means to apply the command to each element of the input list (`models`) sequentially.
+`map` means to apply the function to each group of the input data (`penguins_data_grouped`) sequentially.
 
 Now that we understand how the branching workflow is constructed, let's inspect the output:
 
 ```{r}
 #| label: example-model-show-4
 #| eval: FALSE
-tar_read(model_summaries)
+tar_read(species_summary)
 ```
 
 ```{r}
 #| label: example-model-hide-4
 #| echo: FALSE
 pushd(plan_6_dir)
-tar_read(model_summaries)
+tar_read(species_summary)
 popd()
 ```
 
 The model summary statistics are all included in a single dataframe.
 
-But there's one problem: **we can't tell which row came from which model!** It would be unwise to assume that they are in the same order as the list of models.
+But there's one problem: **we can't tell which row came from which species!** It would be unwise to assume that they are in the same order as the input data.
 
 This is due to the way dynamic branching works: by default, there is no information about the provenance of each target preserved in the output.
 
@@ -230,58 +252,43 @@ How can we fix this?
 ### Second attempt
 
 The key to obtaining useful output from branching pipelines is to include the necessary information in the output of each individual branch.
-Here, we want to know the kind of model that corresponds to each row of the model summaries.
-To do that, we need to write a **custom function**.
-You will need to write custom functions frequently when using `targets`, so it's good to get used to it!
+Here, we want to know the species that corresponds to each row of the model summaries.
 
-Here is the function. Save this in `R/functions.R`:
+We can achieve this by modifying our `model_glance` function. Be sure to save it after modifying it to include a column for species:
 
 ```{r}
 #| label: example-model-show-5
 #| eval: FALSE
-#| file: files/tar_functions/glance_with_mod_name.R
+#| file: files/tar_functions/model_glance.R
 ```
 
-Our new pipeline looks almost the same as before, but this time we use the custom function instead of `glance()`.
+Our new pipeline looks exactly the same as before; we have made a modification, but to a **function**, not the pipeline.
 
-```{r}
-#| label = "example-model-show-6",
-#| code = readLines("files/plans/plan_7.R")[2:29],
-#| eval = FALSE
-```
+Since `targets` tracks the contents of each custom function, it realizes that it needs to recompute `species_summary` and runs this target again with the newly modified function.
 
 ```{r}
 #| label: example-model-hide-6
 #| echo: FALSE
 pushd(plan_6_dir)
-write_example_plan("plan_7.R")
+write_example_plan("plan_6.R")
 tar_make()
 popd()
 ```
 
-And this time, when we load the `model_summaries`, we can tell which model corresponds to which row (you may need to scroll to the right to see it).
+And this time, when we load the `model_summaries`, we can tell which model corresponds to which row (the `.before = 1` in `mutate()` ensures that it shows up before the other columns).
 
 ```{r}
 #| label: example-model-7
 #| echo: [2]
 #| warning: false
 pushd(plan_6_dir)
-tar_read(model_summaries)
+tar_read(species_summary)
 popd()
 ```
 
 Next we will add one more target, a prediction of bill depth based on each model. These will be needed for plotting the models in the report.
-Such a prediction can be obtained with the `augment()` function of the `broom` package.
+Such a prediction can be obtained with the `augment()` function of the `broom` package, and we create a custom function that outputs predicted points as a dataframe much like we did for the model summaries.
 
-```{r}
-#| label: example-augment
-#| echo: [2, 3]
-#| eval: true
-pushd(plan_6_dir)
-tar_load(models)
-augment(models[[1]])
-popd()
-```
 
 ::::::::::::::::::::::::::::::::::::: {.challenge}
 
@@ -291,33 +298,61 @@ Can you add the model predictions using `augment()`? You will need to define a c
 
 :::::::::::::::::::::::::::::::::: {.solution}
 
-Define the new function as `augment_with_mod_name()`. It is the same as `glance_with_mod_name()`, but use `augment()` instead of `glance()`:
+Define the new function as `model_augment()`. It is the same as `model_glance()`, but use `augment()` instead of `glance()`:
 
 ```{r}
 #| label: example-model-augment-func
 #| eval: FALSE
-#| file: files/tar_functions/augment_with_mod_name.R
+#| file: files/tar_functions/model_augment.R
 ```
 
 Add the step to the workflow:
 
 ```{r}
 #| label = "example-model-augment-show",
-#| code = readLines("files/plans/plan_8.R")[2:35],
+#| code = readLines("files/plans/plan_7.R")[2:36],
 #| eval = FALSE
 ```
 
 ::::::::::::::::::::::::::::::::::
 
 :::::::::::::::::::::::::::::::::::::
 
+### Further simplify the workflow
+
+You may have noticed that we can further simplify the workflow: there is no need to have separate `penguins_data` and `penguins_data_grouped` dataframes.
+In general it is best to keep the number of named objects as small as possible to make it easier to reason about your code.
+Let's combine the cleaning and grouping step into a single command:
+
+```{r}
+#| label = "example-model-show-8",
+#| eval = FALSE,
+#| code = readLines("files/plans/plan_8.R")[2:35]
+```
+
+And run it once more:
+
+```{r}
+#| label: example-model-hide-8
+#| echo: false
+pushd(plan_6_dir)
+# simulate already running the plan once
+write_example_plan("plan_7.R")
+tar_make(reporter = "silent")
+# run version of plan that uses `model_glance_orig()` (doesn't include species
+# names in output)
+write_example_plan("plan_8.R")
+tar_make()
+popd()
+```
+
 ::::::::::::::::::::::::::::::::::::: {.callout}
 
 ## Best practices for branching
 
-Dynamic branching is designed to work well with **dataframes** (tibbles).
+Dynamic branching is designed to work well with **dataframes** (it can also use [lists](https://books.ropensci.org/targets/dynamic.html#list-iteration), but that is more advanced, so we recommend using dataframes when possible).
 
-So if possible, write your custom functions to accept dataframes as input and return them as output, and always include any necessary metadata as a column or columns.
+It is recommended to write your custom functions to accept dataframes as input and return them as output, and always include any necessary metadata as a column or columns.
 
 :::::::::::::::::::::::::::::::::::::
 

diff --git a/episodes/files/plans/plan_10.R b/episodes/files/plans/plan_10.R
@@ -16,27 +16,26 @@ tar_plan(
     path_to_file("penguins_raw.csv"),
     read_csv(!!.x, show_col_types = FALSE)
   ),
-  # Clean data
-  penguins_data = clean_penguin_data(penguins_data_raw),
-  # Build models
-  models = list(
-    combined_model = lm(
-      bill_depth_mm ~ bill_length_mm, data = penguins_data),
-    species_model = lm(
-      bill_depth_mm ~ bill_length_mm + species, data = penguins_data),
-    interaction_model = lm(
-      bill_depth_mm ~ bill_length_mm * species, data = penguins_data)
+  # Clean and group data
+  tar_group_by(
+    penguins_data,
+    clean_penguin_data(penguins_data_raw),
+    species
   ),
-  # Get model summaries
+  # Get summary of combined model with all species together
+  combined_summary = model_glance_slow(penguins_data),
+  # Get summary of one model per species
   tar_target(
-    model_summaries,
-    glance_with_mod_name_slow(models),
-    pattern = map(models)
+    species_summary,
+    model_glance_slow(penguins_data),
+    pattern = map(penguins_data)
   ),
-  # Get model predictions
+  # Get predictions of combined model with all species together
+  combined_predictions = model_augment_slow(penguins_data),
+  # Get predictions of one model per species
   tar_target(
-    model_predictions,
-    augment_with_mod_name_slow(models),
-    pattern = map(models)
+    species_predictions,
+    model_augment_slow(penguins_data),
+    pattern = map(penguins_data)
   )
 )
diff --git a/episodes/files/plans/plan_11.R b/episodes/files/plans/plan_11.R
@@ -9,34 +9,32 @@ tar_plan(
     path_to_file("penguins_raw.csv"),
     read_csv(!!.x, show_col_types = FALSE)
   ),
-  # Clean data
-  penguins_data = clean_penguin_data(penguins_data_raw),
-  # Build models
-  models = list(
-    combined_model = lm(
-      bill_depth_mm ~ bill_length_mm, data = penguins_data),
-    species_model = lm(
-      bill_depth_mm ~ bill_length_mm + species, data = penguins_data),
-    interaction_model = lm(
-      bill_depth_mm ~ bill_length_mm * species, data = penguins_data)
+  # Clean and group data
+  tar_group_by(
+    penguins_data,
+    clean_penguin_data(penguins_data_raw),
+    species
   ),
-  # Get model summaries
+  # Get summary of combined model with all species together
+  combined_summary = model_glance(penguins_data),
+  # Get summary of one model per species
   tar_target(
-    model_summaries,
-    glance_with_mod_name(models),
-    pattern = map(models)
+    species_summary,
+    model_glance(penguins_data),
+    pattern = map(penguins_data)
   ),
-  # Get model predictions
+  # Get predictions of combined model with all species together
+  combined_predictions = model_augment(penguins_data),
+  # Get predictions of one model per species
   tar_target(
-    model_predictions,
-    augment_with_mod_name(models),
-    pattern = map(models)
+    species_predictions,
+    model_augment(penguins_data),
+    pattern = map(penguins_data)
   ),
   # Generate report
   tar_quarto(
     penguin_report,
     path = "penguin_report.qmd",
-    quiet = FALSE,
-    packages = c("targets", "tidyverse")
+    quiet = FALSE
   )
 )