Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change branching episode to loop over grouped dataframe #51

Merged
merged 10 commits into from
Dec 24, 2024
139 changes: 87 additions & 52 deletions episodes/branch.Rmd
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
title: 'Branching'
teaching: 10
teaching: 30
exercises: 2
---

Expand Down Expand Up @@ -30,6 +30,14 @@ Episode summary: Show how to use branching
library(targets)
library(tarchetypes)
library(broom)

# sandpaper renders this lesson from episodes/
# need to emulate this behavior during interactive development
# would be preferable to use here::here() but it doesn't work for some reason
if (interactive()) {
setwd("episodes")
}

source("files/lesson_functions.R")

# Increase width for printing tibbles
Expand Down Expand Up @@ -102,15 +110,14 @@ This seems to indicate that the model is highly significant.

But wait a moment... is this really an appropriate model? Recall that there are three species of penguins in the dataset. It is possible that the relationship between bill depth and length **varies by species**.

We should probably test some alternative models.
These could include models that add a parameter for species, or add an interaction effect between species and bill length.
Let's try making one model *per* species (three models total) to see how that does (this is technically not the correct statistical approach, but our focus here is to learn `targets`, not statistics).

Now our workflow is getting more complicated. This is what a workflow for such an analysis might look like **without branching** (make sure to add `library(broom)` to `packages.R`):

```{r}
#| label = "example-model-show-1",
#| eval = FALSE,
#| code = readLines("files/plans/plan_5.R")[2:31]
#| code = readLines("files/plans/plan_5.R")[2:36]
```

```{r}
Expand All @@ -133,19 +140,32 @@ Let's look at the summary of one of the models:
#| eval: true
#| echo: [2]
pushd(plan_5_dir)
tar_read(species_summary)
tar_read(adelie_summary)
popd()
```

So this way of writing the pipeline works, but is repetitive: we have to call `glance()` each time we want to obtain summary statistics for each model.
Furthermore, each summary target (`combined_summary`, etc.) is explicitly named and typed out manually.
Furthermore, each summary target (`adelie_summary`, etc.) is explicitly named and typed out manually.
It would be fairly easy to make a typo and end up with the wrong model being summarized.

Before moving on, let's define another **custom function** function: `model_glance()`.
You will need to write custom functions frequently when using `targets`, so it's good to get used to it!

As the name `model_glance()` suggests (it is good to write functions with names that indicate their purpose), this will build a model then immediately run `glance()` on it.
The reason for doing so is that we get a **dataframe as a result**, which is very helpful for branching, as we will see in the next section.
Save this in `R/functions.R`:

```{r}
#| label = "model-glance",
#| eval = FALSE,
#| code = readLines("files/tar_functions/model_glance_orig.R")
```

## Example with branching

### First attempt

Let's see how to write the same plan using **dynamic branching**:
Let's see how to write the same plan using **dynamic branching** (after running it, we will go through the new version in detail to understand each step):

```{r}
#| label = "example-model-show-3",
Expand All @@ -165,63 +185,65 @@ pushd(plan_6_dir)
# simulate already running the plan once
write_example_plan("plan_5.R")
tar_make(reporter = "silent")
write_example_plan("plan_6.R")
# run version of plan that uses `model_glance_orig()` (doesn't include species
# names in output)
write_example_plan("plan_6b.R")
tar_make()
example_branch_name <- tar_branch_names(model_summaries, 1)
example_branch_name <- tar_branch_names(species_summary, 1)
popd()
```

There is a series of smaller targets (branches) that are each named like `r example_branch_name`, then one overall `model_summaries` target.
There is a series of smaller targets (branches) that are each named like `r example_branch_name`, then one overall `species_summary` target.
That is the result of specifying targets using branching: each of the smaller targets are the "branches" that comprise the overall target.
Since `targets` has no way of knowing ahead of time how many branches there will be or what they represent, it names each one using this series of numbers and letters (the "hash").
`targets` builds each branch one at a time, then combines them into the overall target.

Next, let's look in more detail about how the workflow is set up, starting with how we defined the models:
Next, let's look in more detail about how the workflow is set up, starting with how we set up the data:

```{r}
#| label = "model-def",
#| code = readLines("files/plans/plan_6.R")[14:22],
#| code = readLines("files/plans/plan_6.R")[14:19],
#| eval = FALSE
```

Unlike the non-branching version, we defined the models **in a list** (instead of one target per model).
This is because dynamic branching is similar to the `base::apply()` or [`purrrr::map()`](https://purrr.tidyverse.org/reference/map.html) method of looping: it applies a function to each element of a list.
So we need to prepare the input for looping as a list.
Unlike the non-branching version, we added a step that **groups the data**.
This is because dynamic branching is similar to the [`tidyverse` approach](https://dplyr.tidyverse.org/articles/grouping.html) of applying the same function to a grouped dataframe.
So we use the `tar_group_by()` function to specify the groups in our input data: one group per species.

Next, take a look at the command to build the target `model_summaries`.
Next, take a look at the command to build the target `species_summary`.

```{r}
#| label = "model-summaries",
#| code = readLines("files/plans/plan_6.R")[23:28],
#| code = readLines("files/plans/plan_6.R")[22:27],
#| eval = FALSE
```

As before, the first argument is the name of the target to build, and the second is the command to build it.
As before, the first argument to `tar_target()` is the name of the target to build, and the second is the command to build it.

Here, we apply the `glance()` function to each element of `models` (the `[[1]]` is necessary because when the function gets applied, each element is actually a nested list, and we need to remove one layer of nesting).
Here, we apply our custom `model_glance()` function to each group (in other words, each species) in `penguins_data_grouped`.

Finally, there is an argument we haven't seen before, `pattern`, which indicates that this target should be built using dynamic branching.
`map` means to apply the command to each element of the input list (`models`) sequentially.
`map` means to apply the function to each group of the input data (`penguins_data_grouped`) sequentially.

Now that we understand how the branching workflow is constructed, let's inspect the output:

```{r}
#| label: example-model-show-4
#| eval: FALSE
tar_read(model_summaries)
tar_read(species_summary)
```

```{r}
#| label: example-model-hide-4
#| echo: FALSE
pushd(plan_6_dir)
tar_read(model_summaries)
tar_read(species_summary)
popd()
```

The model summary statistics are all included in a single dataframe.

But there's one problem: **we can't tell which row came from which model!** It would be unwise to assume that they are in the same order as the list of models.
But there's one problem: **we can't tell which row came from which species!** It would be unwise to assume that they are in the same order as the input data.

This is due to the way dynamic branching works: by default, there is no information about the provenance of each target preserved in the output.

Expand All @@ -230,58 +252,43 @@ How can we fix this?
### Second attempt

The key to obtaining useful output from branching pipelines is to include the necessary information in the output of each individual branch.
Here, we want to know the kind of model that corresponds to each row of the model summaries.
To do that, we need to write a **custom function**.
You will need to write custom functions frequently when using `targets`, so it's good to get used to it!
Here, we want to know the species that corresponds to each row of the model summaries.

Here is the function. Save this in `R/functions.R`:
We can achieve this by modifying our `model_glance` function. Be sure to save it after modifying it to include a column for species:

```{r}
#| label: example-model-show-5
#| eval: FALSE
#| file: files/tar_functions/glance_with_mod_name.R
#| file: files/tar_functions/model_glance.R
```

Our new pipeline looks almost the same as before, but this time we use the custom function instead of `glance()`.
Our new pipeline looks exactly the same as before; we have made a modification, but to a **function**, not the pipeline.

```{r}
#| label = "example-model-show-6",
#| code = readLines("files/plans/plan_7.R")[2:29],
#| eval = FALSE
```
Since `targets` tracks the contents of each custom function, it realizes that it needs to recompute `species_summary` and runs this target again with the newly modified function.

```{r}
#| label: example-model-hide-6
#| echo: FALSE
pushd(plan_6_dir)
write_example_plan("plan_7.R")
write_example_plan("plan_6.R")
tar_make()
popd()
```

And this time, when we load the `model_summaries`, we can tell which model corresponds to which row (you may need to scroll to the right to see it).
And this time, when we load the `model_summaries`, we can tell which model corresponds to which row (the `.before = 1` in `mutate()` ensures that it shows up before the other columns).

```{r}
#| label: example-model-7
#| echo: [2]
#| warning: false
pushd(plan_6_dir)
tar_read(model_summaries)
tar_read(species_summary)
popd()
```

Next we will add one more target, a prediction of bill depth based on each model. These will be needed for plotting the models in the report.
Such a prediction can be obtained with the `augment()` function of the `broom` package.
Such a prediction can be obtained with the `augment()` function of the `broom` package, and we create a custom function that outputs predicted points as a dataframe much like we did for the model summaries.

```{r}
#| label: example-augment
#| echo: [2, 3]
#| eval: true
pushd(plan_6_dir)
tar_load(models)
augment(models[[1]])
popd()
```

::::::::::::::::::::::::::::::::::::: {.challenge}

Expand All @@ -291,33 +298,61 @@ Can you add the model predictions using `augment()`? You will need to define a c

:::::::::::::::::::::::::::::::::: {.solution}

Define the new function as `augment_with_mod_name()`. It is the same as `glance_with_mod_name()`, but use `augment()` instead of `glance()`:
Define the new function as `model_augment()`. It is the same as `model_glance()`, but use `augment()` instead of `glance()`:

```{r}
#| label: example-model-augment-func
#| eval: FALSE
#| file: files/tar_functions/augment_with_mod_name.R
#| file: files/tar_functions/model_augment.R
```

Add the step to the workflow:

```{r}
#| label = "example-model-augment-show",
#| code = readLines("files/plans/plan_8.R")[2:35],
#| code = readLines("files/plans/plan_7.R")[2:36],
#| eval = FALSE
```

::::::::::::::::::::::::::::::::::

:::::::::::::::::::::::::::::::::::::

### Further simplify the workflow

You may have noticed that we can further simplify the workflow: there is no need to have separate `penguins_data` and `penguins_data_grouped` dataframes.
In general it is best to keep the number of named objects as small as possible to make it easier to reason about your code.
Let's combine the cleaning and grouping step into a single command:

```{r}
#| label = "example-model-show-8",
#| eval = FALSE,
#| code = readLines("files/plans/plan_8.R")[2:35]
```

And run it once more:

```{r}
#| label: example-model-hide-8
#| echo: false
pushd(plan_6_dir)
# simulate already running the plan once
write_example_plan("plan_7.R")
tar_make(reporter = "silent")
# run version of plan that uses `model_glance_orig()` (doesn't include species
# names in output)
write_example_plan("plan_8.R")
tar_make()
popd()
```

::::::::::::::::::::::::::::::::::::: {.callout}

## Best practices for branching

Dynamic branching is designed to work well with **dataframes** (tibbles).
Dynamic branching is designed to work well with **dataframes** (it can also use [lists](https://books.ropensci.org/targets/dynamic.html#list-iteration), but that is more advanced, so we recommend using dataframes when possible).

So if possible, write your custom functions to accept dataframes as input and return them as output, and always include any necessary metadata as a column or columns.
It is recommended to write your custom functions to accept dataframes as input and return them as output, and always include any necessary metadata as a column or columns.

:::::::::::::::::::::::::::::::::::::

Expand Down
35 changes: 17 additions & 18 deletions episodes/files/plans/plan_10.R
Original file line number Diff line number Diff line change
Expand Up @@ -16,27 +16,26 @@ tar_plan(
path_to_file("penguins_raw.csv"),
read_csv(!!.x, show_col_types = FALSE)
),
# Clean data
penguins_data = clean_penguin_data(penguins_data_raw),
# Build models
models = list(
combined_model = lm(
bill_depth_mm ~ bill_length_mm, data = penguins_data),
species_model = lm(
bill_depth_mm ~ bill_length_mm + species, data = penguins_data),
interaction_model = lm(
bill_depth_mm ~ bill_length_mm * species, data = penguins_data)
# Clean and group data
tar_group_by(
penguins_data,
clean_penguin_data(penguins_data_raw),
species
),
# Get model summaries
# Get summary of combined model with all species together
combined_summary = model_glance_slow(penguins_data),
# Get summary of one model per species
tar_target(
model_summaries,
glance_with_mod_name_slow(models),
pattern = map(models)
species_summary,
model_glance_slow(penguins_data),
pattern = map(penguins_data)
),
# Get model predictions
# Get predictions of combined model with all species together
combined_predictions = model_augment_slow(penguins_data),
# Get predictions of one model per species
tar_target(
model_predictions,
augment_with_mod_name_slow(models),
pattern = map(models)
species_predictions,
model_augment_slow(penguins_data),
pattern = map(penguins_data)
)
)
38 changes: 18 additions & 20 deletions episodes/files/plans/plan_11.R
Original file line number Diff line number Diff line change
Expand Up @@ -9,34 +9,32 @@ tar_plan(
path_to_file("penguins_raw.csv"),
read_csv(!!.x, show_col_types = FALSE)
),
# Clean data
penguins_data = clean_penguin_data(penguins_data_raw),
# Build models
models = list(
combined_model = lm(
bill_depth_mm ~ bill_length_mm, data = penguins_data),
species_model = lm(
bill_depth_mm ~ bill_length_mm + species, data = penguins_data),
interaction_model = lm(
bill_depth_mm ~ bill_length_mm * species, data = penguins_data)
# Clean and group data
tar_group_by(
penguins_data,
clean_penguin_data(penguins_data_raw),
species
),
# Get model summaries
# Get summary of combined model with all species together
combined_summary = model_glance(penguins_data),
# Get summary of one model per species
tar_target(
model_summaries,
glance_with_mod_name(models),
pattern = map(models)
species_summary,
model_glance(penguins_data),
pattern = map(penguins_data)
),
# Get model predictions
# Get predictions of combined model with all species together
combined_predictions = model_augment(penguins_data),
# Get predictions of one model per species
tar_target(
model_predictions,
augment_with_mod_name(models),
pattern = map(models)
species_predictions,
model_augment(penguins_data),
pattern = map(penguins_data)
),
# Generate report
tar_quarto(
penguin_report,
path = "penguin_report.qmd",
quiet = FALSE,
packages = c("targets", "tidyverse")
quiet = FALSE
)
)
Loading
Loading