diff --git a/CODE_OF_CONDUCT.md b/CODE_OF_CONDUCT.md new file mode 100644 index 00000000..f19b8049 --- /dev/null +++ b/CODE_OF_CONDUCT.md @@ -0,0 +1,13 @@ +--- +title: "Contributor Code of Conduct" +--- + +As contributors and maintainers of this project, +we pledge to follow the [The Carpentries Code of Conduct][coc]. + +Instances of abusive, harassing, or otherwise unacceptable behavior +may be reported by following our [reporting guidelines][coc-reporting]. + + +[coc-reporting]: https://docs.carpentries.org/topic_folders/policies/incident-reporting.html +[coc]: https://docs.carpentries.org/topic_folders/policies/code-of-conduct.html diff --git a/LICENSE.md b/LICENSE.md new file mode 100644 index 00000000..7632871f --- /dev/null +++ b/LICENSE.md @@ -0,0 +1,79 @@ +--- +title: "Licenses" +--- + +## Instructional Material + +All Carpentries (Software Carpentry, Data Carpentry, and Library Carpentry) +instructional material is made available under the [Creative Commons +Attribution license][cc-by-human]. The following is a human-readable summary of +(and not a substitute for) the [full legal text of the CC BY 4.0 +license][cc-by-legal]. + +You are free: + +- to **Share**---copy and redistribute the material in any medium or format +- to **Adapt**---remix, transform, and build upon the material + +for any purpose, even commercially. + +The licensor cannot revoke these freedoms as long as you follow the license +terms. + +Under the following terms: + +- **Attribution**---You must give appropriate credit (mentioning that your work + is derived from work that is Copyright (c) The Carpentries and, where + practical, linking to ), provide a [link to the + license][cc-by-human], and indicate if changes were made. You may do so in + any reasonable manner, but not in any way that suggests the licensor endorses + you or your use. + +- **No additional restrictions**---You may not apply legal terms or + technological measures that legally restrict others from doing anything the + license permits. With the understanding that: + +Notices: + +* You do not have to comply with the license for elements of the material in + the public domain or where your use is permitted by an applicable exception + or limitation. +* No warranties are given. The license may not give you all of the permissions + necessary for your intended use. For example, other rights such as publicity, + privacy, or moral rights may limit how you use the material. + +## Software + +Except where otherwise noted, the example programs and other software provided +by The Carpentries are made available under the [OSI][osi]-approved [MIT +license][mit-license]. + +Permission is hereby granted, free of charge, to any person obtaining a copy of +this software and associated documentation files (the "Software"), to deal in +the Software without restriction, including without limitation the rights to +use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies +of the Software, and to permit persons to whom the Software is furnished to do +so, subject to the following conditions: + +The above copyright notice and this permission notice shall be included in all +copies or substantial portions of the Software. + +THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +SOFTWARE. + +## Trademark + +"The Carpentries", "Software Carpentry", "Data Carpentry", and "Library +Carpentry" and their respective logos are registered trademarks of [Community +Initiatives][ci]. + +[cc-by-human]: https://creativecommons.org/licenses/by/4.0/ +[cc-by-legal]: https://creativecommons.org/licenses/by/4.0/legalcode +[mit-license]: https://opensource.org/licenses/mit-license.html +[ci]: https://communityin.org/ +[osi]: https://opensource.org diff --git a/basic-targets.md b/basic-targets.md new file mode 100644 index 00000000..1b0e1820 --- /dev/null +++ b/basic-targets.md @@ -0,0 +1,298 @@ +--- +title: 'First targets Workflow' +teaching: 10 +exercises: 2 +--- + +:::::::::::::::::::::::::::::::::::::: questions + +- What are best practices for organizing analyses? +- What is a `_targets.R` file for? +- What is the content of the `_targets.R` file? +- How do you run a workflow? + +:::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::: objectives + +- Create a project in RStudio +- Explain the purpose of the `_targets.R` file +- Write a basic `_targets.R` file +- Use a `_targets.R` file to run a workflow + +:::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::: {.instructor} + +Episode summary: First chance to get hands dirty by writing a very simple workflow + +::::::::::::::::::::::::::::::::::::: + + + +## Create a project + +### About projects + +`targets` uses the "project" concept for organizing analyses: all of the files needed for a given project are put in a single folder, the project folder. +The project folder has additional subfolders for organization, such as folders for data, code, and results. + +By using projects, it makes it straightforward to re-orient yourself if you return to an analysis after time spent elsewhere. +This wouldn't be a problem if we only ever work on one thing at a time until completion, but that is almost never the case. +It is hard to remember what you were doing when you come back to a project after working on something else (a phenomenon called "context switching"). +By using a standardized organization system, you will reduce confusion and lost time... in other words, you are increasing reproducibility! + +This workshop will use RStudio, since it also works well with the project organization concept. + +### Create a project in RStudio + +Let's start a new project using RStudio. + +Click "File", then select "New Project". + +This will open the New Project Wizard, a set of menus to help you set up the project. + +![The New Project Wizard](fig/basic-rstudio-wizard.png){alt="Screenshot of RStudio New Project Wizard menu"} + +In the Wizard, click the first option, "New Directory", since we are making a brand-new project from scratch. +Click "New Project" in the next menu. +In "Directory name", enter a name that helps you remember the purpose of the project, such as "targets-demo" (follow best practices for naming files and folders). +Under "Create project as a subdirectory of...", click the "Browse" button to select a directory to put the project. +We recommend putting it on your Desktop so you can easily find it. + +You can leave "Create a git repository" and "Use renv with this project" unchecked, but these are both excellent tools to improve reproducibility, and you should consider learning them and using them in the future, if you don't already. +They can be enabled at any later time, so you don't need to worry about trying to use them immediately. + +Once you work through these steps, your RStudio session should look like this: + +![Your newly created project](fig/basic-rstudio-project.png){alt="Screenshot of RStudio with a newly created project called 'targets-demo' open containing a single file, 'targets-demo.Rproj'"} + +Our project now contains a single file, created by RStudio: `targets-demo.Rproj`. You should not edit this file by hand. Its purpose is to tell RStudio that this is a project folder and to store some RStudio settings (if you use version-control software, it is OK to commit this file). Also, you can open the project by double clicking on the `.Rproj` file in your file explorer (try it by quitting RStudio then navigating in your file browser to your Desktop, opening the "targets-demo" folder, and double clicking `targets-demo.Rproj`). + +OK, now that our project is set up, we are ready to start using `targets`! + +## Create a `_targets.R` file + +Every `targets` project must include a special file, called `_targets.R` in the main project folder (the "project root"). +The `_targets.R` file includes the specification of the workflow: directions for R to run your analysis, kind of like a recipe. +By using the `_targets.R` file, you won't have to remember to run specific scripts in a certain order. +Instead, R will do it for you (more reproducibility points)! + +### Anatomy of a `_targets.R` file + +We will now start to write a `_targets.R` file. Fortunately, `targets` comes with a function to help us do this. + +In the R console, first load the `targets` package with `library(targets)`, then run the command `tar_script()`. + + +``` r +library(targets) +tar_script() +``` + +Nothing will happen in the console, but in the file viewer, you should see a new file, `_targets.R` appear. Open it using the File menu or by clicking on it. + +We can see this default `_targets.R` file includes three main parts: + +- Loading packages with `library()` +- Defining a custom function with `function()` +- Defining a list with `list()`. + +The last part, the list, is the most important part of the `_targets.R` file. +It defines the steps in the workflow. +The `_targets.R` file must always end with this list. + +Furthermore, each item in the list is a call of the `tar_target()` function. +The first argument of `tar_target()` is name of the target to build, and the second argument is the command used to build it. +Note that the name of the target is **unquoted**, that is, it is written without any surrounding quotation marks. + +## Set up `_targets.R` file to run example analysis + +### Background: non-`targets` version + +We will use this template to start building our analysis of bill shape in penguins. +First though, to get familiar with the functions and packages we'll use, let's run the code like you would in a "normal" R script without using `targets`. + +Recall that we are using the `palmerpenguins` R package to obtain the data. +This package actually includes two variations of the dataset: one is an external CSV file with the raw data, and another is the cleaned data loaded into R. +In real life you are probably have externally stored raw data, so **let's use the raw penguin data** as the starting point for our analysis too. + +The `path_to_file()` function in `palmerpenguins` provides the path to the raw data CSV file (it is inside the `palmerpenguins` R package source code that you downloaded to your computer when you installed the package). + + +``` r +library(palmerpenguins) + +# Get path to CSV file +penguins_csv_file <- path_to_file("penguins_raw.csv") + +penguins_csv_file +``` + +``` output +[1] "/home/runner/.local/share/renv/cache/v5/linux-ubuntu-jammy/R-4.4/x86_64-pc-linux-gnu/palmerpenguins/0.1.1/6c6861efbc13c1d543749e9c7be4a592/palmerpenguins/extdata/penguins_raw.csv" +``` + +We will use the `tidyverse` set of packages for loading and manipulating the data. We don't have time to cover all the details about using `tidyverse` now, but if you want to learn more about it, please see the ["Manipulating, analyzing and exporting data with tidyverse" lesson](https://datacarpentry.org/R-ecology-lesson/03-dplyr.html). + +Let's load the data with `read_csv()`. + + +``` r +library(tidyverse) + +# Read CSV file into R +penguins_data_raw <- read_csv(penguins_csv_file) + +penguins_data_raw +``` + + +``` output +Rows: 344 Columns: 17 +── Column specification ──────────────────────────────────────────────────────── +Delimiter: "," +chr (9): studyName, Species, Region, Island, Stage, Individual ID, Clutch C... +dbl (7): Sample Number, Culmen Length (mm), Culmen Depth (mm), Flipper Leng... +date (1): Date Egg + +ℹ Use `spec()` to retrieve the full column specification for this data. +ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message. +``` + +``` output +# A tibble: 344 × 17 + studyName `Sample Number` Species Region Island Stage `Individual ID` + + 1 PAL0708 1 Adelie Penguin… Anvers Torge… Adul… N1A1 + 2 PAL0708 2 Adelie Penguin… Anvers Torge… Adul… N1A2 + 3 PAL0708 3 Adelie Penguin… Anvers Torge… Adul… N2A1 + 4 PAL0708 4 Adelie Penguin… Anvers Torge… Adul… N2A2 + 5 PAL0708 5 Adelie Penguin… Anvers Torge… Adul… N3A1 + 6 PAL0708 6 Adelie Penguin… Anvers Torge… Adul… N3A2 + 7 PAL0708 7 Adelie Penguin… Anvers Torge… Adul… N4A1 + 8 PAL0708 8 Adelie Penguin… Anvers Torge… Adul… N4A2 + 9 PAL0708 9 Adelie Penguin… Anvers Torge… Adul… N5A1 +10 PAL0708 10 Adelie Penguin… Anvers Torge… Adul… N5A2 +# ℹ 334 more rows +# ℹ 10 more variables: `Clutch Completion` , `Date Egg` , +# `Culmen Length (mm)` , `Culmen Depth (mm)` , +# `Flipper Length (mm)` , `Body Mass (g)` , Sex , +# `Delta 15 N (o/oo)` , `Delta 13 C (o/oo)` , Comments +``` + +We see the raw data has some awkward column names with spaces (these are hard to type out and can easily lead to mistakes in the code), and far more columns than we need. +For the purposes of this analysis, we only need species name, bill length, and bill depth. +In the raw data, the rather technical term "culmen" is used to refer to the bill. + +![Illustration of bill (culmen) length and depth. Artwork by @allison_horst.](https://allisonhorst.github.io/palmerpenguins/reference/figures/culmen_depth.png) + +Let's clean up the data to make it easier to use for downstream analyses. +We will also remove any rows with missing data, because this could cause errors for some functions later. + + +``` r +# Clean up raw data +penguins_data <- penguins_data_raw |> + # Rename columns for easier typing and + # subset to only the columns needed for analysis + select( + species = Species, + bill_length_mm = `Culmen Length (mm)`, + bill_depth_mm = `Culmen Depth (mm)` + ) |> + # Delete rows with missing data + remove_missing(na.rm = TRUE) + +penguins_data +``` + +``` output +# A tibble: 342 × 3 + species bill_length_mm bill_depth_mm + + 1 Adelie Penguin (Pygoscelis adeliae) 39.1 18.7 + 2 Adelie Penguin (Pygoscelis adeliae) 39.5 17.4 + 3 Adelie Penguin (Pygoscelis adeliae) 40.3 18 + 4 Adelie Penguin (Pygoscelis adeliae) 36.7 19.3 + 5 Adelie Penguin (Pygoscelis adeliae) 39.3 20.6 + 6 Adelie Penguin (Pygoscelis adeliae) 38.9 17.8 + 7 Adelie Penguin (Pygoscelis adeliae) 39.2 19.6 + 8 Adelie Penguin (Pygoscelis adeliae) 34.1 18.1 + 9 Adelie Penguin (Pygoscelis adeliae) 42 20.2 +10 Adelie Penguin (Pygoscelis adeliae) 37.8 17.1 +# ℹ 332 more rows +``` + +That's better! + +### `targets` version + +What does this look like using `targets`? + +The biggest difference is that we need to **put each step of the workflow into the list at the end**. + +We also define a custom function for the data cleaning step. +That is because the list of targets at the end **should look like a high-level summary of your analysis**. +You want to avoid lengthy chunks of code when defining the targets; instead, put that code in the custom functions. +The other steps (setting the file path and loading the data) are each just one function call so there's not much point in putting those into their own custom functions. + +Finally, each step in the workflow is defined with the `tar_target()` function. + + +``` r +library(targets) +library(tidyverse) +library(palmerpenguins) + +clean_penguin_data <- function(penguins_data_raw) { + penguins_data_raw |> + select( + species = Species, + bill_length_mm = `Culmen Length (mm)`, + bill_depth_mm = `Culmen Depth (mm)` + ) |> + remove_missing(na.rm = TRUE) +} + +list( + tar_target(penguins_csv_file, path_to_file("penguins_raw.csv")), + tar_target(penguins_data_raw, read_csv( + penguins_csv_file, show_col_types = FALSE)), + tar_target(penguins_data, clean_penguin_data(penguins_data_raw)) +) +``` + +I have set `show_col_types = FALSE` in `read_csv()` because we know from the earlier code that the column types were set correctly by default (character for species and numeric for bill length and depth), so we don't need to see the warning it would otherwise issue. + +## Run the workflow + +Now that we have a workflow, we can run it with the `tar_make()` function. +Try running it, and you should see something like this: + + +``` r +tar_make() +``` + +``` output +▶ dispatched target penguins_csv_file +● completed target penguins_csv_file [0.001 seconds, 190 bytes] +▶ dispatched target penguins_data_raw +● completed target penguins_data_raw [0.188 seconds, 10.403 kilobytes] +▶ dispatched target penguins_data +● completed target penguins_data [0.007 seconds, 1.609 kilobytes] +▶ ended pipeline [0.341 seconds] +``` + +Congratulations, you've run your first workflow with `targets`! + +::::::::::::::::::::::::::::::::::::: keypoints + +- Projects help keep our analyses organized so we can easily re-run them later +- Use the RStudio Project Wizard to create projects +- The `_targets.R` file is a special file that must be included in all `targets` projects, and defines the worklow +- Use `tar_script()` to create a default `_targets.R` file +- Use `tar_make()` to run the workflow + +:::::::::::::::::::::::::::::::::::::::::::::::: diff --git a/branch.md b/branch.md new file mode 100644 index 00000000..e982fd81 --- /dev/null +++ b/branch.md @@ -0,0 +1,524 @@ +--- +title: 'Branching' +teaching: 10 +exercises: 2 +--- + +:::::::::::::::::::::::::::::::::::::: questions + +- How can we specify many targets without typing everything out? + +:::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::: objectives + +- Be able to specify targets using branching + +:::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::: instructor + +Episode summary: Show how to use branching + +::::::::::::::::::::::::::::::::::::: + + + +## Why branching? + +One of the major strengths of `targets` is the ability to define many targets from a single line of code ("branching"). +This not only saves you typing, it also **reduces the risk of errors** since there is less chance of making a typo. + +## Types of branching + +There are two types of branching, **dynamic branching** and **static branching**. +"Branching" refers to the idea that you can provide a single specification for how to make targets (the "pattern"), and `targets` generates multiple targets from it ("branches"). +"Dynamic" means that the branches that result from the pattern do not have to be defined ahead of time---they are a dynamic result of the code. + +In this workshop, we will only cover dynamic branching since it is generally easier to write (static branching requires use of [meta-programming](https://books.ropensci.org/targets/static.html#metaprogramming), an advanced topic). For more information about each and when you might want to use one or the other (or some combination of the two), [see the `targets` package manual](https://books.ropensci.org/targets/dynamic.html). + +## Example without branching + +To see how this works, let's continue our analysis of the `palmerpenguins` dataset. + +**Our hypothesis is that bill depth decreases with bill length.** +We will test this hypothesis with a linear model. + +For example, this is a model of bill depth dependent on bill length: + + +``` r +lm(bill_depth_mm ~ bill_length_mm, data = penguins_data) +``` + +We can add this to our pipeline. We will call it the `combined_model` because it combines all the species together without distinction: + + +``` r +source("R/packages.R") +source("R/functions.R") + +tar_plan( + # Load raw data + tar_file_read( + penguins_data_raw, + path_to_file("penguins_raw.csv"), + read_csv(!!.x, show_col_types = FALSE) + ), + # Clean data + penguins_data = clean_penguin_data(penguins_data_raw), + # Build model + combined_model = lm( + bill_depth_mm ~ bill_length_mm, + data = penguins_data + ) +) +``` + + +``` output +✔ skipped target penguins_data_raw_file +✔ skipped target penguins_data_raw +✔ skipped target penguins_data +▶ dispatched target combined_model +● completed target combined_model [0.024 seconds, 11.201 kilobytes] +▶ ended pipeline [0.273 seconds] +``` + +Let's have a look at the model. We will use the `glance()` function from the `broom` package. Unlike base R `summary()`, this function returns output as a tibble (the tidyverse equivalent of a dataframe), which as we will see later is quite useful for downstream analyses. + + +``` r +library(broom) +tar_load(combined_model) +glance(combined_model) +``` + +``` output +# A tibble: 1 × 12 + r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs + +1 0.0552 0.0525 1.92 19.9 0.0000112 1 -708. 1422. 1433. 1256. 340 342 +``` + +Notice the small *P*-value. +This seems to indicate that the model is highly significant. + +But wait a moment... is this really an appropriate model? Recall that there are three species of penguins in the dataset. It is possible that the relationship between bill depth and length **varies by species**. + +We should probably test some alternative models. +These could include models that add a parameter for species, or add an interaction effect between species and bill length. + +Now our workflow is getting more complicated. This is what a workflow for such an analysis might look like **without branching** (make sure to add `library(broom)` to `packages.R`): + + +``` r +source("R/packages.R") +source("R/functions.R") + +tar_plan( + # Load raw data + tar_file_read( + penguins_data_raw, + path_to_file("penguins_raw.csv"), + read_csv(!!.x, show_col_types = FALSE) + ), + # Clean data + penguins_data = clean_penguin_data(penguins_data_raw), + # Build models + combined_model = lm( + bill_depth_mm ~ bill_length_mm, + data = penguins_data + ), + species_model = lm( + bill_depth_mm ~ bill_length_mm + species, + data = penguins_data + ), + interaction_model = lm( + bill_depth_mm ~ bill_length_mm * species, + data = penguins_data + ), + # Get model summaries + combined_summary = glance(combined_model), + species_summary = glance(species_model), + interaction_summary = glance(interaction_model) +) +``` + + +``` output +✔ skipped target penguins_data_raw_file +✔ skipped target penguins_data_raw +✔ skipped target penguins_data +✔ skipped target combined_model +▶ dispatched target interaction_model +● completed target interaction_model [0.003 seconds, 19.283 kilobytes] +▶ dispatched target species_model +● completed target species_model [0.001 seconds, 15.439 kilobytes] +▶ dispatched target combined_summary +● completed target combined_summary [0.006 seconds, 348 bytes] +▶ dispatched target interaction_summary +● completed target interaction_summary [0.003 seconds, 348 bytes] +▶ dispatched target species_summary +● completed target species_summary [0.003 seconds, 347 bytes] +▶ ended pipeline [0.28 seconds] +``` + +Let's look at the summary of one of the models: + + +``` r +tar_read(species_summary) +``` + +``` output +# A tibble: 1 × 12 + r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs + +1 0.769 0.767 0.953 375. 3.65e-107 3 -467. 944. 963. 307. 338 342 +``` + +So this way of writing the pipeline works, but is repetitive: we have to call `glance()` each time we want to obtain summary statistics for each model. +Furthermore, each summary target (`combined_summary`, etc.) is explicitly named and typed out manually. +It would be fairly easy to make a typo and end up with the wrong model being summarized. + +## Example with branching + +### First attempt + +Let's see how to write the same plan using **dynamic branching**: + + +``` r +source("R/packages.R") +source("R/functions.R") + +tar_plan( + # Load raw data + tar_file_read( + penguins_data_raw, + path_to_file("penguins_raw.csv"), + read_csv(!!.x, show_col_types = FALSE) + ), + # Clean data + penguins_data = clean_penguin_data(penguins_data_raw), + # Build models + models = list( + combined_model = lm( + bill_depth_mm ~ bill_length_mm, data = penguins_data), + species_model = lm( + bill_depth_mm ~ bill_length_mm + species, data = penguins_data), + interaction_model = lm( + bill_depth_mm ~ bill_length_mm * species, data = penguins_data) + ), + # Get model summaries + tar_target( + model_summaries, + glance(models[[1]]), + pattern = map(models) + ) +) +``` + +What is going on here? + +First, let's look at the messages provided by `tar_make()`. + + +``` output +✔ skipped target penguins_data_raw_file +✔ skipped target penguins_data_raw +✔ skipped target penguins_data +▶ dispatched target models +● completed target models [0.005 seconds, 43.009 kilobytes] +▶ dispatched branch model_summaries_812e3af782bee03f +● completed branch model_summaries_812e3af782bee03f [0.006 seconds, 348 bytes] +▶ dispatched branch model_summaries_2b8108839427c135 +● completed branch model_summaries_2b8108839427c135 [0.003 seconds, 347 bytes] +▶ dispatched branch model_summaries_533cd9a636c3e05b +● completed branch model_summaries_533cd9a636c3e05b [0.003 seconds, 348 bytes] +● completed pattern model_summaries +▶ ended pipeline [0.302 seconds] +``` + +There is a series of smaller targets (branches) that are each named like model_summaries_812e3af782bee03f, then one overall `model_summaries` target. +That is the result of specifying targets using branching: each of the smaller targets are the "branches" that comprise the overall target. +Since `targets` has no way of knowing ahead of time how many branches there will be or what they represent, it names each one using this series of numbers and letters (the "hash"). +`targets` builds each branch one at a time, then combines them into the overall target. + +Next, let's look in more detail about how the workflow is set up, starting with how we defined the models: + + +``` r + # Build models + models = list( + combined_model = lm( + bill_depth_mm ~ bill_length_mm, data = penguins_data), + species_model = lm( + bill_depth_mm ~ bill_length_mm + species, data = penguins_data), + interaction_model = lm( + bill_depth_mm ~ bill_length_mm * species, data = penguins_data) + ), +``` + +Unlike the non-branching version, we defined the models **in a list** (instead of one target per model). +This is because dynamic branching is similar to the `base::apply()` or [`purrrr::map()`](https://purrr.tidyverse.org/reference/map.html) method of looping: it applies a function to each element of a list. +So we need to prepare the input for looping as a list. + +Next, take a look at the command to build the target `model_summaries`. + + +``` r + # Get model summaries + tar_target( + model_summaries, + glance(models[[1]]), + pattern = map(models) + ) +``` + +As before, the first argument is the name of the target to build, and the second is the command to build it. + +Here, we apply the `glance()` function to each element of `models` (the `[[1]]` is necessary because when the function gets applied, each element is actually a nested list, and we need to remove one layer of nesting). + +Finally, there is an argument we haven't seen before, `pattern`, which indicates that this target should be built using dynamic branching. +`map` means to apply the command to each element of the input list (`models`) sequentially. + +Now that we understand how the branching workflow is constructed, let's inspect the output: + + +``` r +tar_read(model_summaries) +``` + + +``` output +# A tibble: 3 × 12 + r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs + +1 0.0552 0.0525 1.92 19.9 1.12e- 5 1 -708. 1422. 1433. 1256. 340 342 +2 0.769 0.767 0.953 375. 3.65e-107 3 -467. 944. 963. 307. 338 342 +3 0.770 0.766 0.955 225. 8.52e-105 5 -466. 947. 974. 306. 336 342 +``` + +The model summary statistics are all included in a single dataframe. + +But there's one problem: **we can't tell which row came from which model!** It would be unwise to assume that they are in the same order as the list of models. + +This is due to the way dynamic branching works: by default, there is no information about the provenance of each target preserved in the output. + +How can we fix this? + +### Second attempt + +The key to obtaining useful output from branching pipelines is to include the necessary information in the output of each individual branch. +Here, we want to know the kind of model that corresponds to each row of the model summaries. +To do that, we need to write a **custom function**. +You will need to write custom functions frequently when using `targets`, so it's good to get used to it! + +Here is the function. Save this in `R/functions.R`: + + +``` r +glance_with_mod_name <- function(model_in_list) { + model_name <- names(model_in_list) + model <- model_in_list[[1]] + glance(model) |> + mutate(model_name = model_name) +} +``` + +Our new pipeline looks almost the same as before, but this time we use the custom function instead of `glance()`. + + +``` r +source("R/functions.R") +source("R/packages.R") + +tar_plan( + # Load raw data + tar_file_read( + penguins_data_raw, + path_to_file("penguins_raw.csv"), + read_csv(!!.x, show_col_types = FALSE) + ), + # Clean data + penguins_data = clean_penguin_data(penguins_data_raw), + # Build models + models = list( + combined_model = lm( + bill_depth_mm ~ bill_length_mm, data = penguins_data), + species_model = lm( + bill_depth_mm ~ bill_length_mm + species, data = penguins_data), + interaction_model = lm( + bill_depth_mm ~ bill_length_mm * species, data = penguins_data) + ), + # Get model summaries + tar_target( + model_summaries, + glance_with_mod_name(models), + pattern = map(models) + ) +) +``` + + +``` output +✔ skipped target penguins_data_raw_file +✔ skipped target penguins_data_raw +✔ skipped target penguins_data +✔ skipped target models +▶ dispatched branch model_summaries_812e3af782bee03f +● completed branch model_summaries_812e3af782bee03f [0.012 seconds, 374 bytes] +▶ dispatched branch model_summaries_2b8108839427c135 +● completed branch model_summaries_2b8108839427c135 [0.007 seconds, 371 bytes] +▶ dispatched branch model_summaries_533cd9a636c3e05b +● completed branch model_summaries_533cd9a636c3e05b [0.004 seconds, 377 bytes] +● completed pattern model_summaries +▶ ended pipeline [0.281 seconds] +``` + +And this time, when we load the `model_summaries`, we can tell which model corresponds to which row (you may need to scroll to the right to see it). + + +``` r +tar_read(model_summaries) +``` + +``` output +# A tibble: 3 × 13 + r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs model_name + +1 0.0552 0.0525 1.92 19.9 1.12e- 5 1 -708. 1422. 1433. 1256. 340 342 combined_model +2 0.769 0.767 0.953 375. 3.65e-107 3 -467. 944. 963. 307. 338 342 species_model +3 0.770 0.766 0.955 225. 8.52e-105 5 -466. 947. 974. 306. 336 342 interaction_model +``` + +Next we will add one more target, a prediction of bill depth based on each model. These will be needed for plotting the models in the report. +Such a prediction can be obtained with the `augment()` function of the `broom` package. + + +``` r +tar_load(models) +augment(models[[1]]) +``` + +``` output +# A tibble: 342 × 8 + bill_depth_mm bill_length_mm .fitted .resid .hat .sigma .cooksd .std.resid + + 1 18.7 39.1 17.6 1.14 0.00521 1.92 0.000924 0.594 + 2 17.4 39.5 17.5 -0.127 0.00485 1.93 0.0000107 -0.0663 + 3 18 40.3 17.5 0.541 0.00421 1.92 0.000168 0.282 + 4 19.3 36.7 17.8 1.53 0.00806 1.92 0.00261 0.802 + 5 20.6 39.3 17.5 3.06 0.00503 1.92 0.00641 1.59 + 6 17.8 38.9 17.6 0.222 0.00541 1.93 0.0000364 0.116 + 7 19.6 39.2 17.6 2.05 0.00512 1.92 0.00293 1.07 + 8 18.1 34.1 18.0 0.114 0.0124 1.93 0.0000223 0.0595 + 9 20.2 42 17.3 2.89 0.00329 1.92 0.00373 1.50 +10 17.1 37.8 17.7 -0.572 0.00661 1.92 0.000296 -0.298 +# ℹ 332 more rows +``` + +::::::::::::::::::::::::::::::::::::: {.challenge} + +## Challenge: Add model predictions to the workflow + +Can you add the model predictions using `augment()`? You will need to define a custom function just like we did for `glance()`. + +:::::::::::::::::::::::::::::::::: {.solution} + +Define the new function as `augment_with_mod_name()`. It is the same as `glance_with_mod_name()`, but use `augment()` instead of `glance()`: + + +``` r +augment_with_mod_name <- function(model_in_list) { + model_name <- names(model_in_list) + model <- model_in_list[[1]] + augment(model) |> + mutate(model_name = model_name) +} +``` + +Add the step to the workflow: + + +``` r +source("R/functions.R") +source("R/packages.R") + +tar_plan( + # Load raw data + tar_file_read( + penguins_data_raw, + path_to_file("penguins_raw.csv"), + read_csv(!!.x, show_col_types = FALSE) + ), + # Clean data + penguins_data = clean_penguin_data(penguins_data_raw), + # Build models + models = list( + combined_model = lm( + bill_depth_mm ~ bill_length_mm, data = penguins_data), + species_model = lm( + bill_depth_mm ~ bill_length_mm + species, data = penguins_data), + interaction_model = lm( + bill_depth_mm ~ bill_length_mm * species, data = penguins_data) + ), + # Get model summaries + tar_target( + model_summaries, + glance_with_mod_name(models), + pattern = map(models) + ), + # Get model predictions + tar_target( + model_predictions, + augment_with_mod_name(models), + pattern = map(models) + ) +) +``` + +:::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::: {.callout} + +## Best practices for branching + +Dynamic branching is designed to work well with **dataframes** (tibbles). + +So if possible, write your custom functions to accept dataframes as input and return them as output, and always include any necessary metadata as a column or columns. + +::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::: {.challenge} + +## Challenge: What other kinds of patterns are there? + +So far, we have only used a single function in conjunction with the `pattern` argument, `map()`, which applies the function to each element of its input in sequence. + +Can you think of any other ways you might want to apply a branching pattern? + +:::::::::::::::::::::::::::::::::: {.solution} + +Some other ways of applying branching patterns include: + +- crossing: one branch per combination of elements (`cross()` function) +- slicing: one branch for each of a manually selected set of elements (`slice()` function) +- sampling: one branch for each of a randomly selected set of elements (`sample()` function) + +You can [find out more about different branching patterns in the `targets` manual](https://books.ropensci.org/targets/dynamic.html#patterns). + +:::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::: keypoints + +- Dynamic branching creates multiple targets with a single command +- You usually need to write custom functions so that the output of the branches includes necessary metadata + +:::::::::::::::::::::::::::::::::::::::::::::::: diff --git a/cache.md b/cache.md new file mode 100644 index 00000000..437d429e --- /dev/null +++ b/cache.md @@ -0,0 +1,145 @@ +--- +title: 'Loading Workflow Objects' +teaching: 10 +exercises: 2 +--- + + + +:::::::::::::::::::::::::::::::::::::: questions + +- Where does the workflow happen? +- How can we inspect the objects built by the workflow? + +:::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::: objectives + +- Explain where `targets` runs the workflow and why +- Be able to load objects built by the workflow into your R session + +:::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::: instructor + +Episode summary: Show how to get at the objects that we built + +::::::::::::::::::::::::::::::::::::: + +## Where does the workflow happen? + +So we just finished running our first workflow. +Now you probably want to look at its output. +But, if we just call the name of the object (for example, `penguins_data`), we get an error. + +``` r +penguins_data +``` + +``` error +Error: object 'penguins_data' not found +``` + +Where are the results of our workflow? + +::::::::::::::::::::::::::::::::::::: instructor + +- To reinforce the concept of `targets` running in a separate R session, you may want to pretend trying to run `penguins_data`, then feigning surprise when it doesn't work and using it as a teaching moment (errors are pedagogy!). + +:::::::::::::::::::::::::::::::::::::::::::::::: + +We don't see the workflow results because `targets` **runs the workflow in a separate R session** that we can't interact with. +This is for reproducibility---the objects built by the workflow should only depend on the code in your project, not any commands you may have interactively given to R. + +Fortunately, `targets` has two functions that can be used to load objects built by the workflow into our current session, `tar_load()` and `tar_read()`. +Let's see how these work. + +## tar_load() + +`tar_load()` loads an object built by the workflow into the current session. +Its first argument is the name of the object you want to load. +Let's use this to load `penguins_data` and get an overview of the data with `summary()`. + + + + +``` r +tar_load(penguins_data) +summary(penguins_data) +``` + +``` output + species bill_length_mm bill_depth_mm + Length:342 Min. :32.10 Min. :13.10 + Class :character 1st Qu.:39.23 1st Qu.:15.60 + Mode :character Median :44.45 Median :17.30 + Mean :43.92 Mean :17.15 + 3rd Qu.:48.50 3rd Qu.:18.70 + Max. :59.60 Max. :21.50 +``` + +Note that `tar_load()` is used for its **side-effect**---loading the desired object into the current R session. +It doesn't actually return a value. + +## tar_read() + +`tar_read()` is similar to `tar_load()` in that it is used to retrieve objects built by the workflow, but unlike `tar_load()`, it returns them directly as output. + +Let's try it with `penguins_csv_file`. + + +``` r +tar_read(penguins_csv_file) +``` + +``` output +[1] "/home/runner/.local/share/renv/cache/v5/linux-ubuntu-jammy/R-4.4/x86_64-pc-linux-gnu/palmerpenguins/0.1.1/6c6861efbc13c1d543749e9c7be4a592/palmerpenguins/extdata/penguins_raw.csv" +``` + +We immediately see the contents of `penguins_csv_file`. +But it has not been loaded into the environment. +If you try to run `penguins_csv_file` now, you will get an error: + + +``` r +penguins_csv_file +``` + +``` error +Error: object 'penguins_csv_file' not found +``` + +## When to use which function + +`tar_load()` tends to be more useful when you want to load objects and do things with them. +`tar_read()` is more useful when you just want to immediately inspect an object. + +## The targets cache + +If you close your R session, then re-start it and use `tar_load()` or `tar_read()`, you will notice that it can still load the workflow objects. +In other words, the workflow output is **saved across R sessions**. +How is this possible? + +You may have noticed a new folder has appeared in your project, called `_targets`. +This is the **targets cache**. +It contains all of the workflow output; that is how we can load the targets built by the workflow even after quitting then restarting R. + +**You should not edit the contents of the cache by hand** (with one exception). +Doing so would make your analysis non-reproducible. + +The one exception to this rule is a special subfolder called `_targets/user`. +This folder does not exist by default. +You can create it if you want, and put whatever you want inside. + +Generally, `_targets/user` is a good place to store files that are not code, like data and output. + +Note that if you don't have anything in `_targets/user` that you need to keep around, it is possible to "reset" your workflow by simply deleting the entire `_targets` folder. Of course, this means you will need to run everything over again, so don't do this lightly! + +::::::::::::::::::::::::::::::::::::: keypoints + +- `targets` workflows are run in a separate, non-interactive R session +- `tar_load()` loads a workflow object into the current R session +- `tar_read()` reads a workflow object and returns its value +- The `_targets` folder is the cache and generally should not be edited by hand + +:::::::::::::::::::::::::::::::::::::::::::::::: diff --git a/config.yaml b/config.yaml new file mode 100644 index 00000000..63382b36 --- /dev/null +++ b/config.yaml @@ -0,0 +1,87 @@ +#------------------------------------------------------------ +# Values for this lesson. +#------------------------------------------------------------ + +# Which carpentry is this (swc, dc, lc, or cp)? +# swc: Software Carpentry +# dc: Data Carpentry +# lc: Library Carpentry +# cp: Carpentries (to use for instructor training for instance) +# incubator: The Carpentries Incubator +carpentry: 'incubator' + +# Overall title for pages. +title: 'Introduction to targets' + +# Date the lesson was created (YYYY-MM-DD, this is empty by default) +created: ~ + +# Comma-separated list of keywords for the lesson +keywords: 'reproducibility, data, targets, R' + +# Life cycle stage of the lesson +# possible values: pre-alpha, alpha, beta, stable +life_cycle: 'pre-alpha' + +# License of the lesson +license: 'CC-BY 4.0' + +# Link to the source repository for this lesson +source: 'https://github.com/carpentries-incubator/targets-workshop' + +# Default branch of your lesson +branch: 'main' + +# Who to contact if there are any issues +contact: 'joelnitta@gmail.com' + +# Navigation ------------------------------------------------ +# +# Use the following menu items to specify the order of +# individual pages in each dropdown section. Leave blank to +# include all pages in the folder. +# +# Example ------------- +# +# episodes: +# - introduction.md +# - first-steps.md +# +# learners: +# - setup.md +# +# instructors: +# - instructor-notes.md +# +# profiles: +# - one-learner.md +# - another-learner.md + +# Order of episodes in your lesson +episodes: +- introduction.Rmd +- basic-targets.Rmd +- cache.Rmd +- lifecycle.Rmd +- organization.Rmd +- packages.Rmd +- files.Rmd +- branch.Rmd +- parallel.Rmd +- quarto.Rmd + +# Information for Learners +learners: + +# Information for Instructors +instructors: + +# Learner Profiles +profiles: + +# Customisation --------------------------------------------- +# +# This space below is where custom yaml items (e.g. pinning +# sandpaper and varnish versions) should live + + diff --git a/fig/basic-rstudio-project.png b/fig/basic-rstudio-project.png new file mode 100644 index 00000000..335768fc Binary files /dev/null and b/fig/basic-rstudio-project.png differ diff --git a/fig/basic-rstudio-wizard.png b/fig/basic-rstudio-wizard.png new file mode 100644 index 00000000..f3c6d5ce Binary files /dev/null and b/fig/basic-rstudio-wizard.png differ diff --git a/fig/lifecycle-visnetwork.png b/fig/lifecycle-visnetwork.png new file mode 100644 index 00000000..5187a62f Binary files /dev/null and b/fig/lifecycle-visnetwork.png differ diff --git a/files.md b/files.md new file mode 100644 index 00000000..d4544f9c --- /dev/null +++ b/files.md @@ -0,0 +1,301 @@ +--- +title: 'Working with External Files' +teaching: 10 +exercises: 2 +--- + +:::::::::::::::::::::::::::::::::::::: questions + +- How can we load external data? + +:::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::: objectives + +- Be able to load external data into a workflow +- Configure the workflow to rerun if the contents of the external data change + +:::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::: instructor + +Episode summary: Show how to read and write external files + +::::::::::::::::::::::::::::::::::::: + + + +## Treating external files as a dependency + +Almost all workflows will start by importing data, which is typically stored as an external file. + +As a simple example, let's create an external data file in RStudio with the "New File" menu option. Enter a single line of text, "Hello World" and save it as "hello.txt" text file in `_targets/user/data/`. + +We will read in the contents of this file and store it as `some_data` in the workflow by writing the following plan and running `tar_make()`: + +::::::::::::::::::::::::::::::::::::: {.callout} + +## Save your progress + +You can only have one active `_targets.R` file at a time in a given project. + +We are about to create a new `_targets.R` file, but you probably don't want to lose your progress in the one we have been working on so far (the penguins bill analysis). You can temporarily rename that one to something like `_targets_old.R` so that you don't overwrite it with the new example `_targets.R` file below. Then, rename them when you are ready to work on it again. + +::::::::::::::::::::::::::::::::::::: + + +``` r +library(targets) +library(tarchetypes) + +tar_plan( + some_data = readLines("_targets/user/data/hello.txt") +) +``` + + +``` output +▶ dispatched target some_data +● completed target some_data [0 seconds, 64 bytes] +▶ ended pipeline [0.089 seconds] +``` + +If we inspect the contents of `some_data` with `tar_read(some_data)`, it will contain the string `"Hello World"` as expected. + +Now say we edit "hello.txt", perhaps add some text: "Hello World. How are you?". Edit this in the RStudio text editor and save it. Now run the pipeline again. + + +``` r +library(targets) +library(tarchetypes) + +tar_plan( + some_data = readLines("_targets/user/data/hello.txt") +) +``` + + +``` output +✔ skipped target some_data +✔ skipped pipeline [0.087 seconds] +``` + +The target `some_data` was skipped, even though the contents of the file changed. + +That is because right now, targets is only tracking the **name** of the file, not its contents. We need to use a special function for that, `tar_file()` from the `tarchetypes` package. `tar_file()` will calculate the "hash" of a file---a unique digital signature that is determined by the file's contents. If the contents change, the hash will change, and this will be detected by `targets`. + + +``` r +library(targets) +library(tarchetypes) + +tar_plan( + tar_file(data_file, "_targets/user/data/hello.txt"), + some_data = readLines(data_file) +) +``` + + +``` output +▶ dispatched target data_file +● completed target data_file [0.001 seconds, 26 bytes] +▶ dispatched target some_data +● completed target some_data [0 seconds, 78 bytes] +▶ ended pipeline [0.109 seconds] +``` + +This time we see that `targets` does successfully re-build `some_data` as expected. + +## A shortcut (or, About target factories) + +However, also notice that this means we need to write two targets instead of one: one target to track the contents of the file (`data_file`), and one target to store what we load from the file (`some_data`). + +It turns out that this is a common pattern in `targets` workflows, so `tarchetypes` provides a shortcut to express this more concisely, `tar_file_read()`. + + +``` r +library(targets) +library(tarchetypes) + +tar_plan( + tar_file_read( + hello, + "_targets/user/data/hello.txt", + readLines(!!.x) + ) +) +``` + +Let's inspect this pipeline with `tar_manifest()`: + + +``` r +tar_manifest() +``` + + +``` output +# A tibble: 2 × 2 + name command + +1 hello_file "\"_targets/user/data/hello.txt\"" +2 hello "readLines(hello_file)" +``` + +Notice that even though we only specified one target in the pipeline (`hello`, with `tar_file_read()`), the pipeline actually includes **two** targets, `hello_file` and `hello`. + +That is because `tar_file_read()` is a special function called a **target factory**, so-called because it makes **multiple** targets at once. One of the main purposes of the `tarchetypes` package is to provide target factories to make writing pipelines easier and less error-prone. + +## Non-standard evaluation + +What is the deal with the `!!.x`? That may look unfamiliar even if you are used to using R. It is known as "non-standard evaluation," and gets used in some special contexts. We don't have time to go into the details now, but just remember that you will need to use this special notation with `tar_file_read()`. If you forget how to write it (this happens frequently!) look at the examples in the help file by running `?tar_file_read`. + +## Other data loading functions + +Although we used `readLines()` as an example here, you can use the same pattern for other functions that load data from external files, such as `readr::read_csv()`, `xlsx::read_excel()`, and others (for example, `read_csv(!!.x)`, `read_excel(!!.x)`, etc.). + +This is generally recommended so that your pipeline stays up to date with your input data. + +::::::::::::::::::::::::::::::::::::: {.challenge} + +## Challenge: Use `tar_file_read()` with the penguins example + +We didn't know about `tar_file_read()` yet when we started on the penguins bill analysis. + +How can you use `tar_file_read()` to load the CSV file while tracking its contents? + +:::::::::::::::::::::::::::::::::: {.solution} + + +``` r +source("R/packages.R") +source("R/functions.R") + +tar_plan( + tar_file_read( + penguins_data_raw, + path_to_file("penguins_raw.csv"), + read_csv(!!.x, show_col_types = FALSE) + ), + penguins_data = clean_penguin_data(penguins_data_raw) +) +``` + + +``` output +▶ dispatched target penguins_data_raw_file +● completed target penguins_data_raw_file [0.001 seconds, 53.098 kilobytes] +▶ dispatched target penguins_data_raw +● completed target penguins_data_raw [0.099 seconds, 10.403 kilobytes] +▶ dispatched target penguins_data +● completed target penguins_data [0.015 seconds, 1.495 kilobytes] +▶ ended pipeline [0.369 seconds] +``` + +:::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::: + +## Writing out data + +Writing to files is similar to loading in files: we will use the `tar_file()` function. There is one important caveat: in this case, the second argument of `tar_file()` (the command to build the target) **must return the path to the file**. Not all functions that write files do this (some return nothing; these treat the output file is a side-effect of running the function), so you may need to define a custom function that writes out the file and then returns its path. + +Let's do this for `writeLines()`, the R function that writes character data to a file. Normally, its output would be `NULL` (nothing), as we can see here: + + +``` r +x <- writeLines("some text", "test.txt") +x +``` + + +``` output +NULL +``` + +Here is our modified function that writes character data to a file and returns the name of the file (the `...` means "pass the rest of these arguments to `writeLines()`"): + + +``` r +write_lines_file <- function(text, file, ...) { + writeLines(text = text, con = file, ...) + file +} +``` + +Let's try it out: + + +``` r +x <- write_lines_file("some text", "test.txt") +x +``` + + +``` output +[1] "test.txt" +``` + +We can now use this in a pipeline. For example let's change the text to upper case then write it out again: + + +``` r +library(targets) +library(tarchetypes) + +source("R/functions.R") + +tar_plan( + tar_file_read( + hello, + "_targets/user/data/hello.txt", + readLines(!!.x) + ), + hello_caps = toupper(hello), + tar_file( + hello_caps_out, + write_lines_file(hello_caps, "_targets/user/results/hello_caps.txt") + ) +) +``` + + +``` output +▶ dispatched target hello_file +● completed target hello_file [0 seconds, 26 bytes] +▶ dispatched target hello +● completed target hello [0 seconds, 78 bytes] +▶ dispatched target hello_caps +● completed target hello_caps [0.001 seconds, 78 bytes] +▶ dispatched target hello_caps_out +● completed target hello_caps_out [0 seconds, 26 bytes] +▶ ended pipeline [0.111 seconds] +``` + +Take a look at `hello_caps.txt` in the `results` folder and verify it is as you expect. + +::::::::::::::::::::::::::::::::::::: {.challenge} + +## Challenge: What happens to file output if its modified? + +Delete or change the contents of `hello_caps.txt` in the `results` folder. +What do you think will happen when you run `tar_make()` again? +Try it and see. + +:::::::::::::::::::::::::::::::::: {.solution} + +`targets` detects that `hello_caps_out` has changed (is "invalidated"), and re-runs the code to make it, thus writing out `hello_caps.txt` to `results` again. + +So this way of writing out results makes your pipeline more robust: we have a guarantee that the contents of the file in `results` are generated solely by the code in your plan. + +:::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::: keypoints + +- `tarchetypes::tar_file()` tracks the contents of a file +- Use `tarchetypes::tar_file_read()` in combination with data loading functions like `read_csv()` to keep the pipeline in sync with your input data +- Use `tarchetypes::tar_file()` in combination with a function that writes to a file and returns its path to write out data + +:::::::::::::::::::::::::::::::::::::::::::::::: diff --git a/files/functions.R b/files/functions.R new file mode 100644 index 00000000..c063edc2 --- /dev/null +++ b/files/functions.R @@ -0,0 +1,357 @@ +#' Write an example targets plan +#' +#' To save on repition and errors when repeatedly running examples +#' +#' @param plan_select Plan template to choose from; 1, 2, or 3. +#' +#' @return Writes out the plan to _targets.R in the working directory. +#' WILL OVERWRITE ANY EXISTING FILE WITH THE SAME NAME +#' @examples +#' library(targets) +#' tar_dir({ +#' write_example_plan(1) +#' tar_make() +#' }) +#' +write_example_plan <- function(plan_select) { + # functions + glance_with_mod_name_func <- c( + "glance_with_mod_name <- function(model_in_list) {", + "model_name <- names(model_in_list)", + "model <- model_in_list[[1]]", + "broom::glance(model) |>", + " mutate(model_name = model_name)", + "}" + ) + augment_with_mod_name_func <- c( + "augment_with_mod_name <- function(model_in_list) {", + "model_name <- names(model_in_list)", + "model <- model_in_list[[1]]", + "broom::augment(model) |>", + " mutate(model_name = model_name)", + "}" + ) + glance_slow_func <- c( + "glance_with_mod_name_slow <- function(model_in_list) {", + "Sys.sleep(4)", + "model_name <- names(model_in_list)", + "model <- model_in_list[[1]]", + "broom::glance(model) |>", + " mutate(model_name = model_name)", + "}" + ) + augment_slow_func <- c( + "augment_with_mod_name_slow <- function(model_in_list) {", + "Sys.sleep(4)", + "model_name <- names(model_in_list)", + "model <- model_in_list[[1]]", + "broom::augment(model) |>", + " mutate(model_name = model_name)", + "}" + ) + clean_penguin_data_func <- c( + "clean_penguin_data <- function(penguins_data_raw) {", + " penguins_data_raw |>", + " select(", + " species = Species,", + " bill_length_mm = `Culmen Length (mm)`,", + " bill_depth_mm = `Culmen Depth (mm)`", + " ) |>", + " remove_missing(na.rm = TRUE) |>", + " separate(species, into = 'species', extra = 'drop')", + "}" + ) + # original plan + plan_1 <- c( + "library(targets)", + "library(palmerpenguins)", + "suppressPackageStartupMessages(library(tidyverse))", + "clean_penguin_data <- function(penguins_data_raw) {", + " penguins_data_raw |>", + " select(", + " species = Species,", + " bill_length_mm = `Culmen Length (mm)`,", + " bill_depth_mm = `Culmen Depth (mm)`", + " ) |>", + " remove_missing(na.rm = TRUE)", + "}", + "list(", + " tar_target(penguins_csv_file, path_to_file('penguins_raw.csv')),", + " tar_target(penguins_data_raw, read_csv(", + " penguins_csv_file, show_col_types = FALSE)),", + " tar_target(penguins_data, clean_penguin_data(penguins_data_raw))", + ")" + ) + # separate species names + plan_2 <- c( + "library(targets)", + "library(palmerpenguins)", + "suppressPackageStartupMessages(library(tidyverse))", + "clean_penguin_data <- function(penguins_data_raw) {", + " penguins_data_raw |>", + " select(", + " species = Species,", + " bill_length_mm = `Culmen Length (mm)`,", + " bill_depth_mm = `Culmen Depth (mm)`", + " ) |>", + " remove_missing(na.rm = TRUE) |>", + " separate(species, into = 'species', extra = 'drop')", + "}", + "list(", + " tar_target(penguins_csv_file, path_to_file('penguins_raw.csv')),", + " tar_target(penguins_data_raw, read_csv(", + " penguins_csv_file, show_col_types = FALSE)),", + " tar_target(penguins_data, clean_penguin_data(penguins_data_raw))", + ")" + ) + # tar_file_read + plan_3 <- c( + "library(targets)", + "library(palmerpenguins)", + "library(tarchetypes)", + "suppressPackageStartupMessages(library(tidyverse))", + "clean_penguin_data <- function(penguins_data_raw) {", + " penguins_data_raw |>", + " select(", + " species = Species,", + " bill_length_mm = `Culmen Length (mm)`,", + " bill_depth_mm = `Culmen Depth (mm)`", + " ) |>", + " remove_missing(na.rm = TRUE) |>", + " separate(species, into = 'species', extra = 'drop')", + "}", + "tar_plan(", + " tar_file_read(", + " penguins_data_raw,", + " path_to_file('penguins_raw.csv'),", + " read_csv(!!.x, show_col_types = FALSE)", + " ),", + " penguins_data = clean_penguin_data(penguins_data_raw)", + ")" + ) + # add one model + plan_4 <- c( + "library(targets)", + "library(palmerpenguins)", + "library(tarchetypes)", + "suppressPackageStartupMessages(library(tidyverse))", + "clean_penguin_data <- function(penguins_data_raw) {", + " penguins_data_raw |>", + " select(", + " species = Species,", + " bill_length_mm = `Culmen Length (mm)`,", + " bill_depth_mm = `Culmen Depth (mm)`", + " ) |>", + " remove_missing(na.rm = TRUE) |>", + " separate(species, into = 'species', extra = 'drop')", + "}", + "tar_plan(", + " tar_file_read(", + " penguins_data_raw,", + " path_to_file('penguins_raw.csv'),", + " read_csv(!!.x, show_col_types = FALSE)", + " ),", + " penguins_data = clean_penguin_data(penguins_data_raw),", + " combined_model = lm(", + " bill_depth_mm ~ bill_length_mm, data = penguins_data)", + ")" + ) + # add multiple models + plan_5 <- c( + "library(targets)", + "library(palmerpenguins)", + "library(tarchetypes)", + "library(broom)", + "suppressPackageStartupMessages(library(tidyverse))", + "clean_penguin_data <- function(penguins_data_raw) {", + " penguins_data_raw |>", + " select(", + " species = Species,", + " bill_length_mm = `Culmen Length (mm)`,", + " bill_depth_mm = `Culmen Depth (mm)`", + " ) |>", + " remove_missing(na.rm = TRUE) |>", + " separate(species, into = 'species', extra = 'drop')", + "}", + "tar_plan(", + " tar_file_read(", + " penguins_data_raw,", + " path_to_file('penguins_raw.csv'),", + " read_csv(!!.x, show_col_types = FALSE)", + " ),", + " penguins_data = clean_penguin_data(penguins_data_raw),", + " combined_model = lm(", + " bill_depth_mm ~ bill_length_mm, data = penguins_data),", + " species_model = lm(", + " bill_depth_mm ~ bill_length_mm + species, data = penguins_data),", + " interaction_model = lm(", + " bill_depth_mm ~ bill_length_mm * species, data = penguins_data),", + " combined_summary = glance(combined_model),", + " species_summary = glance(species_model),", + " interaction_summary = glance(interaction_model)", + ")" + ) + # add multiple models with branching + plan_6 <- c( + "library(targets)", + "library(palmerpenguins)", + "library(tarchetypes)", + "library(broom)", + "suppressPackageStartupMessages(library(tidyverse))", + clean_penguin_data_func, + "tar_plan(", + " tar_file_read(", + " penguins_data_raw,", + " path_to_file('penguins_raw.csv'),", + " read_csv(!!.x, show_col_types = FALSE)", + " ),", + " penguins_data = clean_penguin_data(penguins_data_raw),", + " models = list(", + " combined_model = lm(", + " bill_depth_mm ~ bill_length_mm, data = penguins_data),", + " species_model = lm(", + " bill_depth_mm ~ bill_length_mm + species, data = penguins_data),", + " interaction_model = lm(", + " bill_depth_mm ~ bill_length_mm * species, data = penguins_data)", + " ),", + " tar_target(", + " model_summaries,", + " glance(models[[1]]),", + " pattern = map(models)", + " )", + ")" + ) + # add multiple models with branching, custom glance func + plan_7 <- c( + "library(targets)", + "library(palmerpenguins)", + "library(tarchetypes)", + "library(broom)", + "suppressPackageStartupMessages(library(tidyverse))", + glance_with_mod_name_func, + clean_penguin_data_func, + "tar_plan(", + " tar_file_read(", + " penguins_data_raw,", + " path_to_file('penguins_raw.csv'),", + " read_csv(!!.x, show_col_types = FALSE)", + " ),", + " penguins_data = clean_penguin_data(penguins_data_raw),", + " models = list(", + " combined_model = lm(", + " bill_depth_mm ~ bill_length_mm, data = penguins_data),", + " species_model = lm(", + " bill_depth_mm ~ bill_length_mm + species, data = penguins_data),", + " interaction_model = lm(", + " bill_depth_mm ~ bill_length_mm * species, data = penguins_data)", + " ),", + " tar_target(", + " model_summaries,", + " glance_with_mod_name(models),", + " pattern = map(models)", + " )", + ")" + ) + # adds future and predictions + plan_8 <- c( + "library(targets)", + "library(palmerpenguins)", + "library(tarchetypes)", + "library(broom)", + "library(crew)", + "suppressPackageStartupMessages(library(tidyverse))", + glance_slow_func, + augment_slow_func, + clean_penguin_data_func, + "tar_option_set(controller = crew_controller_local(workers = 2))", + "tar_plan(", + " tar_file_read(", + " penguins_data_raw,", + " path_to_file('penguins_raw.csv'),", + " read_csv(!!.x, show_col_types = FALSE)", + " ),", + " penguins_data = clean_penguin_data(penguins_data_raw),", + " models = list(", + " combined_model = lm(", + " bill_depth_mm ~ bill_length_mm, data = penguins_data),", + " species_model = lm(", + " bill_depth_mm ~ bill_length_mm + species, data = penguins_data),", + " interaction_model = lm(", + " bill_depth_mm ~ bill_length_mm * species, data = penguins_data)", + " ),", + " tar_target(", + " model_summaries,", + " glance_with_mod_name_slow(models),", + " pattern = map(models)", + " ),", + " tar_target(", + " model_predictions,", + " augment_with_mod_name_slow(models),", + " pattern = map(models)", + " ),", + ")" + ) + # adds report + plan_9 <- c( + "library(targets)", + "library(palmerpenguins)", + "library(tarchetypes)", + "library(broom)", + "suppressPackageStartupMessages(library(tidyverse))", + glance_with_mod_name_func, + augment_with_mod_name_func, + clean_penguin_data_func, + "tar_plan(", + " tar_file_read(", + " penguins_data_raw,", + " path_to_file('penguins_raw.csv'),", + " read_csv(!!.x, show_col_types = FALSE)", + " ),", + " penguins_data = clean_penguin_data(penguins_data_raw),", + " models = list(", + " combined_model = lm(", + " bill_depth_mm ~ bill_length_mm, data = penguins_data),", + " species_model = lm(", + " bill_depth_mm ~ bill_length_mm + species, data = penguins_data),", + " interaction_model = lm(", + " bill_depth_mm ~ bill_length_mm * species, data = penguins_data)", + " ),", + " tar_target(", + " model_summaries,", + " glance_with_mod_name(models),", + " pattern = map(models)", + " ),", + " tar_target(", + " model_predictions,", + " augment_with_mod_name(models),", + " pattern = map(models)", + " ),", + " tar_quarto(", + " penguin_report,", + " path = 'penguin_report.qmd',", + " quiet = FALSE,", + " packages = c('targets', 'tidyverse')", + " )", + ")" + ) + switch( + as.character(plan_select), + "1" = readr::write_lines(plan_1, "_targets.R"), + "2" = readr::write_lines(plan_2, "_targets.R"), + "3" = readr::write_lines(plan_3, "_targets.R"), + "4" = readr::write_lines(plan_4, "_targets.R"), + "5" = readr::write_lines(plan_5, "_targets.R"), + "6" = readr::write_lines(plan_6, "_targets.R"), + "7" = readr::write_lines(plan_7, "_targets.R"), + "8" = readr::write_lines(plan_8, "_targets.R"), + "9" = readr::write_lines(plan_9, "_targets.R"), + stop("plan_select must be 1, 2, 3, 4, 5, 6, 7, 8, or 9") + ) +} + +glance_with_mod_name <- function(model_in_list) { + model_name <- names(model_in_list) + model <- model_in_list[[1]] + broom::glance(model) |> + mutate(model_name = model_name) +} diff --git a/files/lesson_functions.R b/files/lesson_functions.R new file mode 100644 index 00000000..5e903b7a --- /dev/null +++ b/files/lesson_functions.R @@ -0,0 +1,59 @@ +# Functions used in the lesson `.Rmd` files, but that learners +# aren't exposed to, and aren't used inside the Targets pipelines + +make_tempdir <- function() { + x <- tempfile() + dir.create(x, showWarnings = FALSE) + x +} + +files_root <- normalizePath("files") +plan_root <- file.path(files_root, "plans") +utility_funcs <- file.path(files_root, "tar_functions") |> + list.files(full.names = TRUE, pattern = "\\.R$") |> + lapply(readLines) |> + unlist() +package_script <- file.path(files_root, "packages.R") + +#' @param file The path to another file to use as a workflow +#' @param chunk The chunk name to use as a targets workflow +write_example_plan <- function(file = NULL, chunk = NULL) { + # Write the utility functions into the R/ directory + + if (!dir.exists("R")) { + dir.create("R") + + # Write the functions.R script + file.path("R", "functions.R") |> + writeLines(utility_funcs, con = _) + + # Copy the packages.R script + file.path("R", "packages.R") |> + file.copy(from = package_script, to = _) + } + + # Write the workflow + if (!is.null(file)) { + file.path(plan_root, file) |> + file.copy(from = _, to = "_targets.R", overwrite = TRUE) + } + if (!is.null(chunk)) { + writeLines(text = knitr::knit_code$get(chunk), con = "_targets.R") + } + + invisible() +} + +directory_stack <- getwd() + +pushd <- function(dir) { + directory_stack <<- c(dir, directory_stack) + setwd(directory_stack[1]) + invisible() +} + +popd <- function() { + directory_stack <<- directory_stack[-1] + setwd(directory_stack[1]) + invisible() +} diff --git a/files/packages.R b/files/packages.R new file mode 100644 index 00000000..6f8bcc90 --- /dev/null +++ b/files/packages.R @@ -0,0 +1,6 @@ +library(targets) +library(tarchetypes) +library(palmerpenguins) +library(tidyverse) +library(broom) +library(htmlwidgets) diff --git a/files/plans/README.md b/files/plans/README.md new file mode 100644 index 00000000..a19d96ea --- /dev/null +++ b/files/plans/README.md @@ -0,0 +1 @@ +Plans that are re-used between multiple episodes are placed here \ No newline at end of file diff --git a/files/plans/plan_1.R b/files/plans/plan_1.R new file mode 100644 index 00000000..7c5575e3 --- /dev/null +++ b/files/plans/plan_1.R @@ -0,0 +1,21 @@ +options(tidyverse.quiet = TRUE) +library(targets) +library(tidyverse) +library(palmerpenguins) + +clean_penguin_data <- function(penguins_data_raw) { + penguins_data_raw |> + select( + species = Species, + bill_length_mm = `Culmen Length (mm)`, + bill_depth_mm = `Culmen Depth (mm)` + ) |> + remove_missing(na.rm = TRUE) +} + +list( + tar_target(penguins_csv_file, path_to_file("penguins_raw.csv")), + tar_target(penguins_data_raw, read_csv( + penguins_csv_file, show_col_types = FALSE)), + tar_target(penguins_data, clean_penguin_data(penguins_data_raw)) +) diff --git a/files/plans/plan_10.R b/files/plans/plan_10.R new file mode 100644 index 00000000..be92fd01 --- /dev/null +++ b/files/plans/plan_10.R @@ -0,0 +1,42 @@ +options(tidyverse.quiet = TRUE) +suppressPackageStartupMessages(library(crew)) +source("R/functions.R") +source("R/packages.R") + +# Set up parallelization +library(crew) +tar_option_set( + controller = crew_controller_local(workers = 2) +) + +tar_plan( + # Load raw data + tar_file_read( + penguins_data_raw, + path_to_file("penguins_raw.csv"), + read_csv(!!.x, show_col_types = FALSE) + ), + # Clean data + penguins_data = clean_penguin_data(penguins_data_raw), + # Build models + models = list( + combined_model = lm( + bill_depth_mm ~ bill_length_mm, data = penguins_data), + species_model = lm( + bill_depth_mm ~ bill_length_mm + species, data = penguins_data), + interaction_model = lm( + bill_depth_mm ~ bill_length_mm * species, data = penguins_data) + ), + # Get model summaries + tar_target( + model_summaries, + glance_with_mod_name_slow(models), + pattern = map(models) + ), + # Get model predictions + tar_target( + model_predictions, + augment_with_mod_name_slow(models), + pattern = map(models) + ) +) diff --git a/files/plans/plan_11.R b/files/plans/plan_11.R new file mode 100644 index 00000000..5c9af52f --- /dev/null +++ b/files/plans/plan_11.R @@ -0,0 +1,42 @@ +options(tidyverse.quiet = TRUE) +source("R/functions.R") +source("R/packages.R") + +tar_plan( + # Load raw data + tar_file_read( + penguins_data_raw, + path_to_file("penguins_raw.csv"), + read_csv(!!.x, show_col_types = FALSE) + ), + # Clean data + penguins_data = clean_penguin_data(penguins_data_raw), + # Build models + models = list( + combined_model = lm( + bill_depth_mm ~ bill_length_mm, data = penguins_data), + species_model = lm( + bill_depth_mm ~ bill_length_mm + species, data = penguins_data), + interaction_model = lm( + bill_depth_mm ~ bill_length_mm * species, data = penguins_data) + ), + # Get model summaries + tar_target( + model_summaries, + glance_with_mod_name(models), + pattern = map(models) + ), + # Get model predictions + tar_target( + model_predictions, + augment_with_mod_name(models), + pattern = map(models) + ), + # Generate report + tar_quarto( + penguin_report, + path = "penguin_report.qmd", + quiet = FALSE, + packages = c("targets", "tidyverse") + ) +) diff --git a/files/plans/plan_2.R b/files/plans/plan_2.R new file mode 100644 index 00000000..35536ff0 --- /dev/null +++ b/files/plans/plan_2.R @@ -0,0 +1,10 @@ +options(tidyverse.quiet = TRUE) +source("R/packages.R") +source("R/functions.R") + +list( + tar_target(penguins_csv_file, path_to_file('penguins_raw.csv')), + tar_target(penguins_data_raw, read_csv( + penguins_csv_file, show_col_types = FALSE)), + tar_target(penguins_data, clean_penguin_data(penguins_data_raw)) +) diff --git a/files/plans/plan_2b.R b/files/plans/plan_2b.R new file mode 100644 index 00000000..4bd68020 --- /dev/null +++ b/files/plans/plan_2b.R @@ -0,0 +1,9 @@ +options(tidyverse.quiet = TRUE) +source("R/packages.R") +source("R/functions.R") + +tar_plan( + penguins_csv_file = path_to_file("penguins_raw.csv"), + penguins_data_raw = read_csv(penguins_csv_file, show_col_types = FALSE), + penguins_data = clean_penguin_data(penguins_data_raw) +) diff --git a/files/plans/plan_3.R b/files/plans/plan_3.R new file mode 100644 index 00000000..0f913938 --- /dev/null +++ b/files/plans/plan_3.R @@ -0,0 +1,12 @@ +options(tidyverse.quiet = TRUE) +source("R/packages.R") +source("R/functions.R") + +tar_plan( + tar_file_read( + penguins_data_raw, + path_to_file("penguins_raw.csv"), + read_csv(!!.x, show_col_types = FALSE) + ), + penguins_data = clean_penguin_data(penguins_data_raw) +) diff --git a/files/plans/plan_4.R b/files/plans/plan_4.R new file mode 100644 index 00000000..86863344 --- /dev/null +++ b/files/plans/plan_4.R @@ -0,0 +1,19 @@ +options(tidyverse.quiet = TRUE) +source("R/packages.R") +source("R/functions.R") + +tar_plan( + # Load raw data + tar_file_read( + penguins_data_raw, + path_to_file("penguins_raw.csv"), + read_csv(!!.x, show_col_types = FALSE) + ), + # Clean data + penguins_data = clean_penguin_data(penguins_data_raw), + # Build model + combined_model = lm( + bill_depth_mm ~ bill_length_mm, + data = penguins_data + ) +) diff --git a/files/plans/plan_5.R b/files/plans/plan_5.R new file mode 100644 index 00000000..882876cc --- /dev/null +++ b/files/plans/plan_5.R @@ -0,0 +1,31 @@ +options(tidyverse.quiet = TRUE) +source("R/packages.R") +source("R/functions.R") + +tar_plan( + # Load raw data + tar_file_read( + penguins_data_raw, + path_to_file("penguins_raw.csv"), + read_csv(!!.x, show_col_types = FALSE) + ), + # Clean data + penguins_data = clean_penguin_data(penguins_data_raw), + # Build models + combined_model = lm( + bill_depth_mm ~ bill_length_mm, + data = penguins_data + ), + species_model = lm( + bill_depth_mm ~ bill_length_mm + species, + data = penguins_data + ), + interaction_model = lm( + bill_depth_mm ~ bill_length_mm * species, + data = penguins_data + ), + # Get model summaries + combined_summary = glance(combined_model), + species_summary = glance(species_model), + interaction_summary = glance(interaction_model) +) diff --git a/files/plans/plan_6.R b/files/plans/plan_6.R new file mode 100644 index 00000000..fad7536b --- /dev/null +++ b/files/plans/plan_6.R @@ -0,0 +1,29 @@ +options(tidyverse.quiet = TRUE) +source("R/packages.R") +source("R/functions.R") + +tar_plan( + # Load raw data + tar_file_read( + penguins_data_raw, + path_to_file("penguins_raw.csv"), + read_csv(!!.x, show_col_types = FALSE) + ), + # Clean data + penguins_data = clean_penguin_data(penguins_data_raw), + # Build models + models = list( + combined_model = lm( + bill_depth_mm ~ bill_length_mm, data = penguins_data), + species_model = lm( + bill_depth_mm ~ bill_length_mm + species, data = penguins_data), + interaction_model = lm( + bill_depth_mm ~ bill_length_mm * species, data = penguins_data) + ), + # Get model summaries + tar_target( + model_summaries, + glance(models[[1]]), + pattern = map(models) + ) +) diff --git a/files/plans/plan_7.R b/files/plans/plan_7.R new file mode 100644 index 00000000..346cca74 --- /dev/null +++ b/files/plans/plan_7.R @@ -0,0 +1,29 @@ +options(tidyverse.quiet = TRUE) +source("R/functions.R") +source("R/packages.R") + +tar_plan( + # Load raw data + tar_file_read( + penguins_data_raw, + path_to_file("penguins_raw.csv"), + read_csv(!!.x, show_col_types = FALSE) + ), + # Clean data + penguins_data = clean_penguin_data(penguins_data_raw), + # Build models + models = list( + combined_model = lm( + bill_depth_mm ~ bill_length_mm, data = penguins_data), + species_model = lm( + bill_depth_mm ~ bill_length_mm + species, data = penguins_data), + interaction_model = lm( + bill_depth_mm ~ bill_length_mm * species, data = penguins_data) + ), + # Get model summaries + tar_target( + model_summaries, + glance_with_mod_name(models), + pattern = map(models) + ) +) diff --git a/files/plans/plan_8.R b/files/plans/plan_8.R new file mode 100644 index 00000000..8a6779ef --- /dev/null +++ b/files/plans/plan_8.R @@ -0,0 +1,35 @@ +options(tidyverse.quiet = TRUE) +source("R/functions.R") +source("R/packages.R") + +tar_plan( + # Load raw data + tar_file_read( + penguins_data_raw, + path_to_file("penguins_raw.csv"), + read_csv(!!.x, show_col_types = FALSE) + ), + # Clean data + penguins_data = clean_penguin_data(penguins_data_raw), + # Build models + models = list( + combined_model = lm( + bill_depth_mm ~ bill_length_mm, data = penguins_data), + species_model = lm( + bill_depth_mm ~ bill_length_mm + species, data = penguins_data), + interaction_model = lm( + bill_depth_mm ~ bill_length_mm * species, data = penguins_data) + ), + # Get model summaries + tar_target( + model_summaries, + glance_with_mod_name(models), + pattern = map(models) + ), + # Get model predictions + tar_target( + model_predictions, + augment_with_mod_name(models), + pattern = map(models) + ) +) diff --git a/files/plans/plan_9.R b/files/plans/plan_9.R new file mode 100644 index 00000000..164359b1 --- /dev/null +++ b/files/plans/plan_9.R @@ -0,0 +1,42 @@ +options(tidyverse.quiet = TRUE) +suppressPackageStartupMessages(library(crew)) +source("R/functions.R") +source("R/packages.R") + +# Set up parallelization +library(crew) +tar_option_set( + controller = crew_controller_local(workers = 2) +) + +tar_plan( + # Load raw data + tar_file_read( + penguins_data_raw, + path_to_file("penguins_raw.csv"), + read_csv(!!.x, show_col_types = FALSE) + ), + # Clean data + penguins_data = clean_penguin_data(penguins_data_raw), + # Build models + models = list( + combined_model = lm( + bill_depth_mm ~ bill_length_mm, data = penguins_data), + species_model = lm( + bill_depth_mm ~ bill_length_mm + species, data = penguins_data), + interaction_model = lm( + bill_depth_mm ~ bill_length_mm * species, data = penguins_data) + ), + # Get model summaries + tar_target( + model_summaries, + glance_with_mod_name(models), + pattern = map(models) + ), + # Get model predictions + tar_target( + model_predictions, + augment_with_mod_name(models), + pattern = map(models) + ) +) diff --git a/files/tar_functions/README.md b/files/tar_functions/README.md new file mode 100644 index 00000000..e7e4db4c --- /dev/null +++ b/files/tar_functions/README.md @@ -0,0 +1,3 @@ +These are functions that are used inside the targets pipelines. +All of them are automatically included in every plan written by `execute_plan`. +However they are split into separate files so they can be included as code chunks and thereby shown to the learners. diff --git a/files/tar_functions/augment_with_mod_name.R b/files/tar_functions/augment_with_mod_name.R new file mode 100644 index 00000000..9b2695b1 --- /dev/null +++ b/files/tar_functions/augment_with_mod_name.R @@ -0,0 +1,6 @@ +augment_with_mod_name <- function(model_in_list) { + model_name <- names(model_in_list) + model <- model_in_list[[1]] + augment(model) |> + mutate(model_name = model_name) +} diff --git a/files/tar_functions/augment_with_mod_name_slow.R b/files/tar_functions/augment_with_mod_name_slow.R new file mode 100644 index 00000000..2d765c3f --- /dev/null +++ b/files/tar_functions/augment_with_mod_name_slow.R @@ -0,0 +1,7 @@ +augment_with_mod_name_slow <- function(model_in_list) { + Sys.sleep(4) + model_name <- names(model_in_list) + model <- model_in_list[[1]] + broom::augment(model) |> + mutate(model_name = model_name) +} diff --git a/files/tar_functions/clean_penguin_data.R b/files/tar_functions/clean_penguin_data.R new file mode 100644 index 00000000..ab7feaf4 --- /dev/null +++ b/files/tar_functions/clean_penguin_data.R @@ -0,0 +1,11 @@ +clean_penguin_data <- function(penguins_data_raw) { + penguins_data_raw |> + select( + species = Species, + bill_length_mm = `Culmen Length (mm)`, + bill_depth_mm = `Culmen Depth (mm)` + ) |> + remove_missing(na.rm = TRUE) |> + # Split "species" apart on spaces, and only keep the first word + separate(species, into = "species", extra = "drop") +} diff --git a/files/tar_functions/glance_with_mod_name.R b/files/tar_functions/glance_with_mod_name.R new file mode 100644 index 00000000..c125a2e4 --- /dev/null +++ b/files/tar_functions/glance_with_mod_name.R @@ -0,0 +1,6 @@ +glance_with_mod_name <- function(model_in_list) { + model_name <- names(model_in_list) + model <- model_in_list[[1]] + glance(model) |> + mutate(model_name = model_name) +} diff --git a/files/tar_functions/glance_with_mod_name_slow.R b/files/tar_functions/glance_with_mod_name_slow.R new file mode 100644 index 00000000..1b9b8682 --- /dev/null +++ b/files/tar_functions/glance_with_mod_name_slow.R @@ -0,0 +1,7 @@ +glance_with_mod_name_slow <- function(model_in_list) { + Sys.sleep(4) + model_name <- names(model_in_list) + model <- model_in_list[[1]] + broom::glance(model) |> + mutate(model_name = model_name) +} diff --git a/files/tar_functions/write_lines_file.R b/files/tar_functions/write_lines_file.R new file mode 100644 index 00000000..7fbc8f70 --- /dev/null +++ b/files/tar_functions/write_lines_file.R @@ -0,0 +1,4 @@ +write_lines_file <- function(text, file, ...) { + writeLines(text = text, con = file, ...) + file +} diff --git a/index.md b/index.md new file mode 100644 index 00000000..544b633d --- /dev/null +++ b/index.md @@ -0,0 +1,5 @@ +--- +site: sandpaper::sandpaper_site +--- + +This is a lesson about how to use the [targets](https://docs.ropensci.org/targets/) R package for maintaining efficient data analysis workflows. diff --git a/instructor-notes.md b/instructor-notes.md new file mode 100644 index 00000000..697f3a03 --- /dev/null +++ b/instructor-notes.md @@ -0,0 +1,7 @@ +--- +title: 'Instructor Notes' +--- + +## General notes + +The examples gradually build up to a [full analysis](https://github.com/joelnitta/penguins-targets) of the [Palmer Penguins dataset](https://allisonhorst.github.io/palmerpenguins/). However, there are a few places where completely different code is demonstrated to explain certain concepts. Since a given `targets` project can only have one `_targets.R` file, this means the participants may have to delete their existing `_targets.R` file and write a new one to follow along with the examples. This may cause frustration if they can't keep a record of what they have done so far. One solution would be to save the old `_targets.R` file as `_targets_old.R` or similar, then rename it when it should be run again. diff --git a/introduction.md b/introduction.md new file mode 100644 index 00000000..6f2259ec --- /dev/null +++ b/introduction.md @@ -0,0 +1,100 @@ +--- +title: "Introduction" +teaching: 10 +exercises: 2 +--- + +:::::::::::::::::::::::::::::::::::::: questions + +- Why should we care about reproducibility? +- How can `targets` help us achieve reproducibility? + +:::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::: objectives + +- Explain why reproducibility is important for science +- Describe the features of `targets` that enhance reproducibility + +:::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::: {.instructor} + +Episode summary: Introduce the idea of reproducibility and why / who would want to use `targets` + +::::::::::::::::::::::::::::::::::::: + +## What is reproducibility? + +Reproducibility is the ability for others (including your future self) to reproduce your analysis. + +We can only have confidence in the results of scientific analyses if they can be reproduced. + +However, reproducibility is not a binary concept (not reproducible vs. reproducible); rather, there is a scale from **less** reproducible to **more** reproducible. + +`targets` goes a long ways towards making your analyses **more reproducible**. + +Other practices you can use to further enhance reproducibility include controlling your computing environment with tools like Docker, conda, or renv, but we don't have time to cover those in this workshop. + +## What is `targets`? + +`targets` is a workflow management package for the R programming language developed and maintained by Will Landau. + +The major features of `targets` include: + +- **Automation** of workflow +- **Caching** of workflow steps +- **Batch creation** of workflow steps +- **Parallelization** at the level of the workflow + +This allows you to do the following: + +- return to a project after working on something else and immediately pick up where you left off without confusion or trying to remember what you were doing +- change the workflow, then only re-run the parts that that are affected by the change +- massively scale up the workflow without changing individual functions + +... and of course, it will help others reproduce your analysis. + +## Who should use `targets`? + +`targets` is by no means the only workflow management software. +There is a large number of similar tools, each with varying features and use-cases. +For example, [snakemake](https://snakemake.readthedocs.io/en/stable/) is a popular workflow tool for python, and [`make`](https://www.gnu.org/software/make/) is a tool that has been around for a very long time for automating bash scripts. +`targets` is designed to work specifically with R, so it makes the most sense to use it if you primarily use R, or intend to. +If you mostly code with other tools, you may want to consider an alternative. + +The **goal** of this workshop is to **learn how to use `targets` to reproducible data analysis in R**. + +## Where to get more information + +`targets` is a sophisticated package and there is a lot more to learn that we can cover in this workshop. + +Here are some recommended resources for continuing on your `targets` journey: + +- [The `targets` R package user manual](https://books.ropensci.org/targets/) by the author of `targets`, Will Landau, should be considered required reading for anyone seriously interested in `targets`. +- [The `targets` discussion board](https://github.com/ropensci/targets/discussions) is a great place for asking questions and getting help. Before you ask a question though, be sure to [read the policy on asking for help](https://books.ropensci.org/targets/help.html). +- [The `targets` package webpage](https://docs.ropensci.org/targets/) includes documentation of all `targets` functions. +- [The `tarchetypes` package webpage](https://docs.ropensci.org/tarchetypes/) includes documentation of all `tarchetypes` functions. You will almost certainly use `tarchetypes` along with `targets`, so it's good to consult both. +- [Reproducible computation at scale in R with `targets`](https://github.com/wlandau/targets-tutorial) is a tutorial by Will Landau analyzing customer churn with Keras. +- [Recorded talks](https://github.com/ropensci/targets#recorded-talks) and [example projects](https://github.com/ropensci/targets#example-projects) listed on the `targets` README. + +## About the example dataset + +For this workshop, we will analyze an example dataset of measurements taken on adult foraging Adélie, Chinstrap, and Gentoo penguins observed on islands in the Palmer Archipelago, Antarctica. + +The data are available from the `palmerpenguins` R package. You can get more information about the data by running `?palmerpenguins`. + +![The three species of penguins in the `palmerpenguins` dataset. Artwork by @allison_horst.](https://allisonhorst.github.io/palmerpenguins/reference/figures/lter_penguins.png) + +The goal of the analysis is to determine the relationship between bill length and depth by using linear models. + +We will gradually build up the analysis through this lesson, but you can see the final version at . + +::::::::::::::::::::::::::::::::::::: keypoints + +- We can only have confidence in the results of scientific analyses if they can be reproduced by others (including your future self) +- `targets` helps achieve reproducibility by automating workflow +- `targets` is designed for use with the R programming language +- The example dataset for this workshop includes measurements taken on penguins in Antarctica + +:::::::::::::::::::::::::::::::::::::::::::::::: diff --git a/learner-profiles.md b/learner-profiles.md new file mode 100644 index 00000000..75f3015a --- /dev/null +++ b/learner-profiles.md @@ -0,0 +1,15 @@ +--- +title: Learner Profiles +--- + +These are fictional examples of the sort of learner expected to take this workshop. + +**Dayja** is a graduate student in evolutionary biology. +She is familiar with R and writes many R scripts to conduct her analyses, but she often finds that it is difficult to remember which scripts to run in what order when she updates her data. + +**Jessie** is an undergraduate who is using Quarto to write their graduate thesis. +They want to make sure all the results presented in the thesis come directly from code to avoid any errors and to make it easier for submission to a journal later. + +**Vincent** is a post-doc in bioinformatics. +He has to orchestrate large workflows that run the same set of steps over many samples. +He wants to simplify his code to avoid repetition. diff --git a/lifecycle.md b/lifecycle.md new file mode 100644 index 00000000..fbdbc204 --- /dev/null +++ b/lifecycle.md @@ -0,0 +1,290 @@ +--- +title: 'The Workflow Lifecycle' +teaching: 10 +exercises: 2 +--- + +:::::::::::::::::::::::::::::::::::::: questions + +- What happens if we re-run a workflow? +- How does `targets` know what steps to re-run? +- How can we inspect the state of the workflow? + +:::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::: objectives + +- Explain how `targets` helps increase efficiency +- Be able to inspect a workflow to see what parts are outdated + +:::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::: {.instructor} + +Episode summary: Demonstrate typical cycle of running `targets`: make, inspect, adjust, make... + +::::::::::::::::::::::::::::::::::::: + + + +## Re-running the workflow + +One of the features of `targets` is that it maximizes efficiency by only running the parts of the workflow that need to be run. + +This is easiest to understand by trying it yourself. Let's try running the workflow again: + + +``` r +tar_make() +``` + +``` output +✔ skipped target penguins_csv_file +✔ skipped target penguins_data_raw +✔ skipped target penguins_data +✔ skipped pipeline [0.238 seconds] +``` + +Remember how the first time we ran the pipeline, `targets` printed out a list of each target as it was being built? + +This time, it tells us it is skipping those targets; they have already been built, so there's no need to run that code again. + +Remember, the fastest code is the code you don't have to run! + +## Re-running the workflow after modification + +What happens when we change one part of the workflow then run it again? + +Say that we decide the species names should be shorter. +Right now they include the common name and the scientific name, but we really only need the first part of the common name to distinguish them. + +Edit `_targets.R` so that the `clean_penguin_data()` function looks like this: + + +``` r +clean_penguin_data <- function(penguins_data_raw) { + penguins_data_raw |> + select( + species = Species, + bill_length_mm = `Culmen Length (mm)`, + bill_depth_mm = `Culmen Depth (mm)` + ) |> + remove_missing(na.rm = TRUE) |> + # Split "species" apart on spaces, and only keep the first word + separate(species, into = "species", extra = "drop") +} +``` + +Then run it again. + + +``` r +tar_make() +``` + +``` output +✔ skipped target penguins_csv_file +✔ skipped target penguins_data_raw +▶ dispatched target penguins_data +● completed target penguins_data [0.012 seconds, 1.495 kilobytes] +▶ ended pipeline [0.271 seconds] +``` + +What happened? + +This time, it skipped `penguins_csv_file` and `penguins_data_raw` and only ran `penguins_data`. + +Of course, since our example workflow is so short we don't even notice the amount of time saved. +But imagine using this in a series of computationally intensive analysis steps. +The ability to automatically skip steps results in a massive increase in efficiency. + +::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 1: Inspect the output + +How can you inspect the contents of `penguins_data`? + +:::::::::::::::::::::::::::::::::: solution + +With `tar_read(penguins_data)` or by running `tar_load(penguins_data)` followed by `penguins_data`. + +:::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::::::::::: + +## Under the hood + +How does `targets` keep track of which targets are up-to-date vs. outdated? + +For each target in the workflow (items in the list at the end of the `_targets.R` file) and any custom functions used in the workflow, `targets` calculates a **hash value**, or unique combination of letters and digits that represents an object in the computer's memory. +You can think of the hash value (or "hash" for short) as **a unique fingerprint** for a target or function. + +The first time your run `tar_make()`, `targets` calculates the hashes for each target and function as it runs the code and stores them in the targets cache (the `_targets` folder). +Then, for each subsequent call of `tar_make()`, it calculates the hashes again and compares them to the stored values. +It detects which have changed, and this is how it knows which targets are out of date. + +:::::::::::::::::::::::::::::::::::::::: callout + +## Where the hashes live + +If you are curious about what the hashes look like, you can see them in the file `_targets/meta/meta`, but **do not edit this file by hand**---that would ruin your workflow! + +:::::::::::::::::::::::::::::::::::::::: + +This information is used in combination with the dependency relationships (in other words, how each target depends on the others) to re-run the workflow in the most efficient way possible: code is only run for targets that need to be re-built, and others are skipped. + +## Visualizing the workflow + +Typically, you will be making edits to various places in your code, adding new targets, and running the workflow periodically. +It is good to be able to visualize the state of the workflow. + +This can be done with `tar_visnetwork()` + + +``` r +tar_visnetwork() +``` + +![](fig/lifecycle-visnetwork.png){alt="Visualization of the targets worklow, showing 'penguins_data' connected by lines to 'penguins_data_raw', 'penguins_csv_file' and 'clean_penguin_data'"} + +You should see the network show up in the plot area of RStudio. + +It is an HTML widget, so you can zoom in and out (this isn't important for the current example since it is so small, but is useful for larger, "real-life" workflows). + +Here, we see that all of the targets are dark green, indicating that they are up-to-date and would be skipped if we were to run the workflow again. + +::::::::::::::::::::::::::::::::::::: prereq + +## Installing visNetwork + +You may encounter an error message `The package "visNetwork" is required.` + +In this case, install it first with `install.packages("visNetwork")`. + +:::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::: challenge + +## Challenge 2: What else can the visualization tell us? + +Modify the workflow in `_targets.R`, then run `tar_visnetwork()` again **without** running `tar_make()`. +What color indicates that a target is out of date? + +:::::::::::::::::::::::::::::::::: solution + +Light blue indicates the target is out of date. + +Depending on how you modified the code, any or all of the targets may now be light blue. + +:::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::: callout + +## 'Outdated' does not always mean 'will be run' + +Just because a target appears as light blue (is "outdated") in the network visualization, this does not guarantee that it will be re-built during the next run. Rather, it means that **at least of one the targets that it depends on has changed**. + +For example, if the workflow state looked like this: + +`A -> B* -> C -> D` + +where the `*` indicates that `B` has changed compared to the last time the workflow was run, the network visualization will show `B`, `C`, and `D` all as light blue. + +But if re-running the workflow results in the exact same value for `C` as before, `D` will not be re-run (will be "skipped"). + +Most of the time, a single change will cascade to the rest of the downstream targets and cause them to be re-built, but this is not always the case. `targets` has no way of knowing ahead of time what the actual output will be, so it cannot provide a network visualization that completely predicts the future! + +::::::::::::::::::::::::::::::::::::::::::::::: + +## Other ways to check workflow status + +The visualization is very useful, but sometimes you may be working on a server that doesn't provide graphical output, or you just want a quick textual summary of the workflow. +There are some other useful functions that can do that. + +`tar_outdated()` lists only the outdated targets; that is, targets that will be built during the next run, or depend on such a target. +If everything is up to date, it will return a zero-length character vector (`character(0)`). + + +``` r +tar_outdated() +``` + +``` output +character(0) +``` + +`tar_progress()` shows the current status of the workflow as a dataframe. +You may find it helpful to further manipulate the dataframe to obtain useful summaries of the workflow, for example using `dplyr` (such data manipulation is beyond the scope of this lesson but the instructor may demonstrate its use). + + +``` r +tar_progress() +``` + +``` output +# A tibble: 3 × 2 + name progress + +1 penguins_csv_file skipped +2 penguins_data_raw skipped +3 penguins_data completed +``` + +## Granular control of targets + +It is possible to only make a particular target instead of running the entire workflow. + +To do this, type the name of the target you wish to build after `tar_make()` (note that any targets required by the one you specify will also be built). +For example, `tar_make(penguins_data_raw)` would **only** build `penguins_data_raw`, not `penguins_data`. + +Furthermore, if you want to manually "reset" a target and make it appear out-of-date, you can do so with `tar_invalidate()`. This means that target (and any that depend on it) will be re-run next time. + +Let's give this a try. Remember that our pipeline is currently up to date, so `tar_make()` will skip everything: + + +``` r +tar_make() +``` + +``` output +✔ skipped target penguins_csv_file +✔ skipped target penguins_data_raw +✔ skipped target penguins_data +✔ skipped pipeline [0.237 seconds] +``` + +Let's invalidate `penguins_data` and run it again: + + +``` r +tar_invalidate(penguins_data) +tar_make() +``` + +``` output +✔ skipped target penguins_csv_file +✔ skipped target penguins_data_raw +▶ dispatched target penguins_data +● completed target penguins_data [0.012 seconds, 1.495 kilobytes] +▶ ended pipeline [0.264 seconds] +``` + +If you want to reset **everything** and start fresh, you can use `tar_invalidate(everything())` (`tar_invalidate()` [accepts `tidyselect` expressions](https://docs.ropensci.org/targets/reference/tar_invalidate.html) to specify target names). + +**Caution should be exercised** when using granular methods like this, though, since you may end up with your workflow in an unexpected state. The surest way to maintain an up-to-date workflow is to run `tar_make()` frequently. + +## How this all works in practice + +In practice, you will likely be switching between running the workflow with `tar_make()`, loading the targets you built with `tar_load()`, and editing your custom functions by running code in an interactive R session. It takes some time to get used to it, but soon you will feel that your code isn't "real" until it is embedded in a `targets` workflow. + +::::::::::::::::::::::::::::::::::::: keypoints + +- `targets` only runs the steps that have been affected by a change to the code +- `tar_visnetwork()` shows the current state of the workflow as a network +- `tar_progress()` shows the current state of the workflow as a data frame +- `tar_outdated()` lists outdated targets +- `tar_invalidate()` can be used to invalidate (re-run) specific targets + +:::::::::::::::::::::::::::::::::::::::::::::::: diff --git a/links.md b/links.md new file mode 100644 index 00000000..4c5cd2f9 --- /dev/null +++ b/links.md @@ -0,0 +1,10 @@ + + +[pandoc]: https://pandoc.org/MANUAL.html +[r-markdown]: https://rmarkdown.rstudio.com/ +[rstudio]: https://www.rstudio.com/ +[carpentries-workbench]: https://carpentries.github.io/sandpaper-docs/ + diff --git a/md5sum.txt b/md5sum.txt new file mode 100644 index 00000000..e93c694e --- /dev/null +++ b/md5sum.txt @@ -0,0 +1,21 @@ +"file" "checksum" "built" "date" +"CODE_OF_CONDUCT.md" "c93c83c630db2fe2462240bf72552548" "site/built/CODE_OF_CONDUCT.md" "2024-12-13" +"LICENSE.md" "b24ebbb41b14ca25cf6b8216dda83e5f" "site/built/LICENSE.md" "2024-12-13" +"config.yaml" "183abc0bde40be8c6a757ac191bd2c1c" "site/built/config.yaml" "2024-12-13" +"index.md" "06bbfd5ab0e2353032361b3321342d13" "site/built/index.md" "2024-12-13" +"links.md" "8184cf4149eafbf03ce8da8ff0778c14" "site/built/links.md" "2024-12-13" +"episodes/introduction.Rmd" "feaacfccab344eb1fa6e702aded97924" "site/built/introduction.md" "2024-12-13" +"episodes/basic-targets.Rmd" "90190eae899db41c64b69320e3f72365" "site/built/basic-targets.md" "2024-12-13" +"episodes/cache.Rmd" "b487d6d792469641faec63c838541aac" "site/built/cache.md" "2024-12-13" +"episodes/lifecycle.Rmd" "7974a62cc37ac1138647d043fe1e4a26" "site/built/lifecycle.md" "2024-12-13" +"episodes/organization.Rmd" "74df25779b74013eeb6a8ca7b8934efe" "site/built/organization.md" "2024-12-13" +"episodes/packages.Rmd" "2c0eb6138ea6685a0ee279c89b381bc4" "site/built/packages.md" "2024-12-13" +"episodes/files.Rmd" "b7f4ef83379a58d5c30d8e011e3b2c0d" "site/built/files.md" "2024-12-13" +"episodes/branch.Rmd" "6f1187d6df3310eb042aaae3a44328dc" "site/built/branch.md" "2024-12-13" +"episodes/parallel.Rmd" "3ec032e9a527138e70e2efb4e5a10410" "site/built/parallel.md" "2024-12-13" +"episodes/quarto.Rmd" "76b257de72894ab24e1d1852b6149bf9" "site/built/quarto.md" "2024-12-13" +"instructors/instructor-notes.md" "df3784ee5c0436a9e171071f7965d3fc" "site/built/instructor-notes.md" "2024-12-13" +"learners/reference.md" "3f06251c1f932e767ae8f22db25eb5a2" "site/built/reference.md" "2024-12-13" +"learners/setup.md" "2c9965f182c4d73141cbf0bef2990f16" "site/built/setup.md" "2024-12-13" +"profiles/learner-profiles.md" "44d8b9d8aca7963e6577e8c67d23eac0" "site/built/learner-profiles.md" "2024-12-13" +"renv/profiles/lesson-requirements/renv.lock" "156bb9b842e08b6f652f7119897c96b0" "site/built/renv.lock" "2024-12-13" diff --git a/organization.md b/organization.md new file mode 100644 index 00000000..e892d66e --- /dev/null +++ b/organization.md @@ -0,0 +1,234 @@ +--- +title: 'Best Practices for targets Project Organization' +teaching: 10 +exercises: 2 +--- + +:::::::::::::::::::::::::::::::::::::: questions + +- What are best practices for organizing `targets` projects? +- How does the organization of a `targets` workflow differ from a script-based analysis? + +:::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::: objectives + +- Explain how to organize `targets` projects for maximal reproducibility +- Understand how to use functions in the context of `targets` + +:::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::: instructor + +Episode summary: Demonstrate best-practices for project organization + +::::::::::::::::::::::::::::::::::::: + + + +## A simpler way to write workflow plans + +The default way to specify targets in the plan is with the `tar_target()` function. +But this way of writing plans can be a bit verbose. + +There is an alternative provided by the `tarchetypes` package, also written by the creator of `targets`, Will Landau. + +::::::::::::::::::::::::::::::::::::: prereq + +## Install `tarchetypes` + +If you haven't done so yet, install `tarchetypes` with `install.packages("tarchetypes")`. + +::::::::::::::::::::::::::::::::::::: + +The purpose of the `tarchetypes` is to provide various shortcuts that make writing `targets` pipelines easier. +We will introduce just one for now, `tar_plan()`. This is used in place of `list()` at the end of the `_targets.R` script. +By using `tar_plan()`, instead of specifying targets with `tar_target()`, we can use a syntax like this: `target_name = target_command`. + +Let's edit the penguins workflow to use the `tar_plan()` syntax: + + + +``` r +library(targets) +library(tarchetypes) +library(palmerpenguins) +library(tidyverse) + +clean_penguin_data <- function(penguins_data_raw) { + penguins_data_raw |> + select( + species = Species, + bill_length_mm = `Culmen Length (mm)`, + bill_depth_mm = `Culmen Depth (mm)` + ) |> + remove_missing(na.rm = TRUE) |> + # Split "species" apart on spaces, and only keep the first word + separate(species, into = "species", extra = "drop") +} + +tar_plan( + penguins_csv_file = path_to_file("penguins_raw.csv"), + penguins_data_raw = read_csv(penguins_csv_file, show_col_types = FALSE), + penguins_data = clean_penguin_data(penguins_data_raw) +) +``` + +I think it is easier to read, do you? + +Notice that `tar_plan()` does not mean you have to write *all* targets this way; you can still use the `tar_target()` format within `tar_plan()`. +That is because `=`, while short and easy to read, does not provide all of the customization that `targets` is capable of. +This doesn't matter so much for now, but it will become important when you start to create more advanced `targets` workflows. + +## Organizing files and folders + +So far, we have been doing everything with a single `_targets.R` file. +This is OK for a small workflow, but does not work very well when the workflow gets bigger. +There are better ways to organize your code. + +First, let's create a directory called `R` to store R code *other than* `_targets.R` (remember, `_targets.R` must be placed in the overall project directory, not in a subdirectory). +Create a new R file in `R/` called `functions.R`. +This is where we will put our custom functions. +Let's go ahead and put `clean_penguin_data()` in there now and save it. + +Similarly, let's put the `library()` calls in their own script in `R/` called `packages.R` (this isn't the only way to do it though; see the ["Managing Packages" episode](https://joelnitta.github.io/targets-workshop/packages.html) for alternative approaches). + +We will also need to modify our `_targets.R` script to call these scripts with `source`: + + +``` r +source("R/packages.R") +source("R/functions.R") + +tar_plan( + penguins_csv_file = path_to_file("penguins_raw.csv"), + penguins_data_raw = read_csv(penguins_csv_file, show_col_types = FALSE), + penguins_data = clean_penguin_data(penguins_data_raw) +) +``` + +Now `_targets.R` is much more streamlined: it is focused just on the workflow and immediately tells us what happens in each step. + +Finally, let's make some directories for storing data and output---files that are not code. +Create a new directory inside the targets cache called `user`: `_targets/user`. +Within `user`, create two more directories, `data` and `results`. +(If you use version control, you will probably want to ignore the `_targets` directory). + +## A word about functions + +We mentioned custom functions earlier in the lesson, but this is an important topic that deserves further clarification. +If you are used to analyzing data in R with a series of scripts instead of a single workflow like `targets`, you may not write many functions (using the `function()` function). + +This is a major difference from `targets`. +It would be quite difficult to write an efficient `targets` pipeline without the use of custom functions, because each target you build has to be the output of a single command. + +We don't have time in this curriculum to cover how to write functions in R, but the [Software Carpentry lesson](https://swcarpentry.github.io/r-novice-gapminder/10-functions) is recommended for reviewing this topic. + +Another major difference is that **each target must have a unique name**. +You may be used to writing code that looks like this: + + +``` r +# Store a person's height in cm, then convert to inches +height <- 160 +height <- height / 2.54 +``` + +You would get an error if you tried to run the equivalent targets pipeline: + + +``` r +tar_plan( + height = 160, + height = height / 2.54 +) +``` + + +``` output + +``` + +``` output +── Debugging ─────────────────────────────────────────────────────────────────── +``` + +``` output + +``` + +``` output +── How to ────────────────────────────────────────────────────────────────────── +``` + +``` output + +``` + +``` output +── Last error message ────────────────────────────────────────────────────────── +``` + +``` output + +``` + +``` output +── Last error traceback ──────────────────────────────────────────────────────── +``` + +``` error +Error: +! targets::tar_make() error + • tar_errored() + • tar_meta(fields = any_of("error"), complete_only = TRUE) + • tar_workspace() + • tar_workspaces() + • Debug: https://books.ropensci.org/targets/debugging.html + • Help: https://books.ropensci.org/targets/help.html + duplicated target names: height + base::tryCatch(base::withCallingHandlers({ NULL base::saveRDS(base::do.c... + tryCatchList(expr, classes, parentenv, handlers) + tryCatchOne(tryCatchList(expr, names[-nh], parentenv, handlers[-nh]), na... + doTryCatch(return(expr), name, parentenv, handler) + tryCatchList(expr, names[-nh], parentenv, handlers[-nh]) + tryCatchOne(expr, names, parentenv, handlers[[1L]]) + doTryCatch(return(expr), name, parentenv, handler) + base::withCallingHandlers({ NULL base::saveRDS(base::do.call(base::do.ca... + base::saveRDS(base::do.call(base::do.call, base::c(base::readRDS("/tmp/R... + base::do.call(base::do.call, base::c(base::readRDS("/tmp/RtmpBcneVt/call... + (function (what, args, quote = FALSE, envir = parent.frame()) { if (!is.... + (function (targets_function, targets_arguments, options, envir = NULL, s... + tryCatch(out <- withCallingHandlers(targets::tar_callr_inner_try(targets... + tryCatchList(expr, classes, parentenv, handlers) + tryCatchOne(expr, names, parentenv, handlers[[1L]]) + doTryCatch(return(expr), name, parentenv, handler) + withCallingHandlers(targets::tar_callr_inner_try(targets_function = targ... + targets::tar_callr_inner_try(targets_function = targets_function, target... + pipeline_from_list(targets) + pipeline_from_list.default(targets) + pipeline_init(out) + pipeline_targets_init(targets, clone_targets) + tar_assert_unique_targets(names) + tar_throw_validate(message) + tar_error(message = paste0(...), class = c("tar_condition_validate", "ta... + rlang::abort(message = message, class = class, call = tar_envir_base) + signal_abort(cnd, .file) +``` + +**A major part of working with `targets` pipelines is writing custom functions that are the right size.** +They should not be so small that each is just a single line of code; this would make your pipeline difficult to understand and be too difficult to maintain. +On the other hand, they should not be so big that each has large numbers of inputs and is thus overly sensitive to changes. + +Striking this balance is more of art than science, and only comes with practice. I find a good rule of thumb is no more than three inputs per target. + +::::::::::::::::::::::::::::::::::::: keypoints + +- Put code in the `R/` folder +- Put functions in `R/functions.R` +- Specify packages in `R/packages.R` +- Put other miscellaneous files in `_targets/user` +- Writing functions is a key skill for `targets` pipelines + +:::::::::::::::::::::::::::::::::::::::::::::::: + diff --git a/packages.md b/packages.md new file mode 100644 index 00000000..1464b870 --- /dev/null +++ b/packages.md @@ -0,0 +1,228 @@ +--- +title: 'Managing Packages' +teaching: 10 +exercises: 2 +--- + +:::::::::::::::::::::::::::::::::::::: questions + +- How should I manage packages for my `targets` project? + +:::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::: objectives + +- Demonstrate best practices for managing packages + +:::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::: instructor + +Episode summary: Show how to load packages and maintain package versions + +::::::::::::::::::::::::::::::::::::: + + + +## Loading packages + +Almost every R analysis relies on packages for functions beyond those available in base R. + +There are three main ways to load packages in `targets` workflows. + +### Method 1: `library()` {#method-1} + +This is the method you are almost certainly more familiar with, and is the method we have been using by default so far. + +Like any other R script, include `library()` calls near the top of the `_targets.R` script. Alternatively (and as the recommended best practice for project organization), you can put all of the `library()` calls in a separate script---this is typically called `packages.R` and stored in the `R/` directory of your project. + +The potential downside to this approach is that if you have a long list of packages to load, certain functions like `tar_visnetwork()`, `tar_outdated()`, etc., may take an unnecessarily long time to run because they have to load all the packages, even though they don't necessarily use them. + +### Method 2: `tar_option_set()` {#method-2} + +In this method, use the `tar_option_set()` function in `_targets.R` to specify the packages to load when running the workflow. + +This will be demonstrated using the pre-cleaned dataset from the `palmerpenguins` package. Let's say we want to filter it down to just data for the Adelie penguin. + +::::::::::::::::::::::::::::::::::::: {.callout} + +## Save your progress + +You can only have one active `_targets.R` file at a time in a given project. + +We are about to create a new `_targets.R` file, but you probably don't want to lose your progress in the one we have been working on so far (the penguins bill analysis). You can temporarily rename that one to something like `_targets_old.R` so that you don't overwrite it with the new example `_targets.R` file below. Then, rename them when you are ready to work on it again. + +::::::::::::::::::::::::::::::::::::: + +This is what using the `tar_option_set()` method looks like: + + +``` r +library(targets) +library(tarchetypes) + +tar_option_set(packages = c("dplyr", "palmerpenguins")) + +tar_plan( + adelie_data = filter(penguins, species == "Adelie") +) +``` + + +``` output +▶ dispatched target adelie_data +● completed target adelie_data [0.017 seconds, 1.544 kilobytes] +▶ ended pipeline [0.106 seconds] +``` + +This method gets around the slow-downs that may sometimes be experienced with Method 1. + +### Method 3: `packages` argument of `tar_target()` {#method-3} + +The main function for defining targets, `tar_target()` includes a `packages` argument that will load the specified packages **only for that target**. + +Here is how we could use this method, modified from the same example as above. + + +``` r +library(targets) +library(tarchetypes) + +tar_plan( + tar_target( + adelie_data, + filter(penguins, species == "Adelie"), + packages = c("dplyr", "palmerpenguins") + ) +) +``` + + +``` output +▶ dispatched target adelie_data +● completed target adelie_data [0.016 seconds, 1.544 kilobytes] +▶ ended pipeline [0.106 seconds] +``` + +This can be more memory efficient in some cases than loading all packages, since not every target is always made during a typical run of the workflow. +But, it can be tedious to remember and specify packages needed on a per-target basis. + +### One more option + +Another alternative that does not actually involve loading packages is to specify the package associated with each function by using the `::` notation, for example, `dplyr::mutate()`. +This means you can **avoid loading packages altogether**. + +Here is how to write the plan using this method: + + +``` r +library(targets) +library(tarchetypes) + +tar_plan( + adelie_data = dplyr::filter(palmerpenguins::penguins, species == "Adelie") +) +``` + + +``` output +▶ dispatched target adelie_data +● completed target adelie_data [0.009 seconds, 1.544 kilobytes] +▶ ended pipeline [0.098 seconds] +``` + +The benefits of this approach are that the origins of all functions is explicit, so you could browse your code (for example, by looking at its source in GitHub), and immediately know where all the functions come from. +The downside is that it is rather verbose because you need to type the package name every time you use one of its functions. + +### Which is the right way? + +**There is no "right" answer about how to load packages**---it is a matter of what works best for your particular situation. + +Often a reasonable approach is to load your most commonly used packages with `library()` (such as `tidyverse`) in `packages.R`, then use `::` notation for less frequently used functions whose origins you may otherwise forget. + +## Maintaining package versions + +### Tracking of custom functions vs. functions from packages + +A critical thing to understand about `targets` is that **it only tracks custom functions and targets**, not functions provided by packages. + +However, the content of packages can change, and packages typically get updated on a regular basis. **The output of your workflow may depend not only on the packages you use, but their versions**. + +Therefore, it is a good idea to track package versions. + +### About `renv` + +Fortunately, you don't have to do this by hand: there are R packages available that can help automate this process. We recommend [renv](https://rstudio.github.io/renv/index.html), but there are others available as well (e.g., [groundhog](https://groundhogr.com/)). We don't have the time to cover detailed usage of `renv` in this lesson. To get started with `renv`, see the ["Introduction to renv" vignette](https://rstudio.github.io/renv/articles/renv.html). + +You can generally use `renv` the same way you would for a `targets` project as any other R project. However, there is one exception: if you load packages using `tar_option_set()` or the `packages` argument of `tar_target()` ([Method 2](#method-2) or [Method 3](#method-3), respectively), `renv` will not detect them (because it expects packages to be loaded with `library()`, `require()`, etc.). + +The solution in this case is to use the [`tar_renv()` function](https://docs.ropensci.org/targets/reference/tar_renv.html). This will write a separate file with `library()` calls for each package used in the workflow so that `renv` will properly detect them. + +### Selective tracking of functions from packages + +Because `targets` doesn't track functions from packages, if you update a package and the contents of one of its functions changes, `targets` **will not re-build the target that was generated by that function**. + +However, it is possible to change this behavior on a per-package basis. +This is best done only for a small number of packages, since adding too many would add too much computational overhead to `targets` when it has to calculate dependencies. +For example, you may want to do this if you are using your own custom package that you update frequently. + +The way to do so is by using `tar_option_set()`, specifying the **same** package name in both `packages` and `imports`. Here is a modified version of the earlier code that demonstrates this for `dplyr` and `palmerpenguins`. + + +``` r +library(targets) +library(tarchetypes) + +tar_option_set( + packages = c("dplyr", "palmerpenguins"), + imports = c("dplyr", "palmerpenguins") +) + +tar_plan( + adelie_data = filter(penguins, species == "Adelie") +) +``` + +If we were to re-install either `dplyr` or `palmerpenguins` and one of the functions used from those in the pipeline changes (for example, `filter()`), any target depending on that function will be rebuilt. + +## Resolving namespace conflicts + +There is one final best-practice to mention related to packages: resolving namespace conflicts. + +"Namespace" refers to the idea that a certain set of unique names are only unique **within a particular context**. +For example, all the function names of a package have to be unique, but only within that package. +Function names could be duplicated across packages. + +As you may imagine, this can cause confusion. +For example, the `filter()` function appears in both the `stats` package and the `dplyr` package, but does completely different things in each. +This is a **namespace conflict**: how do we know which `filter()` we are talking about? + +The `conflicted` package can help prevent such confusion by stopping you if you try to use an ambiguous function, and help you be explicit about which package to use. +We don't have time to cover the details here, but you can read more about how to use `conflicted` at its [website](https://conflicted.r-lib.org/). + +When you use `conflicted`, you will typically run a series of commands to explicitly resolve namespace conflicts, like `conflicts_prefer(dplyr::filter)` (this would tell R that we want to use `filter` from `dplyr`, not `stats`). + +To use this in a `targets` workflow, you should put all calls to `conflicts_prefer` in a special file called `.Rprofile` that is located in the main folder of your project. This will ensure that the conflicts are always resolved for each target. + +The recommended way to edit your `.Rprofile` is to use `usethis::edit_r_profile("project")`. +This will open `.Rprofile` in your editor, where you can edit it and save it. + +For example, your `.Rprofile` could include this: + + +``` r +library(conflicted) +conflicts_prefer(dplyr::filter) +``` + +Note that you don't need to run `source()` to run the code in `.Rprofile`. +It will always get run at the start of each R session automatically. + +::::::::::::::::::::::::::::::::::::: keypoints + +- There are multiple ways to load packages with `targets` +- `targets` only tracks user-defined functions, not packages +- Use `renv` to manage package versions +- Use the `conflicted` package to manage namespace conflicts + +:::::::::::::::::::::::::::::::::::::::::::::::: diff --git a/parallel.md b/parallel.md new file mode 100644 index 00000000..dc47fa28 --- /dev/null +++ b/parallel.md @@ -0,0 +1,211 @@ +--- +title: 'Parallel Processing' +teaching: 10 +exercises: 2 +--- + +:::::::::::::::::::::::::::::::::::::: questions + +- How can we build targets in parallel? + +:::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::: objectives + +- Be able to build targets in parallel + +:::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::: instructor + +Episode summary: Show how to use parallel processing + +::::::::::::::::::::::::::::::::::::: + + + +Once a pipeline starts to include many targets, you may want to think about parallel processing. +This takes advantage of multiple processors in your computer to build multiple targets at the same time. + +::::::::::::::::::::::::::::::::::::: {.callout} + +## When to use parallel processing + +Parallel processing should only be used if your workflow has independent tasks---if your workflow only consists of a linear sequence of targets, then there is nothing to parallelize. +Most workflows that use branching can benefit from parallelism. + +::::::::::::::::::::::::::::::::::::: + +`targets` includes support for high-performance computing, cloud computing, and various parallel backends. +Here, we assume you are running this analysis on a laptop and so will use a relatively simple backend. +If you are interested in high-performance computing, [see the `targets` manual](https://books.ropensci.org/targets/hpc.html). + +### Set up workflow + +To enable parallel processing with `crew` you only need to load the `crew` package, then tell `targets` to use it using `tar_option_set`. +Specifically, the following lines enable crew, and tells it to use 2 parallel workers. +You can increase this number on more powerful machines: + +```r +library(crew) +tar_option_set( + controller = crew_controller_local(workers = 2) +) +``` + +Make these changes to the penguins analysis. +It should now look like this: + + +``` r +source("R/functions.R") +source("R/packages.R") + +# Set up parallelization +library(crew) +tar_option_set( + controller = crew_controller_local(workers = 2) +) + +tar_plan( + # Load raw data + tar_file_read( + penguins_data_raw, + path_to_file("penguins_raw.csv"), + read_csv(!!.x, show_col_types = FALSE) + ), + # Clean data + penguins_data = clean_penguin_data(penguins_data_raw), + # Build models + models = list( + combined_model = lm( + bill_depth_mm ~ bill_length_mm, data = penguins_data), + species_model = lm( + bill_depth_mm ~ bill_length_mm + species, data = penguins_data), + interaction_model = lm( + bill_depth_mm ~ bill_length_mm * species, data = penguins_data) + ), + # Get model summaries + tar_target( + model_summaries, + glance_with_mod_name(models), + pattern = map(models) + ), + # Get model predictions + tar_target( + model_predictions, + augment_with_mod_name(models), + pattern = map(models) + ) +) +``` + +There is still one more thing we need to modify only for the purposes of this demo: if we ran the analysis in parallel now, you wouldn't notice any difference in compute time because the functions are so fast. + +So let's make "slow" versions of `glance_with_mod_name()` and `augment_with_mod_name()` using the `Sys.sleep()` function, which just tells the computer to wait some number of seconds. +This will simulate a long-running computation and enable us to see the difference between running sequentially and in parallel. + +Add these functions to `functions.R` (you can copy-paste the original ones, then modify them): + + +``` r +glance_with_mod_name_slow <- function(model_in_list) { + Sys.sleep(4) + model_name <- names(model_in_list) + model <- model_in_list[[1]] + broom::glance(model) |> + mutate(model_name = model_name) +} +augment_with_mod_name_slow <- function(model_in_list) { + Sys.sleep(4) + model_name <- names(model_in_list) + model <- model_in_list[[1]] + broom::augment(model) |> + mutate(model_name = model_name) +} +``` + +Then, change the plan to use the "slow" version of the functions: + + +``` r +source("R/functions.R") +source("R/packages.R") + +# Set up parallelization +library(crew) +tar_option_set( + controller = crew_controller_local(workers = 2) +) + +tar_plan( + # Load raw data + tar_file_read( + penguins_data_raw, + path_to_file("penguins_raw.csv"), + read_csv(!!.x, show_col_types = FALSE) + ), + # Clean data + penguins_data = clean_penguin_data(penguins_data_raw), + # Build models + models = list( + combined_model = lm( + bill_depth_mm ~ bill_length_mm, data = penguins_data), + species_model = lm( + bill_depth_mm ~ bill_length_mm + species, data = penguins_data), + interaction_model = lm( + bill_depth_mm ~ bill_length_mm * species, data = penguins_data) + ), + # Get model summaries + tar_target( + model_summaries, + glance_with_mod_name_slow(models), + pattern = map(models) + ), + # Get model predictions + tar_target( + model_predictions, + augment_with_mod_name_slow(models), + pattern = map(models) + ) +) +``` + +Finally, run the pipeline with `tar_make()` as normal. + + +``` output +✔ skip target penguins_data_raw_file +✔ skip target penguins_data_raw +✔ skip target penguins_data +✔ skip target models +• start branch model_predictions_5ad4cec5 +• start branch model_predictions_c73912d5 +• start branch model_predictions_91696941 +• start branch model_summaries_5ad4cec5 +• start branch model_summaries_c73912d5 +• start branch model_summaries_91696941 +• built branch model_predictions_5ad4cec5 [4.884 seconds] +• built branch model_predictions_c73912d5 [4.896 seconds] +• built branch model_predictions_91696941 [4.006 seconds] +• built pattern model_predictions +• built branch model_summaries_5ad4cec5 [4.011 seconds] +• built branch model_summaries_c73912d5 [4.011 seconds] +• built branch model_summaries_91696941 [4.011 seconds] +• built pattern model_summaries +• end pipeline [15.153 seconds] +``` + +Notice that although the time required to build each individual target is about 4 seconds, the total time to run the entire workflow is less than the sum of the individual target times! That is proof that processes are running in parallel **and saving you time**. + +The unique and powerful thing about targets is that **we did not need to change our custom function to run it in parallel**. We only adjusted *the workflow*. This means it is relatively easy to refactor (modify) a workflow for running sequentially locally or running in parallel in a high-performance context. + +Now that we have demonstrated how this works, you can change your analysis plan back to the original versions of the functions you wrote. + +::::::::::::::::::::::::::::::::::::: keypoints + +- Dynamic branching creates multiple targets with a single command +- You usually need to write custom functions so that the output of the branches includes necessary metadata +- Parallel computing works at the level of the workflow, not the function + +:::::::::::::::::::::::::::::::::::::::::::::::: diff --git a/quarto.md b/quarto.md new file mode 100644 index 00000000..1782fa2b --- /dev/null +++ b/quarto.md @@ -0,0 +1,204 @@ +--- +title: 'Reproducible Reports with Quarto' +teaching: 10 +exercises: 2 +--- + +:::::::::::::::::::::::::::::::::::::: questions + +- How can we create reproducible reports? + +:::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::: objectives + +- Be able to generate a report using `targets` + +:::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::: instructor + +Episode summary: Show how to write reports with Quarto + +::::::::::::::::::::::::::::::::::::: + + + +## Copy-paste vs. dynamic documents + +Typically, you will want to communicate the results of a data analysis to a broader audience. + +You may have done this before by copying and pasting statistics, plots, and other results into a text document or presentation. +This may be fine if you only ever do the analysis once. +But that is rarely the case---it is much more likely that you will tweak parts of the analysis or add new data and re-run your pipeline. +With the copy-paste method, you'd have to remember what results changed and manually make sure everything is up-to-date. +This is a perilous exercise! + +Fortunately, `targets` provides functions for keeping a document in sync with pipeline results, so you can avoid such pitfalls. +The main tool we will use to generate documents is **Quarto**. +Quarto can be used separately from `targets` (and is a large topic on its own), but it also happens to be an excellent way to dynamically generate reports with `targets`. + +Quarto allows you to insert the results of R code directly into your documents so that there is no danger of copy-and-paste mistakes. +Furthermore, it can generate output from the same underlying script in multiple formats including PDF, HTML, and Microsoft Word. + +::::::::::::::::::::::::::::::::::::: {.prereq} + +## Installing Quarto + +As of v2022.07.1, [RStudio comes with Quarto](https://docs.posit.co/ide/user/ide/guide/documents/quarto-project.html), so you don't need to install it separately. If you can't run Quarto from RStudio, we recommend installing the latest version of RStudio. + +::::::::::::::::::::::::::::::::::::: + +## About Quarto files + +`.qmd` or `.Qmd` is the extension for Quarto files, and stands for "Quarto markdown". +Quarto files invert the normal way of writing code and comments: in a typical R script, all text is assumed to be R code, unless you preface it with a `#` to show that it is a comment. +In Quarto, all text is assumed to be prose, and you use special notation to indicate which lines are R code to be evaluated. +Once the code is evaluated, the results get inserted into a final, rendered document, which could be one of various formats. + +![Quarto workflow](https://ucsbcarpentry.github.io/Reproducible-Publications-with-RStudio-Quarto/fig/03-qmd-workflow.png) + +We don't have the time to go into the details of Quarto during this lesson, but recommend the ["Introduction to Reproducible Publications with RStudio" incubator (in-development) lesson](https://ucsbcarpentry.github.io/Reproducible-Publications-with-RStudio-Quarto/) for more on this topic. + +## Recommended workflow + +Dynamic documents like Quarto (or Rmarkdown, the predecessor to Quarto) can actually be used to manage data analysis pipelines. +But that is not recommended because it doesn't scale well and lacks the sophisticated dependency tracking offered by `targets`. + +Our suggested approach is to conduct the vast majority of data analysis (in other words, the "heavy lifting") in the `targets` pipeline, then use the Quarto document to **summarize** and **plot** the results. + +## Report on bill size in penguins + +Continuing our penguin bill size analysis, let's write a report evaluating each model. + +To save time, the report is already available at . + +Copy the [raw code from here](https://raw.githubusercontent.com/joelnitta/penguins-targets/main/penguin_report.qmd) and save it as a new file `penguin_report.qmd` in your project folder (you may also be able to right click in your browser and select "Save As"). + +Then, add one more target to the pipeline using the `tar_quarto()` function like this: + + +``` r +source("R/functions.R") +source("R/packages.R") + +tar_plan( + # Load raw data + tar_file_read( + penguins_data_raw, + path_to_file("penguins_raw.csv"), + read_csv(!!.x, show_col_types = FALSE) + ), + # Clean data + penguins_data = clean_penguin_data(penguins_data_raw), + # Build models + models = list( + combined_model = lm( + bill_depth_mm ~ bill_length_mm, data = penguins_data), + species_model = lm( + bill_depth_mm ~ bill_length_mm + species, data = penguins_data), + interaction_model = lm( + bill_depth_mm ~ bill_length_mm * species, data = penguins_data) + ), + # Get model summaries + tar_target( + model_summaries, + glance_with_mod_name(models), + pattern = map(models) + ), + # Get model predictions + tar_target( + model_predictions, + augment_with_mod_name(models), + pattern = map(models) + ), + # Generate report + tar_quarto( + penguin_report, + path = "penguin_report.qmd", + quiet = FALSE, + packages = c("targets", "tidyverse") + ) +) +``` + + + +The function to generate the report is `tar_quarto()`, from the `tarchetypes` package. + +As you can see, the "heavy" analysis of running the models is done in the workflow, then there is a single call to render the report at the end with `tar_quarto()`. + +## How does `targets` know when to render the report? + +It is not immediately apparent just from this how `targets` knows to generate the report **at the end of the workflow** (recall that build order is not determined by the order of how targets are written in the workflow, but rather by their dependencies). +`penguin_report` does not appear to depend on any of the other targets, since they do not show up in the `tar_quarto()` call. + +How does this work? + +The answer lies **inside** the `penguin_report.qmd` file. Let's look at the start of the file: + + +```` markdown +--- +title: "Simpson's Paradox in Palmer Penguins" +format: + html: + toc: true +execute: + echo: false +--- + +```{r} +#| label: load +#| message: false +targets::tar_load(penguin_models_augmented) +targets::tar_load(penguin_models_summary) + +library(tidyverse) +``` + +This is an example analysis of penguins on the Palmer Archipelago in Antarctica. + +```` + +The lines in between `---` and `---` at the very beginning are called the "YAML header", and contain directions about how to render the document. + +The R code to be executed is specified by the lines between `` ```{r} `` and `` ``` ``. This is called a "code chunk", since it is a portion of code interspersed within prose text. + +Take a closer look at the R code chunk. Notice the two calls to `targets::tar_load()`. Do you remember what that function does? It loads the targets built during the workflow. + +Now things should make a bit more sense: `targets` knows that the report depends on the targets built during the workflow, `penguin_models_augmented` and `penguin_models_summary`, **because they are loaded in the report with `tar_load()`.** + +## Generating dynamic content + +The call to `tar_load()` at the start of `penguin_report.qmd` is really the key to generating an up-to-date report---once those are loaded from the workflow, we know that they are in sync with the data, and can use them to produce "polished" text and plots. + +::::::::::::::::::::::::::::::::::::: {.challenge} + +## Challenge: Spot the dynamic contents + +Read through `penguin_report.qmd` and try to find instances where the targets built during the workflow (`penguin_models_augmented` and `penguin_models_summary`) are used to dynamically produce text and plots. + +:::::::::::::::::::::::::::::::::: {.solution} + +- In the code chunk labeled `results-stats`, statistics from the models like *P*-value and adjusted *R* squared are extracted, then inserted into the text with in-line code like `` `r mod_stats$combined$r.squared` ``. + +- There are two figures, one for the combined model and one for the separate model (code chunks labeled `fig-combined-plot` and `fig-separate-plot`, respectively). These are built using the points predicted from the model in `penguin_models_augmented`. + +:::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::: + +You should also interactively run the code in `penguin_report.qmd` to better understand what is going on, starting with `tar_load()`. In fact, that is how this report was written: the code was run in an interactive session, and saved to the report as it was gradually tweaked to obtain the desired results. + +The best way to learn this approach to generating reports is to **try it yourself**. + +So your final Challenge is to construct a `targets` workflow using your own data and generate a report. Good luck! + +::::::::::::::::::::::::::::::::::::: keypoints + +- `tarchetypes::tar_quarto()` is used to render Quarto documents +- You should load targets within the Quarto document using `tar_load()` and `tar_read()` +- It is recommended to do heavy computations in the main targets workflow, and lighter formatting and plot generation in the Quarto document + +:::::::::::::::::::::::::::::::::::::::::::::::: diff --git a/reference.md b/reference.md new file mode 100644 index 00000000..014470ef --- /dev/null +++ b/reference.md @@ -0,0 +1,15 @@ +--- +title: 'Reference' +--- + +## Glossary + +branch +: A set of targets that are programmatically defined in the `targets` workflow + +reproducibility +: The ability for others (including your future self) to be able to re-run an analysis and obtain the same results + +target +: An object built by the `targets` workflow + diff --git a/renv.lock b/renv.lock new file mode 100644 index 00000000..ab363187 --- /dev/null +++ b/renv.lock @@ -0,0 +1,1776 @@ +{ + "R": { + "Version": "4.4.2", + "Repositories": [ + { + "Name": "carpentries", + "URL": "https://carpentries.r-universe.dev" + }, + { + "Name": "carpentries_archive", + "URL": "https://carpentries.github.io/drat" + }, + { + "Name": "CRAN", + "URL": "https://cran.rstudio.com" + } + ] + }, + "Packages": { + "DBI": { + "Package": "DBI", + "Version": "1.2.3", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "methods" + ], + "Hash": "065ae649b05f1ff66bb0c793107508f5" + }, + "MASS": { + "Package": "MASS", + "Version": "7.3-61", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "grDevices", + "graphics", + "methods", + "stats", + "utils" + ], + "Hash": "0cafd6f0500e5deba33be22c46bf6055" + }, + "Matrix": { + "Package": "Matrix", + "Version": "1.7-1", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "grDevices", + "graphics", + "grid", + "lattice", + "methods", + "stats", + "utils" + ], + "Hash": "5122bb14d8736372411f955e1b16bc8a" + }, + "R6": { + "Package": "R6", + "Version": "2.5.1", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R" + ], + "Hash": "470851b6d5d0ac559e9d01bb352b4021" + }, + "RColorBrewer": { + "Package": "RColorBrewer", + "Version": "1.1-3", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R" + ], + "Hash": "45f0398006e83a5b10b72a90663d8d8c" + }, + "Rcpp": { + "Package": "Rcpp", + "Version": "1.0.13-1", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "methods", + "utils" + ], + "Hash": "6b868847b365672d6c1677b1608da9ed" + }, + "askpass": { + "Package": "askpass", + "Version": "1.2.1", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "sys" + ], + "Hash": "c39f4155b3ceb1a9a2799d700fbd4b6a" + }, + "backports": { + "Package": "backports", + "Version": "1.5.0", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R" + ], + "Hash": "e1e1b9d75c37401117b636b7ae50827a" + }, + "base64enc": { + "Package": "base64enc", + "Version": "0.1-3", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R" + ], + "Hash": "543776ae6848fde2f48ff3816d0628bc" + }, + "base64url": { + "Package": "base64url", + "Version": "1.4", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "backports" + ], + "Hash": "0c54cf3a08cc0e550fbd64ad33166143" + }, + "bit": { + "Package": "bit", + "Version": "4.5.0", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R" + ], + "Hash": "5dc7b2677d65d0e874fc4aaf0e879987" + }, + "bit64": { + "Package": "bit64", + "Version": "4.5.2", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "bit", + "methods", + "stats", + "utils" + ], + "Hash": "e84984bf5f12a18628d9a02322128dfd" + }, + "blob": { + "Package": "blob", + "Version": "1.2.4", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "methods", + "rlang", + "vctrs" + ], + "Hash": "40415719b5a479b87949f3aa0aee737c" + }, + "broom": { + "Package": "broom", + "Version": "1.0.7", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "backports", + "dplyr", + "generics", + "glue", + "lifecycle", + "purrr", + "rlang", + "stringr", + "tibble", + "tidyr" + ], + "Hash": "8fcc818f3b9887aebaf206f141437cc9" + }, + "bslib": { + "Package": "bslib", + "Version": "0.8.0", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "base64enc", + "cachem", + "fastmap", + "grDevices", + "htmltools", + "jquerylib", + "jsonlite", + "lifecycle", + "memoise", + "mime", + "rlang", + "sass" + ], + "Hash": "b299c6741ca9746fb227debcb0f9fb6c" + }, + "cachem": { + "Package": "cachem", + "Version": "1.1.0", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "fastmap", + "rlang" + ], + "Hash": "cd9a672193789068eb5a2aad65a0dedf" + }, + "callr": { + "Package": "callr", + "Version": "3.7.6", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "R6", + "processx", + "utils" + ], + "Hash": "d7e13f49c19103ece9e58ad2d83a7354" + }, + "cellranger": { + "Package": "cellranger", + "Version": "1.1.0", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "rematch", + "tibble" + ], + "Hash": "f61dbaec772ccd2e17705c1e872e9e7c" + }, + "cli": { + "Package": "cli", + "Version": "3.6.3", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "utils" + ], + "Hash": "b21916dd77a27642b447374a5d30ecf3" + }, + "clipr": { + "Package": "clipr", + "Version": "0.8.0", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "utils" + ], + "Hash": "3f038e5ac7f41d4ac41ce658c85e3042" + }, + "codetools": { + "Package": "codetools", + "Version": "0.2-20", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R" + ], + "Hash": "61e097f35917d342622f21cdc79c256e" + }, + "colorspace": { + "Package": "colorspace", + "Version": "2.1-1", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "grDevices", + "graphics", + "methods", + "stats" + ], + "Hash": "d954cb1c57e8d8b756165d7ba18aa55a" + }, + "conflicted": { + "Package": "conflicted", + "Version": "1.2.0", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "cli", + "memoise", + "rlang" + ], + "Hash": "bb097fccb22d156624fd07cd2894ddb6" + }, + "cpp11": { + "Package": "cpp11", + "Version": "0.5.0", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R" + ], + "Hash": "91570bba75d0c9d3f1040c835cee8fba" + }, + "crayon": { + "Package": "crayon", + "Version": "1.5.3", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "grDevices", + "methods", + "utils" + ], + "Hash": "859d96e65ef198fd43e82b9628d593ef" + }, + "crew": { + "Package": "crew", + "Version": "0.10.2", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "R6", + "cli", + "data.table", + "getip", + "later", + "mirai", + "nanonext", + "processx", + "promises", + "ps", + "rlang", + "stats", + "tibble", + "tidyselect", + "tools", + "utils" + ], + "Hash": "40745863e75317c534992c4796af8c58" + }, + "curl": { + "Package": "curl", + "Version": "6.0.1", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R" + ], + "Hash": "e8ba62486230951fcd2b881c5be23f96" + }, + "data.table": { + "Package": "data.table", + "Version": "1.16.2", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "methods" + ], + "Hash": "2e00b378fc3be69c865120d9f313039a" + }, + "dbplyr": { + "Package": "dbplyr", + "Version": "2.5.0", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "DBI", + "R", + "R6", + "blob", + "cli", + "dplyr", + "glue", + "lifecycle", + "magrittr", + "methods", + "pillar", + "purrr", + "rlang", + "tibble", + "tidyr", + "tidyselect", + "utils", + "vctrs", + "withr" + ], + "Hash": "39b2e002522bfd258039ee4e889e0fd1" + }, + "digest": { + "Package": "digest", + "Version": "0.6.37", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "utils" + ], + "Hash": "33698c4b3127fc9f506654607fb73676" + }, + "dplyr": { + "Package": "dplyr", + "Version": "1.1.4", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "R6", + "cli", + "generics", + "glue", + "lifecycle", + "magrittr", + "methods", + "pillar", + "rlang", + "tibble", + "tidyselect", + "utils", + "vctrs" + ], + "Hash": "fedd9d00c2944ff00a0e2696ccf048ec" + }, + "dtplyr": { + "Package": "dtplyr", + "Version": "1.3.1", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "cli", + "data.table", + "dplyr", + "glue", + "lifecycle", + "rlang", + "tibble", + "tidyselect", + "vctrs" + ], + "Hash": "54ed3ea01b11e81a86544faaecfef8e2" + }, + "evaluate": { + "Package": "evaluate", + "Version": "1.0.1", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R" + ], + "Hash": "3fd29944b231036ad67c3edb32e02201" + }, + "fansi": { + "Package": "fansi", + "Version": "1.0.6", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "grDevices", + "utils" + ], + "Hash": "962174cf2aeb5b9eea581522286a911f" + }, + "farver": { + "Package": "farver", + "Version": "2.1.2", + "Source": "Repository", + "Repository": "CRAN", + "Hash": "680887028577f3fa2a81e410ed0d6e42" + }, + "fastmap": { + "Package": "fastmap", + "Version": "1.2.0", + "Source": "Repository", + "Repository": "CRAN", + "Hash": "aa5e1cd11c2d15497494c5292d7ffcc8" + }, + "fontawesome": { + "Package": "fontawesome", + "Version": "0.5.3", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "htmltools", + "rlang" + ], + "Hash": "bd1297f9b5b1fc1372d19e2c4cd82215" + }, + "forcats": { + "Package": "forcats", + "Version": "1.0.0", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "cli", + "glue", + "lifecycle", + "magrittr", + "rlang", + "tibble" + ], + "Hash": "1a0a9a3d5083d0d573c4214576f1e690" + }, + "fs": { + "Package": "fs", + "Version": "1.6.5", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "methods" + ], + "Hash": "7f48af39fa27711ea5fbd183b399920d" + }, + "gargle": { + "Package": "gargle", + "Version": "1.5.2", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "cli", + "fs", + "glue", + "httr", + "jsonlite", + "lifecycle", + "openssl", + "rappdirs", + "rlang", + "stats", + "utils", + "withr" + ], + "Hash": "fc0b272e5847c58cd5da9b20eedbd026" + }, + "generics": { + "Package": "generics", + "Version": "0.1.3", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "methods" + ], + "Hash": "15e9634c0fcd294799e9b2e929ed1b86" + }, + "getip": { + "Package": "getip", + "Version": "0.1-4", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R" + ], + "Hash": "0e81e7f976441581680fa2ebc5866212" + }, + "ggplot2": { + "Package": "ggplot2", + "Version": "3.5.1", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "MASS", + "R", + "cli", + "glue", + "grDevices", + "grid", + "gtable", + "isoband", + "lifecycle", + "mgcv", + "rlang", + "scales", + "stats", + "tibble", + "vctrs", + "withr" + ], + "Hash": "44c6a2f8202d5b7e878ea274b1092426" + }, + "glue": { + "Package": "glue", + "Version": "1.8.0", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "methods" + ], + "Hash": "5899f1eaa825580172bb56c08266f37c" + }, + "googledrive": { + "Package": "googledrive", + "Version": "2.1.1", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "cli", + "gargle", + "glue", + "httr", + "jsonlite", + "lifecycle", + "magrittr", + "pillar", + "purrr", + "rlang", + "tibble", + "utils", + "uuid", + "vctrs", + "withr" + ], + "Hash": "e99641edef03e2a5e87f0a0b1fcc97f4" + }, + "googlesheets4": { + "Package": "googlesheets4", + "Version": "1.1.1", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "cellranger", + "cli", + "curl", + "gargle", + "glue", + "googledrive", + "httr", + "ids", + "lifecycle", + "magrittr", + "methods", + "purrr", + "rematch2", + "rlang", + "tibble", + "utils", + "vctrs", + "withr" + ], + "Hash": "d6db1667059d027da730decdc214b959" + }, + "gtable": { + "Package": "gtable", + "Version": "0.3.6", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "cli", + "glue", + "grid", + "lifecycle", + "rlang", + "stats" + ], + "Hash": "de949855009e2d4d0e52a844e30617ae" + }, + "haven": { + "Package": "haven", + "Version": "2.5.4", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "cli", + "cpp11", + "forcats", + "hms", + "lifecycle", + "methods", + "readr", + "rlang", + "tibble", + "tidyselect", + "vctrs" + ], + "Hash": "9171f898db9d9c4c1b2c745adc2c1ef1" + }, + "highr": { + "Package": "highr", + "Version": "0.11", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "xfun" + ], + "Hash": "d65ba49117ca223614f71b60d85b8ab7" + }, + "hms": { + "Package": "hms", + "Version": "1.1.3", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "lifecycle", + "methods", + "pkgconfig", + "rlang", + "vctrs" + ], + "Hash": "b59377caa7ed00fa41808342002138f9" + }, + "htmltools": { + "Package": "htmltools", + "Version": "0.5.8.1", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "base64enc", + "digest", + "fastmap", + "grDevices", + "rlang", + "utils" + ], + "Hash": "81d371a9cc60640e74e4ab6ac46dcedc" + }, + "htmlwidgets": { + "Package": "htmlwidgets", + "Version": "1.6.4", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "grDevices", + "htmltools", + "jsonlite", + "knitr", + "rmarkdown", + "yaml" + ], + "Hash": "04291cc45198225444a397606810ac37" + }, + "httr": { + "Package": "httr", + "Version": "1.4.7", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "R6", + "curl", + "jsonlite", + "mime", + "openssl" + ], + "Hash": "ac107251d9d9fd72f0ca8049988f1d7f" + }, + "ids": { + "Package": "ids", + "Version": "1.0.1", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "openssl", + "uuid" + ], + "Hash": "99df65cfef20e525ed38c3d2577f7190" + }, + "igraph": { + "Package": "igraph", + "Version": "2.1.1", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "Matrix", + "R", + "cli", + "cpp11", + "grDevices", + "graphics", + "lifecycle", + "magrittr", + "methods", + "pkgconfig", + "rlang", + "stats", + "utils", + "vctrs" + ], + "Hash": "c03878b48737a0e2da3b772d7b2e22da" + }, + "isoband": { + "Package": "isoband", + "Version": "0.2.7", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "grid", + "utils" + ], + "Hash": "0080607b4a1a7b28979aecef976d8bc2" + }, + "jquerylib": { + "Package": "jquerylib", + "Version": "0.1.4", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "htmltools" + ], + "Hash": "5aab57a3bd297eee1c1d862735972182" + }, + "jsonlite": { + "Package": "jsonlite", + "Version": "1.8.9", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "methods" + ], + "Hash": "4e993b65c2c3ffbffce7bb3e2c6f832b" + }, + "knitr": { + "Package": "knitr", + "Version": "1.49", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "evaluate", + "highr", + "methods", + "tools", + "xfun", + "yaml" + ], + "Hash": "9fcb189926d93c636dea94fbe4f44480" + }, + "labeling": { + "Package": "labeling", + "Version": "0.4.3", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "graphics", + "stats" + ], + "Hash": "b64ec208ac5bc1852b285f665d6368b3" + }, + "later": { + "Package": "later", + "Version": "1.4.1", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "Rcpp", + "rlang" + ], + "Hash": "501744395cac0bab0fbcfab9375ae92c" + }, + "lattice": { + "Package": "lattice", + "Version": "0.22-6", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "grDevices", + "graphics", + "grid", + "stats", + "utils" + ], + "Hash": "cc5ac1ba4c238c7ca9fa6a87ca11a7e2" + }, + "lifecycle": { + "Package": "lifecycle", + "Version": "1.0.4", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "cli", + "glue", + "rlang" + ], + "Hash": "b8552d117e1b808b09a832f589b79035" + }, + "lubridate": { + "Package": "lubridate", + "Version": "1.9.3", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "generics", + "methods", + "timechange" + ], + "Hash": "680ad542fbcf801442c83a6ac5a2126c" + }, + "magrittr": { + "Package": "magrittr", + "Version": "2.0.3", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R" + ], + "Hash": "7ce2733a9826b3aeb1775d56fd305472" + }, + "memoise": { + "Package": "memoise", + "Version": "2.0.1", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "cachem", + "rlang" + ], + "Hash": "e2817ccf4a065c5d9d7f2cfbe7c1d78c" + }, + "mgcv": { + "Package": "mgcv", + "Version": "1.9-1", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "Matrix", + "R", + "graphics", + "methods", + "nlme", + "splines", + "stats", + "utils" + ], + "Hash": "110ee9d83b496279960e162ac97764ce" + }, + "mime": { + "Package": "mime", + "Version": "0.12", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "tools" + ], + "Hash": "18e9c28c1d3ca1560ce30658b22ce104" + }, + "mirai": { + "Package": "mirai", + "Version": "1.3.1", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "nanonext" + ], + "Hash": "0746cbcb4e0a198d26b48fc64c61e710" + }, + "modelr": { + "Package": "modelr", + "Version": "0.1.11", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "broom", + "magrittr", + "purrr", + "rlang", + "tibble", + "tidyr", + "tidyselect", + "vctrs" + ], + "Hash": "4f50122dc256b1b6996a4703fecea821" + }, + "munsell": { + "Package": "munsell", + "Version": "0.5.1", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "colorspace", + "methods" + ], + "Hash": "4fd8900853b746af55b81fda99da7695" + }, + "nanonext": { + "Package": "nanonext", + "Version": "1.4.0", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R" + ], + "Hash": "2f9a62823a91f75349099d95c182787c" + }, + "nlme": { + "Package": "nlme", + "Version": "3.1-166", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "graphics", + "lattice", + "stats", + "utils" + ], + "Hash": "ccbb8846be320b627e6aa2b4616a2ded" + }, + "openssl": { + "Package": "openssl", + "Version": "2.2.2", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "askpass" + ], + "Hash": "d413e0fef796c9401a4419485f709ca1" + }, + "palmerpenguins": { + "Package": "palmerpenguins", + "Version": "0.1.1", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R" + ], + "Hash": "6c6861efbc13c1d543749e9c7be4a592" + }, + "pillar": { + "Package": "pillar", + "Version": "1.9.0", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "cli", + "fansi", + "glue", + "lifecycle", + "rlang", + "utf8", + "utils", + "vctrs" + ], + "Hash": "15da5a8412f317beeee6175fbc76f4bb" + }, + "pkgconfig": { + "Package": "pkgconfig", + "Version": "2.0.3", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "utils" + ], + "Hash": "01f28d4278f15c76cddbea05899c5d6f" + }, + "prettyunits": { + "Package": "prettyunits", + "Version": "1.2.0", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R" + ], + "Hash": "6b01fc98b1e86c4f705ce9dcfd2f57c7" + }, + "processx": { + "Package": "processx", + "Version": "3.8.4", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "R6", + "ps", + "utils" + ], + "Hash": "0c90a7d71988856bad2a2a45dd871bb9" + }, + "progress": { + "Package": "progress", + "Version": "1.2.3", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "R6", + "crayon", + "hms", + "prettyunits" + ], + "Hash": "f4625e061cb2865f111b47ff163a5ca6" + }, + "promises": { + "Package": "promises", + "Version": "1.3.2", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R6", + "Rcpp", + "fastmap", + "later", + "magrittr", + "rlang", + "stats" + ], + "Hash": "c84fd4f75ea1f5434735e08b7f50fbca" + }, + "ps": { + "Package": "ps", + "Version": "1.8.1", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "utils" + ], + "Hash": "b4404b1de13758dea1c0484ad0d48563" + }, + "purrr": { + "Package": "purrr", + "Version": "1.0.2", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "cli", + "lifecycle", + "magrittr", + "rlang", + "vctrs" + ], + "Hash": "1cba04a4e9414bdefc9dcaa99649a8dc" + }, + "quarto": { + "Package": "quarto", + "Version": "1.4.4", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "cli", + "jsonlite", + "later", + "processx", + "rlang", + "rmarkdown", + "rstudioapi", + "tools", + "utils", + "yaml" + ], + "Hash": "af456d7a181750812bd8b2bfedb3ea4e" + }, + "ragg": { + "Package": "ragg", + "Version": "1.3.3", + "Source": "Repository", + "Repository": "RSPM", + "Requirements": [ + "systemfonts", + "textshaping" + ], + "Hash": "0595fe5e47357111f29ad19101c7d271" + }, + "rappdirs": { + "Package": "rappdirs", + "Version": "0.3.3", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R" + ], + "Hash": "5e3c5dc0b071b21fa128676560dbe94d" + }, + "readr": { + "Package": "readr", + "Version": "2.1.5", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "R6", + "cli", + "clipr", + "cpp11", + "crayon", + "hms", + "lifecycle", + "methods", + "rlang", + "tibble", + "tzdb", + "utils", + "vroom" + ], + "Hash": "9de96463d2117f6ac49980577939dfb3" + }, + "readxl": { + "Package": "readxl", + "Version": "1.4.3", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "cellranger", + "cpp11", + "progress", + "tibble", + "utils" + ], + "Hash": "8cf9c239b96df1bbb133b74aef77ad0a" + }, + "rematch": { + "Package": "rematch", + "Version": "2.0.0", + "Source": "Repository", + "Repository": "CRAN", + "Hash": "cbff1b666c6fa6d21202f07e2318d4f1" + }, + "rematch2": { + "Package": "rematch2", + "Version": "2.1.2", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "tibble" + ], + "Hash": "76c9e04c712a05848ae7a23d2f170a40" + }, + "renv": { + "Package": "renv", + "Version": "1.0.11", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "utils" + ], + "Hash": "47623f66b4e80b3b0587bc5d7b309888" + }, + "reprex": { + "Package": "reprex", + "Version": "2.1.1", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "callr", + "cli", + "clipr", + "fs", + "glue", + "knitr", + "lifecycle", + "rlang", + "rmarkdown", + "rstudioapi", + "utils", + "withr" + ], + "Hash": "97b1d5361a24d9fb588db7afe3e5bcbf" + }, + "rlang": { + "Package": "rlang", + "Version": "1.1.4", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "utils" + ], + "Hash": "3eec01f8b1dee337674b2e34ab1f9bc1" + }, + "rmarkdown": { + "Package": "rmarkdown", + "Version": "2.29", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "bslib", + "evaluate", + "fontawesome", + "htmltools", + "jquerylib", + "jsonlite", + "knitr", + "methods", + "tinytex", + "tools", + "utils", + "xfun", + "yaml" + ], + "Hash": "df99277f63d01c34e95e3d2f06a79736" + }, + "rstudioapi": { + "Package": "rstudioapi", + "Version": "0.17.1", + "Source": "Repository", + "Repository": "CRAN", + "Hash": "5f90cd73946d706cfe26024294236113" + }, + "rvest": { + "Package": "rvest", + "Version": "1.0.4", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "cli", + "glue", + "httr", + "lifecycle", + "magrittr", + "rlang", + "selectr", + "tibble", + "xml2" + ], + "Hash": "0bcf0c6f274e90ea314b812a6d19a519" + }, + "sass": { + "Package": "sass", + "Version": "0.4.9", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R6", + "fs", + "htmltools", + "rappdirs", + "rlang" + ], + "Hash": "d53dbfddf695303ea4ad66f86e99b95d" + }, + "scales": { + "Package": "scales", + "Version": "1.3.0", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "R6", + "RColorBrewer", + "cli", + "farver", + "glue", + "labeling", + "lifecycle", + "munsell", + "rlang", + "viridisLite" + ], + "Hash": "c19df082ba346b0ffa6f833e92de34d1" + }, + "secretbase": { + "Package": "secretbase", + "Version": "1.0.3", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R" + ], + "Hash": "eaf84737a6da68c1e843979963c09a6b" + }, + "selectr": { + "Package": "selectr", + "Version": "0.4-2", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "R6", + "methods", + "stringr" + ], + "Hash": "3838071b66e0c566d55cc26bd6e27bf4" + }, + "stringi": { + "Package": "stringi", + "Version": "1.8.4", + "Source": "Repository", + "Repository": "RSPM", + "Requirements": [ + "R", + "stats", + "tools", + "utils" + ], + "Hash": "39e1144fd75428983dc3f63aa53dfa91" + }, + "stringr": { + "Package": "stringr", + "Version": "1.5.1", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "cli", + "glue", + "lifecycle", + "magrittr", + "rlang", + "stringi", + "vctrs" + ], + "Hash": "960e2ae9e09656611e0b8214ad543207" + }, + "sys": { + "Package": "sys", + "Version": "3.4.3", + "Source": "Repository", + "Repository": "CRAN", + "Hash": "de342ebfebdbf40477d0758d05426646" + }, + "systemfonts": { + "Package": "systemfonts", + "Version": "1.1.0", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "cpp11", + "lifecycle" + ], + "Hash": "213b6b8ed5afbf934843e6c3b090d418" + }, + "tarchetypes": { + "Package": "tarchetypes", + "Version": "0.11.0", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "dplyr", + "fs", + "parallel", + "rlang", + "secretbase", + "targets", + "tibble", + "tidyselect", + "utils", + "vctrs", + "withr" + ], + "Hash": "cf140014f9d00f97f4bd22d961e20471" + }, + "targets": { + "Package": "targets", + "Version": "1.9.0", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "R6", + "base64url", + "callr", + "cli", + "codetools", + "data.table", + "igraph", + "knitr", + "ps", + "rlang", + "secretbase", + "stats", + "tibble", + "tidyselect", + "tools", + "utils", + "vctrs", + "yaml" + ], + "Hash": "788a40e60a695237faca8af4015fd703" + }, + "textshaping": { + "Package": "textshaping", + "Version": "0.4.0", + "Source": "Repository", + "RemoteType": "repository", + "RemoteUrl": "https://github.com/r-lib/textshaping", + "RemoteRef": "v0.4.0", + "RemoteSha": "76682df21dce8ef29e905a90dd05732a58b1249f", + "Requirements": [ + "R", + "cpp11", + "lifecycle", + "systemfonts" + ], + "Hash": "038ff7c8cbd6d1ab9c328b2659900e6b" + }, + "tibble": { + "Package": "tibble", + "Version": "3.2.1", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "fansi", + "lifecycle", + "magrittr", + "methods", + "pillar", + "pkgconfig", + "rlang", + "utils", + "vctrs" + ], + "Hash": "a84e2cc86d07289b3b6f5069df7a004c" + }, + "tidyr": { + "Package": "tidyr", + "Version": "1.3.1", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "cli", + "cpp11", + "dplyr", + "glue", + "lifecycle", + "magrittr", + "purrr", + "rlang", + "stringr", + "tibble", + "tidyselect", + "utils", + "vctrs" + ], + "Hash": "915fb7ce036c22a6a33b5a8adb712eb1" + }, + "tidyselect": { + "Package": "tidyselect", + "Version": "1.2.1", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "cli", + "glue", + "lifecycle", + "rlang", + "vctrs", + "withr" + ], + "Hash": "829f27b9c4919c16b593794a6344d6c0" + }, + "tidyverse": { + "Package": "tidyverse", + "Version": "2.0.0", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "broom", + "cli", + "conflicted", + "dbplyr", + "dplyr", + "dtplyr", + "forcats", + "ggplot2", + "googledrive", + "googlesheets4", + "haven", + "hms", + "httr", + "jsonlite", + "lubridate", + "magrittr", + "modelr", + "pillar", + "purrr", + "ragg", + "readr", + "readxl", + "reprex", + "rlang", + "rstudioapi", + "rvest", + "stringr", + "tibble", + "tidyr", + "xml2" + ], + "Hash": "c328568cd14ea89a83bd4ca7f54ae07e" + }, + "timechange": { + "Package": "timechange", + "Version": "0.3.0", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "cpp11" + ], + "Hash": "c5f3c201b931cd6474d17d8700ccb1c8" + }, + "tinytex": { + "Package": "tinytex", + "Version": "0.54", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "xfun" + ], + "Hash": "3ec7e3ddcacc2d34a9046941222bf94d" + }, + "tzdb": { + "Package": "tzdb", + "Version": "0.4.0", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "cpp11" + ], + "Hash": "f561504ec2897f4d46f0c7657e488ae1" + }, + "utf8": { + "Package": "utf8", + "Version": "1.2.4", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R" + ], + "Hash": "62b65c52671e6665f803ff02954446e9" + }, + "uuid": { + "Package": "uuid", + "Version": "1.2-1", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R" + ], + "Hash": "34e965e62a41fcafb1ca60e9b142085b" + }, + "vctrs": { + "Package": "vctrs", + "Version": "0.6.5", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "cli", + "glue", + "lifecycle", + "rlang" + ], + "Hash": "c03fa420630029418f7e6da3667aac4a" + }, + "viridisLite": { + "Package": "viridisLite", + "Version": "0.4.2", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R" + ], + "Hash": "c826c7c4241b6fc89ff55aaea3fa7491" + }, + "visNetwork": { + "Package": "visNetwork", + "Version": "2.1.2", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "grDevices", + "htmltools", + "htmlwidgets", + "jsonlite", + "magrittr", + "methods", + "stats", + "utils" + ], + "Hash": "3e48b097e8d9a91ecced2ed4817a678d" + }, + "vroom": { + "Package": "vroom", + "Version": "1.6.5", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "bit64", + "cli", + "cpp11", + "crayon", + "glue", + "hms", + "lifecycle", + "methods", + "progress", + "rlang", + "stats", + "tibble", + "tidyselect", + "tzdb", + "vctrs", + "withr" + ], + "Hash": "390f9315bc0025be03012054103d227c" + }, + "withr": { + "Package": "withr", + "Version": "3.0.2", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "grDevices", + "graphics" + ], + "Hash": "cc2d62c76458d425210d1eb1478b30b4" + }, + "xfun": { + "Package": "xfun", + "Version": "0.49", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "grDevices", + "stats", + "tools" + ], + "Hash": "8687398773806cfff9401a2feca96298" + }, + "xml2": { + "Package": "xml2", + "Version": "1.3.6", + "Source": "Repository", + "Repository": "CRAN", + "Requirements": [ + "R", + "cli", + "methods", + "rlang" + ], + "Hash": "1d0336142f4cd25d8d23cd3ba7a8fb61" + }, + "yaml": { + "Package": "yaml", + "Version": "2.3.10", + "Source": "Repository", + "Repository": "CRAN", + "Hash": "51dab85c6c98e50a18d7551e9d49f76c" + } + } +} diff --git a/setup.md b/setup.md new file mode 100644 index 00000000..3ed2b206 --- /dev/null +++ b/setup.md @@ -0,0 +1,32 @@ +--- +title: Setup +--- + +## Local setup + +Follow these instructions to install the required software on your computer. + +- [Download and install the latest version of R](https://www.r-project.org/). +- [Download and install RStudio](https://www.rstudio.com/products/rstudio/download/#download). RStudio is an application (an integrated development environment or IDE) that facilitates the use of R and offers a number of nice additional features, including the [Quarto](https://quarto.org/) publishing system. You will need the free Desktop version for your computer. +- Install the necessary R packages with the following command: + +```r +install.packages( + c( + "conflicted", + "crew", + "palmerpenguins", + "quarto", + "tarchetypes", + "targets", + "tidyverse", + "visNetwork" + ) +) +``` + +## Alternative: In the cloud + +There is a [Posit Cloud](https://posit.cloud/) instance with RStudio and all necessary packages pre-installed available, so you don't need to install anything on your own computer. You may need to create an account (free). + +Click this link to open: