Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HPC Episode #18

Open
wants to merge 41 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 8 commits
Commits
Show all changes
41 commits
Select commit Hold shift + click to select a range
52b79be
future -> crew
multimeric Jun 15, 2023
a93a6f3
Fix typo
multimeric Jun 15, 2023
e3fee83
HPC draft
multimeric Jun 30, 2023
3af1a0d
Start worker section
multimeric Jun 30, 2023
f8928bc
Finished draft of heterogenous workers
multimeric Jul 7, 2023
0f8e054
Update dependencies to have crew, finish first draft of HPC
multimeric Jul 10, 2023
5255346
Update knit_exit comment
multimeric Jul 10, 2023
f246ac1
Merge branch 'main' of github.com:carpentries-incubator/targets-works…
multimeric Jul 10, 2023
9c054c4
Better wording when discussing HPC
multimeric Jul 10, 2023
46fc239
Don't assume learner expectations
multimeric Jul 10, 2023
589075b
Update episodes/hpc.Rmd
multimeric Jul 10, 2023
629c7aa
Update episodes/hpc.Rmd
multimeric Jul 10, 2023
5a67422
Update episodes/hpc.Rmd
multimeric Jul 10, 2023
180f290
Update episodes/hpc.Rmd
multimeric Jul 11, 2023
31353d6
Code review, and update dependencies
multimeric Jul 12, 2023
5cbd438
Merge branch 'hpc' of github.com:multimeric/targets-workshop into hpc
multimeric Jul 12, 2023
a482945
Minor improvements and rewording
multimeric Jul 12, 2023
540385e
Install crew.cluster
multimeric Jul 20, 2023
704759d
Elaborate on `sacct`
multimeric Jul 20, 2023
414a742
Tweaks to the HPC episode
multimeric Jul 24, 2023
4092364
Use crew_controller_group, and add memory example
multimeric Jul 25, 2023
06b2cb9
Update episodes/hpc.Rmd
multimeric Jul 25, 2023
141e774
Update episodes/hpc.Rmd
multimeric Jul 25, 2023
4622cbb
Merge branch 'main' of github.com:carpentries-incubator/targets-works…
multimeric Jul 9, 2024
9bf64e5
Merge branch 'hpc' of github.com:multimeric/targets-workshop into hpc
multimeric Jul 9, 2024
a9c04a5
Remove some superfluous files
multimeric Jul 9, 2024
3dd8238
Remove unused figures
multimeric Jul 9, 2024
6cc5c25
I guess these files should be tracked?
multimeric Jul 9, 2024
4723369
Add missing activate file
multimeric Jul 10, 2024
d872dfb
Add missing .gitignore
multimeric Jul 10, 2024
24aa767
Fix execution typo
multimeric Jul 10, 2024
5e40be1
Start slurm for build
multimeric Jul 10, 2024
4eb5f06
Force dep on {htmlwidgets}
joelnitta Jul 10, 2024
9489bea
Update renv and packages
joelnitta Jul 10, 2024
e012aaf
Update renv.lock
joelnitta Jul 10, 2024
7548c49
Reduce memory requirement
multimeric Jul 10, 2024
66c3e5b
Merge branch 'hpc' of github.com:multimeric/targets-workshop into hpc
multimeric Jul 10, 2024
2c9a19a
Re-run CI
multimeric Dec 13, 2024
905a3b7
Merge branch 'main' of github.com:carpentries-incubator/targets-works…
multimeric Dec 13, 2024
0ca5b00
Revert .github and renv changes
multimeric Dec 13, 2024
54e1b53
Merge branch 'main' of github.com:carpentries-incubator/targets-works…
multimeric Dec 13, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,7 @@ episodes:
- branch.Rmd
- parallel.Rmd
- quarto.Rmd
- hpc.Rmd

# Information for Learners
learners:
Expand Down
295 changes: 295 additions & 0 deletions episodes/hpc.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,295 @@
---
title: 'Deploying Targets on HPC'
teaching: 10
exercises: 2
---

```{R, echo=FALSE}
# Exit sensibly when Slurm isn't installed
if (!nzchar(Sys.which("sbatch"))){
knitr::knit_exit("sbatch was not detected. Likely Slurm is not installed. Exiting.")
}
```

:::::::::::::::::::::::::::::::::::::: questions

- Why would we use HPC to run Targets workflows?
- How can we run Targets workflows on Slurm?

::::::::::::::::::::::::::::::::::::::::::::::::

::::::::::::::::::::::::::::::::::::: objectives

- Be able to generate a report using `targets`

::::::::::::::::::::::::::::::::::::::::::::::::

::::::::::::::::::::::::::::::::::::: instructor

Episode summary: Show how to write reports with Quarto

::::::::::::::::::::::::::::::::::::::::::::::::

```{r}
#| label: setup
#| echo: FALSE
#| message: FALSE
#| warning: FALSE
library(targets)
library(tarchetypes)
library(quarto) # don't actually need to load, but put here so renv catches it
source("https://raw.githubusercontent.com/joelnitta/targets-workshop/main/episodes/files/functions.R?token=$(date%20+%s)") # nolint

# Increase width for printing tibbles
options(width = 140)
```

## Advantages of HPC

If your analysis involves computationally intensive or long-running tasks such as training machine learning models or processing very large amounts of data, it will quickly become infeasible to use a single machine to run this.
If you are part of an organisation with access to a High Performance Computing (HPC) cluster, you can easily leverage the numerous machines with Targets to scale up your analysis.
multimeric marked this conversation as resolved.
Show resolved Hide resolved
This differs from the exucution we have learned so far, which spawns extra R processes on the *same machine* to speed up execution.
multimeric marked this conversation as resolved.
Show resolved Hide resolved
multimeric marked this conversation as resolved.
Show resolved Hide resolved

## Configuring Targets for Slurm

Fortunately, using HPC is as simple as changing the Targets `controller`.
multimeric marked this conversation as resolved.
Show resolved Hide resolved
In this section we will assume that our HPC uses Slurm as its job scheduler, but you can easily use other schedulers such as PBS/TORQUE, Sun Grid Engine (SGE) or LSF.
multimeric marked this conversation as resolved.
Show resolved Hide resolved

multimeric marked this conversation as resolved.
Show resolved Hide resolved
In the Parallel Processing section, we used the following configuration:
```{R}
library(crew)
tar_option_set(
controller = crew_controller_local(workers = 2)
)
```
To configure this for Slurm, we just swap out the controller with a new one from the `crew.cluster` package:
multimeric marked this conversation as resolved.
Show resolved Hide resolved

```{R}
library(crew.cluster)
tar_option_set(
controller = crew_controller_slurm(
workers = 3,
script_lines = "module load R"
)
)
```

There are a number of options you can pass to `crew_controller_slurm()` to fine-tune the Slurm execution, [which you can find here](https://wlandau.github.io/crew.cluster/reference/crew_controller_slurm.html).
multimeric marked this conversation as resolved.
Show resolved Hide resolved
Here we are only using two:

* `workers` sets the number of jobs that are submitted to Slurm to process targets.
* `script_lines` adds some lines to the Slurm submit script used by Targets. This is useful for loading Environment Modules and adding `#SBATCH` options.
multimeric marked this conversation as resolved.
Show resolved Hide resolved

Let's run the modified workflow:

```{R, eval=FALSE}
source("R/packages.R")
source("R/functions.R")

library(crew.cluster)
tar_option_set(
controller = crew_controller_slurm(
workers = 3,
script_lines = "module load R"
multimeric marked this conversation as resolved.
Show resolved Hide resolved
)
)

tar_plan(
# Load raw data
tar_file_read(
penguins_data_raw,
path_to_file("penguins_raw.csv"),
read_csv(!!.x, show_col_types = FALSE)
),
# Clean data
penguins_data = clean_penguin_data(penguins_data_raw),
# Build models
models = list(
combined_model = lm(
bill_depth_mm ~ bill_length_mm, data = penguins_data),
species_model = lm(
bill_depth_mm ~ bill_length_mm + species, data = penguins_data),
interaction_model = lm(
bill_depth_mm ~ bill_length_mm * species, data = penguins_data)
),
# Get model summaries
tar_target(
model_summaries,
glance_with_mod_name_slow(models),
pattern = map(models)
),
# Get model predictions
tar_target(
model_predictions,
augment_with_mod_name_slow(models),
pattern = map(models)
)
)
```

::: challenge
## Increasing Resources

Q: How would you modify your `_targets.R` if your targets needed 200GB of RAM?
multimeric marked this conversation as resolved.
Show resolved Hide resolved

::: hint
Check the arguments for [`crew_controller_slurm`](https://wlandau.github.io/crew.cluster/reference/crew_controller_slurm.html#arguments-1).
:::
::: solution
```R
tar_option_set(
controller = crew_controller_slurm(
workers = 3,
script_lines = "module load R",
# Added this
slurm_memory_gigabytes_per_cpu = 200,
slurm_cpus_per_task = 1
)
)
```
:::
:::

## HPC Workers

Despite what you might expect, `crew` does not submit one Slurm job for each target.
multimeric marked this conversation as resolved.
Show resolved Hide resolved
Instead, it uses persistent workers, meaning that you define a pool of workers when configuring the workflow.
In our example above we used 3 workers.
For each worker, `crew` submits a single Slurm job, and these workers will process multiple targets over their lifetime.

We can verify that this has happened using `sacct`:
multimeric marked this conversation as resolved.
Show resolved Hide resolved

```{bash}
sacct
```

The upside of this approach is that we don't have to work out the minutae of how long each target takes to build, or what resources it needs.
multimeric marked this conversation as resolved.
Show resolved Hide resolved
It also means that we don't submit a lot of jobs, making our Slurm usage more efficient and easy to monitor.
multimeric marked this conversation as resolved.
Show resolved Hide resolved

The downside of this mechanism is that **the resources of the worker have to be sufficient to build each of your targets**.
multimeric marked this conversation as resolved.
Show resolved Hide resolved

::: challenge
## Choosing a Worker

Q: Say we have two targets. One uses 100 GB of RAM and 1 CPU, and the other needs 10 GB of RAM and 8 CPUs to run a multi-threaded function. What worker configuration do we use?

::: solution
We need to choose the maximum of all resources if we have a single worker.
It will need 100 GB of RAM and 8 CPUs.
To do this we might use a controller a bit like this:
```{R, results="hide"}
crew_controller_slurm(
name = "cpu_worker",
workers = 3,
script_lines = "
#SBATCH --cpus-per-task=8
module load R",
slurm_memory_gigabytes_per_cpu = 100
)
```
:::
:::

## Heterogeneous Workers

In some cases we may prefer heterogeneous workers, especially if some of our targets need a GPU and others need a CPU.
multimeric marked this conversation as resolved.
Show resolved Hide resolved
To do this, we firstly define each worker configuration by adding the `name` argument to `crew_controller_slurm`.
Note that this time we aren't passing it into `tar_option_set`:

```{R, results="hide"}
library(crew.cluster)
crew_controller_slurm(
name = "cpu_worker",
workers = 3,
script_lines = "module load R",
slurm_memory_gigabytes_per_cpu = 200,
slurm_cpus_per_task = 1
)
```

Then we specify this controller by name in each target definition:

```{R, results="hide"}
tar_target(
name = cpu_task,
command = run_model2(data),
resources = tar_resources(
crew = tar_resources_crew(controller = "cpu_worker")
)
)
```

::: challenge
## Mixing GPU and CPU targets

Q: Say we have the following targets workflow. How would we modify it so that `gpu_task` is only run in a GPU Slurm job?
```{R, eval=FALSE}
graphics_devices <- function(){
system2("lshw", c("-class", "display"), stdout=TRUE, stderr=FALSE)
}

tar_plan(
tar_target(
cpu_hardware,
graphics_devices()
),
tar_target(
gpu_hardware,
graphics_devices()
)
)
```

::: hint
You will need to define two different crew controllers.
:::
::: solution
```R
graphics_devices <- function(){
system2("lshw", c("-class", "display"), stdout=TRUE, stderr=FALSE)
}

library(crew.cluster)
crew_controller_slurm(
name = "cpu_worker"
workers = 3,
script_lines = "module load R",
slurm_memory_gigabytes_per_cpu = 200,
slurm_cpus_per_task = 1
)
crew_controller_slurm(
name = "gpu_worker"
workers = 3,
script_lines = "#SBATCH --gres=gpu:1
module load R",
slurm_memory_gigabytes_per_cpu = 200,
slurm_cpus_per_task = 1
)

tar_plan(
tar_target(
cpu_hardware,
graphics_devices(),
resources = tar_resources(
crew = tar_resources_crew(controller = "cpu_worker")
)
),
tar_target(
gpu_hardware,
graphics_devices(),
resources = tar_resources(
crew = tar_resources_crew(controller = "gpu_worker")
)
)
)
```
:::
:::

::::::::::::::::::::::::::::::::::::: keypoints

- `crew.cluster::crew_controller_slurm()` is used to configure a workflow to use Slurm
- Crew uses persistent workers on HPC, and you need to choose your resources accordingly
- You can create heterogeneous workers by using multiple calls to `crew_controller_slurm(name=)`

::::::::::::::::::::::::::::::::::::::::::::::::
2 changes: 1 addition & 1 deletion learners/setup.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,4 +31,4 @@ install.packages(

There is a [Posit Cloud](https://posit.cloud/) instance with RStudio and all necessary packages pre-installed available, so you don't need to install anything on your own computer. You may need to create an account (free).

Click this link to open: <https://posit.cloud/content/6064275>
Click this link to open: <https://posit.cloud/content/6064275>
Loading