From fb732513182c36c676f1e082498df71a1e4fc2ae Mon Sep 17 00:00:00 2001 From: Geraldine Van der Auwera Date: Tue, 7 Jan 2025 23:35:23 -0500 Subject: [PATCH] Completed update of Hello World --- docs/hello_nextflow/01_hello_world.md | 291 +++++++++++++++----------- 1 file changed, 165 insertions(+), 126 deletions(-) diff --git a/docs/hello_nextflow/01_hello_world.md b/docs/hello_nextflow/01_hello_world.md index 9bca9060..2c1b9f48 100644 --- a/docs/hello_nextflow/01_hello_world.md +++ b/docs/hello_nextflow/01_hello_world.md @@ -36,7 +36,7 @@ cat output.txt !!! tip - In the Gitpod environment, you can also find the output file in the file explorer, and view its contents by clicking on it. Alternatively, you can use the `code` command to open the file for viewing. + In the training environment, you can also find the output file in the file explorer, and view its contents by clicking on it. Alternatively, you can use the `code` command to open the file for viewing. ```bash code output.txt @@ -48,7 +48,7 @@ You now know how to run a simple command in the terminal that outputs some text, ### What's next? -Discover what that would look like written as a Nextflow workflow. +Find out what that would look like written as a Nextflow workflow. --- @@ -56,7 +56,7 @@ Discover what that would look like written as a Nextflow workflow. As mentioned in the orientation, we provide you with a fully functional if minimalist workflow script named `hello-world.nf` that does the same thing as before (write out 'Hello World!') but with Nextflow. -To get you started, we'll first open up the workflow script so you can get a sense of how it's structured +To get you started, we'll first open up the workflow script so you can get a sense of how it's structured. ### 1.1. Examine the overall code structure @@ -146,7 +146,7 @@ The workflow definition starts with the keyword `workflow`, followed by an optio Here we have a **workflow** that consists of one call to the `sayHello` process. -```groovy title="hello-world.nf" linenums="16" +```groovy title="hello-world.nf" linenums="17" workflow { // emit a greeting @@ -156,7 +156,7 @@ workflow { This a very minimal **workflow** definition. In a real-world pipeline, the workflow typically contains multiple calls to **processes** connected by **channels**. -You'll learn how to add more processes and connect them by channels in a little bit. +You'll learn how to add more processes and connect them by channels in Part 3 of this course. ### Takeaway @@ -174,7 +174,7 @@ Looking at code is not nearly as fun as running it, so let's try this out in pra ### 2.1. Launch the workflow and monitor execution -In the terminal, run the following command. +In the terminal, run the following command: ```bash nextflow run hello-world.nf @@ -193,215 +193,254 @@ executor > local (1) Congratulations, you just ran your first Nextflow workflow! -The most important output here is the last line (line 6), which reports that the `sayHello` process was successfully executed once. +The most important output here is the last line (line 6), which reports that the `sayHello` process was successfully executed once (`1 of 1 ✔`). -Okay, that's great, but where do we find the output? -The `sayHello` process definition said that the output would be sent to a file, but where is it? +```console title="Output" +[1c/7d08e6] sayHello [100%] 1 of 1 ✔ +``` + +Importantly, this line also tells you where to find the output of the `sayHello` process call. +Let's look at that now. ### 2.2. Find the output and logs in the `work` directory -When you run Nextflow for the first time in a given directory, it creates a directory called `work` where it will write all files (and symlinks) generated in the course of execution. -Have a look inside; you'll find a subdirectory named with a hash (in order to make it unique; we'll discuss why in a bit), nested two levels deep and containing a handful of log files. +When you run Nextflow for the first time in a given directory, it creates a directory called `work` where it will write all files (and any symlinks) generated in the course of execution. + +Within the `work` directory, Nextflow organizes outputs and logs per process call. +For each process call, Nextflow creates a nested subdirectory, named with a hash in order to make it unique, where it will stage all necessary inputs (using symlinks by default), write helper files, and write out logs and any outputs of the process. + +The path to that subdirectory is shown in truncated form in square brackets in the console output. +Looking at what we got for the run shown above, the console log line for the sayHello process starts with `[1c/7d08e6]`. That corresponds to the following directory path: `work/`**`1c/7d08e6`**`85a7aa7060b9c21667924824` + +Let's take a look at what's in there. !!! tip - If you browse the contents of the task subdirectory in the Gitpod's VSCode file explorer, you'll see all these files right away. - However, these files are set to be invisible in the terminal, so if you want to use `ls` or `tree` to view them, you'll need to set the relevant option for displaying invisible files. + If you browse the contents of the task subdirectory in the VSCode file explorer, you'll see all the files right away. + However, the log files are set to be invisible in the terminal, so if you want to use `ls` or `tree` to view them, you'll need to set the relevant option for displaying invisible files. ```bash tree -a work ``` - You should see something like this, though the exact subdirectory names will be different on your system. - - ```console title="Directory contents" - work - └── 1c - └── 7d08e685a7aa7060b9c21667924824 - ├── .command.begin - ├── .command.err - ├── .command.log - ├── .command.out - ├── .command.run - ├── .command.sh - └── .exitcode - ``` - -You may have noticed that the subdirectory names appeared (in truncated form) in the output from the workflow run, in the line that says: - -```console title="Output" -[1c/7d08e6] sayHello [100%] 1 of 1 ✔ +You should see something like this, though the exact subdirectory names will be different on your system: + +```console title="Directory contents" +work +└── 1c + └── 7d08e685a7aa7060b9c21667924824 + ├── .command.begin + ├── .command.err + ├── .command.log + ├── .command.out + ├── .command.run + ├── .command.sh + ├── .exitcode + └── output.txt ``` -This tells you what is the subdirectory path for that specific process call (sometimes called task). - -!!! note - - Nextflow creates a separate unique subdirectory for each process call. - It stages the relevant input files, script, and other helper files there, and writes any output files and logs there as well. +These are the helper and log files: -If we look inside the subdirectory, we find the following log files: - -- **`.command.begin`**: Metadata related to the beginning of the execution of the process task -- **`.command.err`**: Error messages (stderr) emitted by the process task -- **`.command.log`**: Complete log output emitted by the process task -- **`.command.out`**: Regular output (stdout) by the process task -- **`.command.sh`**: The command that was run by the process task call +- **`.command.begin`**: Metadata related to the beginning of the execution of the process call +- **`.command.err`**: Error messages (`stderr`) emitted by the process call +- **`.command.log`**: Complete log output emitted by the process call +- **`.command.out`**: Regular output (`stdout`) by the process call +- **`.command.run`**: Full script run by Nextflow to execute the process call +- **`.command.sh`**: The command that was run by the process call call - **`.exitcode`**: The exit code resulting from the command -[TODO] UPDATE DESCRIPTION TO LOOK AT ACTUAL OUTPUT FILE INSTEAD OF STDOUT +The `.command.sh` file is especially useful because it tells you what command Nextflow actually executed. +In this case it's very straightforward, but later in the course you'll see commands that involve some interpolation of variables. +When you're dealing with that, you need to be able to check exactly what was run, especially when troubleshooting an issue. -In this case, you can look for your output in the `output.txt` file. +The actual output of the `sayHello` process is `output.txt`. Open it and you will find the `Hello World!` greeting, which was the expected result of our minimalist workflow. -It's also worth having a look at the `.command.sh` file, which tells you what command Nextflow actually executed. In this case it's very straightforward, but later in the course you'll see commands that involve some interpolation of variables. When you're dealing with that, you need to be able to check exactly what was run, especially when troubleshooting an issue. - ### Takeaway -You know how to decipher a simple Nextflow script, run it and find the output and logs in the work directory. +You know how to decipher a simple Nextflow script, run it and find the output and relevant log files in the work directory. ### What's next? -Learn how to manage your workflow executions conveniently. +Learn to 'publish' outputs to a location outside the work directory for convenience. --- -## 3. Manage workflow executions +## 3. Publish outputs -[TODO] A FEW USEFUL TIPS +As you have just learned, the output produced by our pipeline is buried in a working directory several layers deep. +This is done on purpose; Nextflow is in control of this directory and we are not supposed to interact with it. -### 3.1. Re-launch a workflow with `-resume` +But that makes it inconvenient to retrieve outputs that we care about. -Nextflow has an option called `-resume` that allows you to re-run a pipeline you've already launched previously. -When launched with `-resume` any processes that have already been run with the exact same code, settings and inputs will be skipped. -Using this mode means Nextflow will only run processes that are either new, have been modified or are being provided new settings or inputs. +Fortunately, Nextflow provides a way to manage this more conveniently, called the `publishDir` directive, which acts at the process level. +This directive tells Nextflow to copy the output(s) of the process to a designated output directory. +It allows us to retrieve the desired output file without having to drill down into the work directory. -There are two key advantages to doing this: +### 3.1. Add a `publishDir` directive to the `sayHello` process -- If you're in the middle of developing your pipeline, you can iterate more rapidly since you only effectively have to run the process(es) you're actively working on in order to test your changes. -- If you're running a pipeline in production and something goes wrong, in many cases you can fix the issue and relaunch the pipeline, and it will resume running from the point of failure, which can save you a lot of time and compute. +In the workflow script file `hello-world.nf`, make the following code modification: -To use it, simply add `-resume` to your command: +_Before:_ -```bash -nextflow run hello-world.nf -resume +```groovy title="hello-world.nf" linenums="6" +process sayHello { + + output: + path 'output.txt' ``` -The console output should look similar. +_After:_ -```console title="Output" - N E X T F L O W ~ version 24.10.0 +```groovy title="hello-world.nf" linenums="6" +process sayHello { - ┃ Launching `hello-world.nf` [thirsty_gautier] DSL2 - revision: 6654bc1327 + publishDir 'results', mode: 'copy' -[10/15498d] sayHello [100%] 1 of 1, cached: 1 ✔ + output: + path 'output.txt' ``` -Notice the additional `cached:` bit in the process status line, which means that Nextflow has recognized that it has already done this work and simply re-used the result from the last run. +### 3.2. Run the workflow again -!!! note +Now run the modified workflow script: + +```bash +nextflow run hello-world.nf +``` + +The log output should look very familiar: - When your re-run a pipeline with `resume`, Nextflow does not overwrite any files written to a publishDir directory by any process call that was previously run successfully. +```console title="Output" + N E X T F L O W ~ version 24.10.0 -### 3.2. Delete older work directories + ┃ Launching `hello-world.nf` [mighty_lovelace] DSL2 - revision: 6654bc1327 -[TODO] SHOW HOW TO USE `nextflow clean -before -n` (dry run) THEN `nextflow clean -before ` (do it). Using dry run first because deletion is permanent. +executor > local (1) +[10/15498d] sayHello [100%] 1 of 1 ✔ +``` -[TODO] Using `-before` bc it feels like the most likely to be commonly useful. Link to other options here: https://www.nextflow.io/docs/latest/reference/cli.html#clean +This time, Nextflow has created a new directory called `results/`. +Our `output.txt` file is in this directory. +If you check the contents it should match the output in the work subdirectory. +This is how we move results files outside of the working directories conveniently. -[TODO] Note on how deleting work directories breaks ability to resume from those directories and deletes outputs so save your outputs! +It is also possible to set the `publishDir` directive to make a symbolic link to the file instead of actually copying it. +This is preferable when you're dealing with very large files. +However, if you delete the work directory as part of a cleanup operation, you will lost access to the file, so always make sure you have actual copies of everything you care about before deleting anything. + +!!! note + + A newer syntax option had been proposed to make it possible to declare and publish workflow-level outputs, documented [here](https://www.nextflow.io/docs/latest/workflow.html#publishing-outputs). + This will eventually make using `publishDir` at the process level redundant for completed pipelines. + However, we expect that `publishDir` will still remain very useful during pipeline development. ### Takeaway -You know how to to relaunch a pipeline without repeating steps that were already run in an identical way, and how to use the `nextflow clean` command to clean up old work directories. +You know how to use the `publishDir` directive to move files outside of the Nextflow working directory. ### What's next? -Learn to 'publish' outputs to a location outside the work directory for convenience. +Learn how to manage your workflow executions conveniently. --- -## 4. Publish outputs +## 4. Manage workflow executions -You'll have noticed that the output is buried in a working directory several layers deep. -Nextflow is in control of this directory and we are not supposed to interact with it. +Knowing how to launch workflows and retrieve outputs is great, but you'll quickly find there are a few other aspects of workflow management that will make your life easier, especially if you're developing your own workflows. -Let's look at how to use the `publishDir` directive for managing this more conveniently. +Here we show you how to use the `resume` feature for when you need to re-launch the same workflow, and how to delete older work directories. -### 4.1. Add a `publishDir` directive to the process +### 4.1. Re-launch a workflow with `-resume` -To make the output file more accessible, we can utilize the `publishDir` directive. -This directive tells Nextflow to automatically copy the output file to a designated output directory. -It allows us to get easy access to the desired output file without having to drill down into the work directory. +Sometimes, you're going to want to re-run a pipeline that you've already launched previously without redoing any steps that already completed successfully. -_Before:_ +Nextflow has an option called `-resume` that allows you to do this. +Specifically, in this mode, any processes that have already been run with the exact same code, settings and inputs will be skipped. +This means Nextflow will only run processes that you've added or modified since the last run, or to which you're providing new settings or inputs. -```groovy title="hello-world.nf" linenums="6" -process sayHello { +There are two key advantages to doing this: - output: - path 'output.txt' +- If you're in the middle of developing your pipeline, you can iterate more rapidly since you only have to run the process(es) you're actively working on in order to test your changes. +- If you're running a pipeline in production and something goes wrong, in many cases you can fix the issue and relaunch the pipeline, and it will resume running from the point of failure, which can save you a lot of time and compute. + +To use it, simply add `-resume` to your command and run it: + +```bash +nextflow run hello-world.nf -resume ``` -_After:_ +The console output should look similar. -```groovy title="hello-world.nf" linenums="6" -process sayHello { +```console title="Output" + N E X T F L O W ~ version 24.10.0 - publishDir 'results', mode: 'copy' + ┃ Launching `hello-world.nf` [thirsty_gautier] DSL2 - revision: 6654bc1327 - output: - path 'output.txt' +[1c/7d08e6] sayHello [100%] 1 of 1, cached: 1 ✔ ``` -### 4.2. Run the workflow again +Look for the `cached:` bit that has been added in the process status line, which means that Nextflow has recognized that it has already done this work and simply re-used the result from the previous successful run. + +You can also see that the work subdirectory hash is the same as in the previous run. +Nextflow is literally pointing you to the previous execution and saying "I already did that over there." + +!!! note + + When your re-run a pipeline with `resume`, Nextflow does not overwrite any files written to a `publishDir` directory by any process call that was previously run successfully. + +### 4.2. Delete older work directories + +During the development process, you'll typically run your draft pipelines a large number of times, which can lead to an accumulation of very many files across many subdirectories. +Since the subdirectories are named randomly, it is difficult to tell from their names what are older vs. more recent runs. + +Nextflow includes a convenient `-clean` command that can automatically delete the work subdirectories for past runs that you no longer care about, with several [options](https://www.nextflow.io/docs/latest/reference/cli.html#clean) to control what will be deleted. + +Here we show you an example that deletes all subdirectories from runs before a given run, specified using its run name. +The run name is the machine-generated two-part string shown in square brackets in the `Launching (...)` console output line. + +First we use the dry run flag `-n` to check what will be deleted given the command: ```bash -nextflow run hello-world.nf +nextflow clean -before thirsty_gautier -n ``` -The log output should start looking very familiar: +The output should look like this: ```console title="Output" - N E X T F L O W ~ version 24.10.0 +Would remove /workspace/gitpod/hello-nextflow/work/4f/a45bb02e84760ca0fe6b15b3dcd1ed +Would remove /workspace/gitpod/hello-nextflow/work/25/8584e6dea14dab9d8b138d8761c363 +``` - ┃ Launching `hello-world.nf` [mighty_lovelace] DSL2 - revision: 6654bc1327 +If you don't see any lines output, you either did not provide a valid run name or there are no past runs to delete. -executor > local (1) -[10/15498d] sayHello [100%] 1 of 1 ✔ +If the output looks as expected and you want to proceed with the deletion, re-run the command with the `-f` flag instead of `-n`: + +```bash +nextflow clean -before thirsty_gautier -f ``` -This time, Nextflow will have created a new directory called `results/`. -Our `output.txt` file is in this directory. -If you check the contents it should match the output in our work/task directory. -This is how we move results files outside of the working directories. +You should now see the following: -It is also possible to set the `publishDir` directive to make a symbolic link to the file instead of actually copying it. -This is useful when you're dealing with very large files. -However, if you delete the work directory as part of a cleanup operation, you will lost access to the file, so always make sure you have actual copies of everything you care about before deleting anything. +```console title="Output" +Removed /workspace/gitpod/hello-nextflow/work/4f/a45bb02e84760ca0fe6b15b3dcd1ed +Removed /workspace/gitpod/hello-nextflow/work/25/8584e6dea14dab9d8b138d8761c363 +``` -!!! note +!!! Warning - A newer syntax option had been proposed to make it possible to declare and publish workflow-level outputs, documented [here](https://www.nextflow.io/docs/latest/workflow.html#publishing-outputs). - This will eventually make using `publishDir` at the process level redundant for completed pipelines. - However, we expect that `publishDir` will still remain very useful during pipeline development. + Deleting work subdirectories from past runs removes them from Nextflow's cache and deletes any outputs that were stored in those directories. + That means it breaks Nextflow's ability to resume execution without re-running the corresponding processes. + + You are responsible for saving any outputs that you care about or plan to rely on! If you're using the `publishDir` directive for that purpose, make sure to use the `copy` mode, not the `symlink` mode. ### Takeaway -You know how to use the `publishDir` directive to move files outside of the Nextflow working directory. +You know how to to relaunch a pipeline without repeating steps that were already run in an identical way, and how to use the `nextflow clean` command to clean up old work directories. More generally, you know how to interpret a simple Nextflow workflow, manage its execution, and retrieve outputs. ### What's next? -[TODO] - -More generally, you've learned how to use the essential components of Nextflow and you have a basic grasp of the logic of how to build a workflow and retrieve the desired outputs. - -### What's next? - -Take a break! - -[TODO] - -When you're ready, move on to Part X to learn about [TODO]. +Take a short break! +When you're ready, move on to Part 2 to learn how to use channels to feed inputs into your workflow.