From 9e9f9c41ea833d6e32a44823abf8eee3bc0fe9d4 Mon Sep 17 00:00:00 2001 From: Tim Booth Date: Mon, 22 Jul 2024 18:04:54 +0100 Subject: [PATCH] Justify use of -F flag Add some ideas on how to introduce Snakemake Fix some formatting --- episodes/01-introduction.md | 7 +++-- episodes/09-performance.md | 4 +-- index.md | 7 +++-- instructors/instructor-notes.md | 48 +++++++++++++++++++++++++++------ 4 files changed, 52 insertions(+), 14 deletions(-) diff --git a/episodes/01-introduction.md b/episodes/01-introduction.md index bec7a26..40de088 100644 --- a/episodes/01-introduction.md +++ b/episodes/01-introduction.md @@ -27,7 +27,7 @@ For now we'll just look at one single file, `ref1_1.fq`. In the terminal: ```bash -$ cd yeast +$ cd snakemake_data/yeast $ ls reads $ head -n8 reads/ref1_1.fq @@ -121,6 +121,9 @@ indents, etc. we may see an error. $ snakemake -j1 -F -p ref1_1.fq.count ``` +For these early examples, we'll always run Snakemake with the `-j1`, `-F` and `-p` options. Later +we'll look more deeply at these and other available command-line options to Snakemake. + ::::::::::::::::::::::::::::::::::::::: challenge ## Running Snakemake @@ -143,7 +146,7 @@ What does the `-p` option in the `snakemake` command above do? This is such a useful thing we don't know why it isn't the default! The `-j1` option is what tells Snakemake to only run one process at a time, and we'll stick with this for now as it -makes things simpler. The `-F` option tells Snakemake to always overwrite output files, and +makes things simpler. The `-F` option tells Snakemake to always recreate output files, and we'll learn about protected outputs much later in the course. Answer 4 is a total red-herring, as Snakemake never prompts interactively for user input. diff --git a/episodes/09-performance.md b/episodes/09-performance.md index 5e9f3e4..59f022d 100644 --- a/episodes/09-performance.md +++ b/episodes/09-performance.md @@ -225,11 +225,11 @@ A this point in the course there may be a cluster demo... :::::::::::::::::::::::::::::::::::::::::::::::::: -{% comment %} +[comment]: # ( Photo credit: Cskiran Sourced from Wikimedia Commons CC-BY-SA-4.0 -{% endcomment %} +) *For reference, [this is a Snakefile](files/ep09.Snakefile) incorporating the changes made in this episode.* diff --git a/index.md b/index.md index 48a0cc3..2cbc56e 100644 --- a/index.md +++ b/index.md @@ -26,7 +26,10 @@ In the planning phase of writing this course material we outlined some [learner :::::::::::::::::::::::::::::::::::::::::: prereq -## Prerequisites +## Learner Prerequisites + +See the [prerequisites](prereqs.html) page for a full list of skills and concepts we assume that +learners will know prior to taking this lesson. In brief: This is an intermediate lesson and assumes learners have some prior experience in bioinformatics: @@ -35,7 +38,7 @@ This is an intermediate lesson and assumes learners have some prior experience i - Knowing about bioinformatics fundamentals like the [FASTQ file format ](https://en.wikipedia.org/wiki/FASTQ_format) and [read mapping ](https://en.wikipedia.org/wiki/Read_\(biology\)#NGS_and_read_mapping), - in order to understand the example workflow. + in order to understand the example workflows. No previous knowledge of Snakemake or workflow systems, or Python programming, is assumed. diff --git a/instructors/instructor-notes.md b/instructors/instructor-notes.md index 8b9107b..44f4c61 100644 --- a/instructors/instructor-notes.md +++ b/instructors/instructor-notes.md @@ -10,15 +10,17 @@ Prior to beginning the first lesson you want to say something about Snakemake. A saying how you yourself came across Snakemake and how you use it in your own work is probably the best approach. -Otherwise, the info on [https://snakemake.readthedocs.io](https://snakemake.readthedocs.io) should have everything you need, and the -"rolling paper" has a nice graphic showing the history of the Snakemake project. +Otherwise, the info on [https://snakemake.readthedocs.io](https://snakemake.readthedocs.io) should +have everything you need, and the [rolling paper](https://f1000research.com/articles/10-33/v2) +has a nice graphic (fig. 2) showing the history of the Snakemake project. As of July 2024 this +paper has over 1000 citations. ## When to use a workflow system? Learners may ask when it is appropriate to use a system like Snakemake. -The paper [Workflow systems turn raw data into scientific knowledge](https://pubmed.ncbi.nlm.nih.gov/31477884/) -has a view on this: +The paper [Workflow systems turn raw data into scientific knowledge]( +https://pubmed.ncbi.nlm.nih.gov/31477884/) has a view on this: :::::::::::::::::::::::::::::::::::::: discussion @@ -35,17 +37,22 @@ defined in one or two rules. Once you understand the fundamentals you are likely Snakemake for even these simple tasks. Having said this, not every data analysis task is suited to Snakemake, or in some cases you may -only want to use Snakemake for part of a task, and do the rest with regular scripting. +only want to use Snakemake for part of a task, and do the rest with, say, regular scripting. ## Which is the best workflow system to use? -Snakemake 🐍 +Snakemake! 🐍 But, in seriousness, other workflow systems are available. Some are better suited to different tasks, and some users have a preference for one over another. For a large task, it is worth investigating multiple options before committing to an approach. [This GIT repository and associated paper](https://github.com/GoekeLab/bioinformatics-workflows) -comparing eight workflow systems is a good place to start. +comparing eight workflow systems is a good place to start. And in fact the previously mentioned +[Snakemake rolling paper](https://f1000research.com/articles/10-33/v2) compares Snakemake to +several other workflow systems. + +You should also look through existing workflows on resources like [WorkflowHub]( +https://workflowhub.eu), as someone may have already solved all or part of your problem. ## About the sample data files @@ -58,14 +65,39 @@ really add anything to the course. It's possible that a learner will accidentally delete or overwite the input files. In this case, note that a copy is available to download - see the link on [the setup page](../learners/setup.md). +## Choice of bioinformatics software + +Like the toy dataset, the tools in this course are chosen to illustrate the workings of Snakemake. +The choice of older and simpler tools like *fastx toolkit* is deliberate, and reduces the burden of +maintenance of this course material as tools are updated. + +In practise, learners may ask to go into more depth on the choice, configuration, and functionality +of the bioinformatics software. If you have the time and are confident talking about this then do +so, but if not then it is valid to reiterate that the focus of the course is on the orchestration +of analysis steps with Snakemake, not the choice of what software is best for any given analysis. + # Notes on specific episodes +## Episode 01 - Running commands with Snakemake + +In the first few episodes we always run Snakemake with the `-F` flag, and it's not explained what +this does until Ep. 04. The rationale is that the default Snakemake behaviour when pruning the DAG +leads to learners seeing different output (typically the message "nothing to be done") when +repeating the exact same command. This can seem strange to learners who are used to scripting and +imperative programming. + +The internal rules used by Snakemake to determine which jobs in the DAG are to be run, and which +skipped, are pretty complex, but the behaviour seen under `-F` is much more simple and consistent; +Snakemake simply runs every job in the DAG every time. You can think of `-F` as disabling the lazy +evaluation feature of Snakemake, until we are ready to properly introduce and understand it. + ## Episode 03 - Chaining rules There is a figure to illustrate the way Snakemake finds rules by wildcard matching and then tracks back until it runs out of rule matches and finds a file that it already has. You may find that showing an animated version of this is helpful, in which case -[there are some slides here](https://github.com/carpentries-incubator/snakemake-novice-bioinformatics/files/9299078/wildcard_demo.pptx). +[there are some slides here]( +https://github.com/carpentries-incubator/snakemake-novice-bioinformatics/files/9299078/wildcard_demo.pptx).