diff --git a/.nojekyll b/.nojekyll index c972477..afb93ac 100644 --- a/.nojekyll +++ b/.nojekyll @@ -1 +1 @@ -d08babab \ No newline at end of file +1dc31dd5 \ No newline at end of file diff --git a/index.html b/index.html index 4bbef65..16631a6 100644 --- a/index.html +++ b/index.html @@ -144,6 +144,10 @@
  • Nextflow Operators +
  • +
  • + + Output, scatter, and Gather
  • diff --git a/search.json b/search.json index 1ff2f7f..d3385dc 100644 --- a/search.json +++ b/search.json @@ -4,7 +4,7 @@ "href": "sessions/2_nf_dev_intro.html", "title": "Developing bioinformatics workflows with Nextflow", "section": "", - "text": "This workshop is designed to provide participants with a fundamental understanding of developing bioinformatics pipelines using Nextflow. This workshop aims to provide participants with the necessary skills required to create a Nextflow pipeline from scratch or from the nf-core template.\n\nCourse Presenters\n\nRichard Lupat, Bioinformatics Core Facility\nMiriam Yeung, Cancer Genomics Translational Research Centre\nSong Li, Bioinformatics Core Facility\n\n\n\nCourse Helpers\n\nSanduni Rajapaksa, Research Computing Facility\n\n\n\nPrerequisites\n\nExperience with command line interface and cluster/slurm\nFamiliarity with the basic concept of workflows\nAccess to Peter Mac Cluster\nAttendance in the ‘Introduction to Nextflow and Running nf-core Workflows’ workshop, or an understanding of the Nextflow concepts outlined in the workshop material\n\n\n\nLearning Objectives:\nBy the end of this workshop, participants should be able to:\n\nDevelop a basic Nextflow workflow consisting of processes that use multiple scripting languages\nGain an understanding of Groovy and Nextflow syntax\nRead data of different types into a Nextflow workflow\nOutput Nextflow process results to a predefined directory\nRe-use and import processes, modules, and sub-workflows into a Nextflow workflow\nTest and set up profiles for a Nextflow workflow\nCreate conditional processes and conditional scripts within a process\nGain an understanding of Nextflow channel operators\nDevelop a basic Nextflow workflow with nf-core templates\nTroubleshoot known errors in workflow development\n\n\n\nSet up requirements\nPlease complete the Setup Instructions before the course.\nIf you have any trouble, please get in contact with us ASAP via Slack/Teams.\n\n\nWorkshop schedule\n\n\n\nLesson\nOverview\nDate\n\n\n\n\nSetup\nFollow these instructions to install VS Code and setup your workspace\nPrior to workshop\n\n\nSession kick off\nSession kick off: Discuss learning outcomes and finalising workspace setup\nEvery week\n\n\nBasic to Create a Nextflow Workflow\nIntroduction to nextflow channels, processes, data types and workflows\n29th May 2024\n\n\nDeveloping Modularised Workflows\nIntroduction to modules imports, sub-workflows, setting up test-profile, and common useful groovy functions\n5th Jun 2024\n\n\nWorking with nf-core Templates\nIntroduction to developing nextflow workflow with nf-core templates\n12th Jun 2024\n\n\nWorking with Nextflow Built-in Functions\nIntroduction to nextflow operators, metadata propagation, grouping, and splitting\n19th Jun 2024\n\n\n\n\n\nCredits and acknowledgement\nThis workshop is adapted from Fundamentals Training, Advanced Training, Developer Tutorials, and Nextflow Patterns materials from Nextflow and nf-core." + "text": "This workshop is designed to provide participants with a fundamental understanding of developing bioinformatics pipelines using Nextflow. This workshop aims to provide participants with the necessary skills required to create a Nextflow pipeline from scratch or from the nf-core template.\n\nCourse Presenters\n\nRichard Lupat, Bioinformatics Core Facility\nMiriam Yeung, Cancer Genomics Translational Research Centre\nSong Li, Bioinformatics Core Facility\n\n\n\nCourse Helpers\n\nSanduni Rajapaksa, Research Computing Facility\n\n\n\nPrerequisites\n\nExperience with command line interface and cluster/slurm\nFamiliarity with the basic concept of workflows\nAccess to Peter Mac Cluster\nAttendance in the ‘Introduction to Nextflow and Running nf-core Workflows’ workshop, or an understanding of the Nextflow concepts outlined in the workshop material\n\n\n\nLearning Objectives:\nBy the end of this workshop, participants should be able to:\n\nDevelop a basic Nextflow workflow consisting of processes that use multiple scripting languages\nGain an understanding of Groovy and Nextflow syntax\nRead data of different types into a Nextflow workflow\nOutput Nextflow process results to a predefined directory\nRe-use and import processes, modules, and sub-workflows into a Nextflow workflow\nTest and set up profiles for a Nextflow workflow\nCreate conditional processes and conditional scripts within a process\nGain an understanding of Nextflow channel operators\nDevelop a basic Nextflow workflow with nf-core templates\nTroubleshoot known errors in workflow development\n\n\n\nSet up requirements\nPlease complete the Setup Instructions before the course.\nIf you have any trouble, please get in contact with us ASAP via Slack/Teams.\n\n\nWorkshop schedule\n\n\n\nLesson\nOverview\nDate\n\n\n\n\nSetup\nFollow these instructions to install VS Code and setup your workspace\nPrior to workshop\n\n\nSession kick off\nSession kick off: Discuss learning outcomes and finalising workspace setup\nEvery week\n\n\nBasic to Create a Nextflow Workflow\nIntroduction to nextflow channels, processes, data types and workflows\n29th May 2024\n\n\nDeveloping Modularised Workflows\nIntroduction to modules imports, sub-workflows, setting up test-profile, and common useful groovy functions\n5th Jun 2024\n\n\nWorking with nf-core Templates\nIntroduction to developing nextflow workflow with nf-core templates\n12th Jun 2024\n\n\nWorking with Nextflow Built-in Functions operators [metadata] output-scatter-gather\nIntroduction to nextflow operators, metadata propagation, scatter, and gather\n19th Jun 2024\n\n\n\n\n\nCredits and acknowledgement\nThis workshop is adapted from Fundamentals Training, Advanced Training, Developer Tutorials, and Nextflow Patterns materials from Nextflow and nf-core." }, { "objectID": "index.html", @@ -69,6 +69,216 @@ "section": "6.1.4 flatten ", "text": "6.1.4 flatten \nThe flatten operator flattens each item from a source channel and emits the elements separately. Deeply nested inputs are also flattened.\nChannel.of( [1, [2, 3]], 4, [5, [6]] )\n .flatten()\n .view()\nOutput:\n1\n2\n3\n4\n5\n6\n\nWithin the script block of the QUANTIFICATION process in the RNA-seq pipeline, we are assuming the reads are paired, and specify -1 ${reads[0]} -2 ${reads[1]} as inputs to salmon quant.\nprocess QUANTIFICATION {\n container \"/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-salmon-1.10.1--h7e5ed60_0.img\"\n\n input:\n tuple val(sample_id), path(salmon_index), path(reads)\n\n output:\n tuple val(sample_id) path(\"$sample_id\")\n\n script:\n \"\"\"\n salmon quant --threads $task.cpus --libType=U \\\n -i $salmon_index -1 ${reads[0]} -2 ${reads[1]} -o $sample_id\n \"\"\"\n}\nNow that the input reads can be either single or paired, the QUANTIFICATION process needs to be modified to allow for either input type. This can be done using the flatten() operator, and conditional script statements. Additionally, the size() method can be used to calculate the size of a list.\nThe script block can be changed to the following:\n script:\n def input_reads = [reads]\n if( input_reads.flatten().size() == 1 )\n \"\"\"\n salmon quant --threads $task.cpus --libType=U \\\n -i $salmon_index -r $reads -o $sample_id\n \"\"\"\n else \n \"\"\"\n salmon quant --threads $task.cpus --libType=U \\\\\n -i $salmon_index -1 ${reads[0]} -2 ${reads[1]} -o $sample_id\n \"\"\"\nFirst, a new variable input_reads is defined, which consists of the reads input being converted into a list. This has to be done since Nextflow will automatically convert a list of length 1 into a path within process. If the size() method was used on a path type input, it will return the size of the file in bytes, and not the list size. Therefore, all inputs must first be converted into a list in order to correctly caculate the number of files.\ndef input_reads = [reads]\nFor reads that are already in a list (ie. paired reads), this will nest the input into another list, for example:\n[ [ file1, file2 ] ]\nIf the size() operator is used on this input, it will always return 1 since the encompassing list only contains one element. Therefore, the flatten() operator has to be used to emit the files as separate elements.\nThe final definition to obtain the number of files in reads becomes:\ninput_reads.flatten().size()\nFor single reads, the input to salmon quant becomes -r $reads\n\nExercise\nCurrently the TRIMGALORE process only accounts for paired reads.\nprocess TRIMGALORE {\n container '/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-trim-galore-0.6.6--0.img' \n\n input:\n tuple val(sample_id), path(reads)\n \n output:\n tuple val(sample_id), path(\"*{3prime,5prime,trimmed,val}*.fq.gz\"), emit: reads\n tuple val(sample_id), path(\"*report.txt\") , emit: log , optional: true\n tuple val(sample_id), path(\"*unpaired*.fq.gz\") , emit: unpaired, optional: true\n tuple val(sample_id), path(\"*.html\") , emit: html , optional: true\n tuple val(sample_id), path(\"*.zip\") , emit: zip , optional: true\n\n script:\n \"\"\"\n trim_galore \\\\\n --paired \\\\\n --gzip \\\\\n ${reads[0]} \\\\\n ${reads[1]}\n \"\"\"\n}\nModify the process such that both single and paired reads can be used. For single reads, the following script block can be used:\n\"\"\"\ntrim_galore \\\\\n --gzip \\\\\n $reads\n\"\"\"\n\n\n\n\n\n\nSolution\n\n\n\n\n\nprocess TRIMGALORE {\n container '/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-trim-galore-0.6.6--0.img' \n\n input:\n tuple val(sample_id), path(reads)\n \n output:\n tuple val(sample_id), path(\"*{3prime,5prime,trimmed,val}*.fq.gz\"), emit: reads\n tuple val(sample_id), path(\"*report.txt\") , emit: log , optional: true\n tuple val(sample_id), path(\"*unpaired*.fq.gz\") , emit: unpaired, optional: true\n tuple val(sample_id), path(\"*.html\") , emit: html , optional: true\n tuple val(sample_id), path(\"*.zip\") , emit: zip , optional: true\n\n script:\n def input_reads = [reads]\n\n if( input_reads.flatten().size() == 1 )\n \"\"\"\n trim_galore \\\\\n --gzip \\\\\n $reads\n \"\"\"\n else\n \"\"\"\n trim_galore \\\\\n --paired \\\\\n --gzip \\\\\n ${reads[0]} \\\\\n ${reads[1]}\n \"\"\"\n\n}\n\n\n\nExtension\nModify the FASTQC process such that the output is a tuple where the first element is the grouping key, and the second element is the path to the fastqc logs.\nModify the MULTIQC process such that the output is a tuple where the first element is the grouping key, and the second element is the path to the generated html file.\nFinally, run the entire workflow, specifying an --outdir. The workflow block should look like this:\nworkflow {\n index_ch = INDEX(transcriptome_ch)\n\n quant_inputs_ch = index_ch.combine(reads_ch, by: 0)\n quant_ch = QT(quant_inputs_ch)\n\n trimgalore_out_ch = TRIMGALORE(reads_ch).reads\n\n fastqc_ch = FASTQC_one(reads_ch)\n fastqc_cleaned_ch = FASTQC_two(trimgalore_out_ch)\n multiqc_ch = MULTIQC(quant_ch, fastqc_ch)\n}\n\n\n\n\n\n\nSolution\n\n\n\n\n\nThe output block of both processes have been modified to be tuples containing a grouping key.\nprocess FASTQC {\n container \"/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-fastqc-0.12.1--hdfd78af_0.img\"\n\n input:\n tuple val(sample_id), path(reads)\n\n output:\n tuple val(sample_id), path(\"fastqc_${sample_id}_logs\")\n\n script:\n \"\"\"\n mkdir fastqc_${sample_id}_logs\n fastqc -o fastqc_${sample_id}_logs -f fastq -q ${reads}\n \"\"\"\n}\n\nprocess MULTIQC {\n publishDir params.outdir, mode:'copy'\n container \"/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-multiqc-1.21--pyhdfd78af_0.img\"\n\n input:\n tuple val(sample_id), path(quantification)\n tuple val(sample_id), path(fastqc)\n\n output:\n tuple val(sample_id), path(\"*.html\")\n\n script:\n \"\"\"\n multiqc . --filename $quantification\n \"\"\"\n}\n\n\n\n\nThis workshop is adapted from Fundamentals Training, Advanced Training, Developer Tutorials, Nextflow Patterns materials from Nextflow, nf-core nf-core tools documentation and nf-validation" }, + { + "objectID": "workshops/8.1_scatter_gather_output.html", + "href": "workshops/8.1_scatter_gather_output.html", + "title": "Nextflow Development - Outputs, Scatter, and Gather", + "section": "", + "text": "Objectives\n\n\n\n\nGain an understanding of how to structure nextflow published outputs\nGain an understanding of how to do scatter & gather processes" + }, + { + "objectID": "workshops/8.1_scatter_gather_output.html#environment-setup", + "href": "workshops/8.1_scatter_gather_output.html#environment-setup", + "title": "Nextflow Development - Outputs, Scatter, and Gather", + "section": "Environment Setup", + "text": "Environment Setup\nSet up an interactive shell to run our Nextflow workflow:\nsrun --pty -p prod_short --mem 8GB --mincpus 2 -t 0-2:00 bash\nLoad the required modules to run Nextflow:\nmodule load nextflow/23.04.1\nmodule load singularity/3.7.3\nSet the singularity cache environment variable:\nexport NXF_SINGULARITY_CACHEDIR=/config/binaries/singularity/containers_devel/nextflow\nSingularity images downloaded by workflow executions will now be stored in this directory.\nYou may want to include these, or other environmental variables, in your .bashrc file (or alternate) that is loaded when you log in so you don’t need to export variables every session. A complete list of environment variables can be found here.\nThe training data can be cloned from:\ngit clone https://github.com/nextflow-io/training.git" + }, + { + "objectID": "workshops/8.1_scatter_gather_output.html#rna-seq-workflow-and-module-files", + "href": "workshops/8.1_scatter_gather_output.html#rna-seq-workflow-and-module-files", + "title": "Nextflow Development - Outputs, Scatter, and Gather", + "section": "RNA-seq Workflow and Module Files ", + "text": "RNA-seq Workflow and Module Files \nPreviously, we created three Nextflow files and one config file:\n├── nextflow.config\n├── rnaseq.nf\n├── modules.nf\n└── modules\n └── trimgalore.nf\n\nrnaseq.nf: main workflow script where parameters are defined and processes were called.\n\n#!/usr/bin/env nextflow\n\nparams.reads = \"/scratch/users/.../training/nf-training/data/ggal/*_{1,2}.fq\"\nparams.transcriptome_file = \"/scratch/users/.../training/nf-training/data/ggal/transcriptome.fa\"\n\nreads_ch = Channel.fromFilePairs(\"$params.reads\")\n\ninclude { INDEX } from './modules.nf'\ninclude { QUANTIFICATION as QT } from './modules.nf'\ninclude { FASTQC as FASTQC_one } from './modules.nf'\ninclude { FASTQC as FASTQC_two } from './modules.nf'\ninclude { MULTIQC } from './modules.nf'\ninclude { TRIMGALORE } from './modules/trimgalore.nf'\n\nworkflow {\n index_ch = INDEX(params.transcriptome_file)\n quant_ch = QT(index_ch, reads_ch)\n fastqc_ch = FASTQC_one(reads_ch)\n trimgalore_out_ch = TRIMGALORE(reads_ch).reads\n fastqc_cleaned_ch = FASTQC_two(trimgalore_out_ch)\n multiqc_ch = MULTIQC(quant_ch, fastqc_ch)\n}\n\nmodules.nf: script containing the majority of modules, including INDEX, QUANTIFICATION, FASTQC, and MULTIQC\n\nprocess INDEX {\n container \"/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-salmon-1.10.1--h7e5ed60_0.img\"\n\n input:\n path transcriptome\n\n output:\n path \"salmon_idx\"\n\n script:\n \"\"\"\n salmon index --threads $task.cpus -t $transcriptome -i salmon_idx\n \"\"\"\n}\n\nprocess QUANTIFICATION {\n container \"/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-salmon-1.10.1--h7e5ed60_0.img\"\n\n input:\n path salmon_index\n tuple val(sample_id), path(reads)\n\n output:\n path \"$sample_id\"\n\n script:\n \"\"\"\n salmon quant --threads $task.cpus --libType=U \\\n -i $salmon_index -1 ${reads[0]} -2 ${reads[1]} -o $sample_id\n \"\"\"\n}\n\nprocess FASTQC {\n container \"/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-fastqc-0.12.1--hdfd78af_0.img\"\n\n input:\n tuple val(sample_id), path(reads)\n\n output:\n path \"fastqc_${sample_id}_logs\"\n\n script:\n \"\"\"\n mkdir fastqc_${sample_id}_logs\n fastqc -o fastqc_${sample_id}_logs -f fastq -q ${reads}\n \"\"\"\n}\n\nprocess MULTIQC {\n publishDir params.outdir, mode:'copy'\n container \"/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-multiqc-1.21--pyhdfd78af_0.img\"\n\n input:\n path quantification\n path fastqc\n\n output:\n path \"*.html\"\n\n script:\n \"\"\"\n multiqc . --filename $quantification\n \"\"\"\n}\n\nmodules/trimgalore.nf: script inside a modules folder, containing only the TRIMGALORE process\n\nprocess TRIMGALORE {\n container '/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-trim-galore-0.6.6--0.img' \n\n input:\n tuple val(sample_id), path(reads)\n \n output:\n tuple val(sample_id), path(\"*{3prime,5prime,trimmed,val}*.fq.gz\"), emit: reads\n tuple val(sample_id), path(\"*report.txt\") , emit: log , optional: true\n tuple val(sample_id), path(\"*unpaired*.fq.gz\") , emit: unpaired, optional: true\n tuple val(sample_id), path(\"*.html\") , emit: html , optional: true\n tuple val(sample_id), path(\"*.zip\") , emit: zip , optional: true\n\n script:\n \"\"\"\n trim_galore \\\\\n --paired \\\\\n --gzip \\\\\n ${reads[0]} \\\\\n ${reads[1]}\n \"\"\"\n}\n\nnextflow.config: config file that enables singularity\n\nsingularity {\n enabled = true\n autoMounts = true\n cacheDir = \"/config/binaries/singularity/containers_devel/nextflow\"\n}\nRun the pipeline, specifying --outdir:\n>>> nextflow run rnaseq.nf --outdir output\nN E X T F L O W ~ version 23.04.1\nLaunching `rnaseq.nf` [soggy_jennings] DSL2 - revision: 87afc1d98d\nexecutor > local (16)\n[93/d37ef0] process > INDEX [100%] 1 of 1 ✔\n[b3/4c4d9c] process > QT (1) [100%] 3 of 3 ✔\n[d0/173a6e] process > FASTQC_one (3) [100%] 3 of 3 ✔\n[58/0b8af2] process > TRIMGALORE (3) [100%] 3 of 3 ✔\n[c6/def175] process > FASTQC_two (3) [100%] 3 of 3 ✔\n[e0/bcf904] process > MULTIQC (3) [100%] 3 of 3 ✔" + }, + { + "objectID": "workshops/8.1_scatter_gather_output.html#organise-outputs", + "href": "workshops/8.1_scatter_gather_output.html#organise-outputs", + "title": "Nextflow Development - Outputs, Scatter, and Gather", + "section": "8.1. Organise outputs", + "text": "8.1. Organise outputs\nThe output declaration block defines the channels used by the process to send out the results produced. However, this output only stays in the work/ directory if there is no publishDir directive specified.\nGiven each task is being executed in separate temporary work/ folder (e.g., work/f1/850698…), you may want to save important, non-intermediary, and/or final files in a results folder.\nTo store our workflow result files, you need to explicitly mark them using the directive publishDir in the process that’s creating the files. For example:\nprocess MULTIQC {\n publishDir params.outdir, mode:'copy'\n container \"/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-multiqc-1.21--pyhdfd78af_0.img\"\n\n input:\n path quantification\n path fastqc\n\n output:\n path \"*.html\"\n\n script:\n \"\"\"\n multiqc . --filename $quantification\n \"\"\"\n}\nThe above example will copy all html files created by the MULTIQC process into the directory path specified in the params.outdir" + }, + { + "objectID": "workshops/8.1_scatter_gather_output.html#store-outputs-matching-a-glob-pattern", + "href": "workshops/8.1_scatter_gather_output.html#store-outputs-matching-a-glob-pattern", + "title": "Nextflow Development - Outputs, Scatter, and Gather", + "section": "8.1.1. Store outputs matching a glob pattern", + "text": "8.1.1. Store outputs matching a glob pattern\nYou can use more than one publishDir to keep different outputs in separate directories. For each directive specify a different glob pattern using the pattern option to store into each directory only the files that match the provided pattern.\nFor example:\nreads_ch = Channel.fromFilePairs('data/ggal/*_{1,2}.fq')\n\nprocess FOO {\n publishDir \"results/bam\", pattern: \"*.bam\"\n publishDir \"results/bai\", pattern: \"*.bai\"\n\n input:\n tuple val(sample_id), path(sample_id_paths)\n\n output:\n tuple val(sample_id), path(\"*.bam\")\n tuple val(sample_id), path(\"*.bai\")\n\n script:\n \"\"\"\n echo your_command_here --sample $sample_id_paths > ${sample_id}.bam\n echo your_command_here --sample $sample_id_paths > ${sample_id}.bai\n \"\"\"\n}\nExercise\nUse publishDir and pattern to keep the outputs from the trimgalore.nf into separate directories.\n\n\n\n\n\n\nSolution\n\n\n\n\n\nprocess TRIMGALORE {\n container '/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-trim-galore-0.6.6--0.img' \n publishDir \"$params.outdir/report\", mode: \"copy\", pattern:\"*report.txt\"\n publishDir \"$params.outdir/trimmed_fastq\", mode: \"copy\", pattern:\"*fq.gz\"\n\n input:\n tuple val(sample_id), path(reads)\n \n output:\n tuple val(sample_id), path(\"*{3prime,5prime,trimmed,val}*.fq.gz\"), emit: reads\n tuple val(sample_id), path(\"*report.txt\") , emit: log , optional: true\n tuple val(sample_id), path(\"*unpaired*.fq.gz\") , emit: unpaired, optional: true\n tuple val(sample_id), path(\"*.html\") , emit: html , optional: true\n tuple val(sample_id), path(\"*.zip\") , emit: zip , optional: true\n\n script:\n \"\"\"\n trim_galore \\\\\n --paired \\\\\n --gzip \\\\\n ${reads[0]} \\\\\n ${reads[1]}\n \"\"\"\n}\nOutput should now look like\n>>> tree ./output\n./output\n├── gut.html\n├── liver.html\n├── lung.html\n├── report\n│   ├── gut_1.fq_trimming_report.txt\n│   ├── gut_2.fq_trimming_report.txt\n│   ├── liver_1.fq_trimming_report.txt\n│   ├── liver_2.fq_trimming_report.txt\n│   ├── lung_1.fq_trimming_report.txt\n│   └── lung_2.fq_trimming_report.txt\n└── trimmed_fastq\n ├── gut_1_val_1.fq.gz\n ├── gut_2_val_2.fq.gz\n ├── liver_1_val_1.fq.gz\n ├── liver_2_val_2.fq.gz\n ├── lung_1_val_1.fq.gz\n └── lung_2_val_2.fq.gz\n\n2 directories, 15 files" + }, + { + "objectID": "workshops/8.1_scatter_gather_output.html#store-outputs-renaming-files-or-in-a-sub-directory", + "href": "workshops/8.1_scatter_gather_output.html#store-outputs-renaming-files-or-in-a-sub-directory", + "title": "Nextflow Development - Outputs, Scatter, and Gather", + "section": "8.1.2. Store outputs renaming files or in a sub-directory", + "text": "8.1.2. Store outputs renaming files or in a sub-directory\nThe publishDir directive also allow the use of saveAs option to give each file a name of your choice, providing a custom rule as a closure.\nprocess foo {\n publishDir 'results', saveAs: { filename -> \"foo_$filename\" }\n\n output: \n path '*.txt'\n\n '''\n touch this.txt\n touch that.txt\n '''\n}\nThe same pattern can be used to store specific files in separate directories depending on the actual name.\nprocess foo {\n publishDir 'results', saveAs: { filename -> filename.endsWith(\".zip\") ? \"zips/$filename\" : filename }\n\n output: \n path '*'\n\n '''\n touch this.txt\n touch that.zip\n '''\n}\nExercise\nModify the MULTIQC output with saveAs such that resulting folder is as follow:\n./output\n├── MultiQC\n│   ├── multiqc_gut.html\n│   ├── multiqc_liver.html\n│   └── multiqc_lung.html\n├── report\n│   ├── gut_1.fq_trimming_report.txt\n│   ├── gut_2.fq_trimming_report.txt\n│   ├── liver_1.fq_trimming_report.txt\n│   ├── liver_2.fq_trimming_report.txt\n│   ├── lung_1.fq_trimming_report.txt\n│   └── lung_2.fq_trimming_report.txt\n└── trimmed_fastq\n ├── gut_1_val_1.fq.gz\n ├── gut_2_val_2.fq.gz\n ├── liver_1_val_1.fq.gz\n ├── liver_2_val_2.fq.gz\n ├── lung_1_val_1.fq.gz\n └── lung_2_val_2.fq.gz\n\n3 directories, 15 files\n\n\n\n\n\n\nWarning\n\n\n\nYou need to remove existing output folder/files if you want to have a clean output. By default, nextflow will overwrite existing files, and keep all the remaining files in the same specified output directory.\n\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\nprocess MULTIQC {\n publishDir params.outdir, mode:'copy', saveAs: { filename -> filename.endsWith(\".html\") ? \"MultiQC/multiqc_$filename\" : filename }\n container \"/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-multiqc-1.21--pyhdfd78af_0.img\"\n\n input:\n path quantification\n path fastqc\n\n output:\n path \"*.html\"\n\n script:\n \"\"\"\n multiqc . --filename $quantification\n \"\"\"\n}\n\n\n\nChallenge\nModify all the processes in rnaseq.nf such that we will have the following output structure\n./output\n├── gut\n│   ├── QC\n│   │   ├── fastqc_gut_logs\n│   │   │   ├── gut_1_fastqc.html\n│   │   │   ├── gut_1_fastqc.zip\n│   │   │   ├── gut_2_fastqc.html\n│   │   │   └── gut_2_fastqc.zip\n│   │   └── gut.html\n│   ├── report\n│   │   ├── gut_1.fq_trimming_report.txt\n│   │   └── gut_2.fq_trimming_report.txt\n│   └── trimmed_fastq\n│   ├── gut_1_val_1.fq.gz\n│   └── gut_2_val_2.fq.gz\n├── liver\n│   ├── QC\n│   │   ├── fastqc_liver_logs\n│   │   │   ├── liver_1_fastqc.html\n│   │   │   ├── liver_1_fastqc.zip\n│   │   │   ├── liver_2_fastqc.html\n│   │   │   └── liver_2_fastqc.zip\n│   │   └── liver.html\n│   ├── report\n│   │   ├── liver_1.fq_trimming_report.txt\n│   │   └── liver_2.fq_trimming_report.txt\n│   └── trimmed_fastq\n│   ├── liver_1_val_1.fq.gz\n│   └── liver_2_val_2.fq.gz\n└── lung\n ├── QC\n │   ├── fastqc_lung_logs\n │   │   ├── lung_1_fastqc.html\n │   │   ├── lung_1_fastqc.zip\n │   │   ├── lung_2_fastqc.html\n │   │   └── lung_2_fastqc.zip\n │   └── lung.html\n ├── report\n │   ├── lung_1.fq_trimming_report.txt\n │   └── lung_2.fq_trimming_report.txt\n └── trimmed_fastq\n ├── lung_1_val_1.fq.gz\n └── lung_2_val_2.fq.gz\n\n15 directories, 27 files\n\n\n\n\n\n\nSolution\n\n\n\n\n\nprocess FASTQC {\n publishDir \"$params.outdir/$sample_id/QC\", mode:'copy'\n container \"/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-fastqc-0.12.1--hdfd78af_0.img\"\n\n input:\n tuple val(sample_id), path(reads)\n\n output:\n path \"fastqc_${sample_id}_logs\"\n\n script:\n \"\"\"\n mkdir fastqc_${sample_id}_logs\n fastqc -o fastqc_${sample_id}_logs -f fastq -q ${reads}\n \"\"\"\n}\n\nprocess MULTIQC {\n //publishDir params.outdir, mode:'copy', saveAs: { filename -> filename.endsWith(\".html\") ? \"MultiQC/multiqc_$filename\" : filename }\n publishDir \"$params.outdir/$quantification/QC\", mode:'copy'\n container \"/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-multiqc-1.21--pyhdfd78af_0.img\"\n\n input:\n path quantification\n path fastqc\n\n output:\n path \"*.html\"\n\n script:\n \"\"\"\n multiqc . --filename $quantification\n \"\"\"\n}\n\nprocess TRIMGALORE {\n container '/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-trim-galore-0.6.6--0.img'\n publishDir \"${params.outdir}/${sample_id}/report\", mode: \"copy\", pattern:\"*report.txt\"\n publishDir \"${params.outdir}/${sample_id}/trimmed_fastq\", mode: \"copy\", pattern:\"*fq.gz\"\n\n input:\n tuple val(sample_id), path(reads)\n\n output:\n tuple val(sample_id), path(\"*{3prime,5prime,trimmed,val}*.fq.gz\"), emit: reads\n tuple val(sample_id), path(\"*report.txt\") , emit: log , optional: true\n tuple val(sample_id), path(\"*unpaired*.fq.gz\") , emit: unpaired, optional: true\n tuple val(sample_id), path(\"*.html\") , emit: html , optional: true\n tuple val(sample_id), path(\"*.zip\") , emit: zip , optional: true\n\n script:\n \"\"\"\n trim_galore \\\\\n --paired \\\\\n --gzip \\\\\n ${reads[0]} \\\\\n ${reads[1]}\n \"\"\"\n}" + }, + { + "objectID": "workshops/8.1_scatter_gather_output.html#scatter", + "href": "workshops/8.1_scatter_gather_output.html#scatter", + "title": "Nextflow Development - Outputs, Scatter, and Gather", + "section": "8.2 Scatter", + "text": "8.2 Scatter\nThe scatter operation involves distributing large input data into smaller chunks that can be analysed across multiple processes in parallel.\nOne very simple example of native scatter is how nextflow handles Channel factories with the Channel.fromPath or Channel.fromFilePairs method, where multiple input data is processed in parallel.\nparams.reads = \"/scratch/users/.../training/nf-training/data/ggal/*_{1,2}.fq\"\nreads_ch = Channel.fromFilePairs(\"$params.reads\")\n\ninclude { FASTQC as FASTQC_one } from './modules.nf'\n\nworkflow {\n fastqc_ch = FASTQC_one(reads_ch)\n}\nFrom the above snippet from our rnaseq.nf, we will get three execution of FASTQC_one for each pairs of our input data.\nOther than natively splitting execution by input data, Nextflow also provides operators to scatter existing input data for various benefits, such as faster processing. For example:\n\nsplitText\nsplitFasta\nsplitFastq\nmap with from or fromList\nflatten" + }, + { + "objectID": "workshops/8.1_scatter_gather_output.html#process-per-file-chunk", + "href": "workshops/8.1_scatter_gather_output.html#process-per-file-chunk", + "title": "Nextflow Development - Outputs, Scatter, and Gather", + "section": "8.2.1 Process per file chunk", + "text": "8.2.1 Process per file chunk\nExercise\nparams.infile = \"/data/reference/bed_files/Agilent_CRE_v2/S30409818_Covered_MERGED.bed\"\nparams.size = 100000\n\nprocess count_line {\n debug true\n input: \n file x\n\n script:\n \"\"\"\n wc -l $x \n \"\"\"\n}\n\nworkflow {\n Channel.fromPath(params.infile) \\\n | splitText(by: params.size, file: true) \\\n | count_line\n}\nExercise\nparams.infile = \"/scratch/users/rlupat/nfWorkshop/dev1/training/nf-training/data/ggal/*_{1,2}.fq\"\nparams.size = 1000\n\nworkflow {\n Channel.fromFilePairs(params.infile, flat: true) \\\n | splitFastq(by: params.size, pe: true, file: true) \\\n | view()\n}" + }, + { + "objectID": "workshops/8.1_scatter_gather_output.html#process-per-file-range", + "href": "workshops/8.1_scatter_gather_output.html#process-per-file-range", + "title": "Nextflow Development - Outputs, Scatter, and Gather", + "section": "8.2.1 Process per file range", + "text": "8.2.1 Process per file range\nExercise\nChannel.from(1..22) \\\n | map { chr -> [\"sample${chr}\", file(\"${chr}.indels.vcf\"), file(\"${chr}.vcf\")] } \\\n | view()\n>> nextflow run test_scatter.nf\n\n[sample1, /scratch/users/${users}/1.indels.vcf, /scratch/users/${users}/1.vcf]\n[sample2, /scratch/users/${users}/2.indels.vcf, /scratch/users/${users}/2.vcf]\n[sample3, /scratch/users/${users}/3.indels.vcf, /scratch/users/${users}/3.vcf]\n[sample4, /scratch/users/${users}/4.indels.vcf, /scratch/users/${users}/4.vcf]\n[sample5, /scratch/users/${users}/5.indels.vcf, /scratch/users/${users}/5.vcf]\n[sample6, /scratch/users/${users}/6.indels.vcf, /scratch/users/${users}/6.vcf]\n[sample7, /scratch/users/${users}/7.indels.vcf, /scratch/users/${users}/7.vcf]\n[sample8, /scratch/users/${users}/8.indels.vcf, /scratch/users/${users}/8.vcf]\n[sample9, /scratch/users/${users}/9.indels.vcf, /scratch/users/${users}/9.vcf]\n[sample10, /scratch/users${users}/10.indels.vcf, /scratch/users${users}/10.vcf]\n[sample11, /scratch/users${users}/11.indels.vcf, /scratch/users${users}/11.vcf]\n[sample12, /scratch/users${users}/12.indels.vcf, /scratch/users${users}/12.vcf]\n[sample13, /scratch/users${users}/13.indels.vcf, /scratch/users${users}/13.vcf]\n[sample14, /scratch/users${users}/14.indels.vcf, /scratch/users${users}/14.vcf]\n[sample15, /scratch/users${users}/15.indels.vcf, /scratch/users${users}/15.vcf]\n[sample16, /scratch/users${users}/16.indels.vcf, /scratch/users${users}/16.vcf]\n[sample17, /scratch/users${users}/17.indels.vcf, /scratch/users${users}/17.vcf]\n[sample18, /scratch/users${users}/18.indels.vcf, /scratch/users${users}/18.vcf]\n[sample19, /scratch/users${users}/19.indels.vcf, /scratch/users${users}/19.vcf]\n[sample20, /scratch/users${users}/20.indels.vcf, /scratch/users${users}/20.vcf]\n[sample21, /scratch/users${users}/21.indels.vcf, /scratch/users${users}/21.vcf]\n[sample22, /scratch/users${users}/22.indels.vcf, /scratch/users${users}/22.vcf]\nExercise\nparams.infile = \"/data/reference/bed_files/Agilent_CRE_v2/S30409818_Covered_MERGED.bed\"\nparams.size = 100000\n\nprocess split_bed_by_chr {\n debug true\n\n input:\n path bed\n val chr\n\n output:\n path \"*.bed\"\n\n script:\n \"\"\"\n grep ^${chr}\\t ${bed} > ${chr}.bed\n \"\"\"\n}\n\nworkflow {\n split_bed_by_chr(params.infile, Channel.from(1..22)) | view()\n}\nChallenge\nHow do we include chr X and Y into the above split by chromosome?\n\n\n\n\n\n\nSolution\n\n\n\n\n\nworkflow {\n split_bed_by_chr(params.infile, Channel.from(1..22,'X','Y').flatten()) | view()\n}" + }, + { + "objectID": "workshops/8.1_scatter_gather_output.html#gather", + "href": "workshops/8.1_scatter_gather_output.html#gather", + "title": "Nextflow Development - Outputs, Scatter, and Gather", + "section": "8.3 Gather", + "text": "8.3 Gather\nThe gather operation consolidates results from parallel computations (can be from scatter) into a centralized process for aggregation and further processing.\nSome of the Nextflow provided operators that facilitate this gather operation, include:\n\ncollect\ncollectFile\nmap + groupTuple" + }, + { + "objectID": "workshops/8.1_scatter_gather_output.html#process-all-outputs-altogether", + "href": "workshops/8.1_scatter_gather_output.html#process-all-outputs-altogether", + "title": "Nextflow Development - Outputs, Scatter, and Gather", + "section": "8.3.1. Process all outputs altogether", + "text": "8.3.1. Process all outputs altogether\nExercise\nparams.infile = \"/data/reference/bed_files/Agilent_CRE_v2/S30409818_Covered_MERGED.bed\"\nparams.size = 100000\n\nprocess split_bed_by_chr {\n debug true\n\n input:\n path bed\n val chr\n\n output:\n path \"*.bed\"\n\n script:\n \"\"\"\n grep ^${chr}\\t ${bed} > ${chr}.bed\n \"\"\"\n}\n\nworkflow {\n split_bed_by_chr(params.infile, Channel.from(1..22,'X','Y').flatten()) | collect | view()\n}" + }, + { + "objectID": "workshops/8.1_scatter_gather_output.html#collect-outputs-into-a-file", + "href": "workshops/8.1_scatter_gather_output.html#collect-outputs-into-a-file", + "title": "Nextflow Development - Outputs, Scatter, and Gather", + "section": "8.3.2. Collect outputs into a file", + "text": "8.3.2. Collect outputs into a file\nExercise\nparams.infile = \"/data/reference/bed_files/Agilent_CRE_v2/S30409818_Covered_MERGED.bed\"\nparams.size = 100000\n\nprocess split_bed_by_chr {\n debug true\n\n input:\n path bed\n val chr\n\n output:\n path \"*.bed\"\n\n script:\n \"\"\"\n grep ^${chr}\\t ${bed} > ${chr}.bed\n \"\"\"\n}\n\nworkflow {\n split_bed_by_chr(params.infile, Channel.from(1..22,'X','Y').flatten()) | collectFile(name: 'merged.bed', newLine:true) | view()\n}\nExercise\nworkflow {\n Channel.fromPath(\"/scratch/users/rlupat/nfWorkshop/dev1/training/nf-training/data/ggal/*_1.fq\", checkIfExists: true) \\\n | collectFile(name: 'combined_1.fq', newLine:true) \\\n | view\n}" + }, + { + "objectID": "workshops/2.2_troubleshooting.html", + "href": "workshops/2.2_troubleshooting.html", + "title": "Troubleshooting Nextflow run", + "section": "", + "text": "2.2.1. Nextflow log\nIt is important to keep a record of the commands you have run to generate your results. Nextflow helps with this by creating and storing metadata and logs about the run in hidden files and folders in your current directory (unless otherwise specified). This data can be used by Nextflow to generate reports. It can also be queried using the Nextflow log command:\nnextflow log\nThe log command has multiple options to facilitate the queries and is especially useful while debugging a workflow and inspecting execution metadata. You can view all of the possible log options with -h flag:\nnextflow log -h\nTo query a specific execution you can use the RUN NAME or a SESSION ID:\nnextflow log <run name>\nTo get more information, you can use the -f option with named fields. For example:\nnextflow log <run name> -f process,hash,duration\nThere are many other fields you can query. You can view a full list of fields with the -l option:\nnextflow log -l\n\n\n\n\n\n\nChallenge\n\n\n\nUse the log command to view with process, hash, and script fields for your tasks from your most recent Nextflow execution.\n\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\nUse the log command to get a list of you recent executions:\nnextflow log\nTIMESTAMP DURATION RUN NAME STATUS REVISION ID SESSION ID COMMAND \n2023-11-21 22:43:14 14m 17s jovial_angela OK 3bec2331ca 319751c3-25a6-4085-845c-6da28cd771df nextflow run nf-core/rnaseq\n2023-11-21 23:05:49 1m 36s marvelous_shannon OK 3bec2331ca 319751c3-25a6-4085-845c-6da28cd771df nextflow run nf-core/rnaseq\n2023-11-21 23:10:00 1m 35s deadly_babbage OK 3bec2331ca 319751c3-25a6-4085-845c-6da28cd771df nextflow run nf-core/rnaseq\nQuery the process, hash, and script using the -f option for the most recent run:\nnextflow log marvelous_shannon -f process,hash,script\n\n[... truncated ...]\n\nNFCORE_RNASEQ:RNASEQ:SUBREAD_FEATURECOUNTS 7c/f936d4 \n featureCounts \\\n -B -C -g gene_biotype -t exon \\\n -p \\\n -T 2 \\\n -a chr22_with_ERCC92.gtf \\\n -s 2 \\\n -o HBR_Rep1_ERCC.featureCounts.txt \\\n HBR_Rep1_ERCC.markdup.sorted.bam\n\n cat <<-END_VERSIONS > versions.yml\n \"NFCORE_RNASEQ:RNASEQ:SUBREAD_FEATURECOUNTS\":\n subread: $( echo $(featureCounts -v 2>&1) | sed -e \"s/featureCounts v//g\")\n END_VERSIONS\n\n[... truncated ... ]\n\nNFCORE_RNASEQ:RNASEQ:MULTIQC 7a/8449d7 \n multiqc \\\n -f \\\n \\\n \\\n .\n\n cat <<-END_VERSIONS > versions.yml\n \"NFCORE_RNASEQ:RNASEQ:MULTIQC\":\n multiqc: $( multiqc --version | sed -e \"s/multiqc, version //g\" )\n END_VERSIONS\n \n\n\n\n\n\n2.2.2. Execution cache and resume\nTask execution caching is an essential feature of modern workflow managers. As such, Nextflow provides an automated caching mechanism for every execution. When using the Nextflow -resume option, successfully completed tasks from previous executions are skipped and the previously cached results are used in downstream tasks.\nNextflow caching mechanism works by assigning a unique ID to each task. The task unique ID is generated as a 128-bit hash value composing the the complete file path, file size, and last modified timestamp. These ID’s are used to create a separate execution directory where the tasks are executed and the outputs are stored. Nextflow will take care of the inputs and outputs in these folders for you.\nYou can re-launch the previously executed nf-core/rnaseq workflow again, but with a -resume flag, and observe the progress. Notice the time it takes to complete the workflow.\nnextflow run nf-core/rnaseq -r 3.11.1 \\\n --input samplesheet.csv \\\n --outdir ./my_results \\\n --fasta $materials/ref/chr22_with_ERCC92.fa \\\n --gtf $materials/ref/chr22_with_ERCC92.gtf \\\n -profile singularity \\\n --skip_markduplicates true \\\n --save_trimmed true \\\n --save_unaligned true \\\n --max_memory '6.GB' \\\n --max_cpus 2 \\\n -resume \n\n[80/ec6ff8] process > NFCORE_RNASEQ:RNASEQ:PREPARE_GENOME:GTF2BED (chr22_with_ERCC92.gtf) [100%] 1 of 1, cached: 1 ✔\n[1a/7bec9c] process > NFCORE_RNASEQ:RNASEQ:PREPARE_GENOME:GTF_GENE_FILTER (chr22_with_ERCC92.fa) [100%] 1 of 1, cached: 1 ✔\nExecuting this workflow will create a my_results directory with selected results files and add some further sub-directories into the work directory\nIn the schematic above, the hexadecimal numbers, such as 80/ec6ff8, identify the unique task execution. These numbers are also the prefix of the work directories where each task is executed.\nYou can inspect the files produced by a task by looking inside the work directory and using these numbers to find the task-specific execution path:\nls work/80/ec6ff8ba69a8b5b8eede3679e9f978/\nIf you look inside the work directory of a FASTQC task, you will find the files that were staged and created when this task was executed:\n>>> ls -la work/e9/60b2e80b2835a3e1ad595d55ac5bf5/ \n\ntotal 15895\ndrwxrwxr-x 2 rlupat rlupat 4096 Nov 22 03:39 .\ndrwxrwxr-x 4 rlupat rlupat 4096 Nov 22 03:38 ..\n-rw-rw-r-- 1 rlupat rlupat 0 Nov 22 03:39 .command.begin\n-rw-rw-r-- 1 rlupat rlupat 9509 Nov 22 03:39 .command.err\n-rw-rw-r-- 1 rlupat rlupat 9609 Nov 22 03:39 .command.log\n-rw-rw-r-- 1 rlupat rlupat 100 Nov 22 03:39 .command.out\n-rw-rw-r-- 1 rlupat rlupat 10914 Nov 22 03:39 .command.run\n-rw-rw-r-- 1 rlupat rlupat 671 Nov 22 03:39 .command.sh\n-rw-rw-r-- 1 rlupat rlupat 231 Nov 22 03:39 .command.trace\n-rw-rw-r-- 1 rlupat rlupat 1 Nov 22 03:39 .exitcode\nlrwxrwxrwx 1 rlupat rlupat 63 Nov 22 03:39 HBR_Rep1_ERCC_1.fastq.gz -> HBR_Rep1_ERCC-Mix2_Build37-ErccTranscripts-chr22.read1.fastq.gz\n-rw-rw-r-- 1 rlupat rlupat 2368 Nov 22 03:39 HBR_Rep1_ERCC_1.fastq.gz_trimming_report.txt\n-rw-rw-r-- 1 rlupat rlupat 697080 Nov 22 03:39 HBR_Rep1_ERCC_1_val_1_fastqc.html\n-rw-rw-r-- 1 rlupat rlupat 490526 Nov 22 03:39 HBR_Rep1_ERCC_1_val_1_fastqc.zip\n-rw-rw-r-- 1 rlupat rlupat 6735205 Nov 22 03:39 HBR_Rep1_ERCC_1_val_1.fq.gz\nlrwxrwxrwx 1 rlupat rlupat 63 Nov 22 03:39 HBR_Rep1_ERCC_2.fastq.gz -> HBR_Rep1_ERCC-Mix2_Build37-ErccTranscripts-chr22.read2.fastq.gz\n-rw-rw-r-- 1 rlupat rlupat 2688 Nov 22 03:39 HBR_Rep1_ERCC_2.fastq.gz_trimming_report.txt\n-rw-rw-r-- 1 rlupat rlupat 695591 Nov 22 03:39 HBR_Rep1_ERCC_2_val_2_fastqc.html\n-rw-rw-r-- 1 rlupat rlupat 485732 Nov 22 03:39 HBR_Rep1_ERCC_2_val_2_fastqc.zip\n-rw-rw-r-- 1 rlupat rlupat 7088948 Nov 22 03:39 HBR_Rep1_ERCC_2_val_2.fq.gz\nlrwxrwxrwx 1 rlupat rlupat 102 Nov 22 03:39 HBR_Rep1_ERCC-Mix2_Build37-ErccTranscripts-chr22.read1.fastq.gz -> /data/seqliner/test-data/rna-seq/fastq/HBR_Rep1_ERCC-Mix2_Build37-ErccTranscripts-chr22.read1.fastq.gz\nlrwxrwxrwx 1 rlupat rlupat 102 Nov 22 03:39 HBR_Rep1_ERCC-Mix2_Build37-ErccTranscripts-chr22.read2.fastq.gz -> /data/seqliner/test-data/rna-seq/fastq/HBR_Rep1_ERCC-Mix2_Build37-ErccTranscripts-chr22.read2.fastq.gz\n-rw-rw-r-- 1 rlupat rlupat 109 Nov 22 03:39 versions.yml\nThe FASTQC process runs twice, executing in a different work directories for each set of inputs. Therefore, in the previous example, the work directory [e9/60b2e8] represents just one of the four sets of input data that was processed.\nIt’s very likely you will execute a workflow multiple times as you find the parameters that best suit your data. You can save a lot of spaces (and time) by resuming a workflow from the last step that was completed successfully and/or unmodified.\nIn practical terms, the workflow is executed from the beginning. However, before launching the execution of a process, Nextflow uses the task unique ID to check if the work directory already exists and that it contains a valid command exit state with the expected output files. If this condition is satisfied, the task execution is skipped and previously computed results are used as the process results.\nNotably, the -resume functionality is very sensitive. Even touching a file in the work directory can invalidate the cache.\n\n\n\n\n\n\nChallenge\n\n\n\nInvalidate the cache by touching a .fastq.gz file in a FASTQC task work directory (you can use the touch command). Execute the workflow again with the -resume option to show that the cache has been invalidated.\n\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\nExecute the workflow for the first time (if you have not already).\nUse the task ID shown for the FASTQC process and use it to find and touch a the sample1_R1.fastq.gz file:\ntouch work/ff/21abfa87cc7cdec037ce4f36807d32/HBR_Rep1_ERCC_1.fastq.gz\nExecute the workflow again with the -resume command option:\nnextflow run nf-core/rnaseq -r 3.11.1 \\\n --input samplesheet.csv \\\n --outdir ./my_results \\\n --fasta $materials/ref/chr22_with_ERCC92.fa \\\n --gtf $materials/ref/chr22_with_ERCC92.gtf \\\n -profile singularity \\\n --skip_markduplicates true \\\n --save_trimmed true \\\n --save_unaligned true \\\n --max_memory '6.GB' \\\n --max_cpus 2 \\\n -resume \nYou should see that some task were invalid and were executed again.\nWhy did this happen?\nIn this example, the cache of two FASTQC tasks were invalid. The fastq file we touch is used by in the pipeline in multiple places. Thus, touching the symlink for this file and changing the date of last modification disrupted two task executions.\n\n\n\n\n\n2.2.3. Troubleshoot warning and error messages\nWhile our previous workflow execution completed successfully, there were a couple of warning messages that may be cause for concern:\n-[nf-core/rnaseq] Pipeline completed successfully with skipped sampl(es)-\n-[nf-core/rnaseq] Please check MultiQC report: 2/2 samples failed strandedness check.-\nCompleted at: 20-Nov-2023 00:29:04\nDuration : 10m 15s\nCPU hours : 0.3 \nSucceeded : 72\n\n\n\n\n\n\nHandling dodgy error messages 🤬\n\n\n\nThe first warning message isn’t very descriptive (see this pull request). You might come across issues like this when running nf-core pipelines, too. Bug reports and user feedback is very important to open source software communities like nf-core. If you come across any issues, submit a GitHub issue or start a discussion in the relevant nf-core Slack channel so others are aware and it can be addressed by the pipeline’s developers.\n\n\n➤ Take a look at the MultiQC report, as directed by the second message. You can find the MultiQC report in the lesson2.1/ directory:\nls -la lesson2.1/multiqc/star_salmon/\ntotal 1402\ndrwxrwxr-x 4 rlupat rlupat 4096 Nov 22 00:29 .\ndrwxrwxr-x 3 rlupat rlupat 4096 Nov 22 00:29 ..\ndrwxrwxr-x 2 rlupat rlupat 8192 Nov 22 00:29 multiqc_data\ndrwxrwxr-x 5 rlupat rlupat 4096 Nov 22 00:29 multiqc_plots\n-rw-rw-r-- 1 rlupat rlupat 1419998 Nov 22 00:29 multiqc_report.html\n➤ Download the multiqc_report.html the file navigator panel on the left side of your VS Code window by right-clicking on it and then selecting Download. Open the file on your computer.\nTake a look a the section labelled WARNING: Fail Strand Check\nThe warning we have received is indicating that the read strandedness we specified in our samplesheet.csv and inferred strandedness identified by the RSeqQC process in the pipeline do not match. It looks like the test samplesheet have incorrectly specified strandedness as forward in the samplesheet.csv when our raw reads actually show an equal distribution of sense and antisense reads.\nFor those who are not familiar with RNAseq data, incorrectly specified strandedness may negatively impact the read quantification step (process: Salmon quant) and give us inaccurate results. So, let’s clarify how the Salmon quant process is gathering strandedness information for our input files by default and find a way to address this with the parameters provided by the nf-core/rnaseq pipeline.\n\n\n\n2.2.4. Identify the run command for a process\nTo observe exactly what command is being run for a process, we can attempt to infer this information from the module’s main.nf script in the modules/ directory. However, given all the different parameters that may be applied at the process level, this may not be very clear.\n➤ Take a look at the Salmon quant main.nf file:\nnf-core-rnaseq-3.11.1/workflow/modules/nf-core/salmon/quant/main.nf\nUnless you are familiar with developing nf-core pipelines, it can be very hard to see what is actually happening in the code, given all the different variables and conditional arguments inside this script. Above the script block we can see strandedness is being applied using a few different conditional arguments. Instead of trying to infer how the $strandedness variable is being defined and applied to the process, let’s use the hidden command files saved for this task in the work/ directory.\n\n\n\n\n\n\nHidden files in the work directory!\n\n\n\nRemember that the pipeline’s results are cached in the work directory. In addition to the cached files, each task execution directories inside the work directory contains a number of hidden files:\n\n.command.sh: The command script run for the task.\n.command.run: The command wrapped used to run the task.\n.command.out: The task’s standard output log.\n.command.err: The task’s standard error log.\n.command.log: The wrapper execution output.\n.command.begin: A file created as soon as the job is launched.\n.exitcode: A file containing the task exit code (0 if successful)\n\n\n\nWith nextflow log command that we discussed previously, there are multiple options to facilitate the queries and is especially useful while debugging a pipeline and while inspecting pipeline execution metadata.\nTo understand how Salmon quant is interpreting strandedness, we’re going to use this command to track down the hidden .command.sh scripts for each Salmon quant task that was run. This will allow us to find out how Salmon quant handles strandedness and if there is a way for us to override this.\n➤ Use the Nextflow log command to get the unique run name information of the previously executed pipelines:\nnextflow log <run-name>\nThat command will list out all the work subdirectories for all processes run.\nAnd we now need to find the specific hidden.command.sh for Salmon tasks. But how to find them? 🤔\n➤ Let’s add some custom bash code to query a Nextflow run with the run name from the previous lesson. First, save your run name in a bash variable. For example:\nrun_name=marvelous_shannon\n➤ And let’s save the tool of interest (salmon) in another bash variable to pull it from a run command:\ntool=salmon\n➤ Next, run the following bash command:\nnextflow log ${run_name} | while read line;\n do\n cmd=$(ls ${line}/.command.sh 2>/dev/null);\n if grep -q $tool $cmd;\n then \n echo $cmd; \n fi; \n done \nThat will list all process .command.sh scripts containing ‘salmon’. There are a few different processes that run Salmon to perform other steps in the workflow. We are looking for Salmon quant which performs the read quantification:\n/scratch/users/rlupat/nfWorkshop/lesson2.1/work/57/fba8f9a2385dac5fa31688ba1afa9b/.command.sh\n/scratch/users/rlupat/nfWorkshop/lesson2.1/work/30/0113a58c14ca8d3099df04ebf388f3/.command.sh\n/scratch/users/rlupat/nfWorkshop/lesson2.1/work/ec/95d6bd12d578c3bce22b5de4ed43fe/.command.sh\n/scratch/users/rlupat/nfWorkshop/lesson2.1/work/49/6fedcb09e666432ae6ddf8b1e8f488/.command.sh\n/scratch/users/rlupat/nfWorkshop/lesson2.1/work/b4/2ca8d05b049438262745cde92955e9/.command.sh\n/scratch/users/rlupat/nfWorkshop/lesson2.1/work/38/875d68dae270504138bb3d72d511a7/.command.sh\n/scratch/users/rlupat/nfWorkshop/lesson2.1/work/72/776810a99695b1c114cbb103f4a0e6/.command.sh\n/scratch/users/rlupat/nfWorkshop/lesson2.1/work/1c/dc3f54cc7952bf55e6742dd4783392/.command.sh\n/scratch/users/rlupat/nfWorkshop/lesson2.1/work/f3/5116a5b412bde7106645671e4c6ffb/.command.sh\n/scratch/users/rlupat/nfWorkshop/lesson2.1/work/17/fb0c791810f42a438e812d5c894ebf/.command.sh\n/scratch/users/rlupat/nfWorkshop/lesson2.1/work/4c/931a9b60b2f3cf770028854b1c673b/.command.sh\n/scratch/users/rlupat/nfWorkshop/lesson2.1/work/91/e1c99d8acb5adf295b37fd3bbc86a5/.command.sh\nCompared with the salmon quant main.nf file, we get a lot more fine scale details from the .command.sh process scripts:\n>>> cat main.nf\nsalmon quant \\\\\n --geneMap $gtf \\\\\n --threads $task.cpus \\\\\n --libType=$strandedness \\\\\n $reference \\\\\n $input_reads \\\\\n $args \\\\\n -o $prefix\n>>> cat .command.sh\nsalmon quant \\\n --geneMap chr22_with_ERCC92.gtf \\\n --threads 2 \\\n --libType=ISF \\\n -t genome.transcripts.fa \\\n -a HBR_Rep1_ERCC.Aligned.toTranscriptome.out.bam \\\n \\\n -o HBR_Rep1_ERCC\nLooking at the nf-core/rnaseq Parameter documentation and Salmon documentation, we found that we can override this default using the --salmon_quant_libtype A parameter to indicate our data is unstranded and override samplesheet.csv input.\n\n\n\n\n\n\nHow do I get rid of the strandedness check warning message?\n\n\n\nIf we want to get rid of the warning message Please check MultiQC report: 2/2 samples failed strandedness check, we’ll have to change the strandedness fields in our samplesheet.csv. Keep in mind, doing this will invalidate the pipeline’s cache and cause the pipeline to run from the beginning.\n\n\n\n\n\n2.2.5. Write a parameter file\nFrom the previous section we learn that Nextflow accepts either yaml or json formats for parameter files. Any of the pipeline-specific parameters can be supplied to a Nextflow pipeline in this way.\n\n\n\n\n\n\nChallenge\n\n\n\nFill in the parameters file below and save as workshop-params.yaml. This time, include the --salmon_quant_libtype A parameter.\n💡 YAML formatting tips!\n\nStrings need to be inside double quotes\nBooleans (true/false) and numbers do not require quotes\n\ninput: \"\"\noutdir: \"lesson2.2\"\nfasta: \"\"\ngtf: \"\"\nstar_index: \"\"\nsalmon_index: \"\"\nskip_markduplicates: \nsave_trimmed: \nsave_unaligned: \nsalmon_quant_libtype: \"A\" \n\n\n\n\n2.2.6. Apply the parameter file\n➤ Once your params file has been saved, run:\nnextflow run nf-core/rnaseq -r 3.11.1 \\\n -params-file workshop-params.yaml\n -profile singularity \\\n --max_memory '6.GB' \\\n --max_cpus 2 \\\n -resume \nThe number of pipeline-specific parameters we’ve added to our run command has been significantly reduced. The only -- parameters we’ve provided to the run command relate to how the pipeline is executed on our interative job. These resource limits won’t be applicable to others who will run the pipeline on a different infrastructure.\nAs the workflow runs a second time, you will notice 4 things:\n\nThe command is much tidier thanks to offloading some parameters to the params file\nThe -resume flag. Nextflow has lots of run options including the ability to use cached output!\nSome processes will be pulled from the cache. These processes remain unaffected by our addition of a new parameter.\n\nThis run of the pipeline will complete in a much shorter time.\n\n-[nf-core/rnaseq] Pipeline completed successfully with skipped sampl(es)-\n-[nf-core/rnaseq] Please check MultiQC report: 2/2 samples failed strandedness check.-\nCompleted at: 21-Apr-2023 05:58:06\nDuration : 1m 51s\nCPU hours : 0.3 (82.2% cached)\nSucceeded : 11\nCached : 55\n\n\nThese materials are adapted from Customising Nf-Core Workshop by Sydney Informatics Hub" + }, + { + "objectID": "workshops/5.1_nf_core_template.html", + "href": "workshops/5.1_nf_core_template.html", + "title": "Nextflow Development - Developing Modularised Workflows", + "section": "", + "text": "Objectives\n\n\n\n\nDevelop a basic Nextflow workflow with nf-core templates\nTest and set up profiles for a Nextflow workflow\nCreate conditional processes, and conditional scripts within a processs\nRead data of different types into a Nextflow workflow" + }, + { + "objectID": "workshops/5.1_nf_core_template.html#environment-setup", + "href": "workshops/5.1_nf_core_template.html#environment-setup", + "title": "Nextflow Development - Developing Modularised Workflows", + "section": "Environment Setup", + "text": "Environment Setup\nSet up an interactive shell to run our Nextflow workflow:\nsrun --pty -p prod_short --mem 8GB --mincpus 2 -t 0-2:00 bash\nLoad the required modules to run Nextflow:\nmodule load nextflow/23.04.1\nmodule load singularity/3.7.3\nSet the singularity cache environment variable:\nexport NXF_SINGULARITY_CACHEDIR=/config/binaries/singularity/containers_devel/nextflow\nSingularity images downloaded by workflow executions will now be stored in this directory.\nYou may want to include these, or other environmental variables, in your .bashrc file (or alternate) that is loaded when you log in so you don’t need to export variables every session. A complete list of environment variables can be found here.\nSet up a python virtual environment with nf-core/tools installed:\nmodule load python/3.11.2\npython -m venv /scratch/users/${USER}/nfcorevenv\n\nsource /scratch/users/${USER}/nfcorevenv/bin/activate\n\npip install nf-core==2.14.1" + }, + { + "objectID": "workshops/5.1_nf_core_template.html#nf-core", + "href": "workshops/5.1_nf_core_template.html#nf-core", + "title": "Nextflow Development - Developing Modularised Workflows", + "section": "5. Nf-core", + "text": "5. Nf-core\nnf-core is a community effort to collect a curated set of analysis workflows built using Nextflow.\nnf-core provides a standardized set of best practices, guidelines, and templates for building and sharing bioinformatics workflows. These workflows are designed to be modular, scalable, and portable, allowing researchers to easily adapt and execute them using their own data and compute resources.\nThe community is a diverse group of bioinformaticians, developers, and researchers from around the world who collaborate on developing and maintaining a growing collection of high-quality workflows. These workflows cover a range of applications, including transcriptomics, proteomics, and metagenomics.\nOne of the key benefits of nf-core is that it promotes open development, testing, and peer review, ensuring that the workflows are robust, well-documented, and validated against real-world datasets. This helps to increase the reliability and reproducibility of bioinformatics analyses and ultimately enables researchers to accelerate their scientific discoveries.\nnf-core is published in Nature Biotechnology: Nat Biotechnol 38, 276–278 (2020). Nature Biotechnology\nKey Features of nf-core workflows\n\nDocumentation\n\nnf-core workflows have extensive documentation covering installation, usage, and description of output files to ensure that you won’t be left in the dark.\n\nStable Releases\n\nnf-core workflows use GitHub releases to tag stable versions of the code and software, making workflow runs totally reproducible.\n\nPackaged software\n\nPipeline dependencies are automatically downloaded and handled using Docker, Singularity, Conda, or other software management tools. There is no need for any software installations.\n\nPortable and reproducible\n\nnf-core workflows follow best practices to ensure maximum portability and reproducibility. The large community makes the workflows exceptionally well-tested and easy to execute.\n\nCloud-ready\n\nnf-core workflows are tested on AWS" + }, + { + "objectID": "workshops/5.1_nf_core_template.html#nf-core-tools", + "href": "workshops/5.1_nf_core_template.html#nf-core-tools", + "title": "Nextflow Development - Developing Modularised Workflows", + "section": "5.1 Nf-core tools", + "text": "5.1 Nf-core tools\nnf-core-tools is a python package with helper tools for the nf-core community.\nThese helper tools can be used for both building and running nf-core workflows.\nToday we will be focusing on the developer commands to build a workflow using nf-core templates and structures.\nTake a look at what is within with nf-core-tools suite\nnf-core -h\n ,--./,-.\n ___ __ __ __ ___ /,-._.--~\\\n |\\ | |__ __ / ` / \\ |__) |__ } {\n | \\| | \\__, \\__/ | \\ |___ \\`-._,-`-,\n `._,._,'\n\n nf-core/tools version 2.14.1 - https://nf-co.re\n\n\n \n Usage: nf-core [OPTIONS] COMMAND [ARGS]... \n \n nf-core/tools provides a set of helper tools for use with nf-core Nextflow pipelines. \n It is designed for both end-users running pipelines and also developers creating new pipelines. \n \n╭─ Options ────────────────────────────────────────────────────────────────────────────────────────╮\n│ --version Show the version and exit. │\n│ --verbose -v Print verbose output to the console. │\n│ --hide-progress Don't show progress bars. │\n│ --log-file -l <filename> Save a verbose log to a file. │\n│ --help -h Show this message and exit. │\n╰──────────────────────────────────────────────────────────────────────────────────────────────────╯\n╭─ Commands for users ─────────────────────────────────────────────────────────────────────────────╮\n│ list List available nf-core pipelines with local info. │\n│ launch Launch a pipeline using a web GUI or command line prompts. │\n│ create-params-file Build a parameter file for a pipeline. │\n│ download Download a pipeline, nf-core/configs and pipeline singularity images. │\n│ licences List software licences for a given workflow (DSL1 only). │\n│ tui Open Textual TUI. │\n╰──────────────────────────────────────────────────────────────────────────────────────────────────╯\n╭─ Commands for developers ────────────────────────────────────────────────────────────────────────╮\n│ create Create a new pipeline using the nf-core template. │\n│ lint Check pipeline code against nf-core guidelines. │\n│ modules Commands to manage Nextflow DSL2 modules (tool wrappers). │\n│ subworkflows Commands to manage Nextflow DSL2 subworkflows (tool wrappers). │\n│ schema Suite of tools for developers to manage pipeline schema. │\n│ create-logo Generate a logo with the nf-core logo template. │\n│ bump-version Update nf-core pipeline version number. │\n│ sync Sync a pipeline TEMPLATE branch with the nf-core template. │\n╰──────────────────────────────────────────────────────────────────────────────────────────────────╯\nToday we will be predominately focusing on most of the tools for developers." + }, + { + "objectID": "workshops/5.1_nf_core_template.html#nf-core-pipeline", + "href": "workshops/5.1_nf_core_template.html#nf-core-pipeline", + "title": "Nextflow Development - Developing Modularised Workflows", + "section": "5.2 Nf-core Pipeline", + "text": "5.2 Nf-core Pipeline\nLet’s review the structure of the nf-core/rnaseq pipeline.\nAlmost all of the structure provided here is from the nf-core templates. As we briefly covered last week in Developing Modularised Workflows, it is good practice to separate your workflow from subworkflows and modules. As this allows you to modularise your workflows and reuse modules.\nNf-core assists in enforcing this structure with the subfolders:\n\nworkflows - contains the main workflow\nsubworkflows - contains subworkflows either as written by the nf-core community or self-written\nmodules - contains modules either as written by the nf-core community or self-written\n\nIn our Introduction to Nextflow and running nf-core workflows workshop in Customising & running nf-core pipelines, we briefly touched on configuration files in the conf/ folder and nextflow.config.\nToday we will be working on files in these locations and expanding our use of the nf-core template to include:\n\nfiles in the assets folder\nnextflow_schema.json\n\n\n\n5.2.1 nf-core create\nThe create subcommand makes a new pipeline using the nf-core base template. With a given pipeline name, description and author, it makes a starter pipeline which follows nf-core best practices.\nAfter creating the files, the command initialises the folder as a git repository and makes an initial commit. This first “vanilla” commit which is identical to the output from the templating tool is important, as it allows us to keep your pipeline in sync with the base template in the future. See the nf-core syncing docs for more information.\nLet’s set up the nf-core template for today’s workshop:\nnf-core create\nAs we progress through the interactive prompts, we will use the following values below: \nRemember to swap out the Author name with your own!\nThe creates a pipeline called myrnaseq in the directory pmcc-myrnaseq (<prefix>-<name>) with mmyeung as the author. If selected exclude the following:\n\ngithub: removed all files required for GitHub hosting of the pipeline. Specifically, the .github folder and .gitignore file.\nci: removes the GitHub continuous integration tests from the pipeline. Specifically, the .github/workflows/ folder.\ngithub_badges: removes GitHub badges from the README.md file.\nigenomes: removes pipeline options related to iGenomes. Including the conf/igenomes.config file and all references to it.\nnf_core_configs: excludes nf_core/configs repository options, which make multiple config profiles for various institutional clusters available.\n\nTo run the pipeline creation silently (i.e. without any prompts) with the nf-core template, you can use the --plain option.\n\n\n\n\n\n\nAuthor name\n\n\n\nTypically, we would use your github username as the value here, this allows an extra layer of traceability.\n\n\n\n\n\n\n\n\nCustomised pipeline prefix\n\n\n\nRemember we are currently only making the most of the nf-core templates and not contributing back to nf-core. As such, we should not use the nf-core prefix to our pipeline.\n\n\n\n\n\n\n\n\nSkipped templates\n\n\n\nNote that the highlighted values under Skip template areas? are the sections that will be skipped. As this is a test pipeline we are skipping the set up of github CI and badges\n\n\nAs we have requested GitHub hosting, on completion of the command, you will note there are suggested github commands included in the output. Use these commands to push the commits from your computer. You can then continue to edit, commit and push normally as you build your pipeline.\n\n\n\nnf-core template\nLet’s see what has been minimally provided by nf-core create\nll pmcc-myrnaseq/\ntotal 47\ndrwxrwxr-x 2 myeung myeung 4096 Jun 11 15:00 assets\n-rw-rw-r-- 1 myeung myeung 372 Jun 11 15:00 CHANGELOG.md\n-rw-rw-r-- 1 myeung myeung 2729 Jun 11 15:00 CITATIONS.md\ndrwxrwxr-x 2 myeung myeung 4096 Jun 11 15:00 conf\ndrwxrwxr-x 3 myeung myeung 4096 Jun 11 15:00 docs\n-rw-rw-r-- 1 myeung myeung 1060 Jun 11 15:00 LICENSE\n-rw-rw-r-- 1 myeung myeung 3108 Jun 11 15:00 main.nf\ndrwxrwxr-x 3 myeung myeung 4096 Jun 11 15:00 modules\n-rw-rw-r-- 1 myeung myeung 1561 Jun 11 15:00 modules.json\n-rw-rw-r-- 1 myeung myeung 9982 Jun 11 15:00 nextflow.config\n-rw-rw-r-- 1 myeung myeung 16657 Jun 11 15:00 nextflow_schema.json\n-rw-rw-r-- 1 myeung myeung 3843 Jun 11 15:00 README.md\ndrwxrwxr-x 4 myeung myeung 4096 Jun 11 15:00 subworkflows\n-rw-rw-r-- 1 myeung myeung 165 Jun 11 15:00 tower.yml\ndrwxrwxr-x 2 myeung myeung 4096 Jun 11 15:00 workflows\nAs you take look through the files created you will see many comments through the files starting with // TODO nf-core. These are pointers from nf-core towards areas of the pipeline that you may be intersted in changing.\nThey are also the “key word” used by the nf-core lint.\n\nAlternative setups for nf-core create\nAside from the interactive setup we have just completed for nf-core create, there are two alternative methods.\n\nProvide the option using the optional flags from nf-core create\nProvide a template.yaml via the --template-yaml option\n\n\n\n\n\n\n\nChallenge\n\n\n\nCreate a second pipeline template using the optional flags with the name “myworkflow”, provide a description, author name and set the version to “0.0.1”\nWhat options are still you still prompted for?\n\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\nRun the following:\nnf-core create --name myworkflow --description \"my workflow test\" --author \"@mmyeung\" --version \"0.0.1\"\nNote that you are still prompted for any additional customisations such as the pipeline prefix and steps to skip\n\n\n\n\n\n\n\n\n\nAdvanced Challange\n\n\n\nCreate another pipeline template using a yaml file called mytemplate.yaml\nHint: the key values in the yaml should be name, description, author, prefix and skip\nSet the pipeline to skip ci, igenomes and nf_core_configs\n\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\nRun the following:\nvim mytemplate.yaml\nValues in mytemplate.yaml\nname: coolpipe\ndescription: A cool pipeline\nauthor: me\nprefix: myorg\nskip:\n - ci\n - igenomes\n - nf_core_configs\nnf-core create --template-yaml mytemplate.yaml" + }, + { + "objectID": "workshops/5.1_nf_core_template.html#test-profile", + "href": "workshops/5.1_nf_core_template.html#test-profile", + "title": "Nextflow Development - Developing Modularised Workflows", + "section": "5.3 Test Profile", + "text": "5.3 Test Profile\nnf-core tries to encourage software engineering concepts such as minimal test sets, this can be set up using the conf/test.config and conf/test_full.config\nFor the duration of this workshop we will be making use of the conf/test.config, to test our pipeline.\nLet’s take a look at what is currently in the conf/test.config.\ncat pmcc-myrnaseq/conf/test.config\n/*\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n Nextflow config file for running minimal tests\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n Defines input files and everything required to run a fast and simple pipeline test.\n\n Use as follows:\n nextflow run pmcc/myrnaseq -profile test,<docker/singularity> --outdir <OUTDIR>\n\n----------------------------------------------------------------------------------------\n*/\n\nparams {\n config_profile_name = 'Test profile'\n config_profile_description = 'Minimal test dataset to check pipeline function'\n\n // Limit resources so that this can run on GitHub Actions\n max_cpus = 2\n max_memory = '6.GB'\n max_time = '6.h'\n\n // Input data\n // TODO nf-core: Specify the paths to your test data on nf-core/test-datasets\n // TODO nf-core: Give any required params for the test so that command line flags are not needed\n input = params.pipelines_testdata_base_path + 'viralrecon/samplesheet/samplesheet_test_illumina_amplicon.csv'\n\n // Genome references\n genome = 'R64-1-1'\n}\nFrom this, we can see that this config uses the params scope to define:\n\nMaximal values for resources\nDirects the input parameter to a sample sheet hosted in the nf-core/testdata github\nSets the genome to “R64-1-1”\n\n\n\n\n\n\n\nHow does setting the parameter genome set all the genome references?\n\n\n\nThis is possible due to us using the igenomes configs from nf-core.\nYou can see in the conf/igenomes.config how nested within each genome definition are paths to various reference files.\nTo find out more about the igenomes project here\n\n\nFor the duration of this workshop we are going to use the data from nf-training that was cloned in the first workshop. We are also going to update our test.config to contain the igenomes_base parameter, as we have a local cache on the cluster.\ninput = \"/home/Shared/For_NF_Workshop/training/nf-training/data/ggal/samplesheet.csv\"\noutdir = \"/scratch/users/${USER}/myrnaseqtest\"\n\n// genome references\ngenome = \"GRCh38\"\nigenomes_base = \"/data/janis/nextflow/references/genomes/ngi-igenomes\"\nAlso, we will need to change the value, custom_config_base to null, in nextflow.config\ncustom_config_base = null\nLet’s quickly check that our pipeline runs with the test profile.\ncd ..\nnextflow run ./pmcc-myrnaseq -profile test,singularity\n\n\n\n\n\n\nWhat’s the difference between the test.config and the test_full.config\n\n\n\nTypically the test.config contains the minimal test example, while the test_full.config contains at least one full sized example data." + }, + { + "objectID": "workshops/5.1_nf_core_template.html#nf-core-modules", + "href": "workshops/5.1_nf_core_template.html#nf-core-modules", + "title": "Nextflow Development - Developing Modularised Workflows", + "section": "5.4 Nf-core modules", + "text": "5.4 Nf-core modules\nYou can find all the nf-core modules that have been accepted and peer-tested by the community in nf-core modules.\nor with\nnf-core modules list remote\nyou can check which modules are installed localling in your pipeline by running nf-core modules list local, within the pipeline folder.\ncd pmcc-myrnaseq\n\nnf-core modules list local\n\n ,--./,-.\n ___ __ __ __ ___ /,-._.--~\\\n |\\ | |__ __ / ` / \\ |__) |__ } {\n | \\| | \\__, \\__/ | \\ |___ \\`-._,-`-,\n `._,._,'\n\n nf-core/tools version 2.14.1 - https://nf-co.re\n\n\nINFO Repository type: pipeline\nINFO Modules installed in '.':\n┏━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓\n┃ Module Name ┃ Repository ┃ Version SHA ┃ Message ┃ Date ┃\n┡━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩\n│ fastqc │ nf-core/modules │ 285a505 │ Fix FastQC memory allocation (#5432) │ 2024-04-05 │\n│ multiqc │ nf-core/modules │ b7ebe95 │ Update MQC container (#5006) │ 2024-02-29 │\n└─────────────┴─────────────────┴─────────────┴──────────────────────────────────────┴────────────┘\n\n\n\n\n\n\nOverall Challenge\n\n\n\nWe are going to replicate sections of the workflow from last week.\nFASTQC -> Trimgalore -> FASTQC -> MULTIQC\n\n\n\n5.3.1 Installing nf-core modules\nThe general format for installing modules is as below.\nnf-core modules install <tool>/<subcommand>\n\n\n\n\n\n\nTip\n\n\n\nNote that if you search for the modules on the nf-core modules website, you can find the install command at the top of the tool\n\n\n\n\n\n\n\n\nTip\n\n\n\nRemember to run the install commands from within the nf-core pipeline folder (in this case the pmcc-myrnaseq folder)\nIf you are not in an nf-core folder you will see the following error\n\n ,--./,-.\n ___ __ __ __ ___ /,-._.--~\\\n |\\ | |__ __ / ` / \\ |__) |__ } {\n | \\| | \\__, \\__/ | \\ |___ \\`-._,-`-,\n `._,._,'\n\n nf-core/tools version 2.14.1 - https://nf-co.re\n\n\nWARNING 'repository_type' not defined in .nf-core.yml\n? Is this repository an nf-core pipeline or a fork of nf-core/modules? (Use arrow keys)\n » Pipeline\n nf-core/modules\n\n\n\n\n\n\n\n\nChallenge\n\n\n\nInstall the following nf-core modules\n\ntrimgalore\nsalmon quant\nfastqc\n\nWhat happens when we try to install the fastqc module?\n\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\nUnfortunately, nf-core does not allow the installation of multiple modules in one line therefore we mush provide the commands separately for each module.\nnf-core modules install trimgalore\nnf-core modules install salmon/quant\nnf-core modules install fastqc\nNote that from above, when we checked which modules have been installed locally in our pipeline, fastqc was already installed. As such, we see the following output warning us that fastqc is installed and we can either force the reinstallation or we can update the module\n\n ,--./,-.\n ___ __ __ __ ___ /,-._.--~\\\n |\\ | |__ __ / ` / \\ |__) |__ } {\n | \\| | \\__, \\__/ | \\ |___ \\`-._,-`-,\n `._,._,'\n\n nf-core/tools version 2.14.1 - https://nf-co.re\n\nINFO Module 'fastqc' is already installed.\nINFO To update 'fastqc' run 'nf-core modules update fastqc'. To force reinstallation use '--force'. \n\n\n\n\n\n\n\n\n\nAdvanced Challenge\n\n\n\nCan you think of a way to streamline the installation of modules?\n\n\nfollowing the installation what files changed, check with\ngit status\nOn branch master\nChanges not staged for commit:\n (use \"git add <file>...\" to update what will be committed)\n (use \"git restore <file>...\" to discard changes in working directory)\n modified: modules.json\n\nUntracked files:\n (use \"git add <file>...\" to include in what will be committed)\n modules/nf-core/salmon/\n modules/nf-core/trimgalore/\n\nno changes added to commit (use \"git add\" and/or \"git commit -a\")\nmodules.json is a running record of the modules installed and should be included in your pipeline. Note: you can find the github SHA for the exact “version” of the module installed.\nThis insulates your pipeline from when a module is deleted.\nrm -r modules/nf-core/salmon/quant\n\nnf-core modules list local\n\n ,--./,-.\n ___ __ __ __ ___ /,-._.--~\\\n |\\ | |__ __ / ` / \\ |__) |__ } {\n | \\| | \\__, \\__/ | \\ |___ \\`-._,-`-,\n `._,._,'\n\n nf-core/tools version 2.14.1 - https://nf-co.re\n\n\nINFO Repository type: pipeline\nINFO Reinstalling modules found in 'modules.json' but missing from directory: 'modules/nf-core/salmon/quant'\nINFO Modules installed in '.':\n┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓\n┃ Module Name ┃ Repository ┃ Version SHA ┃ Message ┃ Date ┃\n┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩\n│ fastqc │ nf-core/modules │ 285a505 │ Fix FastQC memory allocation (#5432) │ 2024-04-05 │\n│ multiqc │ nf-core/modules │ b7ebe95 │ Update MQC container (#5006) │ 2024-02-29 │\n│ salmon/quant │ nf-core/modules │ cb6b2b9 │ fix stubs salmon (#5517) │ 2024-04-24 │\n│ trimgalore │ nf-core/modules │ a984184 │ run nf-core lint on trimgalore (#5129) │ 2024-03-15 │\n└──────────────┴─────────────────┴─────────────┴────────────────────────────────────────┴────────────┘\n\n\n\n\n\n\nAdvanced Challenge\n\n\n\nHow would you look up previous versions of the module?\n\n\n\n\n\n\n\n\nCaution\n\n\n\n\n\nThere are a few ways to approach this.\n\nYou could hop onto github and search throught the git history for the main.nf of the particular module, to identify the git SHA and provide it to the --sha flag.\nYou could run the install command with the --prompt flag, as seen below\n\n\n ,--./,-.\n ___ __ __ __ ___ /,-._.--~\\\n |\\ | |__ __ / ` / \\ |__) |__ } {\n | \\| | \\__, \\__/ | \\ |___ \\`-._,-`-,\n `._,._,'\n\n nf-core/tools version 2.14.1 - https://nf-co.re\n\n\nINFO Module 'fastqc' is already installed.\n? Module fastqc is already installed.\nDo you want to force the reinstallation? Yes\n? Select 'fastqc' commit: (Use arrow keys)\n Fix FastQC memory allocation (#5432) 285a50500f9e02578d90b3ce6382ea3c30216acd (installed version)\n Update FASTQC to use unique names for snapshots (#4825) f4ae1d942bd50c5c0b9bd2de1393ce38315ba57c\n CHORES: update fasqc tests with new data organisation (#4760) c9488585ce7bd35ccd2a30faa2371454c8112fb9\n fix fastqc tests n snap (#4669) 617777a807a1770f73deb38c80004bac06807eef\n Update version strings (#4556) 65ad3e0b9a4099592e1102e92e10455dc661cf53\n Remove pytest-workflow tests for modules covered by nf-test (#4521) 3e8b0c1144ccf60b7848efbdc2be285ff20b49ee\n Add conda environment names (#4327) 3f5420aa22e00bd030a2556dfdffc9e164ec0ec5\n Fix conda declaration (#4252) 8fc1d24c710ebe1d5de0f2447ec9439fd3d9d66a\n Move conda environment to yml (#4079) 516189e968feb4ebdd9921806988b4c12b4ac2dc\n authors => maintainers (#4173) cfd937a668919d948f6fcbf4218e79de50c2f36f\n » older commits\n\n\n\n\n\n5.3.2 Updating nf-core modules\nAbove we got and error message for fastq because the module was already installed. As listed in the output, one of the suggested solutions is that we might be looking to update the module\nnf-core modules update fastqc\nAfter running the command you will find that you are prompted for whether you wish to view the differences between the current installation and the update.\n\n ,--./,-.\n ___ __ __ __ ___ /,-._.--~\\\n |\\ | |__ __ / ` / \\ |__) |__ } {\n | \\| | \\__, \\__/ | \\ |___ \\`-._,-`-,\n `._,._,'\n\n nf-core/tools version 2.14.1 - https://nf-co.re\n\n\n? Do you want to view diffs of the proposed changes? (Use arrow keys)\n » No previews, just update everything\n Preview diff in terminal, choose whether to update files\n Just write diffs to a patch file\nFor the sake of this exercise, we are going to roll fastqc back by one commit.\nIf you select the 2nd option Preview diff in terminal, choose whether to update files\nnf-core modules update fastqc -p\n ,--./,-.\n ___ __ __ __ ___ /,-._.--~\\\n |\\ | |__ __ / ` / \\ |__) |__ } {\n | \\| | \\__, \\__/ | \\ |___ \\`-._,-`-,\n `._,._,'\n\n nf-core/tools version 2.14.1 - https://nf-co.re\n\n\n? Do you want to view diffs of the proposed changes? Preview diff in terminal, choose whether to update files\n? Select 'fastqc' commit: (Use arrow keys)\n Fix FastQC memory allocation (#5432) 285a50500f9e02578d90b3ce6382ea3c30216acd (installed version)\n » Update FASTQC to use unique names for snapshots (#4825) f4ae1d942bd50c5c0b9bd2de1393ce38315ba57c\n CHORES: update fasqc tests with new data organisation (#4760) c9488585ce7bd35ccd2a30faa2371454c8112fb9\n fix fastqc tests n snap (#4669) 617777a807a1770f73deb38c80004bac06807eef\n Update version strings (#4556) 65ad3e0b9a4099592e1102e92e10455dc661cf53\n Remove pytest-workflow tests for modules covered by nf-test (#4521) 3e8b0c1144ccf60b7848efbdc2be285ff20b49ee\n Add conda environment names (#4327) 3f5420aa22e00bd030a2556dfdffc9e164ec0ec5\n Fix conda declaration (#4252) 8fc1d24c710ebe1d5de0f2447ec9439fd3d9d66a\n Move conda environment to yml (#4079) 516189e968feb4ebdd9921806988b4c12b4ac2dc\n authors => maintainers (#4173) cfd937a668919d948f6fcbf4218e79de50c2f36f\n older commits\n? Select 'fastqc' commit: Update FASTQC to use unique names for snapshots (#4825) f4ae1d942bd50c5c0b9bd2de1393ce38315ba57c\nINFO Changes in module 'nf-core/fastqc' between (285a50500f9e02578d90b3ce6382ea3c30216acd) and (f4ae1d942bd50c5c0b9bd2de1393ce38315ba57c)\nINFO Changes in 'fastqc/main.nf':\n --- modules/nf-core/fastqc/main.nf\n +++ modules/nf-core/fastqc/main.nf\n @@ -25,11 +25,6 @@\n def old_new_pairs = reads instanceof Path || reads.size() == 1 ? [[ reads, \"${prefix}.${reads.extension}\" ]] : reads.withIndex().collect { entry, index -> [ entry, \"${prefix}_${index + 1}.${entry.extension}\" ] }\n def rename_to = old_new_pairs*.join(' ').join(' ')\n def renamed_files = old_new_pairs.collect{ old_name, new_name -> new_name }.join(' ')\n -\n - def memory_in_mb = MemoryUnit.of(\"${task.memory}\").toUnit('MB')\n - // FastQC memory value allowed range (100 - 10000)\n - def fastqc_memory = memory_in_mb > 10000 ? 10000 : (memory_in_mb < 100 ? 100 : memory_in_mb)\n -\n \"\"\"\n printf \"%s %s\\\\n\" $rename_to | while read old_name new_name; do\n [ -f \"\\${new_name}\" ] || ln -s \\$old_name \\$new_name\n @@ -38,7 +33,6 @@\n fastqc \\\\\n $args \\\\\n --threads $task.cpus \\\\\n - --memory $fastqc_memory \\\\\n $renamed_files\n\n cat <<-END_VERSIONS > versions.yml\nINFO 'modules/nf-core/fastqc/meta.yml' is unchanged\nINFO 'modules/nf-core/fastqc/environment.yml' is unchanged\nINFO 'modules/nf-core/fastqc/tests/main.nf.test.snap' is unchanged\nINFO 'modules/nf-core/fastqc/tests/tags.yml' is unchanged\nINFO 'modules/nf-core/fastqc/tests/main.nf.test' is unchanged\n? Update module 'fastqc'? No\nINFO Updates complete ✨ \n\n\n5.3.3 Removing nf-core modules\nAs mentioned above, if you decide that you don’t need a module anymore, you can’t just remove the folder with rm -r.\nFor nf-core to no longer register the module is to be distributed with your pipeline you need to use:\nnf-core modules remove\nAs an exercise, we are going to install the samtools/sort module\nnf-core modules install samtools/sort\nQuickly view the modules.json or use nf-core modules list local to view the changes from installing the module.\nNow remove the samtools/sort module\nnf-core modules remove samtools/sort\n\n\n\n\n\n\nOverall Challenge\n\n\n\nNow add the include module statements to the our workflows/myrnaseq.nf\n\n\n\n\n\n\n\n\nCaution\n\n\n\n\n\ninclude { FASTQC as FASTQC_one } from '../modules/nf-core/fastq/main' \ninclude { FASTQC as FASTQC_two } from '../modules/nf-core/fastq/main' \n\ninclude { TRIMGALORE } from '../modules/nf-core/trimgalore/main'\n\n\n\n\n\n5.3.4 Writing modules with nf-core template\nFor this section we are going to refer to the nf-core guidelines for modules.\nWhile these are the full guidelines for contributing back to nf-core, there are still some general components that are good practice even if you are NOT planning to contribute.\n\n\n\n\n\n\nSummary of guidelines\n\n\n\n\nAll required and optional input files must be included in the input as a path variable\nThe command should run without any additional argument, any required flag values should be included as an input val variable\ntask.ext.args must be provided as a variable\nWhere possible all input and output files should be compressed (i.e. fastq.gz and .bam)\nA versions.yml file is output\nNaming conventions include using all lowercase without puntuation and follows the convention of software/tool (i.e. bwa/mem)\nAll outputs must include an emit definition\n\n\n\nWe are going to write up our own samtools/view module.\nnf-core modules create \n\n ,--./,-.\n ___ __ __ __ ___ /,-._.--~\\\n |\\ | |__ __ / ` / \\ |__) |__ } {\n | \\| | \\__, \\__/ | \\ |___ \\`-._,-`-,\n `._,._,'\n\n nf-core/tools version 2.14.1 - https://nf-co.re\n\n\nINFO Repository type: pipeline\nINFO Press enter to use default values (shown in brackets) or type your own responses. ctrl+click underlined text to open links.\nName of tool/subtool: samtools/view\nINFO Using Bioconda package: 'bioconda::samtools=1.20'\nINFO Could not find a Docker/Singularity container (Unexpected response code `500` for https://api.biocontainers.pro/ga4gh/trs/v2/tools/samtools/versions/samtools-1.20) ## Cluster\nGitHub Username: (@author): @mmyeung\nINFO Provide an appropriate resource label for the process, taken from the nf-core pipeline template.\n For example: process_single, process_low, process_medium, process_high, process_long\n? Process resource label: process_low\nINFO Where applicable all sample-specific information e.g. 'id', 'single_end', 'read_group' MUST be provided as an input via a Groovy Map called\n 'meta'. This information may not be required in some instances, for example indexing reference genome files.\nWill the module require a meta map of sample information? [y/n] (y): y\nINFO Created component template: 'samtools/view'\nINFO Created following files:\n modules/local/samtools/view.nf\nAs we progressed through the interactive prompt, you will have noticed that nf-core always attempts to locate the corresponding bioconda package and singularity/Docker container.\n\n\n\n\n\n\nWhat happens when there is no bioconda package or container?\n\n\n\n\n\nnf-core modules create --author @mmyeung --label process_single --meta testscript\nThe command will indicate that the there is no bioconda package with the software name, and prompt you for a package name you might wish to use.\nINFO Repository type: pipeline\nINFO Press enter to use default values (shown in brackets) or type your own responses. ctrl+click underlined text to open links.\nWARNING Could not find Conda dependency using the Anaconda API: 'testscript'\nDo you want to enter a different Bioconda package name? [y/n]: n\nWARNING Could not find Conda dependency using the Anaconda API: 'testscript'\n Building module without tool software and meta, you will need to enter this information manually.\nINFO Created component template: 'testscript'\nINFO Created following files:\n modules/local/testscript.nf \nwithin the module .nf script you will note that the definitions for the conda and container are incomplete for the tool.\n conda \"${moduleDir}/environment.yml\"\n container \"${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ?\n 'https://depot.galaxyproject.org/singularity/YOUR-TOOL-HERE':\n 'biocontainers/YOUR-TOOL-HERE' }\"\nnf-core has a large cache of containers here. Though you can also provide a simple path to docker hub.\n container \"mmyeung/trccustomunix:0.0.1\"\n\n\n\nThe resource labels, are those as defined in conf/base.config\n\n\n\n\n\n\nChallenge\n\n\n\nWrite up the inputs, outputs and script for samtools/view.\nAssume that all the inputs will be .bam and the outputs will also be .bam.\nFor reference look at the documentation for samtools/view\nAre there optional flags that take file inputs? What options need to set to ensure that the command runs without error?\n\n\n\n\n\n\n\n\nCaution\n\n\n\n\n\nprocess SAMTOOLS_VIEW {\n tag \"$meta.id\"\n label 'process_low'\n\n conda \"${moduleDir}/environment.yml\"\n container \"${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ?\n 'https://depot.galaxyproject.org/singularity/samtools:1.20--h50ea8bc_0' :\n 'biocontainers/samtools:1.20--h50ea8bc_0' }\"\n\n input:\n tuple val(meta), path(input), path(index)\n tuple val(meta2), path(fasta)\n path bed\n path qname\n\n output:\n tuple val(meta), path(\"*.bam\"), emit: bam\n path \"versions.yml\", emit: versions\n\n when:\n task.ext.when == null || task.ext.when\n\n script:\n def args = task.ext.args ?: ''\n def args2 = task.ext.args2 ?: ''\n def prefix = task.ext.prefix ?: \"${meta.id}\"\n def reference = fasta ? \"--reference ${fasta}\" : \"\"\n def readnames = qname ? \"--qname-file ${qname}\": \"\"\n def regions = bed ? \"-L ${bed}\": \"\"\n if (\"$input\" == \"${prefix}.${file_type}\") error \"Input and output names are the same, use \\\"task.ext.prefix\\\" to disambiguate!\"\n \"\"\"\n samtools \\\\\n view \\\\\n -hb \\\\\n --threads ${task.cpus-1} \\\\\n ${reference} \\\\\n ${readnames} \\\\\n ${regions} \\\\\n $args \\\\\n -o ${prefix}.bam \\\\\n $input \\\\\n $args2\n\n cat <<-END_VERSIONS > versions.yml\n \"${task.process}\":\n samtools: \\$(echo \\$(samtools --version 2>&1) | sed 's/^.*samtools //; s/Using.*\\$//')\n END_VERSIONS\n \"\"\"\n\n stub:\n def args = task.ext.args ?: ''\n def prefix = task.ext.prefix ?: \"${meta.id}\"\n def file_type = args.contains(\"--output-fmt sam\") ? \"sam\" :\n args.contains(\"--output-fmt bam\") ? \"bam\" :\n args.contains(\"--output-fmt cram\") ? \"cram\" :\n input.getExtension()\n if (\"$input\" == \"${prefix}.${file_type}\") error \"Input and output names are the same, use \\\"task.ext.prefix\\\" to disambiguate!\"\n\n def index = args.contains(\"--write-index\") ? \"touch ${prefix}.csi\" : \"\"\n\n \"\"\"\n touch ${prefix}.${file_type}\n ${index}\n\n cat <<-END_VERSIONS > versions.yml\n \"${task.process}\":\n samtools: \\$(echo \\$(samtools --version 2>&1) | sed 's/^.*samtools //; s/Using.*\\$//')\n END_VERSIONS\n \"\"\"\n\n\n\nSimilar to nf-core create you can minimise a the number of prompts by using optional flags.\n\n\n\n\n\n\nOverall Challenge\n\n\n\nWrite up the short workflow as discussed above\nFASTQC -> trimgalore -> FASTQC -> MULTIQC" + }, + { + "objectID": "workshops/5.1_nf_core_template.html#nf-core-subworkflow", + "href": "workshops/5.1_nf_core_template.html#nf-core-subworkflow", + "title": "Nextflow Development - Developing Modularised Workflows", + "section": "5.4 Nf-core subworkflow", + "text": "5.4 Nf-core subworkflow\nnf-core subworkflows\nor with\nnf-core subworkflows list remote\n\n5.4.1 Installing nf-core subworkflows\nSubworkflows can be updated/removed like modules\n\n\n\n\n\n\nChallenge\n\n\n\nInstall the subworkflow fastq_subsample_fq_salmon into the workflow\n\n\n\n\n\n\n\n\nCaution\n\n\n\n\n\nnf-core subworkflows install fastq_subsample_fq_salmon\n\n\n\n\n\n5.4.2 Writing subworkflows with nf-core template\n\n\n\n\n\n\nChallenge\n\n\n\nWrite up the QC_WF subworkflow from last week." + }, + { + "objectID": "workshops/5.1_nf_core_template.html#nf-core-schema-and-input-validation", + "href": "workshops/5.1_nf_core_template.html#nf-core-schema-and-input-validation", + "title": "Nextflow Development - Developing Modularised Workflows", + "section": "5.5 Nf-core schema and input validation", + "text": "5.5 Nf-core schema and input validation\nRelies on plugins written by nf-core community\nIn particular nf-validation\nnextflow_schmea.json is for pipeline parameters\nnf-core schema build\n\n ,--./,-.\n ___ __ __ __ ___ /,-._.--~\\\n |\\ | |__ __ / ` / \\ |__) |__ } {\n | \\| | \\__, \\__/ | \\ |___ \\`-._,-`-,\n `._,._,'\n\n nf-core/tools version 2.14.1 - https://nf-co.re\n\n\nINFO [✓] Default parameters match schema validation\nINFO [✓] Pipeline schema looks valid (found 32 params)\nINFO Writing schema with 32 params: 'nextflow_schema.json'\n🚀 Launch web builder for customisation and editing? [y/n]: y\nINFO Opening URL: https://nf-co.re/pipeline_schema_builder?id=1718112529_0841fa08f86f\nINFO Waiting for form to be completed in the browser. Remember to click Finished when you're done.\n⢿ Use ctrl+c to stop waiting and force exit.\nRecommend writing in web browser\njson format details additional reading\n\n\n\n\n\n\nChallenge\n\n\n\nWe are going add the input parameter for the transcript.fa\nThen install salmon/index and write up quant_wf subworkflow from last week.git\n\n\n\n5.5.2 Nf-core inputs\nnested in this schema is the input or samplesheet schema. unfortunately there isn’t a nice interface to help you write this schema yet.\n\nmeta: Allows you to predesignate the “key” with in the “meta”\nrequired: value must be included\ndependency: value is dependant on other value existing in samplesheet (i.e. fastq_2 must imply there is a fastq_1)\n\n\n\n5.6 Nf-core tools for launching\ncreate-params-file\n\n\n5.7 Nf-core for pipeline management\nbump-version ==> good software management to note down versions" + }, + { + "objectID": "workshops/5.1_nf_core_template.html#contributing-to-nf-core", + "href": "workshops/5.1_nf_core_template.html#contributing-to-nf-core", + "title": "Nextflow Development - Developing Modularised Workflows", + "section": "Contributing to nf-core", + "text": "Contributing to nf-core\nFull pipelines Please see the nf-core documentation for a full walkthrough of how to create a new nf-core workflow.\n\nThis workshop is adapted from Fundamentals Training, Advanced Training, Developer Tutorials, Nextflow Patterns materials from Nextflow, nf-core nf-core tools documentation and nf-validation" + }, + { + "objectID": "workshops/3.1_creating_a_workflow.html", + "href": "workshops/3.1_creating_a_workflow.html", + "title": "Nextflow Development - Creating a Nextflow Workflow", + "section": "", + "text": "Objectives\n\n\n\n\nGain an understanding of Nextflow channels and processes\nGain an understanding of Nextflow syntax\nRead data of different types into a Nextflow workflow\nCreate Nextflow processes consisting of multiple scripting languages\n\n\n\n\n\nClone the training materials repository on GitHub:\ngit clone https://github.com/nextflow-io/training.git\nSet up an interactive shell to run our Nextflow workflow:\nsrun --pty -p prod_short --mem 8GB --mincpus 2 -t 0-2:00 bash\nLoad the required modules to run Nextflow:\nmodule load nextflow/23.04.1\nmodule load singularity/3.7.3\nMake sure to always use version 23 and above, as we have encountered problems running nf-core workflows with older versions.\nSince we are using a shared storage, we should consider including common shared paths to where software is stored. These variables can be accessed using the NXF_SINGULARITY_CACHEDIR or the NXF_CONDA_CACHEDIR environment variables.\nCurrently we set the singularity cache environment variable:\nexport NXF_SINGULARITY_CACHEDIR=/config/binaries/singularity/containers_devel/nextflow\nSingularity images downloaded by workflow executions will now be stored in this directory.\nYou may want to include these, or other environmental variables, in your .bashrc file (or alternate) that is loaded when you log in so you don’t need to export variables every session. A complete list of environment variables can be found here.\n\n\n\n\nA workflow can be defined as sequence of steps through which computational tasks are chained together. Steps may be dependent on other tasks to complete, or they can be run in parallel.\n\nIn Nextflow, each step that will execute a single computational task is known as a process. Channels are used to join processes, and pass the outputs from one task into another task.\n\n\n\nChannels are a key data structure of Nextflow, used to pass data between processes.\n\n\nA queue channel connects two processes or operators, and is implicitly created by process outputs, or using channel factories such as Channel.of or Channel.fromPath.\nThe training/nf-training/snippet.nf script creates a channel where each element in the channel is an arguments provided to it. This script uses the Channel.of channel factory, which creates a channel from parameters such as strings or integers.\nch = Channel.of(1, 2, 3)\nch.view()\nThe following will be returned:\n>>> nextflow run training/nf-training/snippet.nf\nN E X T F L O W ~ version 23.04.1\nLaunching `training/nf-training/snippet.nf` [shrivelled_brattain] DSL2 - revision: 7e2661e10b\n1\n2\n3\n\n\n\nA value channel differs from a queue channel in that it is bound to a single value, and it can be read unlimited times without consuming its contents. To see the difference between value and queue channels, you can modify training/nf-training/snippet.nf to the following:\nch1 = Channel.of(1, 2, 3)\nch2 = Channel.of(1)\n\nprocess SUM {\n input:\n val x\n val y\n\n output:\n stdout\n\n script:\n \"\"\"\n echo \\$(($x+$y))\n \"\"\"\n}\n\nworkflow {\n SUM(ch1, ch2).view()\n}\nThis workflow creates two queue channels, ch1 and ch2, that are input into the SUM process. The SUM process sums the two inputs and prints the result to the standard output using the view() channel operator.\nAfter running the script, the only output is 2, as below:\n>>> nextflow run training/nf-training/snippet.nf\nN E X T F L O W ~ version 23.04.1\nLaunching `training/nf-training/snippet.nf` [modest_pike] DSL2 - revision: 7e2661e10b\n2\nSince ch1 and ch2 are queue channels, the single element of ch2 has been consumed when it was initially passed to the SUM process with the first element of ch1. Even though there are other elements to be consumed in ch1, no new process instances will be launched. This is because a process waits until it receives an input value from all the channels declared as an input. The channel values are consumed serially one after another and the first empty channel causes the process execution to stop, even though there are values in other channels.\nTo use the single element in ch2 multiple times, you can use the Channel.value channel factory. Modify the second line of training/nf-training/snippet.nf to the following: ch2 = Channel.value(1) and run the script.\n>>> nextflow run training/nf-training/snippet.nf\nN E X T F L O W ~ version 23.04.1\nLaunching `training/nf-training/snippet.nf` [jolly_archimedes] DSL2 - revision: 7e2661e10b\n2\n3\n4\nNow that ch2 has been read in as a value channel, its value can be read unlimited times without consuming its contents.\nIn many situations, Nextflow will implicitly convert variables to value channels when they are used in a process invocation. When a process is invoked using a workflow parameter, it is automatically cast into a value channel. Modify the invocation of the SUM process to the following: SUM(ch1, 1).view() and run the script”\n>>> nextflow run training/nf-training/snippet.nf\nN E X T F L O W ~ version 23.04.1\nLaunching `training/nf-training/snippet.nf` [jolly_archimedes] DSL2 - revision: 7e2661e10b\n2\n3\n4\n\n\n\n\n\nIn Nextflow, a process is the basic computing task to execute functions (i.e., custom scripts or tools).\nThe process definition starts with the keyword process, followed by the process name, commly written in upper case by convention, and finally the process body delimited by curly brackets.\nThe process body can contain many definition blocks:\nprocess < name > {\n [ directives ] \n\n input: \n < process inputs >\n\n output: \n < process outputs >\n\n [script|shell|exec]: \n \"\"\"\n < user script to be executed >\n \"\"\"\n}\n\nDirectives are optional declarations of settings such as cpus, time, executor, container.\nInput defines the expected names and qualifiers of variables into the process\nOutput defines the expected names and qualifiers of variables output from the process\nScript is a string statement that defines the command to be executed by the process\n\nInside the script block, all $ characters need to be escaped with a \\. This is true for both referencing Bash variables created inside the script block (ie. echo \\$z) as well as performing commands (ie. echo \\$(($x+$y))), but not when referencing Nextflow variables (ie. $x+$y).\nprocess SUM {\n debug true \n\n input:\n val x\n val y\n\n output:\n stdout\n\n script:\n \"\"\"\n z='SUM'\n echo \\$z\n echo \\$(($x+$y))\n \"\"\"\n}\nBy default, the process command is interpreted as a Bash script. However, any other scripting language can be used by simply starting the script with the corresponding Shebang declaration. To reference Python variables created inside the Python script, no $ is required. For example:\nprocess PYSTUFF {\n debug true \n\n script:\n \"\"\"\n #!/usr/bin/env python\n\n x = 'Hello'\n y = 'world!'\n print (\"%s - %s\" % (x, y))\n \"\"\"\n}\n\nworkflow {\n PYSTUFF()\n}\n\n\nThe val qualifier allows any data type to be received as input. In the example below, num queue channel is created from integers 1, 2 and 3, and input into the BASICEXAMPLE process, where it is declared with the qualifier val and assigned to the variable x. Within this process, the channel input is referred to and accessed locally by the specified variable name x, prepended with $.\nnum = Channel.of(1, 2, 3)\n\nprocess BASICEXAMPLE {\n debug true\n\n input:\n val x\n\n script:\n \"\"\"\n echo process job $x\n \"\"\"\n}\n\nworkflow {\n BASICEXAMPLE(num)\n}\nIn the above example the process is executed three times, for each element in the channel num. Thus, it results in an output similar to the one shown below:\nprocess job 1\nprocess job 2\nprocess job 3\nThe val qualifier can also be used to specify the process output. In this example, the Hello World! string is implicitly converted into a channel that is input to the FOO process. This process prints the input to a file named file.txt, and returns the same input value as the output.\nprocess FOO {\n input:\n val x\n\n output:\n val x\n\n script:\n \"\"\"\n echo $x > file.txt\n \"\"\"\n}\n\nworkflow {\n out_ch = FOO(\"Hello world!\")\n out_ch.view()\n}\nThe output from FOO is assigned to out_ch, and its contents printed using the view() channel operator.\n>>> nextflow run foo.nf\nN E X T F L O W ~ version 23.04.1\nLaunching `foo.nf` [dreamy_turing] DSL2 - revision: 0d1a07970e\nexecutor > local (1)\n[a4/f710b3] process > FOO [100%] 1 of 1 ✔\nHello world!\n\n\n\n\nThe path qualifier allows the handling of files inside a process. When a new instance of a process is executed, a new process execution director will be created just for that process. When the path qualifier is specified as the input, Nextflow will stage the file inside the process execution directory, allowing it to be accessed by the script using the specified name in the input declaration.\nIn this example, the reads channel is created from multiple .fq files inside training/nf-training/data/ggal, and input into process FOO. In the input declaration of the process, the file is referred to as sample.fastq.\nThe training/nf-training/data/ggal folder contains multiple .fq files, along with a .fa file. The wildcard *is used to match only .fq to be used as input.\n>>> ls training/nf-training/data/ggal\ngut_1.fq gut_2.fq liver_1.fq liver_2.fq lung_1.fq lung_2.fq transcriptome.fa\nSave the following code block as foo.nf.\nreads = Channel.fromPath('training/nf-training/data/ggal/*.fq')\n\nprocess FOO {\n debug true\n\n input:\n path 'sample.fastq'\n\n script:\n \"\"\"\n ls sample.fastq\n \"\"\"\n}\n\nworkflow {\n FOO(reads)\n}\nWhen the script is ran, the FOO process is executed six times and will print the name of the file sample.fastq six times, since this is the name assigned in the input declaration.\n>>> nextflow run foo.nf\nN E X T F L O W ~ version 23.04.1\nLaunching `foo.nf` [nasty_lamport] DSL2 - revision: b214838b82\n[78/a8a52d] process > FOO [100%] 6 of 6 ✔\nsample.fastq\nsample.fastq\nsample.fastq\nsample.fastq\nsample.fastq\nsample.fastq\nInside the process execution directory (ie. work/78/a8a52d...), the input file has been staged (symbolically linked) under the input declaration name. This allows the script to access the file within the execution directory via the declaration name.\n>>> ll work/78/a8a52d...\nsample.fastq -> /.../training/nf-training/data/ggal/liver_1.fq\nSimilarly, the path qualifier can also be used to specify one or more files that will be output by the process. In this example, the RANDOMNUM process creates a file results.txt containing a random number. Note that the Bash function is escaped with a back-slash character (ie. \\$RANDOM).\nprocess RANDOMNUM {\n output:\n path \"*.txt\"\n\n script:\n \"\"\"\n echo \\$RANDOM > result.txt\n \"\"\"\n}\n\nworkflow {\n receiver_ch = RANDOMNUM()\n receiver_ch.view()\n}\nThe output file is declared with the path qualifier, and specified using the wildcard * that will output all files with .txt extension. The output of the RANDOMNUM process is assigned to receiver_ch, which can be used for downstream processes.\n>>> nextflow run foo.nf\nN E X T F L O W ~ version 23.04.1\nLaunching `foo.nf` [nostalgic_cajal] DSL2 - revision: 9e260eead5\nexecutor > local (1)\n[76/7e8e36] process > RANDOMNUM [100%] 1 of 1 ✔\n/...work/8c/792157d409524d06b89faf2c1e6d75/result.txt\n\n\n\n\nTo define paired/grouped input and output information, the tuple qualifier can be used. The input and output declarations for tuples must be declared with a tuple qualifier followed by the definition of each element in the tuple.\nIn the example below, reads_ch is a channel created using the fromFilePairs channel factory, which automatically creates a tuple from file pairs.\nreads_ch = Channel.fromFilePairs(\"training/nf-training/data/ggal/*_{1,2}.fq\")\nreads_ch.view()\nThe created tuple consists of two elements – the first element is always the grouping key of the matching pair (based on similarities in the file name), and the second is a list of paths to each file.\n[gut, [/.../training/nf-training/data/ggal/gut_1.fq, /.../training/nf-training/data/ggal/gut_2.fq]]\n[liver, [/.../training/nf-training/data/ggal/liver_1.fq, /.../training/nf-training/data/ggal/liver_2.fq]]\n[lung, [/.../training/nf-training/data/ggal/lung_1.fq, /.../training/nf-training/data/ggal/lung_2.fq]]\nTo input a tuple into a process, the tuple qualifier must be used in the input block. Below, the first element of the tuple (ie. the grouping key) is declared with the val qualifier, and the second element of the tuple is declared with the path qualifier. The FOO process then prints the .fq file paths to a file called sample.txt, and returns it as a tuple containing the same grouping key, declared with val, and the output file created inside the process, declared with path.\nprocess FOO {\n input:\n tuple val(sample_id), path(sample_id_paths)\n\n output:\n tuple val(sample_id), path('sample.txt')\n\n script:\n \"\"\"\n echo $sample_id_paths > sample.txt\n \"\"\"\n}\n\nworkflow {\n sample_ch = FOO(reads_ch)\n sample_ch.view()\n}\nUpdate foo.nf to the above, and run the script.\n>>> nextflow run foo.nf\nN E X T F L O W ~ version 23.04.1\nLaunching `test.nf` [sharp_becquerel] DSL2 - revision: cd652fc08b\nexecutor > local (3)\n[65/54124a] process > FOO (3) [100%] 3 of 3 ✔\n[lung, /.../work/23/fe268295bab990a40b95b7091530b6/sample.txt]\n[liver, /.../work/32/656b96a01a460f27fa207e85995ead/sample.txt]\n[gut, /.../work/ae/3cfc7cf0748a598c5e2da750b6bac6/sample.txt]\nIt’s worth noting that the FOO process is executed three times in parallel, so there’s no guarantee of a particular execution order. Therefore, if the script was ran again, the final result may be printed out in a different order:\n>>> nextflow run foo.nf\nN E X T F L O W ~ version 23.04.1\nLaunching `foo.nf` [high_mendel] DSL2 - revision: cd652fc08b\nexecutor > local (3)\n[82/71a961] process > FOO (1) [100%] 3 of 3 ✔\n[gut, /.../work/ae/3cfc7cf0748a598c5e2da750b6bac6/sample.txt]\n[lung, /.../work/23/fe268295bab990a40b95b7091530b6/sample.txt]\n[liver, /.../work/32/656b96a01a460f27fa207e85995ead/sample.txt]\nThus, if the output of a process is being used as an input into another process, the use of the tuple qualifier that contains metadata information is especially important to ensure the correct inputs are being used for downstream processes.\n\n\n\n\n\n\nKey points\n\n\n\n\nThe contents of value channels can be consumed an unlimited amount of times, wheres queue channels cannot\nDifferent channel factories can be used to read different input types\n$ characters need to be escaped with \\ when referencing Bash variables and functions, while Nextflow variables do not\nThe scripting language within a process can be altered by starting the script with the desired Shebang declaration" + }, + { + "objectID": "workshops/3.1_creating_a_workflow.html#nextflow-channels-and-processes", + "href": "workshops/3.1_creating_a_workflow.html#nextflow-channels-and-processes", + "title": "Nextflow Development - Creating a Nextflow Workflow", + "section": "", + "text": "Objectives\n\n\n\n\nGain an understanding of Nextflow channels and processes\nGain an understanding of Nextflow syntax\nRead data of different types into a Nextflow workflow\nCreate Nextflow processes consisting of multiple scripting languages\n\n\n\n\n\nClone the training materials repository on GitHub:\ngit clone https://github.com/nextflow-io/training.git\nSet up an interactive shell to run our Nextflow workflow:\nsrun --pty -p prod_short --mem 8GB --mincpus 2 -t 0-2:00 bash\nLoad the required modules to run Nextflow:\nmodule load nextflow/23.04.1\nmodule load singularity/3.7.3\nMake sure to always use version 23 and above, as we have encountered problems running nf-core workflows with older versions.\nSince we are using a shared storage, we should consider including common shared paths to where software is stored. These variables can be accessed using the NXF_SINGULARITY_CACHEDIR or the NXF_CONDA_CACHEDIR environment variables.\nCurrently we set the singularity cache environment variable:\nexport NXF_SINGULARITY_CACHEDIR=/config/binaries/singularity/containers_devel/nextflow\nSingularity images downloaded by workflow executions will now be stored in this directory.\nYou may want to include these, or other environmental variables, in your .bashrc file (or alternate) that is loaded when you log in so you don’t need to export variables every session. A complete list of environment variables can be found here.\n\n\n\n\nA workflow can be defined as sequence of steps through which computational tasks are chained together. Steps may be dependent on other tasks to complete, or they can be run in parallel.\n\nIn Nextflow, each step that will execute a single computational task is known as a process. Channels are used to join processes, and pass the outputs from one task into another task.\n\n\n\nChannels are a key data structure of Nextflow, used to pass data between processes.\n\n\nA queue channel connects two processes or operators, and is implicitly created by process outputs, or using channel factories such as Channel.of or Channel.fromPath.\nThe training/nf-training/snippet.nf script creates a channel where each element in the channel is an arguments provided to it. This script uses the Channel.of channel factory, which creates a channel from parameters such as strings or integers.\nch = Channel.of(1, 2, 3)\nch.view()\nThe following will be returned:\n>>> nextflow run training/nf-training/snippet.nf\nN E X T F L O W ~ version 23.04.1\nLaunching `training/nf-training/snippet.nf` [shrivelled_brattain] DSL2 - revision: 7e2661e10b\n1\n2\n3\n\n\n\nA value channel differs from a queue channel in that it is bound to a single value, and it can be read unlimited times without consuming its contents. To see the difference between value and queue channels, you can modify training/nf-training/snippet.nf to the following:\nch1 = Channel.of(1, 2, 3)\nch2 = Channel.of(1)\n\nprocess SUM {\n input:\n val x\n val y\n\n output:\n stdout\n\n script:\n \"\"\"\n echo \\$(($x+$y))\n \"\"\"\n}\n\nworkflow {\n SUM(ch1, ch2).view()\n}\nThis workflow creates two queue channels, ch1 and ch2, that are input into the SUM process. The SUM process sums the two inputs and prints the result to the standard output using the view() channel operator.\nAfter running the script, the only output is 2, as below:\n>>> nextflow run training/nf-training/snippet.nf\nN E X T F L O W ~ version 23.04.1\nLaunching `training/nf-training/snippet.nf` [modest_pike] DSL2 - revision: 7e2661e10b\n2\nSince ch1 and ch2 are queue channels, the single element of ch2 has been consumed when it was initially passed to the SUM process with the first element of ch1. Even though there are other elements to be consumed in ch1, no new process instances will be launched. This is because a process waits until it receives an input value from all the channels declared as an input. The channel values are consumed serially one after another and the first empty channel causes the process execution to stop, even though there are values in other channels.\nTo use the single element in ch2 multiple times, you can use the Channel.value channel factory. Modify the second line of training/nf-training/snippet.nf to the following: ch2 = Channel.value(1) and run the script.\n>>> nextflow run training/nf-training/snippet.nf\nN E X T F L O W ~ version 23.04.1\nLaunching `training/nf-training/snippet.nf` [jolly_archimedes] DSL2 - revision: 7e2661e10b\n2\n3\n4\nNow that ch2 has been read in as a value channel, its value can be read unlimited times without consuming its contents.\nIn many situations, Nextflow will implicitly convert variables to value channels when they are used in a process invocation. When a process is invoked using a workflow parameter, it is automatically cast into a value channel. Modify the invocation of the SUM process to the following: SUM(ch1, 1).view() and run the script”\n>>> nextflow run training/nf-training/snippet.nf\nN E X T F L O W ~ version 23.04.1\nLaunching `training/nf-training/snippet.nf` [jolly_archimedes] DSL2 - revision: 7e2661e10b\n2\n3\n4\n\n\n\n\n\nIn Nextflow, a process is the basic computing task to execute functions (i.e., custom scripts or tools).\nThe process definition starts with the keyword process, followed by the process name, commly written in upper case by convention, and finally the process body delimited by curly brackets.\nThe process body can contain many definition blocks:\nprocess < name > {\n [ directives ] \n\n input: \n < process inputs >\n\n output: \n < process outputs >\n\n [script|shell|exec]: \n \"\"\"\n < user script to be executed >\n \"\"\"\n}\n\nDirectives are optional declarations of settings such as cpus, time, executor, container.\nInput defines the expected names and qualifiers of variables into the process\nOutput defines the expected names and qualifiers of variables output from the process\nScript is a string statement that defines the command to be executed by the process\n\nInside the script block, all $ characters need to be escaped with a \\. This is true for both referencing Bash variables created inside the script block (ie. echo \\$z) as well as performing commands (ie. echo \\$(($x+$y))), but not when referencing Nextflow variables (ie. $x+$y).\nprocess SUM {\n debug true \n\n input:\n val x\n val y\n\n output:\n stdout\n\n script:\n \"\"\"\n z='SUM'\n echo \\$z\n echo \\$(($x+$y))\n \"\"\"\n}\nBy default, the process command is interpreted as a Bash script. However, any other scripting language can be used by simply starting the script with the corresponding Shebang declaration. To reference Python variables created inside the Python script, no $ is required. For example:\nprocess PYSTUFF {\n debug true \n\n script:\n \"\"\"\n #!/usr/bin/env python\n\n x = 'Hello'\n y = 'world!'\n print (\"%s - %s\" % (x, y))\n \"\"\"\n}\n\nworkflow {\n PYSTUFF()\n}\n\n\nThe val qualifier allows any data type to be received as input. In the example below, num queue channel is created from integers 1, 2 and 3, and input into the BASICEXAMPLE process, where it is declared with the qualifier val and assigned to the variable x. Within this process, the channel input is referred to and accessed locally by the specified variable name x, prepended with $.\nnum = Channel.of(1, 2, 3)\n\nprocess BASICEXAMPLE {\n debug true\n\n input:\n val x\n\n script:\n \"\"\"\n echo process job $x\n \"\"\"\n}\n\nworkflow {\n BASICEXAMPLE(num)\n}\nIn the above example the process is executed three times, for each element in the channel num. Thus, it results in an output similar to the one shown below:\nprocess job 1\nprocess job 2\nprocess job 3\nThe val qualifier can also be used to specify the process output. In this example, the Hello World! string is implicitly converted into a channel that is input to the FOO process. This process prints the input to a file named file.txt, and returns the same input value as the output.\nprocess FOO {\n input:\n val x\n\n output:\n val x\n\n script:\n \"\"\"\n echo $x > file.txt\n \"\"\"\n}\n\nworkflow {\n out_ch = FOO(\"Hello world!\")\n out_ch.view()\n}\nThe output from FOO is assigned to out_ch, and its contents printed using the view() channel operator.\n>>> nextflow run foo.nf\nN E X T F L O W ~ version 23.04.1\nLaunching `foo.nf` [dreamy_turing] DSL2 - revision: 0d1a07970e\nexecutor > local (1)\n[a4/f710b3] process > FOO [100%] 1 of 1 ✔\nHello world!\n\n\n\n\nThe path qualifier allows the handling of files inside a process. When a new instance of a process is executed, a new process execution director will be created just for that process. When the path qualifier is specified as the input, Nextflow will stage the file inside the process execution directory, allowing it to be accessed by the script using the specified name in the input declaration.\nIn this example, the reads channel is created from multiple .fq files inside training/nf-training/data/ggal, and input into process FOO. In the input declaration of the process, the file is referred to as sample.fastq.\nThe training/nf-training/data/ggal folder contains multiple .fq files, along with a .fa file. The wildcard *is used to match only .fq to be used as input.\n>>> ls training/nf-training/data/ggal\ngut_1.fq gut_2.fq liver_1.fq liver_2.fq lung_1.fq lung_2.fq transcriptome.fa\nSave the following code block as foo.nf.\nreads = Channel.fromPath('training/nf-training/data/ggal/*.fq')\n\nprocess FOO {\n debug true\n\n input:\n path 'sample.fastq'\n\n script:\n \"\"\"\n ls sample.fastq\n \"\"\"\n}\n\nworkflow {\n FOO(reads)\n}\nWhen the script is ran, the FOO process is executed six times and will print the name of the file sample.fastq six times, since this is the name assigned in the input declaration.\n>>> nextflow run foo.nf\nN E X T F L O W ~ version 23.04.1\nLaunching `foo.nf` [nasty_lamport] DSL2 - revision: b214838b82\n[78/a8a52d] process > FOO [100%] 6 of 6 ✔\nsample.fastq\nsample.fastq\nsample.fastq\nsample.fastq\nsample.fastq\nsample.fastq\nInside the process execution directory (ie. work/78/a8a52d...), the input file has been staged (symbolically linked) under the input declaration name. This allows the script to access the file within the execution directory via the declaration name.\n>>> ll work/78/a8a52d...\nsample.fastq -> /.../training/nf-training/data/ggal/liver_1.fq\nSimilarly, the path qualifier can also be used to specify one or more files that will be output by the process. In this example, the RANDOMNUM process creates a file results.txt containing a random number. Note that the Bash function is escaped with a back-slash character (ie. \\$RANDOM).\nprocess RANDOMNUM {\n output:\n path \"*.txt\"\n\n script:\n \"\"\"\n echo \\$RANDOM > result.txt\n \"\"\"\n}\n\nworkflow {\n receiver_ch = RANDOMNUM()\n receiver_ch.view()\n}\nThe output file is declared with the path qualifier, and specified using the wildcard * that will output all files with .txt extension. The output of the RANDOMNUM process is assigned to receiver_ch, which can be used for downstream processes.\n>>> nextflow run foo.nf\nN E X T F L O W ~ version 23.04.1\nLaunching `foo.nf` [nostalgic_cajal] DSL2 - revision: 9e260eead5\nexecutor > local (1)\n[76/7e8e36] process > RANDOMNUM [100%] 1 of 1 ✔\n/...work/8c/792157d409524d06b89faf2c1e6d75/result.txt\n\n\n\n\nTo define paired/grouped input and output information, the tuple qualifier can be used. The input and output declarations for tuples must be declared with a tuple qualifier followed by the definition of each element in the tuple.\nIn the example below, reads_ch is a channel created using the fromFilePairs channel factory, which automatically creates a tuple from file pairs.\nreads_ch = Channel.fromFilePairs(\"training/nf-training/data/ggal/*_{1,2}.fq\")\nreads_ch.view()\nThe created tuple consists of two elements – the first element is always the grouping key of the matching pair (based on similarities in the file name), and the second is a list of paths to each file.\n[gut, [/.../training/nf-training/data/ggal/gut_1.fq, /.../training/nf-training/data/ggal/gut_2.fq]]\n[liver, [/.../training/nf-training/data/ggal/liver_1.fq, /.../training/nf-training/data/ggal/liver_2.fq]]\n[lung, [/.../training/nf-training/data/ggal/lung_1.fq, /.../training/nf-training/data/ggal/lung_2.fq]]\nTo input a tuple into a process, the tuple qualifier must be used in the input block. Below, the first element of the tuple (ie. the grouping key) is declared with the val qualifier, and the second element of the tuple is declared with the path qualifier. The FOO process then prints the .fq file paths to a file called sample.txt, and returns it as a tuple containing the same grouping key, declared with val, and the output file created inside the process, declared with path.\nprocess FOO {\n input:\n tuple val(sample_id), path(sample_id_paths)\n\n output:\n tuple val(sample_id), path('sample.txt')\n\n script:\n \"\"\"\n echo $sample_id_paths > sample.txt\n \"\"\"\n}\n\nworkflow {\n sample_ch = FOO(reads_ch)\n sample_ch.view()\n}\nUpdate foo.nf to the above, and run the script.\n>>> nextflow run foo.nf\nN E X T F L O W ~ version 23.04.1\nLaunching `test.nf` [sharp_becquerel] DSL2 - revision: cd652fc08b\nexecutor > local (3)\n[65/54124a] process > FOO (3) [100%] 3 of 3 ✔\n[lung, /.../work/23/fe268295bab990a40b95b7091530b6/sample.txt]\n[liver, /.../work/32/656b96a01a460f27fa207e85995ead/sample.txt]\n[gut, /.../work/ae/3cfc7cf0748a598c5e2da750b6bac6/sample.txt]\nIt’s worth noting that the FOO process is executed three times in parallel, so there’s no guarantee of a particular execution order. Therefore, if the script was ran again, the final result may be printed out in a different order:\n>>> nextflow run foo.nf\nN E X T F L O W ~ version 23.04.1\nLaunching `foo.nf` [high_mendel] DSL2 - revision: cd652fc08b\nexecutor > local (3)\n[82/71a961] process > FOO (1) [100%] 3 of 3 ✔\n[gut, /.../work/ae/3cfc7cf0748a598c5e2da750b6bac6/sample.txt]\n[lung, /.../work/23/fe268295bab990a40b95b7091530b6/sample.txt]\n[liver, /.../work/32/656b96a01a460f27fa207e85995ead/sample.txt]\nThus, if the output of a process is being used as an input into another process, the use of the tuple qualifier that contains metadata information is especially important to ensure the correct inputs are being used for downstream processes.\n\n\n\n\n\n\nKey points\n\n\n\n\nThe contents of value channels can be consumed an unlimited amount of times, wheres queue channels cannot\nDifferent channel factories can be used to read different input types\n$ characters need to be escaped with \\ when referencing Bash variables and functions, while Nextflow variables do not\nThe scripting language within a process can be altered by starting the script with the desired Shebang declaration" + }, + { + "objectID": "workshops/3.1_creating_a_workflow.html#creating-an-rnaseq-workflow", + "href": "workshops/3.1_creating_a_workflow.html#creating-an-rnaseq-workflow", + "title": "Nextflow Development - Creating a Nextflow Workflow", + "section": "Creating an RNAseq Workflow", + "text": "Creating an RNAseq Workflow\n\n\n\n\n\n\nObjectives\n\n\n\n\nDevelop a Nextflow workflow\nRead data of different types into a Nextflow workflow\nOutput Nextflow process results to a predefined directory\n\n\n\n\n4.1.1. Define Workflow Parameters\nLet’s create a Nextflow script rnaseq.nf for a RNA-seq workflow. The code begins with a shebang, which declares Nextflow as the interpreter.\n#!/usr/bin/env nextflow\nOne way to define the workflow parameters is inside the Nextflow script.\nparams.reads = \"/.../training/nf-training/data/ggal/*_{1,2}.fq\"\nparams.transcriptome_file = \"/.../training/nf-training/data/ggal/transcriptome.fa\"\nparams.multiqc = \"/.../training/nf-training/multiqc\"\n\nprintln \"reads: $params.reads\"\nWorkflow parameters can be defined and accessed inside the Nextflow script by prepending the prefix params to a variable name, separated by a dot character, eg. params.reads.\nDifferent data types can be assigned as a parameter in Nextflow. The reads parameter is defined as multiple .fq files. The transcriptome_file parameter is defined as one file, /.../training/nf-training/data/ggal/transcriptome.fa. The multiqc parameter is defined as a directory, /.../training/nf-training/data/ggal/multiqc.\nThe Groovy println command is then used to print the contents of the reads parameter, which is access with the $ character.\nRun the script:\n>>> nextflow run rnaseq.nf\nN E X T F L O W ~ version 23.04.1\nLaunching `rnaseq.nf` [astonishing_raman] DSL2 - revision: 8c9adc1772\nreads: /.../training/nf-training/data/ggal/*_{1,2}.fq\n\n\n\n4.1.2. Create a transcriptome index file\nCommands or scripts can be executed inside a process.\nprocess INDEX {\n input:\n path transcriptome\n\n output:\n path \"salmon_idx\"\n\n script:\n \"\"\"\n salmon index --threads $task.cpus -t $transcriptome -i salmon_idx\n \"\"\"\n}\nThe INDEX process takes an input path, and assigns that input as the variable transcriptome. The path type qualifier will allow Nextflow to stage the files in the process execution directory, where they can be accessed by the script via the defined variable name, ie. transcriptome. The code between the three double-quotes of the script block will be executed, and accesses the input transcriptome variable using $. The output is a path, with a filename salmon_idx. The output path can also be defined using wildcards, eg. path \"*_idx\".\nNote that the name of the input file is not used and is only referenced to by the input variable name. This feature allows pipeline tasks to be self-contained and decoupled from the execution environment. As best practice, avoid referencing files that are not defined in the process script.\nTo execute the INDEX process, a workflow scope will need to be added.\nworkflow {\n index_ch = INDEX(params.transcriptome_file)\n}\nHere, the params.transcriptome_file parameter we defined earlier in the Nextflow script is used as an input into the INDEX process. The output of the process is assigned to the index_ch channel.\nRun the Nextflow script:\n>>> nextflow run rnaseq.nf\n\nERROR ~ Error executing process > 'INDEX'\n\nCaused by:\n Process `INDEX` terminated with an error exit status (127)\n\nCommand executed:\n\n salmon index --threads 1 -t transcriptome.fa -i salmon_index\n\nCommand exit status:\n 127\n\nCommand output:\n (empty)\n\nCommand error:\n .command.sh: line 2: salmon: command not found\n\nWork dir:\n /.../work/85/495a21afcaaf5f94780aff6b2a964c\n\nTip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh`\n\n -- Check '.nextflow.log' file for details\nWhen a process execution exits with a non-zero exit status, the workflow will be stopped. Nextflow will output the cause of the error, the command that caused the error, the exit status, the standard output (if available), the comand standard error, and the work directory where the process was executed.\nLet’s first look inside the process execution directory:\n>>> ls -a /.../work/85/495a21afcaaf5f94780aff6b2a964c \n\n. .command.begin .command.log .command.run .exitcode\n.. .command.err .command.out .command.sh transcriptome.fa\nWe can see that the input file transcriptome.fa has been staged inside this process execution directory by being symbolically linked. This allows it to be accessed by the script.\nInside the .command.err script, we can see that the salmon command was not found, resulting in the termination of the Nextflow workflow.\nSingularity containers can be used to execute the process within an environment that contains the package of interest. Create a config file nextflow.config containing the following:\nsingularity {\n enabled = true\n autoMounts = true\n cacheDir = \"/config/binaries/singularity/containers_devel/nextflow\"\n}\nThe container process directive can be used to specify the required container:\nprocess INDEX {\n container \"/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-salmon-1.10.1--h7e5ed60_0.img\"\n\n input:\n path transcriptome\n\n output:\n path \"salmon_idx\"\n\n script:\n \"\"\"\n salmon index --threads $task.cpus -t $transcriptome -i salmon_idx\n \"\"\"\n}\nRun the Nextflow script:\n>>> nextflow run rnaseq.nf\nN E X T F L O W ~ version 23.04.1\nLaunching `rnaseq.nf` [distraught_goldwasser] DSL2 - revision: bdebf34e16\nexecutor > local (1)\n[37/7ef8f0] process > INDEX [100%] 1 of 1 ✔\nThe newly created nextflow.config files does not need to be specified in the nextflow run command. This file is automatically searched for and used by Nextflow.\nAn alternative to singularity containers is the use of a module. Since the script block is executed as a Bash script, it can contain any command or script normally executed on the command line. If there is a module present in the host environment, it can be loaded as part of the process script.\nprocess INDEX {\n input:\n path transcriptome\n\n output:\n path \"salmon_idx\"\n\n script:\n \"\"\"\n module purge\n module load salmon/1.3.0\n\n salmon index --threads $task.cpus -t $transcriptome -i salmon_idx\n \"\"\"\n}\nRun the Nextflow script:\n>>> nextflow run rnaseq.nf\nN E X T F L O W ~ version 23.04.1\nLaunching `rnaseq.nf` [reverent_liskov] DSL2 - revision: b74c22049d\nexecutor > local (1)\n[ba/3c12ab] process > INDEX [100%] 1 of 1 ✔\n\n\n\n4.1.3. Collect Read Files By Pairs\nPreviously, we have defined the reads parameter to be the following:\nparams.reads = \"/.../training/nf-training/data/ggal/*_{1,2}.fq\"\nChallenge: Convert the reads parameter into a tuple channel called reads_ch, where the first element is a unique grouping key, and the second element is the paired .fq files. Then, view the contents of reads_ch\n\n\n\n\n\n\nAnswer\n\n\n\n\n\nreads_ch = Channel.fromFilePairs(\"$params.reads\")\nreads_ch.view()\nThe fromFilePairs channel factory will automatically group input files into a tuple with a unique grouping key. The view() channel operator can be used to view the contents of the channel.\n>>> nextflow run rnaseq.nf\n\n[gut, [/.../training/nf-training/data/ggal/gut_1.fq, /.../training/nf-training/data/ggal/gut_2.fq]]\n[liver, [/.../training/nf-training/data/ggal/liver_1.fq, /.../training/nf-training/data/ggal/liver_2.fq]]\n[lung, [/.../training/nf-training/data/ggal/lung_1.fq, /.../training/nf-training/data/ggal/lung_2.fq]]\n\n\n\n\n\n4.1.4. Perform Expression Quantification\nLet’s add a new process QUANTIFICATION that uses both the indexed transcriptome file and the .fq file pairs to execute the salmon quant command.\nprocess QUANTIFICATION {\n input:\n path salmon_index\n tuple val(sample_id), path(reads)\n\n output:\n path \"$sample_id\"\n\n script:\n \"\"\"\n salmon quant --threads $task.cpus --libType=U \\\n -i $salmon_index -1 ${reads[0]} -2 ${reads[1]} -o $sample_id\n \"\"\"\n}\nThe QUANTIFICATION process takes two inputs, the first is the path to the salmon_index created from the INDEX process. The second input is set to match the output of fromFilePairs – a tuple where the first element is a value (ie. grouping key), and the second element is a list of paths to the .fq reads.\nIn the script block, the salmon quant command saves the output of the tool as $sample_id. This output is emitted by the QUANTIFICATION process, using $ to access the Nextflow variable.\nChallenge:\nSet the following as the execution container for QUANTIFICATION:\n/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-salmon-1.10.1--h7e5ed60_0.img\nAssign index_ch and reads_ch as the inputs to this process, and emit the process outputs as quant_ch. View the contents of quant_ch\n\n\n\n\n\n\nAnswer\n\n\n\n\n\nTo assign a container to a process, the container directive can be used.\nprocess QUANTIFICATION {\n container \"/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-salmon-1.10.1--h7e5ed60_0.img\"\n\n input:\n path salmon_index\n tuple val(sample_id), path(reads)\n\n output:\n path \"$sample_id\"\n\n script:\n \"\"\"\n salmon quant --threads $task.cpus --libType=U \\\n -i $salmon_index -1 ${reads[0]} -2 ${reads[1]} -o $sample_id\n \"\"\"\n}\nTo run the QUANTIFICATION process and emit the outputs as quant_ch, the following can be added to the end of the workflow block:\nquant_ch = QUANTIFICATION(index_ch, reads_ch)\nquant_ch.view()\nThe script can now be run:\n>>> nextflow run rnaseq.nf \nN E X T F L O W ~ version 23.04.1\nLaunching `rnaseq.nf` [elated_cray] DSL2 - revision: abe41f4f69\nexecutor > local (4)\n[e5/e75095] process > INDEX [100%] 1 of 1 ✔\n[4c/68a000] process > QUANTIFICATION (1) [100%] 3 of 3 ✔\n/.../work/b1/d861d26d4d36864a17d2cec8d67c80/liver\n/.../work/b4/a6545471c1f949b2723d43a9cce05f/lung\n/.../work/4c/68a000f7c6503e8ae1fe4d0d3c93d8/gut\nIn the Nextflow output, we can see that the QUANTIFICATION process has been ran three times, since the reads_ch consists of three elements. Nextflow will automatically run the QUANTIFICATION process on each of the elements in the input channel, creating separate process execution work directories for each execution.\n\n\n\n\n\n4.1.5. Quality Control\nNow, let’s implement a FASTQC quality control process for the input fastq reads.\nChallenge:\nCreate a process called FASTQC that takes reads_ch as an input, and declares the process input to be a tuple matching the structure of reads_ch, where the first element is assigned the variable sample_id, and the second variable is assigned the variable reads. This FASTQC process will first create an output directory fastqc_${sample_id}_logs, then perform fastqc on the input reads and save the results in the newly created directory fastqc_${sample_id}_logs:\nmkdir fastqc_${sample_id}_logs\nfastqc -o fastqc_${sample_id}_logs -f fastq -q ${reads}\nTake fastqc_${sample_id}_logs as the output of the process, and assign it to the channel fastqc_ch. Finally, specify the process container to be the following:\n/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-fastqc-0.12.1--hdfd78af_0.img\n\n\n\n\n\n\nAnswer\n\n\n\n\n\nThe process FASTQC is created in rnaseq.nf. Since the input channel is a tuple, the process input declaration is a tuple containing elements that match the structure of the incoming channel. The first element of the tuple is assigned the variable sample_id, and the second element of the tuple is assigned the variable reads. The relevant container is specified using the container process directive.\nprocess FASTQC {\n container \"/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-fastqc-0.12.1--hdfd78af_0.img\"\n\n input:\n tuple val(sample_id), path(reads)\n\n output:\n path \"fastqc_${sample_id}_logs\"\n\n script:\n \"\"\"\n mkdir fastqc_${sample_id}_logs\n fastqc -o fastqc_${sample_id}_logs -f fastq -q ${reads}\n \"\"\"\n}\nIn the workflow scope, the following can be added:\nfastqc_ch = FASTQC(reads_ch)\nThe FASTQC process is called, taking reads_ch as an input. The output of the process is assigned to be fastqc_ch.\n>>> nextflow run rnaseq.nf\nN E X T F L O W ~ version 23.04.1\nLaunching `rnaseq.nf` [sad_jennings] DSL2 - revision: cfae7ccc0e\nexecutor > local (7)\n[b5/6bece3] process > INDEX [100%] 1 of 1 ✔\n[32/46f20b] process > QUANTIFICATION (3) [100%] 3 of 3 ✔\n[44/27aa8d] process > FASTQC (2) [100%] 3 of 3 ✔\nIn the Nextflow output, we can see that the FASTQC has been ran three times as expected, since the reads_ch consists of three elements.\n\n\n\n\n\n4.1.6. MultiQC Report\nSo far, the generated outputs have all been saved inside the Nextflow work directory. For the FASTQC process, the specified output directory is only created inside the process execution directory. To save results to a specified folder, the publishDir process directive can be used.\nLet’s create a new MULTIQC process in our workflow that takes the outputs from the QUANTIFICATION and FASTQC processes to create a final report using the multiqc tool, and publish the process outputs to a directory outside of the process execution directory.\nprocess MULTIQC {\n publishDir params.outdir, mode:'copy'\n container \"/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-multiqc-1.21--pyhdfd78af_0.img\"\n\n input:\n path quantification\n path fastqc\n\n output:\n path \"*.html\"\n\n script:\n \"\"\"\n multiqc . --filename $quantification\n \"\"\"\n}\nIn the MULTIQC process, the multiqc command is performed on both quantification and fastqc inputs, and publishes the report to a directory defined by the outdir parameter. Only files that match the declaration in the output block are published, not all the outputs of a process. By default, files are published to the target folder creating a symbolic link to the file produced in the process execution directory. This behavior can be modified using the mode option, eg. copy, which copies the file from the process execution directory to the specified output directory.\nAdd the following to the end of workflow scope:\nmultiqc_ch = MULTIQC(quant_ch, fastqc_ch)\nRun the pipeline, specifying an output directory using the outdir parameter:\nnextflow run rnaseq.nf --outdir \"results\"\nA results directory containing the output multiqc reports will be created outside of the process execution directory.\n>>> ls results\ngut.html liver.html lung.html\n\n\n\n\n\n\n\nKey points\n\n\n\n\nCommands or scripts can be executed inside a process\nEnvironments can be defined using the container process directive\nThe input declaration for a process must match the structure of the channel that is being passed into that process\n\n\n\n\nThis workshop is adapted from Fundamentals Training, Advanced Training, Developer Tutorials, and Nextflow Patterns materials from Nextflow and nf-core\n^*Draft for Future Sessions" + }, + { + "objectID": "workshops/1.1_intro_nextflow.html", + "href": "workshops/1.1_intro_nextflow.html", + "title": "Introduction to Nextflow", + "section": "", + "text": "Objectives\n\n\n\n\nLearn about the benefits of a workflow manager.\nLearn Nextflow terminology.\nLearn basic commands and options to run a Nextflow workflow" + }, + { + "objectID": "workshops/1.1_intro_nextflow.html#footnotes", + "href": "workshops/1.1_intro_nextflow.html#footnotes", + "title": "Introduction to Nextflow", + "section": "Footnotes", + "text": "Footnotes\n\n\nhttps://www.lexico.com/definition/workflow↩︎" + }, + { + "objectID": "workshops/4.1_draft_future_sess.html", + "href": "workshops/4.1_draft_future_sess.html", + "title": "Nextflow Development - Metadata Parsing", + "section": "", + "text": "Currently, we have defined the reads parameter as a string:\nparams.reads = \"/.../training/nf-training/data/ggal/gut_{1,2}.fq\"\nTo group the reads parameter, the fromFilePairs channel factory can be used. Add the following to the workflow block and run the workflow:\nreads_ch = Channel.fromFilePairs(\"$params.reads\")\nreads_ch.view()\nThe reads parameter is being converted into a file pair group using fromFilePairs, and is assigned to reads_ch. The reads_ch consists of a tuple of two items – the first is the grouping key of the matching pair (gut), and the second is a list of paths to each file:\n[gut, [/.../training/nf-training/data/ggal/gut_1.fq, /.../training/nf-training/data/ggal/gut_2.fq]]\nGlob patterns can also be used to create channels of file pair groups. Inside the data directory, we have pairs of gut, liver, and lung files that can all be read into reads_ch.\n>>> ls \"/.../training/nf-training/data/ggal/\"\n\ngut_1.fq gut_2.fq liver_1.fq liver_2.fq lung_1.fq lung_2.fq transcriptome.fa\nRun the rnaseq.nf workflow specifying all .fq files inside /.../training/nf-training/data/ggal/ as the reads parameter via the command line:\nnextflow run rnaseq.nf --reads '/.../training/nf-training/data/ggal/*_{1,2}.fq'\nFile paths that include one or more wildcards (ie. *, ?, etc.) MUST be wrapped in single-quoted characters to avoid Bash expanding the glob on the command line.\nThe reads_ch now contains three tuple elements with unique grouping keys:\n[gut, [/.../training/nf-training/data/ggal/gut_1.fq, /.../training/nf-training/data/ggal/gut_2.fq]]\n[liver, [/.../training/nf-training/data/ggal/liver_1.fq, /.../training/nf-training/data/ggal/liver_2.fq]]\n[lung, [/.../training/nf-training/data/ggal/lung_1.fq, /.../training/nf-training/data/ggal/lung_2.fq]]\nThe grouping key metadata can also be explicitly created without having to rely on file names, using the map channel operator. Let’s start by creating a samplesheet rnaseq_samplesheet.csv with column headings sample_name, fastq1, and fastq2, and fill in a custom sample_name, along with the paths to the .fq files.\nsample_name,fastq1,fastq2\ngut_sample,/.../training/nf-training/data/ggal/gut_1.fq,/.../training/nf-training/data/ggal/gut_2.fq\nliver_sample,/.../training/nf-training/data/ggal/liver_1.fq,/.../training/nf-training/data/ggal/liver_2.fq\nlung_sample,/.../training/nf-training/data/ggal/lung_1.fq,/.../training/nf-training/data/ggal/lung_2.fq\nLet’s now supply the path to rnaseq_samplesheet.csv to the reads parameter in rnaseq.nf.\nparams.reads = \"/.../rnaseq_samplesheet.csv\"\nPreviously, the reads parameter consisted of a string of the .fq files directly. Now, it is a string to a .csv file containing the .fq files. Therefore, the channel factory method that reads the input file also needs to be changed. Since the parameter is now a single file path, the fromPath method can first be used, which creates a channel of Path type object. The splitCsv channel operator can then be used to parse the contents of the channel.\nreads_ch = Channel.fromPath(params.reads)\nreads_ch.view()\n\nreads_ch = reads_ch.splitCsv(header:true)\nreads_ch.view()\nWhen using splitCsv in the above example, header is set to true. This will use the first line of the .csv file as the column names. Let’s run the pipeline containing the new input parameter.\n>>> nextflow run rnaseq.nf\n\nN E X T F L O W ~ version 23.04.1\nLaunching `rnaseq.nf` [distraught_avogadro] DSL2 - revision: 525e081ba2\nreads: rnaseq_samplesheet.csv\nreads: $params.reads\nexecutor > local (1)\n[4e/eeae2a] process > INDEX [100%] 1 of 1 ✔\n/.../rnaseq_samplesheet.csv\n[sample_name:gut_sample, fastq1:/.../training/nf-training/data/ggal/gut_1.fq, fastq2:/.../training/nf-training/data/ggal/gut_2.fq]\n[sample_name:liver_sample, fastq1:/.../training/nf-training/data/ggal/liver_1.fq, fastq2:/.../training/nf-training/data/ggal/liver_2.f]\n[sample_name:lung_sample, fastq1:/.../training/nf-training/data/ggal/lung_1.fq, fastq2:/.../training/nf-training/data/ggal/lung_2.fq]\nThe /.../rnaseq_samplesheet.csv is the output of reads_ch directly after the fromPath channel factory method was used. Here, the channel is a Path type object. After invoking the splitCsv channel operator, the reads_ch is now replaced with a channel consisting of three elements, where each element is a row in the .csv file, returned as a list. Since header was set to true, each element in the list is also mapped to the column names. This can be used when creating the custom grouping key.\nTo create grouping key metadata from the list output by splitCsv, the map channel operator can be used.\n reads_ch = reads_ch.map { row -> \n grp_meta = \"$row.sample_name\"\n [grp_meta, [row.fastq1, row.fastq2]]\n }\n reads_ch.view()\nHere, for each list in reads_ch, we assign it to a variable row. We then create custom grouping key metadata grp_meta based on the sample_name column from the .csv, which can be accessed via the row variable by . separation. After the custom metadata key is assigned, a tuple is created by assigning grp_meta as the first element, and the two .fq files as the second element, accessed via the row variable by . separation.\nLet’s run the pipeline containing the custom grouping key:\n>>> nextflow run rnaseq.nf\n\nN E X T F L O W ~ version 23.04.1\nLaunching `rnaseq.nf` [happy_torricelli] DSL2 - revision: e9e1499a97\nreads: rnaseq_samplesheet.csv\nreads: $params.reads\n[- ] process > INDEX -\n[gut_sample, [/.../training/nf-training/data/ggal/gut_1.fq, /.../training/nf-training/data/ggal/gut_2.fq]]\n[liver_sample, [/home/sli/test/training/nf-training/data/ggal/liver_1.fq, /.../training/nf-training/data/ggal/liver_2.fq]]\n[lung_sample, [/.../training/nf-training/data/ggal/lung_1.fq, /.../training/nf-training/data/ggal/lung_2.fq]]\nThe custom grouping key can be created from multiple values in the samplesheet. For example, grp_meta = [sample : row.sample_name , file : row.fastq1] will create the metadata key using both the sample_name and fastq1 file names. The samplesheet can also be created to include multiple sample characteristics, such as lane, data_type, etc. Each of these characteristics can be used to ensure an adequte grouping key is creaed for that sample." + }, + { + "objectID": "workshops/4.1_draft_future_sess.html#metadata-parsing", + "href": "workshops/4.1_draft_future_sess.html#metadata-parsing", + "title": "Nextflow Development - Metadata Parsing", + "section": "", + "text": "Currently, we have defined the reads parameter as a string:\nparams.reads = \"/.../training/nf-training/data/ggal/gut_{1,2}.fq\"\nTo group the reads parameter, the fromFilePairs channel factory can be used. Add the following to the workflow block and run the workflow:\nreads_ch = Channel.fromFilePairs(\"$params.reads\")\nreads_ch.view()\nThe reads parameter is being converted into a file pair group using fromFilePairs, and is assigned to reads_ch. The reads_ch consists of a tuple of two items – the first is the grouping key of the matching pair (gut), and the second is a list of paths to each file:\n[gut, [/.../training/nf-training/data/ggal/gut_1.fq, /.../training/nf-training/data/ggal/gut_2.fq]]\nGlob patterns can also be used to create channels of file pair groups. Inside the data directory, we have pairs of gut, liver, and lung files that can all be read into reads_ch.\n>>> ls \"/.../training/nf-training/data/ggal/\"\n\ngut_1.fq gut_2.fq liver_1.fq liver_2.fq lung_1.fq lung_2.fq transcriptome.fa\nRun the rnaseq.nf workflow specifying all .fq files inside /.../training/nf-training/data/ggal/ as the reads parameter via the command line:\nnextflow run rnaseq.nf --reads '/.../training/nf-training/data/ggal/*_{1,2}.fq'\nFile paths that include one or more wildcards (ie. *, ?, etc.) MUST be wrapped in single-quoted characters to avoid Bash expanding the glob on the command line.\nThe reads_ch now contains three tuple elements with unique grouping keys:\n[gut, [/.../training/nf-training/data/ggal/gut_1.fq, /.../training/nf-training/data/ggal/gut_2.fq]]\n[liver, [/.../training/nf-training/data/ggal/liver_1.fq, /.../training/nf-training/data/ggal/liver_2.fq]]\n[lung, [/.../training/nf-training/data/ggal/lung_1.fq, /.../training/nf-training/data/ggal/lung_2.fq]]\nThe grouping key metadata can also be explicitly created without having to rely on file names, using the map channel operator. Let’s start by creating a samplesheet rnaseq_samplesheet.csv with column headings sample_name, fastq1, and fastq2, and fill in a custom sample_name, along with the paths to the .fq files.\nsample_name,fastq1,fastq2\ngut_sample,/.../training/nf-training/data/ggal/gut_1.fq,/.../training/nf-training/data/ggal/gut_2.fq\nliver_sample,/.../training/nf-training/data/ggal/liver_1.fq,/.../training/nf-training/data/ggal/liver_2.fq\nlung_sample,/.../training/nf-training/data/ggal/lung_1.fq,/.../training/nf-training/data/ggal/lung_2.fq\nLet’s now supply the path to rnaseq_samplesheet.csv to the reads parameter in rnaseq.nf.\nparams.reads = \"/.../rnaseq_samplesheet.csv\"\nPreviously, the reads parameter consisted of a string of the .fq files directly. Now, it is a string to a .csv file containing the .fq files. Therefore, the channel factory method that reads the input file also needs to be changed. Since the parameter is now a single file path, the fromPath method can first be used, which creates a channel of Path type object. The splitCsv channel operator can then be used to parse the contents of the channel.\nreads_ch = Channel.fromPath(params.reads)\nreads_ch.view()\n\nreads_ch = reads_ch.splitCsv(header:true)\nreads_ch.view()\nWhen using splitCsv in the above example, header is set to true. This will use the first line of the .csv file as the column names. Let’s run the pipeline containing the new input parameter.\n>>> nextflow run rnaseq.nf\n\nN E X T F L O W ~ version 23.04.1\nLaunching `rnaseq.nf` [distraught_avogadro] DSL2 - revision: 525e081ba2\nreads: rnaseq_samplesheet.csv\nreads: $params.reads\nexecutor > local (1)\n[4e/eeae2a] process > INDEX [100%] 1 of 1 ✔\n/.../rnaseq_samplesheet.csv\n[sample_name:gut_sample, fastq1:/.../training/nf-training/data/ggal/gut_1.fq, fastq2:/.../training/nf-training/data/ggal/gut_2.fq]\n[sample_name:liver_sample, fastq1:/.../training/nf-training/data/ggal/liver_1.fq, fastq2:/.../training/nf-training/data/ggal/liver_2.f]\n[sample_name:lung_sample, fastq1:/.../training/nf-training/data/ggal/lung_1.fq, fastq2:/.../training/nf-training/data/ggal/lung_2.fq]\nThe /.../rnaseq_samplesheet.csv is the output of reads_ch directly after the fromPath channel factory method was used. Here, the channel is a Path type object. After invoking the splitCsv channel operator, the reads_ch is now replaced with a channel consisting of three elements, where each element is a row in the .csv file, returned as a list. Since header was set to true, each element in the list is also mapped to the column names. This can be used when creating the custom grouping key.\nTo create grouping key metadata from the list output by splitCsv, the map channel operator can be used.\n reads_ch = reads_ch.map { row -> \n grp_meta = \"$row.sample_name\"\n [grp_meta, [row.fastq1, row.fastq2]]\n }\n reads_ch.view()\nHere, for each list in reads_ch, we assign it to a variable row. We then create custom grouping key metadata grp_meta based on the sample_name column from the .csv, which can be accessed via the row variable by . separation. After the custom metadata key is assigned, a tuple is created by assigning grp_meta as the first element, and the two .fq files as the second element, accessed via the row variable by . separation.\nLet’s run the pipeline containing the custom grouping key:\n>>> nextflow run rnaseq.nf\n\nN E X T F L O W ~ version 23.04.1\nLaunching `rnaseq.nf` [happy_torricelli] DSL2 - revision: e9e1499a97\nreads: rnaseq_samplesheet.csv\nreads: $params.reads\n[- ] process > INDEX -\n[gut_sample, [/.../training/nf-training/data/ggal/gut_1.fq, /.../training/nf-training/data/ggal/gut_2.fq]]\n[liver_sample, [/home/sli/test/training/nf-training/data/ggal/liver_1.fq, /.../training/nf-training/data/ggal/liver_2.fq]]\n[lung_sample, [/.../training/nf-training/data/ggal/lung_1.fq, /.../training/nf-training/data/ggal/lung_2.fq]]\nThe custom grouping key can be created from multiple values in the samplesheet. For example, grp_meta = [sample : row.sample_name , file : row.fastq1] will create the metadata key using both the sample_name and fastq1 file names. The samplesheet can also be created to include multiple sample characteristics, such as lane, data_type, etc. Each of these characteristics can be used to ensure an adequte grouping key is creaed for that sample." + }, { "objectID": "workshops/4.1_modules.html", "href": "workshops/4.1_modules.html", @@ -244,132 +454,6 @@ "section": "7.2. nf-test", "text": "7.2. nf-test\nIt is critical for reproducibility and long-term maintenance to have a way to systematically test that every part of your workflow is doing what it’s supposed to do. To that end, people often focus on top-level tests, in which the workflow is un on some test data from start to finish. This is useful but unfortunately incomplete. You should also implement module-level tests (equivalent to what is called ‘unit tests’ in general software engineering) to verify the functionality of individual components of your workflow, ensuring that each module performs as expected under different conditions and inputs.\nThe nf-test package provides a testing framework that integrates well with Nextflow and makes it straightforward to add both module-level and workflow-level tests to your pipeline. For more background information, read the blog post about nf-test on the nf-core blog.\nSee this tutorial for some examples.\n\nThis workshop is adapted from Fundamentals Training, Advanced Training, Developer Tutorials, and Nextflow Patterns materials from Nextflow and nf-core" }, - { - "objectID": "workshops/4.1_draft_future_sess.html", - "href": "workshops/4.1_draft_future_sess.html", - "title": "Nextflow Development - Metadata Parsing", - "section": "", - "text": "Currently, we have defined the reads parameter as a string:\nparams.reads = \"/.../training/nf-training/data/ggal/gut_{1,2}.fq\"\nTo group the reads parameter, the fromFilePairs channel factory can be used. Add the following to the workflow block and run the workflow:\nreads_ch = Channel.fromFilePairs(\"$params.reads\")\nreads_ch.view()\nThe reads parameter is being converted into a file pair group using fromFilePairs, and is assigned to reads_ch. The reads_ch consists of a tuple of two items – the first is the grouping key of the matching pair (gut), and the second is a list of paths to each file:\n[gut, [/.../training/nf-training/data/ggal/gut_1.fq, /.../training/nf-training/data/ggal/gut_2.fq]]\nGlob patterns can also be used to create channels of file pair groups. Inside the data directory, we have pairs of gut, liver, and lung files that can all be read into reads_ch.\n>>> ls \"/.../training/nf-training/data/ggal/\"\n\ngut_1.fq gut_2.fq liver_1.fq liver_2.fq lung_1.fq lung_2.fq transcriptome.fa\nRun the rnaseq.nf workflow specifying all .fq files inside /.../training/nf-training/data/ggal/ as the reads parameter via the command line:\nnextflow run rnaseq.nf --reads '/.../training/nf-training/data/ggal/*_{1,2}.fq'\nFile paths that include one or more wildcards (ie. *, ?, etc.) MUST be wrapped in single-quoted characters to avoid Bash expanding the glob on the command line.\nThe reads_ch now contains three tuple elements with unique grouping keys:\n[gut, [/.../training/nf-training/data/ggal/gut_1.fq, /.../training/nf-training/data/ggal/gut_2.fq]]\n[liver, [/.../training/nf-training/data/ggal/liver_1.fq, /.../training/nf-training/data/ggal/liver_2.fq]]\n[lung, [/.../training/nf-training/data/ggal/lung_1.fq, /.../training/nf-training/data/ggal/lung_2.fq]]\nThe grouping key metadata can also be explicitly created without having to rely on file names, using the map channel operator. Let’s start by creating a samplesheet rnaseq_samplesheet.csv with column headings sample_name, fastq1, and fastq2, and fill in a custom sample_name, along with the paths to the .fq files.\nsample_name,fastq1,fastq2\ngut_sample,/.../training/nf-training/data/ggal/gut_1.fq,/.../training/nf-training/data/ggal/gut_2.fq\nliver_sample,/.../training/nf-training/data/ggal/liver_1.fq,/.../training/nf-training/data/ggal/liver_2.fq\nlung_sample,/.../training/nf-training/data/ggal/lung_1.fq,/.../training/nf-training/data/ggal/lung_2.fq\nLet’s now supply the path to rnaseq_samplesheet.csv to the reads parameter in rnaseq.nf.\nparams.reads = \"/.../rnaseq_samplesheet.csv\"\nPreviously, the reads parameter consisted of a string of the .fq files directly. Now, it is a string to a .csv file containing the .fq files. Therefore, the channel factory method that reads the input file also needs to be changed. Since the parameter is now a single file path, the fromPath method can first be used, which creates a channel of Path type object. The splitCsv channel operator can then be used to parse the contents of the channel.\nreads_ch = Channel.fromPath(params.reads)\nreads_ch.view()\n\nreads_ch = reads_ch.splitCsv(header:true)\nreads_ch.view()\nWhen using splitCsv in the above example, header is set to true. This will use the first line of the .csv file as the column names. Let’s run the pipeline containing the new input parameter.\n>>> nextflow run rnaseq.nf\n\nN E X T F L O W ~ version 23.04.1\nLaunching `rnaseq.nf` [distraught_avogadro] DSL2 - revision: 525e081ba2\nreads: rnaseq_samplesheet.csv\nreads: $params.reads\nexecutor > local (1)\n[4e/eeae2a] process > INDEX [100%] 1 of 1 ✔\n/.../rnaseq_samplesheet.csv\n[sample_name:gut_sample, fastq1:/.../training/nf-training/data/ggal/gut_1.fq, fastq2:/.../training/nf-training/data/ggal/gut_2.fq]\n[sample_name:liver_sample, fastq1:/.../training/nf-training/data/ggal/liver_1.fq, fastq2:/.../training/nf-training/data/ggal/liver_2.f]\n[sample_name:lung_sample, fastq1:/.../training/nf-training/data/ggal/lung_1.fq, fastq2:/.../training/nf-training/data/ggal/lung_2.fq]\nThe /.../rnaseq_samplesheet.csv is the output of reads_ch directly after the fromPath channel factory method was used. Here, the channel is a Path type object. After invoking the splitCsv channel operator, the reads_ch is now replaced with a channel consisting of three elements, where each element is a row in the .csv file, returned as a list. Since header was set to true, each element in the list is also mapped to the column names. This can be used when creating the custom grouping key.\nTo create grouping key metadata from the list output by splitCsv, the map channel operator can be used.\n reads_ch = reads_ch.map { row -> \n grp_meta = \"$row.sample_name\"\n [grp_meta, [row.fastq1, row.fastq2]]\n }\n reads_ch.view()\nHere, for each list in reads_ch, we assign it to a variable row. We then create custom grouping key metadata grp_meta based on the sample_name column from the .csv, which can be accessed via the row variable by . separation. After the custom metadata key is assigned, a tuple is created by assigning grp_meta as the first element, and the two .fq files as the second element, accessed via the row variable by . separation.\nLet’s run the pipeline containing the custom grouping key:\n>>> nextflow run rnaseq.nf\n\nN E X T F L O W ~ version 23.04.1\nLaunching `rnaseq.nf` [happy_torricelli] DSL2 - revision: e9e1499a97\nreads: rnaseq_samplesheet.csv\nreads: $params.reads\n[- ] process > INDEX -\n[gut_sample, [/.../training/nf-training/data/ggal/gut_1.fq, /.../training/nf-training/data/ggal/gut_2.fq]]\n[liver_sample, [/home/sli/test/training/nf-training/data/ggal/liver_1.fq, /.../training/nf-training/data/ggal/liver_2.fq]]\n[lung_sample, [/.../training/nf-training/data/ggal/lung_1.fq, /.../training/nf-training/data/ggal/lung_2.fq]]\nThe custom grouping key can be created from multiple values in the samplesheet. For example, grp_meta = [sample : row.sample_name , file : row.fastq1] will create the metadata key using both the sample_name and fastq1 file names. The samplesheet can also be created to include multiple sample characteristics, such as lane, data_type, etc. Each of these characteristics can be used to ensure an adequte grouping key is creaed for that sample." - }, - { - "objectID": "workshops/4.1_draft_future_sess.html#metadata-parsing", - "href": "workshops/4.1_draft_future_sess.html#metadata-parsing", - "title": "Nextflow Development - Metadata Parsing", - "section": "", - "text": "Currently, we have defined the reads parameter as a string:\nparams.reads = \"/.../training/nf-training/data/ggal/gut_{1,2}.fq\"\nTo group the reads parameter, the fromFilePairs channel factory can be used. Add the following to the workflow block and run the workflow:\nreads_ch = Channel.fromFilePairs(\"$params.reads\")\nreads_ch.view()\nThe reads parameter is being converted into a file pair group using fromFilePairs, and is assigned to reads_ch. The reads_ch consists of a tuple of two items – the first is the grouping key of the matching pair (gut), and the second is a list of paths to each file:\n[gut, [/.../training/nf-training/data/ggal/gut_1.fq, /.../training/nf-training/data/ggal/gut_2.fq]]\nGlob patterns can also be used to create channels of file pair groups. Inside the data directory, we have pairs of gut, liver, and lung files that can all be read into reads_ch.\n>>> ls \"/.../training/nf-training/data/ggal/\"\n\ngut_1.fq gut_2.fq liver_1.fq liver_2.fq lung_1.fq lung_2.fq transcriptome.fa\nRun the rnaseq.nf workflow specifying all .fq files inside /.../training/nf-training/data/ggal/ as the reads parameter via the command line:\nnextflow run rnaseq.nf --reads '/.../training/nf-training/data/ggal/*_{1,2}.fq'\nFile paths that include one or more wildcards (ie. *, ?, etc.) MUST be wrapped in single-quoted characters to avoid Bash expanding the glob on the command line.\nThe reads_ch now contains three tuple elements with unique grouping keys:\n[gut, [/.../training/nf-training/data/ggal/gut_1.fq, /.../training/nf-training/data/ggal/gut_2.fq]]\n[liver, [/.../training/nf-training/data/ggal/liver_1.fq, /.../training/nf-training/data/ggal/liver_2.fq]]\n[lung, [/.../training/nf-training/data/ggal/lung_1.fq, /.../training/nf-training/data/ggal/lung_2.fq]]\nThe grouping key metadata can also be explicitly created without having to rely on file names, using the map channel operator. Let’s start by creating a samplesheet rnaseq_samplesheet.csv with column headings sample_name, fastq1, and fastq2, and fill in a custom sample_name, along with the paths to the .fq files.\nsample_name,fastq1,fastq2\ngut_sample,/.../training/nf-training/data/ggal/gut_1.fq,/.../training/nf-training/data/ggal/gut_2.fq\nliver_sample,/.../training/nf-training/data/ggal/liver_1.fq,/.../training/nf-training/data/ggal/liver_2.fq\nlung_sample,/.../training/nf-training/data/ggal/lung_1.fq,/.../training/nf-training/data/ggal/lung_2.fq\nLet’s now supply the path to rnaseq_samplesheet.csv to the reads parameter in rnaseq.nf.\nparams.reads = \"/.../rnaseq_samplesheet.csv\"\nPreviously, the reads parameter consisted of a string of the .fq files directly. Now, it is a string to a .csv file containing the .fq files. Therefore, the channel factory method that reads the input file also needs to be changed. Since the parameter is now a single file path, the fromPath method can first be used, which creates a channel of Path type object. The splitCsv channel operator can then be used to parse the contents of the channel.\nreads_ch = Channel.fromPath(params.reads)\nreads_ch.view()\n\nreads_ch = reads_ch.splitCsv(header:true)\nreads_ch.view()\nWhen using splitCsv in the above example, header is set to true. This will use the first line of the .csv file as the column names. Let’s run the pipeline containing the new input parameter.\n>>> nextflow run rnaseq.nf\n\nN E X T F L O W ~ version 23.04.1\nLaunching `rnaseq.nf` [distraught_avogadro] DSL2 - revision: 525e081ba2\nreads: rnaseq_samplesheet.csv\nreads: $params.reads\nexecutor > local (1)\n[4e/eeae2a] process > INDEX [100%] 1 of 1 ✔\n/.../rnaseq_samplesheet.csv\n[sample_name:gut_sample, fastq1:/.../training/nf-training/data/ggal/gut_1.fq, fastq2:/.../training/nf-training/data/ggal/gut_2.fq]\n[sample_name:liver_sample, fastq1:/.../training/nf-training/data/ggal/liver_1.fq, fastq2:/.../training/nf-training/data/ggal/liver_2.f]\n[sample_name:lung_sample, fastq1:/.../training/nf-training/data/ggal/lung_1.fq, fastq2:/.../training/nf-training/data/ggal/lung_2.fq]\nThe /.../rnaseq_samplesheet.csv is the output of reads_ch directly after the fromPath channel factory method was used. Here, the channel is a Path type object. After invoking the splitCsv channel operator, the reads_ch is now replaced with a channel consisting of three elements, where each element is a row in the .csv file, returned as a list. Since header was set to true, each element in the list is also mapped to the column names. This can be used when creating the custom grouping key.\nTo create grouping key metadata from the list output by splitCsv, the map channel operator can be used.\n reads_ch = reads_ch.map { row -> \n grp_meta = \"$row.sample_name\"\n [grp_meta, [row.fastq1, row.fastq2]]\n }\n reads_ch.view()\nHere, for each list in reads_ch, we assign it to a variable row. We then create custom grouping key metadata grp_meta based on the sample_name column from the .csv, which can be accessed via the row variable by . separation. After the custom metadata key is assigned, a tuple is created by assigning grp_meta as the first element, and the two .fq files as the second element, accessed via the row variable by . separation.\nLet’s run the pipeline containing the custom grouping key:\n>>> nextflow run rnaseq.nf\n\nN E X T F L O W ~ version 23.04.1\nLaunching `rnaseq.nf` [happy_torricelli] DSL2 - revision: e9e1499a97\nreads: rnaseq_samplesheet.csv\nreads: $params.reads\n[- ] process > INDEX -\n[gut_sample, [/.../training/nf-training/data/ggal/gut_1.fq, /.../training/nf-training/data/ggal/gut_2.fq]]\n[liver_sample, [/home/sli/test/training/nf-training/data/ggal/liver_1.fq, /.../training/nf-training/data/ggal/liver_2.fq]]\n[lung_sample, [/.../training/nf-training/data/ggal/lung_1.fq, /.../training/nf-training/data/ggal/lung_2.fq]]\nThe custom grouping key can be created from multiple values in the samplesheet. For example, grp_meta = [sample : row.sample_name , file : row.fastq1] will create the metadata key using both the sample_name and fastq1 file names. The samplesheet can also be created to include multiple sample characteristics, such as lane, data_type, etc. Each of these characteristics can be used to ensure an adequte grouping key is creaed for that sample." - }, - { - "objectID": "workshops/1.1_intro_nextflow.html", - "href": "workshops/1.1_intro_nextflow.html", - "title": "Introduction to Nextflow", - "section": "", - "text": "Objectives\n\n\n\n\nLearn about the benefits of a workflow manager.\nLearn Nextflow terminology.\nLearn basic commands and options to run a Nextflow workflow" - }, - { - "objectID": "workshops/1.1_intro_nextflow.html#footnotes", - "href": "workshops/1.1_intro_nextflow.html#footnotes", - "title": "Introduction to Nextflow", - "section": "Footnotes", - "text": "Footnotes\n\n\nhttps://www.lexico.com/definition/workflow↩︎" - }, - { - "objectID": "workshops/3.1_creating_a_workflow.html", - "href": "workshops/3.1_creating_a_workflow.html", - "title": "Nextflow Development - Creating a Nextflow Workflow", - "section": "", - "text": "Objectives\n\n\n\n\nGain an understanding of Nextflow channels and processes\nGain an understanding of Nextflow syntax\nRead data of different types into a Nextflow workflow\nCreate Nextflow processes consisting of multiple scripting languages\n\n\n\n\n\nClone the training materials repository on GitHub:\ngit clone https://github.com/nextflow-io/training.git\nSet up an interactive shell to run our Nextflow workflow:\nsrun --pty -p prod_short --mem 8GB --mincpus 2 -t 0-2:00 bash\nLoad the required modules to run Nextflow:\nmodule load nextflow/23.04.1\nmodule load singularity/3.7.3\nMake sure to always use version 23 and above, as we have encountered problems running nf-core workflows with older versions.\nSince we are using a shared storage, we should consider including common shared paths to where software is stored. These variables can be accessed using the NXF_SINGULARITY_CACHEDIR or the NXF_CONDA_CACHEDIR environment variables.\nCurrently we set the singularity cache environment variable:\nexport NXF_SINGULARITY_CACHEDIR=/config/binaries/singularity/containers_devel/nextflow\nSingularity images downloaded by workflow executions will now be stored in this directory.\nYou may want to include these, or other environmental variables, in your .bashrc file (or alternate) that is loaded when you log in so you don’t need to export variables every session. A complete list of environment variables can be found here.\n\n\n\n\nA workflow can be defined as sequence of steps through which computational tasks are chained together. Steps may be dependent on other tasks to complete, or they can be run in parallel.\n\nIn Nextflow, each step that will execute a single computational task is known as a process. Channels are used to join processes, and pass the outputs from one task into another task.\n\n\n\nChannels are a key data structure of Nextflow, used to pass data between processes.\n\n\nA queue channel connects two processes or operators, and is implicitly created by process outputs, or using channel factories such as Channel.of or Channel.fromPath.\nThe training/nf-training/snippet.nf script creates a channel where each element in the channel is an arguments provided to it. This script uses the Channel.of channel factory, which creates a channel from parameters such as strings or integers.\nch = Channel.of(1, 2, 3)\nch.view()\nThe following will be returned:\n>>> nextflow run training/nf-training/snippet.nf\nN E X T F L O W ~ version 23.04.1\nLaunching `training/nf-training/snippet.nf` [shrivelled_brattain] DSL2 - revision: 7e2661e10b\n1\n2\n3\n\n\n\nA value channel differs from a queue channel in that it is bound to a single value, and it can be read unlimited times without consuming its contents. To see the difference between value and queue channels, you can modify training/nf-training/snippet.nf to the following:\nch1 = Channel.of(1, 2, 3)\nch2 = Channel.of(1)\n\nprocess SUM {\n input:\n val x\n val y\n\n output:\n stdout\n\n script:\n \"\"\"\n echo \\$(($x+$y))\n \"\"\"\n}\n\nworkflow {\n SUM(ch1, ch2).view()\n}\nThis workflow creates two queue channels, ch1 and ch2, that are input into the SUM process. The SUM process sums the two inputs and prints the result to the standard output using the view() channel operator.\nAfter running the script, the only output is 2, as below:\n>>> nextflow run training/nf-training/snippet.nf\nN E X T F L O W ~ version 23.04.1\nLaunching `training/nf-training/snippet.nf` [modest_pike] DSL2 - revision: 7e2661e10b\n2\nSince ch1 and ch2 are queue channels, the single element of ch2 has been consumed when it was initially passed to the SUM process with the first element of ch1. Even though there are other elements to be consumed in ch1, no new process instances will be launched. This is because a process waits until it receives an input value from all the channels declared as an input. The channel values are consumed serially one after another and the first empty channel causes the process execution to stop, even though there are values in other channels.\nTo use the single element in ch2 multiple times, you can use the Channel.value channel factory. Modify the second line of training/nf-training/snippet.nf to the following: ch2 = Channel.value(1) and run the script.\n>>> nextflow run training/nf-training/snippet.nf\nN E X T F L O W ~ version 23.04.1\nLaunching `training/nf-training/snippet.nf` [jolly_archimedes] DSL2 - revision: 7e2661e10b\n2\n3\n4\nNow that ch2 has been read in as a value channel, its value can be read unlimited times without consuming its contents.\nIn many situations, Nextflow will implicitly convert variables to value channels when they are used in a process invocation. When a process is invoked using a workflow parameter, it is automatically cast into a value channel. Modify the invocation of the SUM process to the following: SUM(ch1, 1).view() and run the script”\n>>> nextflow run training/nf-training/snippet.nf\nN E X T F L O W ~ version 23.04.1\nLaunching `training/nf-training/snippet.nf` [jolly_archimedes] DSL2 - revision: 7e2661e10b\n2\n3\n4\n\n\n\n\n\nIn Nextflow, a process is the basic computing task to execute functions (i.e., custom scripts or tools).\nThe process definition starts with the keyword process, followed by the process name, commly written in upper case by convention, and finally the process body delimited by curly brackets.\nThe process body can contain many definition blocks:\nprocess < name > {\n [ directives ] \n\n input: \n < process inputs >\n\n output: \n < process outputs >\n\n [script|shell|exec]: \n \"\"\"\n < user script to be executed >\n \"\"\"\n}\n\nDirectives are optional declarations of settings such as cpus, time, executor, container.\nInput defines the expected names and qualifiers of variables into the process\nOutput defines the expected names and qualifiers of variables output from the process\nScript is a string statement that defines the command to be executed by the process\n\nInside the script block, all $ characters need to be escaped with a \\. This is true for both referencing Bash variables created inside the script block (ie. echo \\$z) as well as performing commands (ie. echo \\$(($x+$y))), but not when referencing Nextflow variables (ie. $x+$y).\nprocess SUM {\n debug true \n\n input:\n val x\n val y\n\n output:\n stdout\n\n script:\n \"\"\"\n z='SUM'\n echo \\$z\n echo \\$(($x+$y))\n \"\"\"\n}\nBy default, the process command is interpreted as a Bash script. However, any other scripting language can be used by simply starting the script with the corresponding Shebang declaration. To reference Python variables created inside the Python script, no $ is required. For example:\nprocess PYSTUFF {\n debug true \n\n script:\n \"\"\"\n #!/usr/bin/env python\n\n x = 'Hello'\n y = 'world!'\n print (\"%s - %s\" % (x, y))\n \"\"\"\n}\n\nworkflow {\n PYSTUFF()\n}\n\n\nThe val qualifier allows any data type to be received as input. In the example below, num queue channel is created from integers 1, 2 and 3, and input into the BASICEXAMPLE process, where it is declared with the qualifier val and assigned to the variable x. Within this process, the channel input is referred to and accessed locally by the specified variable name x, prepended with $.\nnum = Channel.of(1, 2, 3)\n\nprocess BASICEXAMPLE {\n debug true\n\n input:\n val x\n\n script:\n \"\"\"\n echo process job $x\n \"\"\"\n}\n\nworkflow {\n BASICEXAMPLE(num)\n}\nIn the above example the process is executed three times, for each element in the channel num. Thus, it results in an output similar to the one shown below:\nprocess job 1\nprocess job 2\nprocess job 3\nThe val qualifier can also be used to specify the process output. In this example, the Hello World! string is implicitly converted into a channel that is input to the FOO process. This process prints the input to a file named file.txt, and returns the same input value as the output.\nprocess FOO {\n input:\n val x\n\n output:\n val x\n\n script:\n \"\"\"\n echo $x > file.txt\n \"\"\"\n}\n\nworkflow {\n out_ch = FOO(\"Hello world!\")\n out_ch.view()\n}\nThe output from FOO is assigned to out_ch, and its contents printed using the view() channel operator.\n>>> nextflow run foo.nf\nN E X T F L O W ~ version 23.04.1\nLaunching `foo.nf` [dreamy_turing] DSL2 - revision: 0d1a07970e\nexecutor > local (1)\n[a4/f710b3] process > FOO [100%] 1 of 1 ✔\nHello world!\n\n\n\n\nThe path qualifier allows the handling of files inside a process. When a new instance of a process is executed, a new process execution director will be created just for that process. When the path qualifier is specified as the input, Nextflow will stage the file inside the process execution directory, allowing it to be accessed by the script using the specified name in the input declaration.\nIn this example, the reads channel is created from multiple .fq files inside training/nf-training/data/ggal, and input into process FOO. In the input declaration of the process, the file is referred to as sample.fastq.\nThe training/nf-training/data/ggal folder contains multiple .fq files, along with a .fa file. The wildcard *is used to match only .fq to be used as input.\n>>> ls training/nf-training/data/ggal\ngut_1.fq gut_2.fq liver_1.fq liver_2.fq lung_1.fq lung_2.fq transcriptome.fa\nSave the following code block as foo.nf.\nreads = Channel.fromPath('training/nf-training/data/ggal/*.fq')\n\nprocess FOO {\n debug true\n\n input:\n path 'sample.fastq'\n\n script:\n \"\"\"\n ls sample.fastq\n \"\"\"\n}\n\nworkflow {\n FOO(reads)\n}\nWhen the script is ran, the FOO process is executed six times and will print the name of the file sample.fastq six times, since this is the name assigned in the input declaration.\n>>> nextflow run foo.nf\nN E X T F L O W ~ version 23.04.1\nLaunching `foo.nf` [nasty_lamport] DSL2 - revision: b214838b82\n[78/a8a52d] process > FOO [100%] 6 of 6 ✔\nsample.fastq\nsample.fastq\nsample.fastq\nsample.fastq\nsample.fastq\nsample.fastq\nInside the process execution directory (ie. work/78/a8a52d...), the input file has been staged (symbolically linked) under the input declaration name. This allows the script to access the file within the execution directory via the declaration name.\n>>> ll work/78/a8a52d...\nsample.fastq -> /.../training/nf-training/data/ggal/liver_1.fq\nSimilarly, the path qualifier can also be used to specify one or more files that will be output by the process. In this example, the RANDOMNUM process creates a file results.txt containing a random number. Note that the Bash function is escaped with a back-slash character (ie. \\$RANDOM).\nprocess RANDOMNUM {\n output:\n path \"*.txt\"\n\n script:\n \"\"\"\n echo \\$RANDOM > result.txt\n \"\"\"\n}\n\nworkflow {\n receiver_ch = RANDOMNUM()\n receiver_ch.view()\n}\nThe output file is declared with the path qualifier, and specified using the wildcard * that will output all files with .txt extension. The output of the RANDOMNUM process is assigned to receiver_ch, which can be used for downstream processes.\n>>> nextflow run foo.nf\nN E X T F L O W ~ version 23.04.1\nLaunching `foo.nf` [nostalgic_cajal] DSL2 - revision: 9e260eead5\nexecutor > local (1)\n[76/7e8e36] process > RANDOMNUM [100%] 1 of 1 ✔\n/...work/8c/792157d409524d06b89faf2c1e6d75/result.txt\n\n\n\n\nTo define paired/grouped input and output information, the tuple qualifier can be used. The input and output declarations for tuples must be declared with a tuple qualifier followed by the definition of each element in the tuple.\nIn the example below, reads_ch is a channel created using the fromFilePairs channel factory, which automatically creates a tuple from file pairs.\nreads_ch = Channel.fromFilePairs(\"training/nf-training/data/ggal/*_{1,2}.fq\")\nreads_ch.view()\nThe created tuple consists of two elements – the first element is always the grouping key of the matching pair (based on similarities in the file name), and the second is a list of paths to each file.\n[gut, [/.../training/nf-training/data/ggal/gut_1.fq, /.../training/nf-training/data/ggal/gut_2.fq]]\n[liver, [/.../training/nf-training/data/ggal/liver_1.fq, /.../training/nf-training/data/ggal/liver_2.fq]]\n[lung, [/.../training/nf-training/data/ggal/lung_1.fq, /.../training/nf-training/data/ggal/lung_2.fq]]\nTo input a tuple into a process, the tuple qualifier must be used in the input block. Below, the first element of the tuple (ie. the grouping key) is declared with the val qualifier, and the second element of the tuple is declared with the path qualifier. The FOO process then prints the .fq file paths to a file called sample.txt, and returns it as a tuple containing the same grouping key, declared with val, and the output file created inside the process, declared with path.\nprocess FOO {\n input:\n tuple val(sample_id), path(sample_id_paths)\n\n output:\n tuple val(sample_id), path('sample.txt')\n\n script:\n \"\"\"\n echo $sample_id_paths > sample.txt\n \"\"\"\n}\n\nworkflow {\n sample_ch = FOO(reads_ch)\n sample_ch.view()\n}\nUpdate foo.nf to the above, and run the script.\n>>> nextflow run foo.nf\nN E X T F L O W ~ version 23.04.1\nLaunching `test.nf` [sharp_becquerel] DSL2 - revision: cd652fc08b\nexecutor > local (3)\n[65/54124a] process > FOO (3) [100%] 3 of 3 ✔\n[lung, /.../work/23/fe268295bab990a40b95b7091530b6/sample.txt]\n[liver, /.../work/32/656b96a01a460f27fa207e85995ead/sample.txt]\n[gut, /.../work/ae/3cfc7cf0748a598c5e2da750b6bac6/sample.txt]\nIt’s worth noting that the FOO process is executed three times in parallel, so there’s no guarantee of a particular execution order. Therefore, if the script was ran again, the final result may be printed out in a different order:\n>>> nextflow run foo.nf\nN E X T F L O W ~ version 23.04.1\nLaunching `foo.nf` [high_mendel] DSL2 - revision: cd652fc08b\nexecutor > local (3)\n[82/71a961] process > FOO (1) [100%] 3 of 3 ✔\n[gut, /.../work/ae/3cfc7cf0748a598c5e2da750b6bac6/sample.txt]\n[lung, /.../work/23/fe268295bab990a40b95b7091530b6/sample.txt]\n[liver, /.../work/32/656b96a01a460f27fa207e85995ead/sample.txt]\nThus, if the output of a process is being used as an input into another process, the use of the tuple qualifier that contains metadata information is especially important to ensure the correct inputs are being used for downstream processes.\n\n\n\n\n\n\nKey points\n\n\n\n\nThe contents of value channels can be consumed an unlimited amount of times, wheres queue channels cannot\nDifferent channel factories can be used to read different input types\n$ characters need to be escaped with \\ when referencing Bash variables and functions, while Nextflow variables do not\nThe scripting language within a process can be altered by starting the script with the desired Shebang declaration" - }, - { - "objectID": "workshops/3.1_creating_a_workflow.html#nextflow-channels-and-processes", - "href": "workshops/3.1_creating_a_workflow.html#nextflow-channels-and-processes", - "title": "Nextflow Development - Creating a Nextflow Workflow", - "section": "", - "text": "Objectives\n\n\n\n\nGain an understanding of Nextflow channels and processes\nGain an understanding of Nextflow syntax\nRead data of different types into a Nextflow workflow\nCreate Nextflow processes consisting of multiple scripting languages\n\n\n\n\n\nClone the training materials repository on GitHub:\ngit clone https://github.com/nextflow-io/training.git\nSet up an interactive shell to run our Nextflow workflow:\nsrun --pty -p prod_short --mem 8GB --mincpus 2 -t 0-2:00 bash\nLoad the required modules to run Nextflow:\nmodule load nextflow/23.04.1\nmodule load singularity/3.7.3\nMake sure to always use version 23 and above, as we have encountered problems running nf-core workflows with older versions.\nSince we are using a shared storage, we should consider including common shared paths to where software is stored. These variables can be accessed using the NXF_SINGULARITY_CACHEDIR or the NXF_CONDA_CACHEDIR environment variables.\nCurrently we set the singularity cache environment variable:\nexport NXF_SINGULARITY_CACHEDIR=/config/binaries/singularity/containers_devel/nextflow\nSingularity images downloaded by workflow executions will now be stored in this directory.\nYou may want to include these, or other environmental variables, in your .bashrc file (or alternate) that is loaded when you log in so you don’t need to export variables every session. A complete list of environment variables can be found here.\n\n\n\n\nA workflow can be defined as sequence of steps through which computational tasks are chained together. Steps may be dependent on other tasks to complete, or they can be run in parallel.\n\nIn Nextflow, each step that will execute a single computational task is known as a process. Channels are used to join processes, and pass the outputs from one task into another task.\n\n\n\nChannels are a key data structure of Nextflow, used to pass data between processes.\n\n\nA queue channel connects two processes or operators, and is implicitly created by process outputs, or using channel factories such as Channel.of or Channel.fromPath.\nThe training/nf-training/snippet.nf script creates a channel where each element in the channel is an arguments provided to it. This script uses the Channel.of channel factory, which creates a channel from parameters such as strings or integers.\nch = Channel.of(1, 2, 3)\nch.view()\nThe following will be returned:\n>>> nextflow run training/nf-training/snippet.nf\nN E X T F L O W ~ version 23.04.1\nLaunching `training/nf-training/snippet.nf` [shrivelled_brattain] DSL2 - revision: 7e2661e10b\n1\n2\n3\n\n\n\nA value channel differs from a queue channel in that it is bound to a single value, and it can be read unlimited times without consuming its contents. To see the difference between value and queue channels, you can modify training/nf-training/snippet.nf to the following:\nch1 = Channel.of(1, 2, 3)\nch2 = Channel.of(1)\n\nprocess SUM {\n input:\n val x\n val y\n\n output:\n stdout\n\n script:\n \"\"\"\n echo \\$(($x+$y))\n \"\"\"\n}\n\nworkflow {\n SUM(ch1, ch2).view()\n}\nThis workflow creates two queue channels, ch1 and ch2, that are input into the SUM process. The SUM process sums the two inputs and prints the result to the standard output using the view() channel operator.\nAfter running the script, the only output is 2, as below:\n>>> nextflow run training/nf-training/snippet.nf\nN E X T F L O W ~ version 23.04.1\nLaunching `training/nf-training/snippet.nf` [modest_pike] DSL2 - revision: 7e2661e10b\n2\nSince ch1 and ch2 are queue channels, the single element of ch2 has been consumed when it was initially passed to the SUM process with the first element of ch1. Even though there are other elements to be consumed in ch1, no new process instances will be launched. This is because a process waits until it receives an input value from all the channels declared as an input. The channel values are consumed serially one after another and the first empty channel causes the process execution to stop, even though there are values in other channels.\nTo use the single element in ch2 multiple times, you can use the Channel.value channel factory. Modify the second line of training/nf-training/snippet.nf to the following: ch2 = Channel.value(1) and run the script.\n>>> nextflow run training/nf-training/snippet.nf\nN E X T F L O W ~ version 23.04.1\nLaunching `training/nf-training/snippet.nf` [jolly_archimedes] DSL2 - revision: 7e2661e10b\n2\n3\n4\nNow that ch2 has been read in as a value channel, its value can be read unlimited times without consuming its contents.\nIn many situations, Nextflow will implicitly convert variables to value channels when they are used in a process invocation. When a process is invoked using a workflow parameter, it is automatically cast into a value channel. Modify the invocation of the SUM process to the following: SUM(ch1, 1).view() and run the script”\n>>> nextflow run training/nf-training/snippet.nf\nN E X T F L O W ~ version 23.04.1\nLaunching `training/nf-training/snippet.nf` [jolly_archimedes] DSL2 - revision: 7e2661e10b\n2\n3\n4\n\n\n\n\n\nIn Nextflow, a process is the basic computing task to execute functions (i.e., custom scripts or tools).\nThe process definition starts with the keyword process, followed by the process name, commly written in upper case by convention, and finally the process body delimited by curly brackets.\nThe process body can contain many definition blocks:\nprocess < name > {\n [ directives ] \n\n input: \n < process inputs >\n\n output: \n < process outputs >\n\n [script|shell|exec]: \n \"\"\"\n < user script to be executed >\n \"\"\"\n}\n\nDirectives are optional declarations of settings such as cpus, time, executor, container.\nInput defines the expected names and qualifiers of variables into the process\nOutput defines the expected names and qualifiers of variables output from the process\nScript is a string statement that defines the command to be executed by the process\n\nInside the script block, all $ characters need to be escaped with a \\. This is true for both referencing Bash variables created inside the script block (ie. echo \\$z) as well as performing commands (ie. echo \\$(($x+$y))), but not when referencing Nextflow variables (ie. $x+$y).\nprocess SUM {\n debug true \n\n input:\n val x\n val y\n\n output:\n stdout\n\n script:\n \"\"\"\n z='SUM'\n echo \\$z\n echo \\$(($x+$y))\n \"\"\"\n}\nBy default, the process command is interpreted as a Bash script. However, any other scripting language can be used by simply starting the script with the corresponding Shebang declaration. To reference Python variables created inside the Python script, no $ is required. For example:\nprocess PYSTUFF {\n debug true \n\n script:\n \"\"\"\n #!/usr/bin/env python\n\n x = 'Hello'\n y = 'world!'\n print (\"%s - %s\" % (x, y))\n \"\"\"\n}\n\nworkflow {\n PYSTUFF()\n}\n\n\nThe val qualifier allows any data type to be received as input. In the example below, num queue channel is created from integers 1, 2 and 3, and input into the BASICEXAMPLE process, where it is declared with the qualifier val and assigned to the variable x. Within this process, the channel input is referred to and accessed locally by the specified variable name x, prepended with $.\nnum = Channel.of(1, 2, 3)\n\nprocess BASICEXAMPLE {\n debug true\n\n input:\n val x\n\n script:\n \"\"\"\n echo process job $x\n \"\"\"\n}\n\nworkflow {\n BASICEXAMPLE(num)\n}\nIn the above example the process is executed three times, for each element in the channel num. Thus, it results in an output similar to the one shown below:\nprocess job 1\nprocess job 2\nprocess job 3\nThe val qualifier can also be used to specify the process output. In this example, the Hello World! string is implicitly converted into a channel that is input to the FOO process. This process prints the input to a file named file.txt, and returns the same input value as the output.\nprocess FOO {\n input:\n val x\n\n output:\n val x\n\n script:\n \"\"\"\n echo $x > file.txt\n \"\"\"\n}\n\nworkflow {\n out_ch = FOO(\"Hello world!\")\n out_ch.view()\n}\nThe output from FOO is assigned to out_ch, and its contents printed using the view() channel operator.\n>>> nextflow run foo.nf\nN E X T F L O W ~ version 23.04.1\nLaunching `foo.nf` [dreamy_turing] DSL2 - revision: 0d1a07970e\nexecutor > local (1)\n[a4/f710b3] process > FOO [100%] 1 of 1 ✔\nHello world!\n\n\n\n\nThe path qualifier allows the handling of files inside a process. When a new instance of a process is executed, a new process execution director will be created just for that process. When the path qualifier is specified as the input, Nextflow will stage the file inside the process execution directory, allowing it to be accessed by the script using the specified name in the input declaration.\nIn this example, the reads channel is created from multiple .fq files inside training/nf-training/data/ggal, and input into process FOO. In the input declaration of the process, the file is referred to as sample.fastq.\nThe training/nf-training/data/ggal folder contains multiple .fq files, along with a .fa file. The wildcard *is used to match only .fq to be used as input.\n>>> ls training/nf-training/data/ggal\ngut_1.fq gut_2.fq liver_1.fq liver_2.fq lung_1.fq lung_2.fq transcriptome.fa\nSave the following code block as foo.nf.\nreads = Channel.fromPath('training/nf-training/data/ggal/*.fq')\n\nprocess FOO {\n debug true\n\n input:\n path 'sample.fastq'\n\n script:\n \"\"\"\n ls sample.fastq\n \"\"\"\n}\n\nworkflow {\n FOO(reads)\n}\nWhen the script is ran, the FOO process is executed six times and will print the name of the file sample.fastq six times, since this is the name assigned in the input declaration.\n>>> nextflow run foo.nf\nN E X T F L O W ~ version 23.04.1\nLaunching `foo.nf` [nasty_lamport] DSL2 - revision: b214838b82\n[78/a8a52d] process > FOO [100%] 6 of 6 ✔\nsample.fastq\nsample.fastq\nsample.fastq\nsample.fastq\nsample.fastq\nsample.fastq\nInside the process execution directory (ie. work/78/a8a52d...), the input file has been staged (symbolically linked) under the input declaration name. This allows the script to access the file within the execution directory via the declaration name.\n>>> ll work/78/a8a52d...\nsample.fastq -> /.../training/nf-training/data/ggal/liver_1.fq\nSimilarly, the path qualifier can also be used to specify one or more files that will be output by the process. In this example, the RANDOMNUM process creates a file results.txt containing a random number. Note that the Bash function is escaped with a back-slash character (ie. \\$RANDOM).\nprocess RANDOMNUM {\n output:\n path \"*.txt\"\n\n script:\n \"\"\"\n echo \\$RANDOM > result.txt\n \"\"\"\n}\n\nworkflow {\n receiver_ch = RANDOMNUM()\n receiver_ch.view()\n}\nThe output file is declared with the path qualifier, and specified using the wildcard * that will output all files with .txt extension. The output of the RANDOMNUM process is assigned to receiver_ch, which can be used for downstream processes.\n>>> nextflow run foo.nf\nN E X T F L O W ~ version 23.04.1\nLaunching `foo.nf` [nostalgic_cajal] DSL2 - revision: 9e260eead5\nexecutor > local (1)\n[76/7e8e36] process > RANDOMNUM [100%] 1 of 1 ✔\n/...work/8c/792157d409524d06b89faf2c1e6d75/result.txt\n\n\n\n\nTo define paired/grouped input and output information, the tuple qualifier can be used. The input and output declarations for tuples must be declared with a tuple qualifier followed by the definition of each element in the tuple.\nIn the example below, reads_ch is a channel created using the fromFilePairs channel factory, which automatically creates a tuple from file pairs.\nreads_ch = Channel.fromFilePairs(\"training/nf-training/data/ggal/*_{1,2}.fq\")\nreads_ch.view()\nThe created tuple consists of two elements – the first element is always the grouping key of the matching pair (based on similarities in the file name), and the second is a list of paths to each file.\n[gut, [/.../training/nf-training/data/ggal/gut_1.fq, /.../training/nf-training/data/ggal/gut_2.fq]]\n[liver, [/.../training/nf-training/data/ggal/liver_1.fq, /.../training/nf-training/data/ggal/liver_2.fq]]\n[lung, [/.../training/nf-training/data/ggal/lung_1.fq, /.../training/nf-training/data/ggal/lung_2.fq]]\nTo input a tuple into a process, the tuple qualifier must be used in the input block. Below, the first element of the tuple (ie. the grouping key) is declared with the val qualifier, and the second element of the tuple is declared with the path qualifier. The FOO process then prints the .fq file paths to a file called sample.txt, and returns it as a tuple containing the same grouping key, declared with val, and the output file created inside the process, declared with path.\nprocess FOO {\n input:\n tuple val(sample_id), path(sample_id_paths)\n\n output:\n tuple val(sample_id), path('sample.txt')\n\n script:\n \"\"\"\n echo $sample_id_paths > sample.txt\n \"\"\"\n}\n\nworkflow {\n sample_ch = FOO(reads_ch)\n sample_ch.view()\n}\nUpdate foo.nf to the above, and run the script.\n>>> nextflow run foo.nf\nN E X T F L O W ~ version 23.04.1\nLaunching `test.nf` [sharp_becquerel] DSL2 - revision: cd652fc08b\nexecutor > local (3)\n[65/54124a] process > FOO (3) [100%] 3 of 3 ✔\n[lung, /.../work/23/fe268295bab990a40b95b7091530b6/sample.txt]\n[liver, /.../work/32/656b96a01a460f27fa207e85995ead/sample.txt]\n[gut, /.../work/ae/3cfc7cf0748a598c5e2da750b6bac6/sample.txt]\nIt’s worth noting that the FOO process is executed three times in parallel, so there’s no guarantee of a particular execution order. Therefore, if the script was ran again, the final result may be printed out in a different order:\n>>> nextflow run foo.nf\nN E X T F L O W ~ version 23.04.1\nLaunching `foo.nf` [high_mendel] DSL2 - revision: cd652fc08b\nexecutor > local (3)\n[82/71a961] process > FOO (1) [100%] 3 of 3 ✔\n[gut, /.../work/ae/3cfc7cf0748a598c5e2da750b6bac6/sample.txt]\n[lung, /.../work/23/fe268295bab990a40b95b7091530b6/sample.txt]\n[liver, /.../work/32/656b96a01a460f27fa207e85995ead/sample.txt]\nThus, if the output of a process is being used as an input into another process, the use of the tuple qualifier that contains metadata information is especially important to ensure the correct inputs are being used for downstream processes.\n\n\n\n\n\n\nKey points\n\n\n\n\nThe contents of value channels can be consumed an unlimited amount of times, wheres queue channels cannot\nDifferent channel factories can be used to read different input types\n$ characters need to be escaped with \\ when referencing Bash variables and functions, while Nextflow variables do not\nThe scripting language within a process can be altered by starting the script with the desired Shebang declaration" - }, - { - "objectID": "workshops/3.1_creating_a_workflow.html#creating-an-rnaseq-workflow", - "href": "workshops/3.1_creating_a_workflow.html#creating-an-rnaseq-workflow", - "title": "Nextflow Development - Creating a Nextflow Workflow", - "section": "Creating an RNAseq Workflow", - "text": "Creating an RNAseq Workflow\n\n\n\n\n\n\nObjectives\n\n\n\n\nDevelop a Nextflow workflow\nRead data of different types into a Nextflow workflow\nOutput Nextflow process results to a predefined directory\n\n\n\n\n4.1.1. Define Workflow Parameters\nLet’s create a Nextflow script rnaseq.nf for a RNA-seq workflow. The code begins with a shebang, which declares Nextflow as the interpreter.\n#!/usr/bin/env nextflow\nOne way to define the workflow parameters is inside the Nextflow script.\nparams.reads = \"/.../training/nf-training/data/ggal/*_{1,2}.fq\"\nparams.transcriptome_file = \"/.../training/nf-training/data/ggal/transcriptome.fa\"\nparams.multiqc = \"/.../training/nf-training/multiqc\"\n\nprintln \"reads: $params.reads\"\nWorkflow parameters can be defined and accessed inside the Nextflow script by prepending the prefix params to a variable name, separated by a dot character, eg. params.reads.\nDifferent data types can be assigned as a parameter in Nextflow. The reads parameter is defined as multiple .fq files. The transcriptome_file parameter is defined as one file, /.../training/nf-training/data/ggal/transcriptome.fa. The multiqc parameter is defined as a directory, /.../training/nf-training/data/ggal/multiqc.\nThe Groovy println command is then used to print the contents of the reads parameter, which is access with the $ character.\nRun the script:\n>>> nextflow run rnaseq.nf\nN E X T F L O W ~ version 23.04.1\nLaunching `rnaseq.nf` [astonishing_raman] DSL2 - revision: 8c9adc1772\nreads: /.../training/nf-training/data/ggal/*_{1,2}.fq\n\n\n\n4.1.2. Create a transcriptome index file\nCommands or scripts can be executed inside a process.\nprocess INDEX {\n input:\n path transcriptome\n\n output:\n path \"salmon_idx\"\n\n script:\n \"\"\"\n salmon index --threads $task.cpus -t $transcriptome -i salmon_idx\n \"\"\"\n}\nThe INDEX process takes an input path, and assigns that input as the variable transcriptome. The path type qualifier will allow Nextflow to stage the files in the process execution directory, where they can be accessed by the script via the defined variable name, ie. transcriptome. The code between the three double-quotes of the script block will be executed, and accesses the input transcriptome variable using $. The output is a path, with a filename salmon_idx. The output path can also be defined using wildcards, eg. path \"*_idx\".\nNote that the name of the input file is not used and is only referenced to by the input variable name. This feature allows pipeline tasks to be self-contained and decoupled from the execution environment. As best practice, avoid referencing files that are not defined in the process script.\nTo execute the INDEX process, a workflow scope will need to be added.\nworkflow {\n index_ch = INDEX(params.transcriptome_file)\n}\nHere, the params.transcriptome_file parameter we defined earlier in the Nextflow script is used as an input into the INDEX process. The output of the process is assigned to the index_ch channel.\nRun the Nextflow script:\n>>> nextflow run rnaseq.nf\n\nERROR ~ Error executing process > 'INDEX'\n\nCaused by:\n Process `INDEX` terminated with an error exit status (127)\n\nCommand executed:\n\n salmon index --threads 1 -t transcriptome.fa -i salmon_index\n\nCommand exit status:\n 127\n\nCommand output:\n (empty)\n\nCommand error:\n .command.sh: line 2: salmon: command not found\n\nWork dir:\n /.../work/85/495a21afcaaf5f94780aff6b2a964c\n\nTip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh`\n\n -- Check '.nextflow.log' file for details\nWhen a process execution exits with a non-zero exit status, the workflow will be stopped. Nextflow will output the cause of the error, the command that caused the error, the exit status, the standard output (if available), the comand standard error, and the work directory where the process was executed.\nLet’s first look inside the process execution directory:\n>>> ls -a /.../work/85/495a21afcaaf5f94780aff6b2a964c \n\n. .command.begin .command.log .command.run .exitcode\n.. .command.err .command.out .command.sh transcriptome.fa\nWe can see that the input file transcriptome.fa has been staged inside this process execution directory by being symbolically linked. This allows it to be accessed by the script.\nInside the .command.err script, we can see that the salmon command was not found, resulting in the termination of the Nextflow workflow.\nSingularity containers can be used to execute the process within an environment that contains the package of interest. Create a config file nextflow.config containing the following:\nsingularity {\n enabled = true\n autoMounts = true\n cacheDir = \"/config/binaries/singularity/containers_devel/nextflow\"\n}\nThe container process directive can be used to specify the required container:\nprocess INDEX {\n container \"/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-salmon-1.10.1--h7e5ed60_0.img\"\n\n input:\n path transcriptome\n\n output:\n path \"salmon_idx\"\n\n script:\n \"\"\"\n salmon index --threads $task.cpus -t $transcriptome -i salmon_idx\n \"\"\"\n}\nRun the Nextflow script:\n>>> nextflow run rnaseq.nf\nN E X T F L O W ~ version 23.04.1\nLaunching `rnaseq.nf` [distraught_goldwasser] DSL2 - revision: bdebf34e16\nexecutor > local (1)\n[37/7ef8f0] process > INDEX [100%] 1 of 1 ✔\nThe newly created nextflow.config files does not need to be specified in the nextflow run command. This file is automatically searched for and used by Nextflow.\nAn alternative to singularity containers is the use of a module. Since the script block is executed as a Bash script, it can contain any command or script normally executed on the command line. If there is a module present in the host environment, it can be loaded as part of the process script.\nprocess INDEX {\n input:\n path transcriptome\n\n output:\n path \"salmon_idx\"\n\n script:\n \"\"\"\n module purge\n module load salmon/1.3.0\n\n salmon index --threads $task.cpus -t $transcriptome -i salmon_idx\n \"\"\"\n}\nRun the Nextflow script:\n>>> nextflow run rnaseq.nf\nN E X T F L O W ~ version 23.04.1\nLaunching `rnaseq.nf` [reverent_liskov] DSL2 - revision: b74c22049d\nexecutor > local (1)\n[ba/3c12ab] process > INDEX [100%] 1 of 1 ✔\n\n\n\n4.1.3. Collect Read Files By Pairs\nPreviously, we have defined the reads parameter to be the following:\nparams.reads = \"/.../training/nf-training/data/ggal/*_{1,2}.fq\"\nChallenge: Convert the reads parameter into a tuple channel called reads_ch, where the first element is a unique grouping key, and the second element is the paired .fq files. Then, view the contents of reads_ch\n\n\n\n\n\n\nAnswer\n\n\n\n\n\nreads_ch = Channel.fromFilePairs(\"$params.reads\")\nreads_ch.view()\nThe fromFilePairs channel factory will automatically group input files into a tuple with a unique grouping key. The view() channel operator can be used to view the contents of the channel.\n>>> nextflow run rnaseq.nf\n\n[gut, [/.../training/nf-training/data/ggal/gut_1.fq, /.../training/nf-training/data/ggal/gut_2.fq]]\n[liver, [/.../training/nf-training/data/ggal/liver_1.fq, /.../training/nf-training/data/ggal/liver_2.fq]]\n[lung, [/.../training/nf-training/data/ggal/lung_1.fq, /.../training/nf-training/data/ggal/lung_2.fq]]\n\n\n\n\n\n4.1.4. Perform Expression Quantification\nLet’s add a new process QUANTIFICATION that uses both the indexed transcriptome file and the .fq file pairs to execute the salmon quant command.\nprocess QUANTIFICATION {\n input:\n path salmon_index\n tuple val(sample_id), path(reads)\n\n output:\n path \"$sample_id\"\n\n script:\n \"\"\"\n salmon quant --threads $task.cpus --libType=U \\\n -i $salmon_index -1 ${reads[0]} -2 ${reads[1]} -o $sample_id\n \"\"\"\n}\nThe QUANTIFICATION process takes two inputs, the first is the path to the salmon_index created from the INDEX process. The second input is set to match the output of fromFilePairs – a tuple where the first element is a value (ie. grouping key), and the second element is a list of paths to the .fq reads.\nIn the script block, the salmon quant command saves the output of the tool as $sample_id. This output is emitted by the QUANTIFICATION process, using $ to access the Nextflow variable.\nChallenge:\nSet the following as the execution container for QUANTIFICATION:\n/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-salmon-1.10.1--h7e5ed60_0.img\nAssign index_ch and reads_ch as the inputs to this process, and emit the process outputs as quant_ch. View the contents of quant_ch\n\n\n\n\n\n\nAnswer\n\n\n\n\n\nTo assign a container to a process, the container directive can be used.\nprocess QUANTIFICATION {\n container \"/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-salmon-1.10.1--h7e5ed60_0.img\"\n\n input:\n path salmon_index\n tuple val(sample_id), path(reads)\n\n output:\n path \"$sample_id\"\n\n script:\n \"\"\"\n salmon quant --threads $task.cpus --libType=U \\\n -i $salmon_index -1 ${reads[0]} -2 ${reads[1]} -o $sample_id\n \"\"\"\n}\nTo run the QUANTIFICATION process and emit the outputs as quant_ch, the following can be added to the end of the workflow block:\nquant_ch = QUANTIFICATION(index_ch, reads_ch)\nquant_ch.view()\nThe script can now be run:\n>>> nextflow run rnaseq.nf \nN E X T F L O W ~ version 23.04.1\nLaunching `rnaseq.nf` [elated_cray] DSL2 - revision: abe41f4f69\nexecutor > local (4)\n[e5/e75095] process > INDEX [100%] 1 of 1 ✔\n[4c/68a000] process > QUANTIFICATION (1) [100%] 3 of 3 ✔\n/.../work/b1/d861d26d4d36864a17d2cec8d67c80/liver\n/.../work/b4/a6545471c1f949b2723d43a9cce05f/lung\n/.../work/4c/68a000f7c6503e8ae1fe4d0d3c93d8/gut\nIn the Nextflow output, we can see that the QUANTIFICATION process has been ran three times, since the reads_ch consists of three elements. Nextflow will automatically run the QUANTIFICATION process on each of the elements in the input channel, creating separate process execution work directories for each execution.\n\n\n\n\n\n4.1.5. Quality Control\nNow, let’s implement a FASTQC quality control process for the input fastq reads.\nChallenge:\nCreate a process called FASTQC that takes reads_ch as an input, and declares the process input to be a tuple matching the structure of reads_ch, where the first element is assigned the variable sample_id, and the second variable is assigned the variable reads. This FASTQC process will first create an output directory fastqc_${sample_id}_logs, then perform fastqc on the input reads and save the results in the newly created directory fastqc_${sample_id}_logs:\nmkdir fastqc_${sample_id}_logs\nfastqc -o fastqc_${sample_id}_logs -f fastq -q ${reads}\nTake fastqc_${sample_id}_logs as the output of the process, and assign it to the channel fastqc_ch. Finally, specify the process container to be the following:\n/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-fastqc-0.12.1--hdfd78af_0.img\n\n\n\n\n\n\nAnswer\n\n\n\n\n\nThe process FASTQC is created in rnaseq.nf. Since the input channel is a tuple, the process input declaration is a tuple containing elements that match the structure of the incoming channel. The first element of the tuple is assigned the variable sample_id, and the second element of the tuple is assigned the variable reads. The relevant container is specified using the container process directive.\nprocess FASTQC {\n container \"/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-fastqc-0.12.1--hdfd78af_0.img\"\n\n input:\n tuple val(sample_id), path(reads)\n\n output:\n path \"fastqc_${sample_id}_logs\"\n\n script:\n \"\"\"\n mkdir fastqc_${sample_id}_logs\n fastqc -o fastqc_${sample_id}_logs -f fastq -q ${reads}\n \"\"\"\n}\nIn the workflow scope, the following can be added:\nfastqc_ch = FASTQC(reads_ch)\nThe FASTQC process is called, taking reads_ch as an input. The output of the process is assigned to be fastqc_ch.\n>>> nextflow run rnaseq.nf\nN E X T F L O W ~ version 23.04.1\nLaunching `rnaseq.nf` [sad_jennings] DSL2 - revision: cfae7ccc0e\nexecutor > local (7)\n[b5/6bece3] process > INDEX [100%] 1 of 1 ✔\n[32/46f20b] process > QUANTIFICATION (3) [100%] 3 of 3 ✔\n[44/27aa8d] process > FASTQC (2) [100%] 3 of 3 ✔\nIn the Nextflow output, we can see that the FASTQC has been ran three times as expected, since the reads_ch consists of three elements.\n\n\n\n\n\n4.1.6. MultiQC Report\nSo far, the generated outputs have all been saved inside the Nextflow work directory. For the FASTQC process, the specified output directory is only created inside the process execution directory. To save results to a specified folder, the publishDir process directive can be used.\nLet’s create a new MULTIQC process in our workflow that takes the outputs from the QUANTIFICATION and FASTQC processes to create a final report using the multiqc tool, and publish the process outputs to a directory outside of the process execution directory.\nprocess MULTIQC {\n publishDir params.outdir, mode:'copy'\n container \"/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-multiqc-1.21--pyhdfd78af_0.img\"\n\n input:\n path quantification\n path fastqc\n\n output:\n path \"*.html\"\n\n script:\n \"\"\"\n multiqc . --filename $quantification\n \"\"\"\n}\nIn the MULTIQC process, the multiqc command is performed on both quantification and fastqc inputs, and publishes the report to a directory defined by the outdir parameter. Only files that match the declaration in the output block are published, not all the outputs of a process. By default, files are published to the target folder creating a symbolic link to the file produced in the process execution directory. This behavior can be modified using the mode option, eg. copy, which copies the file from the process execution directory to the specified output directory.\nAdd the following to the end of workflow scope:\nmultiqc_ch = MULTIQC(quant_ch, fastqc_ch)\nRun the pipeline, specifying an output directory using the outdir parameter:\nnextflow run rnaseq.nf --outdir \"results\"\nA results directory containing the output multiqc reports will be created outside of the process execution directory.\n>>> ls results\ngut.html liver.html lung.html\n\n\n\n\n\n\n\nKey points\n\n\n\n\nCommands or scripts can be executed inside a process\nEnvironments can be defined using the container process directive\nThe input declaration for a process must match the structure of the channel that is being passed into that process\n\n\n\n\nThis workshop is adapted from Fundamentals Training, Advanced Training, Developer Tutorials, and Nextflow Patterns materials from Nextflow and nf-core\n^*Draft for Future Sessions" - }, - { - "objectID": "workshops/5.1_nf_core_template.html", - "href": "workshops/5.1_nf_core_template.html", - "title": "Nextflow Development - Developing Modularised Workflows", - "section": "", - "text": "Objectives\n\n\n\n\nDevelop a basic Nextflow workflow with nf-core templates\nTest and set up profiles for a Nextflow workflow\nCreate conditional processes, and conditional scripts within a processs\nRead data of different types into a Nextflow workflow" - }, - { - "objectID": "workshops/5.1_nf_core_template.html#environment-setup", - "href": "workshops/5.1_nf_core_template.html#environment-setup", - "title": "Nextflow Development - Developing Modularised Workflows", - "section": "Environment Setup", - "text": "Environment Setup\nSet up an interactive shell to run our Nextflow workflow:\nsrun --pty -p prod_short --mem 8GB --mincpus 2 -t 0-2:00 bash\nLoad the required modules to run Nextflow:\nmodule load nextflow/23.04.1\nmodule load singularity/3.7.3\nSet the singularity cache environment variable:\nexport NXF_SINGULARITY_CACHEDIR=/config/binaries/singularity/containers_devel/nextflow\nSingularity images downloaded by workflow executions will now be stored in this directory.\nYou may want to include these, or other environmental variables, in your .bashrc file (or alternate) that is loaded when you log in so you don’t need to export variables every session. A complete list of environment variables can be found here.\nSet up a python virtual environment with nf-core/tools installed:\nmodule load python/3.11.2\npython -m venv /scratch/users/${USER}/nfcorevenv\n\nsource /scratch/users/${USER}/nfcorevenv/bin/activate\n\npip install nf-core==2.14.1" - }, - { - "objectID": "workshops/5.1_nf_core_template.html#nf-core", - "href": "workshops/5.1_nf_core_template.html#nf-core", - "title": "Nextflow Development - Developing Modularised Workflows", - "section": "5. Nf-core", - "text": "5. Nf-core\nnf-core is a community effort to collect a curated set of analysis workflows built using Nextflow.\nnf-core provides a standardized set of best practices, guidelines, and templates for building and sharing bioinformatics workflows. These workflows are designed to be modular, scalable, and portable, allowing researchers to easily adapt and execute them using their own data and compute resources.\nThe community is a diverse group of bioinformaticians, developers, and researchers from around the world who collaborate on developing and maintaining a growing collection of high-quality workflows. These workflows cover a range of applications, including transcriptomics, proteomics, and metagenomics.\nOne of the key benefits of nf-core is that it promotes open development, testing, and peer review, ensuring that the workflows are robust, well-documented, and validated against real-world datasets. This helps to increase the reliability and reproducibility of bioinformatics analyses and ultimately enables researchers to accelerate their scientific discoveries.\nnf-core is published in Nature Biotechnology: Nat Biotechnol 38, 276–278 (2020). Nature Biotechnology\nKey Features of nf-core workflows\n\nDocumentation\n\nnf-core workflows have extensive documentation covering installation, usage, and description of output files to ensure that you won’t be left in the dark.\n\nStable Releases\n\nnf-core workflows use GitHub releases to tag stable versions of the code and software, making workflow runs totally reproducible.\n\nPackaged software\n\nPipeline dependencies are automatically downloaded and handled using Docker, Singularity, Conda, or other software management tools. There is no need for any software installations.\n\nPortable and reproducible\n\nnf-core workflows follow best practices to ensure maximum portability and reproducibility. The large community makes the workflows exceptionally well-tested and easy to execute.\n\nCloud-ready\n\nnf-core workflows are tested on AWS" - }, - { - "objectID": "workshops/5.1_nf_core_template.html#nf-core-tools", - "href": "workshops/5.1_nf_core_template.html#nf-core-tools", - "title": "Nextflow Development - Developing Modularised Workflows", - "section": "5.1 Nf-core tools", - "text": "5.1 Nf-core tools\nnf-core-tools is a python package with helper tools for the nf-core community.\nThese helper tools can be used for both building and running nf-core workflows.\nToday we will be focusing on the developer commands to build a workflow using nf-core templates and structures.\nTake a look at what is within with nf-core-tools suite\nnf-core -h\n ,--./,-.\n ___ __ __ __ ___ /,-._.--~\\\n |\\ | |__ __ / ` / \\ |__) |__ } {\n | \\| | \\__, \\__/ | \\ |___ \\`-._,-`-,\n `._,._,'\n\n nf-core/tools version 2.14.1 - https://nf-co.re\n\n\n \n Usage: nf-core [OPTIONS] COMMAND [ARGS]... \n \n nf-core/tools provides a set of helper tools for use with nf-core Nextflow pipelines. \n It is designed for both end-users running pipelines and also developers creating new pipelines. \n \n╭─ Options ────────────────────────────────────────────────────────────────────────────────────────╮\n│ --version Show the version and exit. │\n│ --verbose -v Print verbose output to the console. │\n│ --hide-progress Don't show progress bars. │\n│ --log-file -l <filename> Save a verbose log to a file. │\n│ --help -h Show this message and exit. │\n╰──────────────────────────────────────────────────────────────────────────────────────────────────╯\n╭─ Commands for users ─────────────────────────────────────────────────────────────────────────────╮\n│ list List available nf-core pipelines with local info. │\n│ launch Launch a pipeline using a web GUI or command line prompts. │\n│ create-params-file Build a parameter file for a pipeline. │\n│ download Download a pipeline, nf-core/configs and pipeline singularity images. │\n│ licences List software licences for a given workflow (DSL1 only). │\n│ tui Open Textual TUI. │\n╰──────────────────────────────────────────────────────────────────────────────────────────────────╯\n╭─ Commands for developers ────────────────────────────────────────────────────────────────────────╮\n│ create Create a new pipeline using the nf-core template. │\n│ lint Check pipeline code against nf-core guidelines. │\n│ modules Commands to manage Nextflow DSL2 modules (tool wrappers). │\n│ subworkflows Commands to manage Nextflow DSL2 subworkflows (tool wrappers). │\n│ schema Suite of tools for developers to manage pipeline schema. │\n│ create-logo Generate a logo with the nf-core logo template. │\n│ bump-version Update nf-core pipeline version number. │\n│ sync Sync a pipeline TEMPLATE branch with the nf-core template. │\n╰──────────────────────────────────────────────────────────────────────────────────────────────────╯\nToday we will be predominately focusing on most of the tools for developers." - }, - { - "objectID": "workshops/5.1_nf_core_template.html#nf-core-pipeline", - "href": "workshops/5.1_nf_core_template.html#nf-core-pipeline", - "title": "Nextflow Development - Developing Modularised Workflows", - "section": "5.2 Nf-core Pipeline", - "text": "5.2 Nf-core Pipeline\nLet’s review the structure of the nf-core/rnaseq pipeline.\nAlmost all of the structure provided here is from the nf-core templates. As we briefly covered last week in Developing Modularised Workflows, it is good practice to separate your workflow from subworkflows and modules. As this allows you to modularise your workflows and reuse modules.\nNf-core assists in enforcing this structure with the subfolders:\n\nworkflows - contains the main workflow\nsubworkflows - contains subworkflows either as written by the nf-core community or self-written\nmodules - contains modules either as written by the nf-core community or self-written\n\nIn our Introduction to Nextflow and running nf-core workflows workshop in Customising & running nf-core pipelines, we briefly touched on configuration files in the conf/ folder and nextflow.config.\nToday we will be working on files in these locations and expanding our use of the nf-core template to include:\n\nfiles in the assets folder\nnextflow_schema.json\n\n\n\n5.2.1 nf-core create\nThe create subcommand makes a new pipeline using the nf-core base template. With a given pipeline name, description and author, it makes a starter pipeline which follows nf-core best practices.\nAfter creating the files, the command initialises the folder as a git repository and makes an initial commit. This first “vanilla” commit which is identical to the output from the templating tool is important, as it allows us to keep your pipeline in sync with the base template in the future. See the nf-core syncing docs for more information.\nLet’s set up the nf-core template for today’s workshop:\nnf-core create\nAs we progress through the interactive prompts, we will use the following values below: \nRemember to swap out the Author name with your own!\nThe creates a pipeline called myrnaseq in the directory pmcc-myrnaseq (<prefix>-<name>) with mmyeung as the author. If selected exclude the following:\n\ngithub: removed all files required for GitHub hosting of the pipeline. Specifically, the .github folder and .gitignore file.\nci: removes the GitHub continuous integration tests from the pipeline. Specifically, the .github/workflows/ folder.\ngithub_badges: removes GitHub badges from the README.md file.\nigenomes: removes pipeline options related to iGenomes. Including the conf/igenomes.config file and all references to it.\nnf_core_configs: excludes nf_core/configs repository options, which make multiple config profiles for various institutional clusters available.\n\nTo run the pipeline creation silently (i.e. without any prompts) with the nf-core template, you can use the --plain option.\n\n\n\n\n\n\nAuthor name\n\n\n\nTypically, we would use your github username as the value here, this allows an extra layer of traceability.\n\n\n\n\n\n\n\n\nCustomised pipeline prefix\n\n\n\nRemember we are currently only making the most of the nf-core templates and not contributing back to nf-core. As such, we should not use the nf-core prefix to our pipeline.\n\n\n\n\n\n\n\n\nSkipped templates\n\n\n\nNote that the highlighted values under Skip template areas? are the sections that will be skipped. As this is a test pipeline we are skipping the set up of github CI and badges\n\n\nAs we have requested GitHub hosting, on completion of the command, you will note there are suggested github commands included in the output. Use these commands to push the commits from your computer. You can then continue to edit, commit and push normally as you build your pipeline.\n\n\n\nnf-core template\nLet’s see what has been minimally provided by nf-core create\nll pmcc-myrnaseq/\ntotal 47\ndrwxrwxr-x 2 myeung myeung 4096 Jun 11 15:00 assets\n-rw-rw-r-- 1 myeung myeung 372 Jun 11 15:00 CHANGELOG.md\n-rw-rw-r-- 1 myeung myeung 2729 Jun 11 15:00 CITATIONS.md\ndrwxrwxr-x 2 myeung myeung 4096 Jun 11 15:00 conf\ndrwxrwxr-x 3 myeung myeung 4096 Jun 11 15:00 docs\n-rw-rw-r-- 1 myeung myeung 1060 Jun 11 15:00 LICENSE\n-rw-rw-r-- 1 myeung myeung 3108 Jun 11 15:00 main.nf\ndrwxrwxr-x 3 myeung myeung 4096 Jun 11 15:00 modules\n-rw-rw-r-- 1 myeung myeung 1561 Jun 11 15:00 modules.json\n-rw-rw-r-- 1 myeung myeung 9982 Jun 11 15:00 nextflow.config\n-rw-rw-r-- 1 myeung myeung 16657 Jun 11 15:00 nextflow_schema.json\n-rw-rw-r-- 1 myeung myeung 3843 Jun 11 15:00 README.md\ndrwxrwxr-x 4 myeung myeung 4096 Jun 11 15:00 subworkflows\n-rw-rw-r-- 1 myeung myeung 165 Jun 11 15:00 tower.yml\ndrwxrwxr-x 2 myeung myeung 4096 Jun 11 15:00 workflows\nAs you take look through the files created you will see many comments through the files starting with // TODO nf-core. These are pointers from nf-core towards areas of the pipeline that you may be intersted in changing.\nThey are also the “key word” used by the nf-core lint.\n\nAlternative setups for nf-core create\nAside from the interactive setup we have just completed for nf-core create, there are two alternative methods.\n\nProvide the option using the optional flags from nf-core create\nProvide a template.yaml via the --template-yaml option\n\n\n\n\n\n\n\nChallenge\n\n\n\nCreate a second pipeline template using the optional flags with the name “myworkflow”, provide a description, author name and set the version to “0.0.1”\nWhat options are still you still prompted for?\n\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\nRun the following:\nnf-core create --name myworkflow --description \"my workflow test\" --author \"@mmyeung\" --version \"0.0.1\"\nNote that you are still prompted for any additional customisations such as the pipeline prefix and steps to skip\n\n\n\n\n\n\n\n\n\nAdvanced Challange\n\n\n\nCreate another pipeline template using a yaml file called mytemplate.yaml\nHint: the key values in the yaml should be name, description, author, prefix and skip\nSet the pipeline to skip ci, igenomes and nf_core_configs\n\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\nRun the following:\nvim mytemplate.yaml\nValues in mytemplate.yaml\nname: coolpipe\ndescription: A cool pipeline\nauthor: me\nprefix: myorg\nskip:\n - ci\n - igenomes\n - nf_core_configs\nnf-core create --template-yaml mytemplate.yaml" - }, - { - "objectID": "workshops/5.1_nf_core_template.html#test-profile", - "href": "workshops/5.1_nf_core_template.html#test-profile", - "title": "Nextflow Development - Developing Modularised Workflows", - "section": "5.3 Test Profile", - "text": "5.3 Test Profile\nnf-core tries to encourage software engineering concepts such as minimal test sets, this can be set up using the conf/test.config and conf/test_full.config\nFor the duration of this workshop we will be making use of the conf/test.config, to test our pipeline.\nLet’s take a look at what is currently in the conf/test.config.\ncat pmcc-myrnaseq/conf/test.config\n/*\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n Nextflow config file for running minimal tests\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n Defines input files and everything required to run a fast and simple pipeline test.\n\n Use as follows:\n nextflow run pmcc/myrnaseq -profile test,<docker/singularity> --outdir <OUTDIR>\n\n----------------------------------------------------------------------------------------\n*/\n\nparams {\n config_profile_name = 'Test profile'\n config_profile_description = 'Minimal test dataset to check pipeline function'\n\n // Limit resources so that this can run on GitHub Actions\n max_cpus = 2\n max_memory = '6.GB'\n max_time = '6.h'\n\n // Input data\n // TODO nf-core: Specify the paths to your test data on nf-core/test-datasets\n // TODO nf-core: Give any required params for the test so that command line flags are not needed\n input = params.pipelines_testdata_base_path + 'viralrecon/samplesheet/samplesheet_test_illumina_amplicon.csv'\n\n // Genome references\n genome = 'R64-1-1'\n}\nFrom this, we can see that this config uses the params scope to define:\n\nMaximal values for resources\nDirects the input parameter to a sample sheet hosted in the nf-core/testdata github\nSets the genome to “R64-1-1”\n\n\n\n\n\n\n\nHow does setting the parameter genome set all the genome references?\n\n\n\nThis is possible due to us using the igenomes configs from nf-core.\nYou can see in the conf/igenomes.config how nested within each genome definition are paths to various reference files.\nTo find out more about the igenomes project here\n\n\nFor the duration of this workshop we are going to use the data from nf-training that was cloned in the first workshop. We are also going to update our test.config to contain the igenomes_base parameter, as we have a local cache on the cluster.\ninput = \"/home/Shared/For_NF_Workshop/training/nf-training/data/ggal/samplesheet.csv\"\noutdir = \"/scratch/users/${USER}/myrnaseqtest\"\n\n// genome references\ngenome = \"GRCh38\"\nigenomes_base = \"/data/janis/nextflow/references/genomes/ngi-igenomes\"\nAlso, we will need to change the value, custom_config_base to null, in nextflow.config\ncustom_config_base = null\nLet’s quickly check that our pipeline runs with the test profile.\ncd ..\nnextflow run ./pmcc-myrnaseq -profile test,singularity\n\n\n\n\n\n\nWhat’s the difference between the test.config and the test_full.config\n\n\n\nTypically the test.config contains the minimal test example, while the test_full.config contains at least one full sized example data." - }, - { - "objectID": "workshops/5.1_nf_core_template.html#nf-core-modules", - "href": "workshops/5.1_nf_core_template.html#nf-core-modules", - "title": "Nextflow Development - Developing Modularised Workflows", - "section": "5.4 Nf-core modules", - "text": "5.4 Nf-core modules\nYou can find all the nf-core modules that have been accepted and peer-tested by the community in nf-core modules.\nor with\nnf-core modules list remote\nyou can check which modules are installed localling in your pipeline by running nf-core modules list local, within the pipeline folder.\ncd pmcc-myrnaseq\n\nnf-core modules list local\n\n ,--./,-.\n ___ __ __ __ ___ /,-._.--~\\\n |\\ | |__ __ / ` / \\ |__) |__ } {\n | \\| | \\__, \\__/ | \\ |___ \\`-._,-`-,\n `._,._,'\n\n nf-core/tools version 2.14.1 - https://nf-co.re\n\n\nINFO Repository type: pipeline\nINFO Modules installed in '.':\n┏━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓\n┃ Module Name ┃ Repository ┃ Version SHA ┃ Message ┃ Date ┃\n┡━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩\n│ fastqc │ nf-core/modules │ 285a505 │ Fix FastQC memory allocation (#5432) │ 2024-04-05 │\n│ multiqc │ nf-core/modules │ b7ebe95 │ Update MQC container (#5006) │ 2024-02-29 │\n└─────────────┴─────────────────┴─────────────┴──────────────────────────────────────┴────────────┘\n\n\n\n\n\n\nOverall Challenge\n\n\n\nWe are going to replicate sections of the workflow from last week.\nFASTQC -> Trimgalore -> FASTQC -> MULTIQC\n\n\n\n5.3.1 Installing nf-core modules\nThe general format for installing modules is as below.\nnf-core modules install <tool>/<subcommand>\n\n\n\n\n\n\nTip\n\n\n\nNote that if you search for the modules on the nf-core modules website, you can find the install command at the top of the tool\n\n\n\n\n\n\n\n\nTip\n\n\n\nRemember to run the install commands from within the nf-core pipeline folder (in this case the pmcc-myrnaseq folder)\nIf you are not in an nf-core folder you will see the following error\n\n ,--./,-.\n ___ __ __ __ ___ /,-._.--~\\\n |\\ | |__ __ / ` / \\ |__) |__ } {\n | \\| | \\__, \\__/ | \\ |___ \\`-._,-`-,\n `._,._,'\n\n nf-core/tools version 2.14.1 - https://nf-co.re\n\n\nWARNING 'repository_type' not defined in .nf-core.yml\n? Is this repository an nf-core pipeline or a fork of nf-core/modules? (Use arrow keys)\n » Pipeline\n nf-core/modules\n\n\n\n\n\n\n\n\nChallenge\n\n\n\nInstall the following nf-core modules\n\ntrimgalore\nsalmon quant\nfastqc\n\nWhat happens when we try to install the fastqc module?\n\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\nUnfortunately, nf-core does not allow the installation of multiple modules in one line therefore we mush provide the commands separately for each module.\nnf-core modules install trimgalore\nnf-core modules install salmon/quant\nnf-core modules install fastqc\nNote that from above, when we checked which modules have been installed locally in our pipeline, fastqc was already installed. As such, we see the following output warning us that fastqc is installed and we can either force the reinstallation or we can update the module\n\n ,--./,-.\n ___ __ __ __ ___ /,-._.--~\\\n |\\ | |__ __ / ` / \\ |__) |__ } {\n | \\| | \\__, \\__/ | \\ |___ \\`-._,-`-,\n `._,._,'\n\n nf-core/tools version 2.14.1 - https://nf-co.re\n\nINFO Module 'fastqc' is already installed.\nINFO To update 'fastqc' run 'nf-core modules update fastqc'. To force reinstallation use '--force'. \n\n\n\n\n\n\n\n\n\nAdvanced Challenge\n\n\n\nCan you think of a way to streamline the installation of modules?\n\n\nfollowing the installation what files changed, check with\ngit status\nOn branch master\nChanges not staged for commit:\n (use \"git add <file>...\" to update what will be committed)\n (use \"git restore <file>...\" to discard changes in working directory)\n modified: modules.json\n\nUntracked files:\n (use \"git add <file>...\" to include in what will be committed)\n modules/nf-core/salmon/\n modules/nf-core/trimgalore/\n\nno changes added to commit (use \"git add\" and/or \"git commit -a\")\nmodules.json is a running record of the modules installed and should be included in your pipeline. Note: you can find the github SHA for the exact “version” of the module installed.\nThis insulates your pipeline from when a module is deleted.\nrm -r modules/nf-core/salmon/quant\n\nnf-core modules list local\n\n ,--./,-.\n ___ __ __ __ ___ /,-._.--~\\\n |\\ | |__ __ / ` / \\ |__) |__ } {\n | \\| | \\__, \\__/ | \\ |___ \\`-._,-`-,\n `._,._,'\n\n nf-core/tools version 2.14.1 - https://nf-co.re\n\n\nINFO Repository type: pipeline\nINFO Reinstalling modules found in 'modules.json' but missing from directory: 'modules/nf-core/salmon/quant'\nINFO Modules installed in '.':\n┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓\n┃ Module Name ┃ Repository ┃ Version SHA ┃ Message ┃ Date ┃\n┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩\n│ fastqc │ nf-core/modules │ 285a505 │ Fix FastQC memory allocation (#5432) │ 2024-04-05 │\n│ multiqc │ nf-core/modules │ b7ebe95 │ Update MQC container (#5006) │ 2024-02-29 │\n│ salmon/quant │ nf-core/modules │ cb6b2b9 │ fix stubs salmon (#5517) │ 2024-04-24 │\n│ trimgalore │ nf-core/modules │ a984184 │ run nf-core lint on trimgalore (#5129) │ 2024-03-15 │\n└──────────────┴─────────────────┴─────────────┴────────────────────────────────────────┴────────────┘\n\n\n\n\n\n\nAdvanced Challenge\n\n\n\nHow would you look up previous versions of the module?\n\n\n\n\n\n\n\n\nCaution\n\n\n\n\n\nThere are a few ways to approach this.\n\nYou could hop onto github and search throught the git history for the main.nf of the particular module, to identify the git SHA and provide it to the --sha flag.\nYou could run the install command with the --prompt flag, as seen below\n\n\n ,--./,-.\n ___ __ __ __ ___ /,-._.--~\\\n |\\ | |__ __ / ` / \\ |__) |__ } {\n | \\| | \\__, \\__/ | \\ |___ \\`-._,-`-,\n `._,._,'\n\n nf-core/tools version 2.14.1 - https://nf-co.re\n\n\nINFO Module 'fastqc' is already installed.\n? Module fastqc is already installed.\nDo you want to force the reinstallation? Yes\n? Select 'fastqc' commit: (Use arrow keys)\n Fix FastQC memory allocation (#5432) 285a50500f9e02578d90b3ce6382ea3c30216acd (installed version)\n Update FASTQC to use unique names for snapshots (#4825) f4ae1d942bd50c5c0b9bd2de1393ce38315ba57c\n CHORES: update fasqc tests with new data organisation (#4760) c9488585ce7bd35ccd2a30faa2371454c8112fb9\n fix fastqc tests n snap (#4669) 617777a807a1770f73deb38c80004bac06807eef\n Update version strings (#4556) 65ad3e0b9a4099592e1102e92e10455dc661cf53\n Remove pytest-workflow tests for modules covered by nf-test (#4521) 3e8b0c1144ccf60b7848efbdc2be285ff20b49ee\n Add conda environment names (#4327) 3f5420aa22e00bd030a2556dfdffc9e164ec0ec5\n Fix conda declaration (#4252) 8fc1d24c710ebe1d5de0f2447ec9439fd3d9d66a\n Move conda environment to yml (#4079) 516189e968feb4ebdd9921806988b4c12b4ac2dc\n authors => maintainers (#4173) cfd937a668919d948f6fcbf4218e79de50c2f36f\n » older commits\n\n\n\n\n\n5.3.2 Updating nf-core modules\nAbove we got and error message for fastq because the module was already installed. As listed in the output, one of the suggested solutions is that we might be looking to update the module\nnf-core modules update fastqc\nAfter running the command you will find that you are prompted for whether you wish to view the differences between the current installation and the update.\n\n ,--./,-.\n ___ __ __ __ ___ /,-._.--~\\\n |\\ | |__ __ / ` / \\ |__) |__ } {\n | \\| | \\__, \\__/ | \\ |___ \\`-._,-`-,\n `._,._,'\n\n nf-core/tools version 2.14.1 - https://nf-co.re\n\n\n? Do you want to view diffs of the proposed changes? (Use arrow keys)\n » No previews, just update everything\n Preview diff in terminal, choose whether to update files\n Just write diffs to a patch file\nFor the sake of this exercise, we are going to roll fastqc back by one commit.\nIf you select the 2nd option Preview diff in terminal, choose whether to update files\nnf-core modules update fastqc -p\n ,--./,-.\n ___ __ __ __ ___ /,-._.--~\\\n |\\ | |__ __ / ` / \\ |__) |__ } {\n | \\| | \\__, \\__/ | \\ |___ \\`-._,-`-,\n `._,._,'\n\n nf-core/tools version 2.14.1 - https://nf-co.re\n\n\n? Do you want to view diffs of the proposed changes? Preview diff in terminal, choose whether to update files\n? Select 'fastqc' commit: (Use arrow keys)\n Fix FastQC memory allocation (#5432) 285a50500f9e02578d90b3ce6382ea3c30216acd (installed version)\n » Update FASTQC to use unique names for snapshots (#4825) f4ae1d942bd50c5c0b9bd2de1393ce38315ba57c\n CHORES: update fasqc tests with new data organisation (#4760) c9488585ce7bd35ccd2a30faa2371454c8112fb9\n fix fastqc tests n snap (#4669) 617777a807a1770f73deb38c80004bac06807eef\n Update version strings (#4556) 65ad3e0b9a4099592e1102e92e10455dc661cf53\n Remove pytest-workflow tests for modules covered by nf-test (#4521) 3e8b0c1144ccf60b7848efbdc2be285ff20b49ee\n Add conda environment names (#4327) 3f5420aa22e00bd030a2556dfdffc9e164ec0ec5\n Fix conda declaration (#4252) 8fc1d24c710ebe1d5de0f2447ec9439fd3d9d66a\n Move conda environment to yml (#4079) 516189e968feb4ebdd9921806988b4c12b4ac2dc\n authors => maintainers (#4173) cfd937a668919d948f6fcbf4218e79de50c2f36f\n older commits\n? Select 'fastqc' commit: Update FASTQC to use unique names for snapshots (#4825) f4ae1d942bd50c5c0b9bd2de1393ce38315ba57c\nINFO Changes in module 'nf-core/fastqc' between (285a50500f9e02578d90b3ce6382ea3c30216acd) and (f4ae1d942bd50c5c0b9bd2de1393ce38315ba57c)\nINFO Changes in 'fastqc/main.nf':\n --- modules/nf-core/fastqc/main.nf\n +++ modules/nf-core/fastqc/main.nf\n @@ -25,11 +25,6 @@\n def old_new_pairs = reads instanceof Path || reads.size() == 1 ? [[ reads, \"${prefix}.${reads.extension}\" ]] : reads.withIndex().collect { entry, index -> [ entry, \"${prefix}_${index + 1}.${entry.extension}\" ] }\n def rename_to = old_new_pairs*.join(' ').join(' ')\n def renamed_files = old_new_pairs.collect{ old_name, new_name -> new_name }.join(' ')\n -\n - def memory_in_mb = MemoryUnit.of(\"${task.memory}\").toUnit('MB')\n - // FastQC memory value allowed range (100 - 10000)\n - def fastqc_memory = memory_in_mb > 10000 ? 10000 : (memory_in_mb < 100 ? 100 : memory_in_mb)\n -\n \"\"\"\n printf \"%s %s\\\\n\" $rename_to | while read old_name new_name; do\n [ -f \"\\${new_name}\" ] || ln -s \\$old_name \\$new_name\n @@ -38,7 +33,6 @@\n fastqc \\\\\n $args \\\\\n --threads $task.cpus \\\\\n - --memory $fastqc_memory \\\\\n $renamed_files\n\n cat <<-END_VERSIONS > versions.yml\nINFO 'modules/nf-core/fastqc/meta.yml' is unchanged\nINFO 'modules/nf-core/fastqc/environment.yml' is unchanged\nINFO 'modules/nf-core/fastqc/tests/main.nf.test.snap' is unchanged\nINFO 'modules/nf-core/fastqc/tests/tags.yml' is unchanged\nINFO 'modules/nf-core/fastqc/tests/main.nf.test' is unchanged\n? Update module 'fastqc'? No\nINFO Updates complete ✨ \n\n\n5.3.3 Removing nf-core modules\nAs mentioned above, if you decide that you don’t need a module anymore, you can’t just remove the folder with rm -r.\nFor nf-core to no longer register the module is to be distributed with your pipeline you need to use:\nnf-core modules remove\nAs an exercise, we are going to install the samtools/sort module\nnf-core modules install samtools/sort\nQuickly view the modules.json or use nf-core modules list local to view the changes from installing the module.\nNow remove the samtools/sort module\nnf-core modules remove samtools/sort\n\n\n\n\n\n\nOverall Challenge\n\n\n\nNow add the include module statements to the our workflows/myrnaseq.nf\n\n\n\n\n\n\n\n\nCaution\n\n\n\n\n\ninclude { FASTQC as FASTQC_one } from '../modules/nf-core/fastq/main' \ninclude { FASTQC as FASTQC_two } from '../modules/nf-core/fastq/main' \n\ninclude { TRIMGALORE } from '../modules/nf-core/trimgalore/main'\n\n\n\n\n\n5.3.4 Writing modules with nf-core template\nFor this section we are going to refer to the nf-core guidelines for modules.\nWhile these are the full guidelines for contributing back to nf-core, there are still some general components that are good practice even if you are NOT planning to contribute.\n\n\n\n\n\n\nSummary of guidelines\n\n\n\n\nAll required and optional input files must be included in the input as a path variable\nThe command should run without any additional argument, any required flag values should be included as an input val variable\ntask.ext.args must be provided as a variable\nWhere possible all input and output files should be compressed (i.e. fastq.gz and .bam)\nA versions.yml file is output\nNaming conventions include using all lowercase without puntuation and follows the convention of software/tool (i.e. bwa/mem)\nAll outputs must include an emit definition\n\n\n\nWe are going to write up our own samtools/view module.\nnf-core modules create \n\n ,--./,-.\n ___ __ __ __ ___ /,-._.--~\\\n |\\ | |__ __ / ` / \\ |__) |__ } {\n | \\| | \\__, \\__/ | \\ |___ \\`-._,-`-,\n `._,._,'\n\n nf-core/tools version 2.14.1 - https://nf-co.re\n\n\nINFO Repository type: pipeline\nINFO Press enter to use default values (shown in brackets) or type your own responses. ctrl+click underlined text to open links.\nName of tool/subtool: samtools/view\nINFO Using Bioconda package: 'bioconda::samtools=1.20'\nINFO Could not find a Docker/Singularity container (Unexpected response code `500` for https://api.biocontainers.pro/ga4gh/trs/v2/tools/samtools/versions/samtools-1.20) ## Cluster\nGitHub Username: (@author): @mmyeung\nINFO Provide an appropriate resource label for the process, taken from the nf-core pipeline template.\n For example: process_single, process_low, process_medium, process_high, process_long\n? Process resource label: process_low\nINFO Where applicable all sample-specific information e.g. 'id', 'single_end', 'read_group' MUST be provided as an input via a Groovy Map called\n 'meta'. This information may not be required in some instances, for example indexing reference genome files.\nWill the module require a meta map of sample information? [y/n] (y): y\nINFO Created component template: 'samtools/view'\nINFO Created following files:\n modules/local/samtools/view.nf\nAs we progressed through the interactive prompt, you will have noticed that nf-core always attempts to locate the corresponding bioconda package and singularity/Docker container.\n\n\n\n\n\n\nWhat happens when there is no bioconda package or container?\n\n\n\n\n\nnf-core modules create --author @mmyeung --label process_single --meta testscript\nThe command will indicate that the there is no bioconda package with the software name, and prompt you for a package name you might wish to use.\nINFO Repository type: pipeline\nINFO Press enter to use default values (shown in brackets) or type your own responses. ctrl+click underlined text to open links.\nWARNING Could not find Conda dependency using the Anaconda API: 'testscript'\nDo you want to enter a different Bioconda package name? [y/n]: n\nWARNING Could not find Conda dependency using the Anaconda API: 'testscript'\n Building module without tool software and meta, you will need to enter this information manually.\nINFO Created component template: 'testscript'\nINFO Created following files:\n modules/local/testscript.nf \nwithin the module .nf script you will note that the definitions for the conda and container are incomplete for the tool.\n conda \"${moduleDir}/environment.yml\"\n container \"${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ?\n 'https://depot.galaxyproject.org/singularity/YOUR-TOOL-HERE':\n 'biocontainers/YOUR-TOOL-HERE' }\"\nnf-core has a large cache of containers here. Though you can also provide a simple path to docker hub.\n container \"mmyeung/trccustomunix:0.0.1\"\n\n\n\nThe resource labels, are those as defined in conf/base.config\n\n\n\n\n\n\nChallenge\n\n\n\nWrite up the inputs, outputs and script for samtools/view.\nAssume that all the inputs will be .bam and the outputs will also be .bam.\nFor reference look at the documentation for samtools/view\nAre there optional flags that take file inputs? What options need to set to ensure that the command runs without error?\n\n\n\n\n\n\n\n\nCaution\n\n\n\n\n\nprocess SAMTOOLS_VIEW {\n tag \"$meta.id\"\n label 'process_low'\n\n conda \"${moduleDir}/environment.yml\"\n container \"${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ?\n 'https://depot.galaxyproject.org/singularity/samtools:1.20--h50ea8bc_0' :\n 'biocontainers/samtools:1.20--h50ea8bc_0' }\"\n\n input:\n tuple val(meta), path(input), path(index)\n tuple val(meta2), path(fasta)\n path bed\n path qname\n\n output:\n tuple val(meta), path(\"*.bam\"), emit: bam\n path \"versions.yml\", emit: versions\n\n when:\n task.ext.when == null || task.ext.when\n\n script:\n def args = task.ext.args ?: ''\n def args2 = task.ext.args2 ?: ''\n def prefix = task.ext.prefix ?: \"${meta.id}\"\n def reference = fasta ? \"--reference ${fasta}\" : \"\"\n def readnames = qname ? \"--qname-file ${qname}\": \"\"\n def regions = bed ? \"-L ${bed}\": \"\"\n if (\"$input\" == \"${prefix}.${file_type}\") error \"Input and output names are the same, use \\\"task.ext.prefix\\\" to disambiguate!\"\n \"\"\"\n samtools \\\\\n view \\\\\n -hb \\\\\n --threads ${task.cpus-1} \\\\\n ${reference} \\\\\n ${readnames} \\\\\n ${regions} \\\\\n $args \\\\\n -o ${prefix}.bam \\\\\n $input \\\\\n $args2\n\n cat <<-END_VERSIONS > versions.yml\n \"${task.process}\":\n samtools: \\$(echo \\$(samtools --version 2>&1) | sed 's/^.*samtools //; s/Using.*\\$//')\n END_VERSIONS\n \"\"\"\n\n stub:\n def args = task.ext.args ?: ''\n def prefix = task.ext.prefix ?: \"${meta.id}\"\n def file_type = args.contains(\"--output-fmt sam\") ? \"sam\" :\n args.contains(\"--output-fmt bam\") ? \"bam\" :\n args.contains(\"--output-fmt cram\") ? \"cram\" :\n input.getExtension()\n if (\"$input\" == \"${prefix}.${file_type}\") error \"Input and output names are the same, use \\\"task.ext.prefix\\\" to disambiguate!\"\n\n def index = args.contains(\"--write-index\") ? \"touch ${prefix}.csi\" : \"\"\n\n \"\"\"\n touch ${prefix}.${file_type}\n ${index}\n\n cat <<-END_VERSIONS > versions.yml\n \"${task.process}\":\n samtools: \\$(echo \\$(samtools --version 2>&1) | sed 's/^.*samtools //; s/Using.*\\$//')\n END_VERSIONS\n \"\"\"\n\n\n\nSimilar to nf-core create you can minimise a the number of prompts by using optional flags.\n\n\n\n\n\n\nOverall Challenge\n\n\n\nWrite up the short workflow as discussed above\nFASTQC -> trimgalore -> FASTQC -> MULTIQC" - }, - { - "objectID": "workshops/5.1_nf_core_template.html#nf-core-subworkflow", - "href": "workshops/5.1_nf_core_template.html#nf-core-subworkflow", - "title": "Nextflow Development - Developing Modularised Workflows", - "section": "5.4 Nf-core subworkflow", - "text": "5.4 Nf-core subworkflow\nnf-core subworkflows\nor with\nnf-core subworkflows list remote\n\n5.4.1 Installing nf-core subworkflows\nSubworkflows can be updated/removed like modules\n\n\n\n\n\n\nChallenge\n\n\n\nInstall the subworkflow fastq_subsample_fq_salmon into the workflow\n\n\n\n\n\n\n\n\nCaution\n\n\n\n\n\nnf-core subworkflows install fastq_subsample_fq_salmon\n\n\n\n\n\n5.4.2 Writing subworkflows with nf-core template\n\n\n\n\n\n\nChallenge\n\n\n\nWrite up the QC_WF subworkflow from last week." - }, - { - "objectID": "workshops/5.1_nf_core_template.html#nf-core-schema-and-input-validation", - "href": "workshops/5.1_nf_core_template.html#nf-core-schema-and-input-validation", - "title": "Nextflow Development - Developing Modularised Workflows", - "section": "5.5 Nf-core schema and input validation", - "text": "5.5 Nf-core schema and input validation\nRelies on plugins written by nf-core community\nIn particular nf-validation\nnextflow_schmea.json is for pipeline parameters\nnf-core schema build\n\n ,--./,-.\n ___ __ __ __ ___ /,-._.--~\\\n |\\ | |__ __ / ` / \\ |__) |__ } {\n | \\| | \\__, \\__/ | \\ |___ \\`-._,-`-,\n `._,._,'\n\n nf-core/tools version 2.14.1 - https://nf-co.re\n\n\nINFO [✓] Default parameters match schema validation\nINFO [✓] Pipeline schema looks valid (found 32 params)\nINFO Writing schema with 32 params: 'nextflow_schema.json'\n🚀 Launch web builder for customisation and editing? [y/n]: y\nINFO Opening URL: https://nf-co.re/pipeline_schema_builder?id=1718112529_0841fa08f86f\nINFO Waiting for form to be completed in the browser. Remember to click Finished when you're done.\n⢿ Use ctrl+c to stop waiting and force exit.\nRecommend writing in web browser\njson format details additional reading\n\n\n\n\n\n\nChallenge\n\n\n\nWe are going add the input parameter for the transcript.fa\nThen install salmon/index and write up quant_wf subworkflow from last week.git\n\n\n\n5.5.2 Nf-core inputs\nnested in this schema is the input or samplesheet schema. unfortunately there isn’t a nice interface to help you write this schema yet.\n\nmeta: Allows you to predesignate the “key” with in the “meta”\nrequired: value must be included\ndependency: value is dependant on other value existing in samplesheet (i.e. fastq_2 must imply there is a fastq_1)\n\n\n\n5.6 Nf-core tools for launching\ncreate-params-file\n\n\n5.7 Nf-core for pipeline management\nbump-version ==> good software management to note down versions" - }, - { - "objectID": "workshops/5.1_nf_core_template.html#contributing-to-nf-core", - "href": "workshops/5.1_nf_core_template.html#contributing-to-nf-core", - "title": "Nextflow Development - Developing Modularised Workflows", - "section": "Contributing to nf-core", - "text": "Contributing to nf-core\nFull pipelines Please see the nf-core documentation for a full walkthrough of how to create a new nf-core workflow.\n\nThis workshop is adapted from Fundamentals Training, Advanced Training, Developer Tutorials, Nextflow Patterns materials from Nextflow, nf-core nf-core tools documentation and nf-validation" - }, - { - "objectID": "workshops/2.2_troubleshooting.html", - "href": "workshops/2.2_troubleshooting.html", - "title": "Troubleshooting Nextflow run", - "section": "", - "text": "2.2.1. Nextflow log\nIt is important to keep a record of the commands you have run to generate your results. Nextflow helps with this by creating and storing metadata and logs about the run in hidden files and folders in your current directory (unless otherwise specified). This data can be used by Nextflow to generate reports. It can also be queried using the Nextflow log command:\nnextflow log\nThe log command has multiple options to facilitate the queries and is especially useful while debugging a workflow and inspecting execution metadata. You can view all of the possible log options with -h flag:\nnextflow log -h\nTo query a specific execution you can use the RUN NAME or a SESSION ID:\nnextflow log <run name>\nTo get more information, you can use the -f option with named fields. For example:\nnextflow log <run name> -f process,hash,duration\nThere are many other fields you can query. You can view a full list of fields with the -l option:\nnextflow log -l\n\n\n\n\n\n\nChallenge\n\n\n\nUse the log command to view with process, hash, and script fields for your tasks from your most recent Nextflow execution.\n\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\nUse the log command to get a list of you recent executions:\nnextflow log\nTIMESTAMP DURATION RUN NAME STATUS REVISION ID SESSION ID COMMAND \n2023-11-21 22:43:14 14m 17s jovial_angela OK 3bec2331ca 319751c3-25a6-4085-845c-6da28cd771df nextflow run nf-core/rnaseq\n2023-11-21 23:05:49 1m 36s marvelous_shannon OK 3bec2331ca 319751c3-25a6-4085-845c-6da28cd771df nextflow run nf-core/rnaseq\n2023-11-21 23:10:00 1m 35s deadly_babbage OK 3bec2331ca 319751c3-25a6-4085-845c-6da28cd771df nextflow run nf-core/rnaseq\nQuery the process, hash, and script using the -f option for the most recent run:\nnextflow log marvelous_shannon -f process,hash,script\n\n[... truncated ...]\n\nNFCORE_RNASEQ:RNASEQ:SUBREAD_FEATURECOUNTS 7c/f936d4 \n featureCounts \\\n -B -C -g gene_biotype -t exon \\\n -p \\\n -T 2 \\\n -a chr22_with_ERCC92.gtf \\\n -s 2 \\\n -o HBR_Rep1_ERCC.featureCounts.txt \\\n HBR_Rep1_ERCC.markdup.sorted.bam\n\n cat <<-END_VERSIONS > versions.yml\n \"NFCORE_RNASEQ:RNASEQ:SUBREAD_FEATURECOUNTS\":\n subread: $( echo $(featureCounts -v 2>&1) | sed -e \"s/featureCounts v//g\")\n END_VERSIONS\n\n[... truncated ... ]\n\nNFCORE_RNASEQ:RNASEQ:MULTIQC 7a/8449d7 \n multiqc \\\n -f \\\n \\\n \\\n .\n\n cat <<-END_VERSIONS > versions.yml\n \"NFCORE_RNASEQ:RNASEQ:MULTIQC\":\n multiqc: $( multiqc --version | sed -e \"s/multiqc, version //g\" )\n END_VERSIONS\n \n\n\n\n\n\n2.2.2. Execution cache and resume\nTask execution caching is an essential feature of modern workflow managers. As such, Nextflow provides an automated caching mechanism for every execution. When using the Nextflow -resume option, successfully completed tasks from previous executions are skipped and the previously cached results are used in downstream tasks.\nNextflow caching mechanism works by assigning a unique ID to each task. The task unique ID is generated as a 128-bit hash value composing the the complete file path, file size, and last modified timestamp. These ID’s are used to create a separate execution directory where the tasks are executed and the outputs are stored. Nextflow will take care of the inputs and outputs in these folders for you.\nYou can re-launch the previously executed nf-core/rnaseq workflow again, but with a -resume flag, and observe the progress. Notice the time it takes to complete the workflow.\nnextflow run nf-core/rnaseq -r 3.11.1 \\\n --input samplesheet.csv \\\n --outdir ./my_results \\\n --fasta $materials/ref/chr22_with_ERCC92.fa \\\n --gtf $materials/ref/chr22_with_ERCC92.gtf \\\n -profile singularity \\\n --skip_markduplicates true \\\n --save_trimmed true \\\n --save_unaligned true \\\n --max_memory '6.GB' \\\n --max_cpus 2 \\\n -resume \n\n[80/ec6ff8] process > NFCORE_RNASEQ:RNASEQ:PREPARE_GENOME:GTF2BED (chr22_with_ERCC92.gtf) [100%] 1 of 1, cached: 1 ✔\n[1a/7bec9c] process > NFCORE_RNASEQ:RNASEQ:PREPARE_GENOME:GTF_GENE_FILTER (chr22_with_ERCC92.fa) [100%] 1 of 1, cached: 1 ✔\nExecuting this workflow will create a my_results directory with selected results files and add some further sub-directories into the work directory\nIn the schematic above, the hexadecimal numbers, such as 80/ec6ff8, identify the unique task execution. These numbers are also the prefix of the work directories where each task is executed.\nYou can inspect the files produced by a task by looking inside the work directory and using these numbers to find the task-specific execution path:\nls work/80/ec6ff8ba69a8b5b8eede3679e9f978/\nIf you look inside the work directory of a FASTQC task, you will find the files that were staged and created when this task was executed:\n>>> ls -la work/e9/60b2e80b2835a3e1ad595d55ac5bf5/ \n\ntotal 15895\ndrwxrwxr-x 2 rlupat rlupat 4096 Nov 22 03:39 .\ndrwxrwxr-x 4 rlupat rlupat 4096 Nov 22 03:38 ..\n-rw-rw-r-- 1 rlupat rlupat 0 Nov 22 03:39 .command.begin\n-rw-rw-r-- 1 rlupat rlupat 9509 Nov 22 03:39 .command.err\n-rw-rw-r-- 1 rlupat rlupat 9609 Nov 22 03:39 .command.log\n-rw-rw-r-- 1 rlupat rlupat 100 Nov 22 03:39 .command.out\n-rw-rw-r-- 1 rlupat rlupat 10914 Nov 22 03:39 .command.run\n-rw-rw-r-- 1 rlupat rlupat 671 Nov 22 03:39 .command.sh\n-rw-rw-r-- 1 rlupat rlupat 231 Nov 22 03:39 .command.trace\n-rw-rw-r-- 1 rlupat rlupat 1 Nov 22 03:39 .exitcode\nlrwxrwxrwx 1 rlupat rlupat 63 Nov 22 03:39 HBR_Rep1_ERCC_1.fastq.gz -> HBR_Rep1_ERCC-Mix2_Build37-ErccTranscripts-chr22.read1.fastq.gz\n-rw-rw-r-- 1 rlupat rlupat 2368 Nov 22 03:39 HBR_Rep1_ERCC_1.fastq.gz_trimming_report.txt\n-rw-rw-r-- 1 rlupat rlupat 697080 Nov 22 03:39 HBR_Rep1_ERCC_1_val_1_fastqc.html\n-rw-rw-r-- 1 rlupat rlupat 490526 Nov 22 03:39 HBR_Rep1_ERCC_1_val_1_fastqc.zip\n-rw-rw-r-- 1 rlupat rlupat 6735205 Nov 22 03:39 HBR_Rep1_ERCC_1_val_1.fq.gz\nlrwxrwxrwx 1 rlupat rlupat 63 Nov 22 03:39 HBR_Rep1_ERCC_2.fastq.gz -> HBR_Rep1_ERCC-Mix2_Build37-ErccTranscripts-chr22.read2.fastq.gz\n-rw-rw-r-- 1 rlupat rlupat 2688 Nov 22 03:39 HBR_Rep1_ERCC_2.fastq.gz_trimming_report.txt\n-rw-rw-r-- 1 rlupat rlupat 695591 Nov 22 03:39 HBR_Rep1_ERCC_2_val_2_fastqc.html\n-rw-rw-r-- 1 rlupat rlupat 485732 Nov 22 03:39 HBR_Rep1_ERCC_2_val_2_fastqc.zip\n-rw-rw-r-- 1 rlupat rlupat 7088948 Nov 22 03:39 HBR_Rep1_ERCC_2_val_2.fq.gz\nlrwxrwxrwx 1 rlupat rlupat 102 Nov 22 03:39 HBR_Rep1_ERCC-Mix2_Build37-ErccTranscripts-chr22.read1.fastq.gz -> /data/seqliner/test-data/rna-seq/fastq/HBR_Rep1_ERCC-Mix2_Build37-ErccTranscripts-chr22.read1.fastq.gz\nlrwxrwxrwx 1 rlupat rlupat 102 Nov 22 03:39 HBR_Rep1_ERCC-Mix2_Build37-ErccTranscripts-chr22.read2.fastq.gz -> /data/seqliner/test-data/rna-seq/fastq/HBR_Rep1_ERCC-Mix2_Build37-ErccTranscripts-chr22.read2.fastq.gz\n-rw-rw-r-- 1 rlupat rlupat 109 Nov 22 03:39 versions.yml\nThe FASTQC process runs twice, executing in a different work directories for each set of inputs. Therefore, in the previous example, the work directory [e9/60b2e8] represents just one of the four sets of input data that was processed.\nIt’s very likely you will execute a workflow multiple times as you find the parameters that best suit your data. You can save a lot of spaces (and time) by resuming a workflow from the last step that was completed successfully and/or unmodified.\nIn practical terms, the workflow is executed from the beginning. However, before launching the execution of a process, Nextflow uses the task unique ID to check if the work directory already exists and that it contains a valid command exit state with the expected output files. If this condition is satisfied, the task execution is skipped and previously computed results are used as the process results.\nNotably, the -resume functionality is very sensitive. Even touching a file in the work directory can invalidate the cache.\n\n\n\n\n\n\nChallenge\n\n\n\nInvalidate the cache by touching a .fastq.gz file in a FASTQC task work directory (you can use the touch command). Execute the workflow again with the -resume option to show that the cache has been invalidated.\n\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\nExecute the workflow for the first time (if you have not already).\nUse the task ID shown for the FASTQC process and use it to find and touch a the sample1_R1.fastq.gz file:\ntouch work/ff/21abfa87cc7cdec037ce4f36807d32/HBR_Rep1_ERCC_1.fastq.gz\nExecute the workflow again with the -resume command option:\nnextflow run nf-core/rnaseq -r 3.11.1 \\\n --input samplesheet.csv \\\n --outdir ./my_results \\\n --fasta $materials/ref/chr22_with_ERCC92.fa \\\n --gtf $materials/ref/chr22_with_ERCC92.gtf \\\n -profile singularity \\\n --skip_markduplicates true \\\n --save_trimmed true \\\n --save_unaligned true \\\n --max_memory '6.GB' \\\n --max_cpus 2 \\\n -resume \nYou should see that some task were invalid and were executed again.\nWhy did this happen?\nIn this example, the cache of two FASTQC tasks were invalid. The fastq file we touch is used by in the pipeline in multiple places. Thus, touching the symlink for this file and changing the date of last modification disrupted two task executions.\n\n\n\n\n\n2.2.3. Troubleshoot warning and error messages\nWhile our previous workflow execution completed successfully, there were a couple of warning messages that may be cause for concern:\n-[nf-core/rnaseq] Pipeline completed successfully with skipped sampl(es)-\n-[nf-core/rnaseq] Please check MultiQC report: 2/2 samples failed strandedness check.-\nCompleted at: 20-Nov-2023 00:29:04\nDuration : 10m 15s\nCPU hours : 0.3 \nSucceeded : 72\n\n\n\n\n\n\nHandling dodgy error messages 🤬\n\n\n\nThe first warning message isn’t very descriptive (see this pull request). You might come across issues like this when running nf-core pipelines, too. Bug reports and user feedback is very important to open source software communities like nf-core. If you come across any issues, submit a GitHub issue or start a discussion in the relevant nf-core Slack channel so others are aware and it can be addressed by the pipeline’s developers.\n\n\n➤ Take a look at the MultiQC report, as directed by the second message. You can find the MultiQC report in the lesson2.1/ directory:\nls -la lesson2.1/multiqc/star_salmon/\ntotal 1402\ndrwxrwxr-x 4 rlupat rlupat 4096 Nov 22 00:29 .\ndrwxrwxr-x 3 rlupat rlupat 4096 Nov 22 00:29 ..\ndrwxrwxr-x 2 rlupat rlupat 8192 Nov 22 00:29 multiqc_data\ndrwxrwxr-x 5 rlupat rlupat 4096 Nov 22 00:29 multiqc_plots\n-rw-rw-r-- 1 rlupat rlupat 1419998 Nov 22 00:29 multiqc_report.html\n➤ Download the multiqc_report.html the file navigator panel on the left side of your VS Code window by right-clicking on it and then selecting Download. Open the file on your computer.\nTake a look a the section labelled WARNING: Fail Strand Check\nThe warning we have received is indicating that the read strandedness we specified in our samplesheet.csv and inferred strandedness identified by the RSeqQC process in the pipeline do not match. It looks like the test samplesheet have incorrectly specified strandedness as forward in the samplesheet.csv when our raw reads actually show an equal distribution of sense and antisense reads.\nFor those who are not familiar with RNAseq data, incorrectly specified strandedness may negatively impact the read quantification step (process: Salmon quant) and give us inaccurate results. So, let’s clarify how the Salmon quant process is gathering strandedness information for our input files by default and find a way to address this with the parameters provided by the nf-core/rnaseq pipeline.\n\n\n\n2.2.4. Identify the run command for a process\nTo observe exactly what command is being run for a process, we can attempt to infer this information from the module’s main.nf script in the modules/ directory. However, given all the different parameters that may be applied at the process level, this may not be very clear.\n➤ Take a look at the Salmon quant main.nf file:\nnf-core-rnaseq-3.11.1/workflow/modules/nf-core/salmon/quant/main.nf\nUnless you are familiar with developing nf-core pipelines, it can be very hard to see what is actually happening in the code, given all the different variables and conditional arguments inside this script. Above the script block we can see strandedness is being applied using a few different conditional arguments. Instead of trying to infer how the $strandedness variable is being defined and applied to the process, let’s use the hidden command files saved for this task in the work/ directory.\n\n\n\n\n\n\nHidden files in the work directory!\n\n\n\nRemember that the pipeline’s results are cached in the work directory. In addition to the cached files, each task execution directories inside the work directory contains a number of hidden files:\n\n.command.sh: The command script run for the task.\n.command.run: The command wrapped used to run the task.\n.command.out: The task’s standard output log.\n.command.err: The task’s standard error log.\n.command.log: The wrapper execution output.\n.command.begin: A file created as soon as the job is launched.\n.exitcode: A file containing the task exit code (0 if successful)\n\n\n\nWith nextflow log command that we discussed previously, there are multiple options to facilitate the queries and is especially useful while debugging a pipeline and while inspecting pipeline execution metadata.\nTo understand how Salmon quant is interpreting strandedness, we’re going to use this command to track down the hidden .command.sh scripts for each Salmon quant task that was run. This will allow us to find out how Salmon quant handles strandedness and if there is a way for us to override this.\n➤ Use the Nextflow log command to get the unique run name information of the previously executed pipelines:\nnextflow log <run-name>\nThat command will list out all the work subdirectories for all processes run.\nAnd we now need to find the specific hidden.command.sh for Salmon tasks. But how to find them? 🤔\n➤ Let’s add some custom bash code to query a Nextflow run with the run name from the previous lesson. First, save your run name in a bash variable. For example:\nrun_name=marvelous_shannon\n➤ And let’s save the tool of interest (salmon) in another bash variable to pull it from a run command:\ntool=salmon\n➤ Next, run the following bash command:\nnextflow log ${run_name} | while read line;\n do\n cmd=$(ls ${line}/.command.sh 2>/dev/null);\n if grep -q $tool $cmd;\n then \n echo $cmd; \n fi; \n done \nThat will list all process .command.sh scripts containing ‘salmon’. There are a few different processes that run Salmon to perform other steps in the workflow. We are looking for Salmon quant which performs the read quantification:\n/scratch/users/rlupat/nfWorkshop/lesson2.1/work/57/fba8f9a2385dac5fa31688ba1afa9b/.command.sh\n/scratch/users/rlupat/nfWorkshop/lesson2.1/work/30/0113a58c14ca8d3099df04ebf388f3/.command.sh\n/scratch/users/rlupat/nfWorkshop/lesson2.1/work/ec/95d6bd12d578c3bce22b5de4ed43fe/.command.sh\n/scratch/users/rlupat/nfWorkshop/lesson2.1/work/49/6fedcb09e666432ae6ddf8b1e8f488/.command.sh\n/scratch/users/rlupat/nfWorkshop/lesson2.1/work/b4/2ca8d05b049438262745cde92955e9/.command.sh\n/scratch/users/rlupat/nfWorkshop/lesson2.1/work/38/875d68dae270504138bb3d72d511a7/.command.sh\n/scratch/users/rlupat/nfWorkshop/lesson2.1/work/72/776810a99695b1c114cbb103f4a0e6/.command.sh\n/scratch/users/rlupat/nfWorkshop/lesson2.1/work/1c/dc3f54cc7952bf55e6742dd4783392/.command.sh\n/scratch/users/rlupat/nfWorkshop/lesson2.1/work/f3/5116a5b412bde7106645671e4c6ffb/.command.sh\n/scratch/users/rlupat/nfWorkshop/lesson2.1/work/17/fb0c791810f42a438e812d5c894ebf/.command.sh\n/scratch/users/rlupat/nfWorkshop/lesson2.1/work/4c/931a9b60b2f3cf770028854b1c673b/.command.sh\n/scratch/users/rlupat/nfWorkshop/lesson2.1/work/91/e1c99d8acb5adf295b37fd3bbc86a5/.command.sh\nCompared with the salmon quant main.nf file, we get a lot more fine scale details from the .command.sh process scripts:\n>>> cat main.nf\nsalmon quant \\\\\n --geneMap $gtf \\\\\n --threads $task.cpus \\\\\n --libType=$strandedness \\\\\n $reference \\\\\n $input_reads \\\\\n $args \\\\\n -o $prefix\n>>> cat .command.sh\nsalmon quant \\\n --geneMap chr22_with_ERCC92.gtf \\\n --threads 2 \\\n --libType=ISF \\\n -t genome.transcripts.fa \\\n -a HBR_Rep1_ERCC.Aligned.toTranscriptome.out.bam \\\n \\\n -o HBR_Rep1_ERCC\nLooking at the nf-core/rnaseq Parameter documentation and Salmon documentation, we found that we can override this default using the --salmon_quant_libtype A parameter to indicate our data is unstranded and override samplesheet.csv input.\n\n\n\n\n\n\nHow do I get rid of the strandedness check warning message?\n\n\n\nIf we want to get rid of the warning message Please check MultiQC report: 2/2 samples failed strandedness check, we’ll have to change the strandedness fields in our samplesheet.csv. Keep in mind, doing this will invalidate the pipeline’s cache and cause the pipeline to run from the beginning.\n\n\n\n\n\n2.2.5. Write a parameter file\nFrom the previous section we learn that Nextflow accepts either yaml or json formats for parameter files. Any of the pipeline-specific parameters can be supplied to a Nextflow pipeline in this way.\n\n\n\n\n\n\nChallenge\n\n\n\nFill in the parameters file below and save as workshop-params.yaml. This time, include the --salmon_quant_libtype A parameter.\n💡 YAML formatting tips!\n\nStrings need to be inside double quotes\nBooleans (true/false) and numbers do not require quotes\n\ninput: \"\"\noutdir: \"lesson2.2\"\nfasta: \"\"\ngtf: \"\"\nstar_index: \"\"\nsalmon_index: \"\"\nskip_markduplicates: \nsave_trimmed: \nsave_unaligned: \nsalmon_quant_libtype: \"A\" \n\n\n\n\n2.2.6. Apply the parameter file\n➤ Once your params file has been saved, run:\nnextflow run nf-core/rnaseq -r 3.11.1 \\\n -params-file workshop-params.yaml\n -profile singularity \\\n --max_memory '6.GB' \\\n --max_cpus 2 \\\n -resume \nThe number of pipeline-specific parameters we’ve added to our run command has been significantly reduced. The only -- parameters we’ve provided to the run command relate to how the pipeline is executed on our interative job. These resource limits won’t be applicable to others who will run the pipeline on a different infrastructure.\nAs the workflow runs a second time, you will notice 4 things:\n\nThe command is much tidier thanks to offloading some parameters to the params file\nThe -resume flag. Nextflow has lots of run options including the ability to use cached output!\nSome processes will be pulled from the cache. These processes remain unaffected by our addition of a new parameter.\n\nThis run of the pipeline will complete in a much shorter time.\n\n-[nf-core/rnaseq] Pipeline completed successfully with skipped sampl(es)-\n-[nf-core/rnaseq] Please check MultiQC report: 2/2 samples failed strandedness check.-\nCompleted at: 21-Apr-2023 05:58:06\nDuration : 1m 51s\nCPU hours : 0.3 (82.2% cached)\nSucceeded : 11\nCached : 55\n\n\nThese materials are adapted from Customising Nf-Core Workshop by Sydney Informatics Hub" - }, { "objectID": "workshops/1.2_intro_nf_core.html", "href": "workshops/1.2_intro_nf_core.html", diff --git a/sessions/1_intro_run_nf.html b/sessions/1_intro_run_nf.html index 77c4065..3811225 100644 --- a/sessions/1_intro_run_nf.html +++ b/sessions/1_intro_run_nf.html @@ -144,6 +144,10 @@
  • Nextflow Operators +
  • +
  • + + Output, scatter, and Gather
  • diff --git a/sessions/2_nf_dev_intro.html b/sessions/2_nf_dev_intro.html index 73234e8..667f924 100644 --- a/sessions/2_nf_dev_intro.html +++ b/sessions/2_nf_dev_intro.html @@ -144,6 +144,10 @@
  • Nextflow Operators +
  • +
  • + + Output, scatter, and Gather
  • @@ -275,8 +279,8 @@

    Workshop schedule

    12th Jun 2024 -Working with Nextflow Built-in Functions -Introduction to nextflow operators, metadata propagation, grouping, and splitting +Working with Nextflow Built-in Functions operators [metadata] output-scatter-gather +Introduction to nextflow operators, metadata propagation, scatter, and gather 19th Jun 2024 diff --git a/sitemap.xml b/sitemap.xml index f4c1d24..8d51feb 100644 --- a/sitemap.xml +++ b/sitemap.xml @@ -2,58 +2,62 @@ https://PMCC-BioinformaticsCore.github.io/nextflow-intro-workshop/sessions/2_nf_dev_intro.html - 2024-06-18T14:54:38.961Z + 2024-06-18T17:57:28.085Z https://PMCC-BioinformaticsCore.github.io/nextflow-intro-workshop/index.html - 2024-06-18T14:54:38.132Z + 2024-06-18T17:57:27.250Z https://PMCC-BioinformaticsCore.github.io/nextflow-intro-workshop/workshops/2.3_tips_and_tricks.html - 2024-06-18T14:54:36.432Z + 2024-06-18T17:57:25.584Z https://PMCC-BioinformaticsCore.github.io/nextflow-intro-workshop/workshops/6.1_operators.html - 2024-06-18T14:54:35.497Z + 2024-06-18T17:57:24.652Z - https://PMCC-BioinformaticsCore.github.io/nextflow-intro-workshop/workshops/4.1_modules.html - 2024-06-18T14:54:33.932Z + https://PMCC-BioinformaticsCore.github.io/nextflow-intro-workshop/workshops/8.1_scatter_gather_output.html + 2024-06-18T17:57:23.110Z - https://PMCC-BioinformaticsCore.github.io/nextflow-intro-workshop/workshops/4.1_draft_future_sess.html - 2024-06-18T14:54:32.284Z + https://PMCC-BioinformaticsCore.github.io/nextflow-intro-workshop/workshops/2.2_troubleshooting.html + 2024-06-18T17:57:21.650Z - https://PMCC-BioinformaticsCore.github.io/nextflow-intro-workshop/workshops/1.1_intro_nextflow.html - 2024-06-18T14:54:30.435Z + https://PMCC-BioinformaticsCore.github.io/nextflow-intro-workshop/workshops/5.1_nf_core_template.html + 2024-06-18T17:57:20.502Z https://PMCC-BioinformaticsCore.github.io/nextflow-intro-workshop/workshops/3.1_creating_a_workflow.html - 2024-06-18T14:54:29.766Z + 2024-06-18T17:57:18.440Z - https://PMCC-BioinformaticsCore.github.io/nextflow-intro-workshop/workshops/5.1_nf_core_template.html - 2024-06-18T14:54:31.887Z + https://PMCC-BioinformaticsCore.github.io/nextflow-intro-workshop/workshops/1.1_intro_nextflow.html + 2024-06-18T17:57:19.115Z - https://PMCC-BioinformaticsCore.github.io/nextflow-intro-workshop/workshops/2.2_troubleshooting.html - 2024-06-18T14:54:33.050Z + https://PMCC-BioinformaticsCore.github.io/nextflow-intro-workshop/workshops/4.1_draft_future_sess.html + 2024-06-18T17:57:20.937Z + + + https://PMCC-BioinformaticsCore.github.io/nextflow-intro-workshop/workshops/4.1_modules.html + 2024-06-18T17:57:22.515Z https://PMCC-BioinformaticsCore.github.io/nextflow-intro-workshop/workshops/1.2_intro_nf_core.html - 2024-06-18T14:54:34.745Z + 2024-06-18T17:57:23.910Z https://PMCC-BioinformaticsCore.github.io/nextflow-intro-workshop/workshops/00_setup.html - 2024-06-18T14:54:35.926Z + 2024-06-18T17:57:25.093Z https://PMCC-BioinformaticsCore.github.io/nextflow-intro-workshop/workshops/2.1_customise_and_run.html - 2024-06-18T14:54:37.766Z + 2024-06-18T17:57:26.880Z https://PMCC-BioinformaticsCore.github.io/nextflow-intro-workshop/sessions/1_intro_run_nf.html - 2024-06-18T14:54:38.554Z + 2024-06-18T17:57:27.669Z diff --git a/workshops/00_setup.html b/workshops/00_setup.html index 994494f..340a740 100644 --- a/workshops/00_setup.html +++ b/workshops/00_setup.html @@ -178,6 +178,10 @@
  • Nextflow Operators +
  • +
  • + + Output, scatter, and Gather
  • diff --git a/workshops/1.1_intro_nextflow.html b/workshops/1.1_intro_nextflow.html index b45c230..df279c3 100644 --- a/workshops/1.1_intro_nextflow.html +++ b/workshops/1.1_intro_nextflow.html @@ -178,6 +178,10 @@
  • Nextflow Operators +
  • +
  • + + Output, scatter, and Gather
  • diff --git a/workshops/1.2_intro_nf_core.html b/workshops/1.2_intro_nf_core.html index 5e59cf5..3f94a5b 100644 --- a/workshops/1.2_intro_nf_core.html +++ b/workshops/1.2_intro_nf_core.html @@ -178,6 +178,10 @@
  • Nextflow Operators +
  • +
  • + + Output, scatter, and Gather
  • diff --git a/workshops/2.1_customise_and_run.html b/workshops/2.1_customise_and_run.html index 1674bf1..d5265e4 100644 --- a/workshops/2.1_customise_and_run.html +++ b/workshops/2.1_customise_and_run.html @@ -178,6 +178,10 @@
  • Nextflow Operators +
  • +
  • + + Output, scatter, and Gather
  • diff --git a/workshops/2.2_troubleshooting.html b/workshops/2.2_troubleshooting.html index 092b37d..89ce41f 100644 --- a/workshops/2.2_troubleshooting.html +++ b/workshops/2.2_troubleshooting.html @@ -178,6 +178,10 @@
  • Nextflow Operators +
  • +
  • + + Output, scatter, and Gather
  • diff --git a/workshops/2.3_tips_and_tricks.html b/workshops/2.3_tips_and_tricks.html index a2f2b6c..b2d3f5a 100644 --- a/workshops/2.3_tips_and_tricks.html +++ b/workshops/2.3_tips_and_tricks.html @@ -178,6 +178,10 @@
  • Nextflow Operators +
  • +
  • + + Output, scatter, and Gather
  • diff --git a/workshops/3.1_creating_a_workflow.html b/workshops/3.1_creating_a_workflow.html index 54d4007..60cca6b 100644 --- a/workshops/3.1_creating_a_workflow.html +++ b/workshops/3.1_creating_a_workflow.html @@ -178,6 +178,10 @@
  • Nextflow Operators +
  • +
  • + + Output, scatter, and Gather
  • diff --git a/workshops/4.1_draft_future_sess.html b/workshops/4.1_draft_future_sess.html index d9f53f1..cc36782 100644 --- a/workshops/4.1_draft_future_sess.html +++ b/workshops/4.1_draft_future_sess.html @@ -178,6 +178,10 @@
  • Nextflow Operators +
  • +
  • + + Output, scatter, and Gather
  • diff --git a/workshops/4.1_modules.html b/workshops/4.1_modules.html index b51ddba..3898fc7 100644 --- a/workshops/4.1_modules.html +++ b/workshops/4.1_modules.html @@ -178,6 +178,10 @@
  • Nextflow Operators +
  • +
  • + + Output, scatter, and Gather
  • diff --git a/workshops/5.1_nf_core_template.html b/workshops/5.1_nf_core_template.html index 4713c68..ad5ed72 100644 --- a/workshops/5.1_nf_core_template.html +++ b/workshops/5.1_nf_core_template.html @@ -178,6 +178,10 @@
  • Nextflow Operators +
  • +
  • + + Output, scatter, and Gather
  • diff --git a/workshops/6.1_operators.html b/workshops/6.1_operators.html index 85260e4..d153f8b 100644 --- a/workshops/6.1_operators.html +++ b/workshops/6.1_operators.html @@ -178,6 +178,10 @@
  • Nextflow Operators +
  • +
  • + + Output, scatter, and Gather
  • diff --git a/workshops/8.1_scatter_gather_output.html b/workshops/8.1_scatter_gather_output.html new file mode 100644 index 0000000..6573e61 --- /dev/null +++ b/workshops/8.1_scatter_gather_output.html @@ -0,0 +1,1175 @@ + + + + + + + + + +Peter Mac Nextflow Workshop - Nextflow Development - Outputs, Scatter, and Gather + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    +
    + +
    + +
    + + + + +
    + +
    +
    +

    Nextflow Development - Outputs, Scatter, and Gather

    +
    + + + +
    + + + + +
    + + +
    + +
    +
    +
    + +
    +
    +Objectives +
    +
    +
    +
      +
    • Gain an understanding of how to structure nextflow published outputs
    • +
    • Gain an understanding of how to do scatter & gather processes
    • +
    +
    +
    +
    +

    Environment Setup

    +

    Set up an interactive shell to run our Nextflow workflow:

    +
    srun --pty -p prod_short --mem 8GB --mincpus 2 -t 0-2:00 bash
    +

    Load the required modules to run Nextflow:

    +
    module load nextflow/23.04.1
    +module load singularity/3.7.3
    +

    Set the singularity cache environment variable:

    +
    export NXF_SINGULARITY_CACHEDIR=/config/binaries/singularity/containers_devel/nextflow
    +

    Singularity images downloaded by workflow executions will now be stored in this directory.

    +

    You may want to include these, or other environmental variables, in your .bashrc file (or alternate) that is loaded when you log in so you don’t need to export variables every session. A complete list of environment variables can be found here.

    +

    The training data can be cloned from:

    +
    git clone https://github.com/nextflow-io/training.git
    +
    +
    +

    RNA-seq Workflow and Module Files

    +

    Previously, we created three Nextflow files and one config file:

    +
    ├── nextflow.config
    +├── rnaseq.nf
    +├── modules.nf
    +└── modules
    +    └── trimgalore.nf
    +
      +
    • rnaseq.nf: main workflow script where parameters are defined and processes were called.
    • +
    +
    #!/usr/bin/env nextflow
    +
    +params.reads = "/scratch/users/.../training/nf-training/data/ggal/*_{1,2}.fq"
    +params.transcriptome_file = "/scratch/users/.../training/nf-training/data/ggal/transcriptome.fa"
    +
    +reads_ch = Channel.fromFilePairs("$params.reads")
    +
    +include { INDEX } from './modules.nf'
    +include { QUANTIFICATION as QT } from './modules.nf'
    +include { FASTQC as FASTQC_one } from './modules.nf'
    +include { FASTQC as FASTQC_two } from './modules.nf'
    +include { MULTIQC } from './modules.nf'
    +include { TRIMGALORE } from './modules/trimgalore.nf'
    +
    +workflow {
    +  index_ch = INDEX(params.transcriptome_file)
    +  quant_ch = QT(index_ch, reads_ch)
    +  fastqc_ch = FASTQC_one(reads_ch)
    +  trimgalore_out_ch = TRIMGALORE(reads_ch).reads
    +  fastqc_cleaned_ch = FASTQC_two(trimgalore_out_ch)
    +  multiqc_ch = MULTIQC(quant_ch, fastqc_ch)
    +}
    +
      +
    • modules.nf: script containing the majority of modules, including INDEX, QUANTIFICATION, FASTQC, and MULTIQC
    • +
    +
    process INDEX {
    +    container "/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-salmon-1.10.1--h7e5ed60_0.img"
    +
    +    input:
    +    path transcriptome
    +
    +    output:
    +    path "salmon_idx"
    +
    +    script:
    +    """
    +    salmon index --threads $task.cpus -t $transcriptome -i salmon_idx
    +    """
    +}
    +
    +process QUANTIFICATION {
    +    container "/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-salmon-1.10.1--h7e5ed60_0.img"
    +
    +    input:
    +    path salmon_index
    +    tuple val(sample_id), path(reads)
    +
    +    output:
    +    path "$sample_id"
    +
    +    script:
    +    """
    +    salmon quant --threads $task.cpus --libType=U \
    +    -i $salmon_index -1 ${reads[0]} -2 ${reads[1]} -o $sample_id
    +    """
    +}
    +
    +process FASTQC {
    +    container "/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-fastqc-0.12.1--hdfd78af_0.img"
    +
    +    input:
    +    tuple val(sample_id), path(reads)
    +
    +    output:
    +    path "fastqc_${sample_id}_logs"
    +
    +    script:
    +    """
    +    mkdir fastqc_${sample_id}_logs
    +    fastqc -o fastqc_${sample_id}_logs -f fastq -q ${reads}
    +    """
    +}
    +
    +process MULTIQC {
    +    publishDir params.outdir, mode:'copy'
    +    container "/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-multiqc-1.21--pyhdfd78af_0.img"
    +
    +    input:
    +    path quantification
    +    path fastqc
    +
    +    output:
    +    path "*.html"
    +
    +    script:
    +    """
    +    multiqc . --filename $quantification
    +    """
    +}
    +
      +
    • modules/trimgalore.nf: script inside a modules folder, containing only the TRIMGALORE process
    • +
    +
    process TRIMGALORE {
    +  container '/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-trim-galore-0.6.6--0.img' 
    +
    +  input:
    +    tuple val(sample_id), path(reads)
    +  
    +  output:
    +    tuple val(sample_id), path("*{3prime,5prime,trimmed,val}*.fq.gz"), emit: reads
    +    tuple val(sample_id), path("*report.txt")                        , emit: log     , optional: true
    +    tuple val(sample_id), path("*unpaired*.fq.gz")                   , emit: unpaired, optional: true
    +    tuple val(sample_id), path("*.html")                             , emit: html    , optional: true
    +    tuple val(sample_id), path("*.zip")                              , emit: zip     , optional: true
    +
    +  script:
    +    """
    +    trim_galore \\
    +      --paired \\
    +      --gzip \\
    +      ${reads[0]} \\
    +      ${reads[1]}
    +    """
    +}
    +
      +
    • nextflow.config: config file that enables singularity
    • +
    +
    singularity {
    +  enabled = true
    +  autoMounts = true
    +  cacheDir = "/config/binaries/singularity/containers_devel/nextflow"
    +}
    +

    Run the pipeline, specifying --outdir:

    +
    >>> nextflow run rnaseq.nf --outdir output
    +N E X T F L O W  ~  version 23.04.1
    +Launching `rnaseq.nf` [soggy_jennings] DSL2 - revision: 87afc1d98d
    +executor >  local (16)
    +[93/d37ef0] process > INDEX          [100%] 1 of 1 ✔
    +[b3/4c4d9c] process > QT (1)         [100%] 3 of 3 ✔
    +[d0/173a6e] process > FASTQC_one (3) [100%] 3 of 3 ✔
    +[58/0b8af2] process > TRIMGALORE (3) [100%] 3 of 3 ✔
    +[c6/def175] process > FASTQC_two (3) [100%] 3 of 3 ✔
    +[e0/bcf904] process > MULTIQC (3)    [100%] 3 of 3 ✔
    +
    +
    +

    8.1. Organise outputs

    +

    The output declaration block defines the channels used by the process to send out the results produced. However, this output only stays in the work/ directory if there is no publishDir directive specified.

    +

    Given each task is being executed in separate temporary work/ folder (e.g., work/f1/850698…), you may want to save important, non-intermediary, and/or final files in a results folder.

    +

    To store our workflow result files, you need to explicitly mark them using the directive publishDir in the process that’s creating the files. For example:

    +
    process MULTIQC {
    +    publishDir params.outdir, mode:'copy'
    +    container "/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-multiqc-1.21--pyhdfd78af_0.img"
    +
    +    input:
    +    path quantification
    +    path fastqc
    +
    +    output:
    +    path "*.html"
    +
    +    script:
    +    """
    +    multiqc . --filename $quantification
    +    """
    +}
    +

    The above example will copy all html files created by the MULTIQC process into the directory path specified in the params.outdir

    +
    +
    +

    8.1.1. Store outputs matching a glob pattern

    +

    You can use more than one publishDir to keep different outputs in separate directories. For each directive specify a different glob pattern using the pattern option to store into each directory only the files that match the provided pattern.

    +

    For example:

    +
    reads_ch = Channel.fromFilePairs('data/ggal/*_{1,2}.fq')
    +
    +process FOO {
    +    publishDir "results/bam", pattern: "*.bam"
    +    publishDir "results/bai", pattern: "*.bai"
    +
    +    input:
    +    tuple val(sample_id), path(sample_id_paths)
    +
    +    output:
    +    tuple val(sample_id), path("*.bam")
    +    tuple val(sample_id), path("*.bai")
    +
    +    script:
    +    """
    +    echo your_command_here --sample $sample_id_paths > ${sample_id}.bam
    +    echo your_command_here --sample $sample_id_paths > ${sample_id}.bai
    +    """
    +}
    +

    Exercise

    +

    Use publishDir and pattern to keep the outputs from the trimgalore.nf into separate directories.

    +
    + +
    +
    +
    process TRIMGALORE {
    +  container '/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-trim-galore-0.6.6--0.img' 
    +  publishDir "$params.outdir/report", mode: "copy", pattern:"*report.txt"
    +  publishDir "$params.outdir/trimmed_fastq", mode: "copy", pattern:"*fq.gz"
    +
    +  input:
    +    tuple val(sample_id), path(reads)
    +  
    +  output:
    +    tuple val(sample_id), path("*{3prime,5prime,trimmed,val}*.fq.gz"), emit: reads
    +    tuple val(sample_id), path("*report.txt")                        , emit: log     , optional: true
    +    tuple val(sample_id), path("*unpaired*.fq.gz")                   , emit: unpaired, optional: true
    +    tuple val(sample_id), path("*.html")                             , emit: html    , optional: true
    +    tuple val(sample_id), path("*.zip")                              , emit: zip     , optional: true
    +
    +  script:
    +    """
    +    trim_galore \\
    +      --paired \\
    +      --gzip \\
    +      ${reads[0]} \\
    +      ${reads[1]}
    +    """
    +}
    +

    Output should now look like

    +
    >>> tree ./output
    +./output
    +├── gut.html
    +├── liver.html
    +├── lung.html
    +├── report
    +│   ├── gut_1.fq_trimming_report.txt
    +│   ├── gut_2.fq_trimming_report.txt
    +│   ├── liver_1.fq_trimming_report.txt
    +│   ├── liver_2.fq_trimming_report.txt
    +│   ├── lung_1.fq_trimming_report.txt
    +│   └── lung_2.fq_trimming_report.txt
    +└── trimmed_fastq
    +    ├── gut_1_val_1.fq.gz
    +    ├── gut_2_val_2.fq.gz
    +    ├── liver_1_val_1.fq.gz
    +    ├── liver_2_val_2.fq.gz
    +    ├── lung_1_val_1.fq.gz
    +    └── lung_2_val_2.fq.gz
    +
    +2 directories, 15 files
    +
    +
    +
    +
    +
    +

    8.1.2. Store outputs renaming files or in a sub-directory

    +

    The publishDir directive also allow the use of saveAs option to give each file a name of your choice, providing a custom rule as a closure.

    +
    process foo {
    +  publishDir 'results', saveAs: { filename -> "foo_$filename" }
    +
    +  output: 
    +  path '*.txt'
    +
    +  '''
    +  touch this.txt
    +  touch that.txt
    +  '''
    +}
    +

    The same pattern can be used to store specific files in separate directories depending on the actual name.

    +
    process foo {
    +  publishDir 'results', saveAs: { filename -> filename.endsWith(".zip") ? "zips/$filename" : filename }
    +
    +  output: 
    +  path '*'
    +
    +  '''
    +  touch this.txt
    +  touch that.zip
    +  '''
    +}
    +

    Exercise

    +

    Modify the MULTIQC output with saveAs such that resulting folder is as follow:

    +
    ./output
    +├── MultiQC
    +│   ├── multiqc_gut.html
    +│   ├── multiqc_liver.html
    +│   └── multiqc_lung.html
    +├── report
    +│   ├── gut_1.fq_trimming_report.txt
    +│   ├── gut_2.fq_trimming_report.txt
    +│   ├── liver_1.fq_trimming_report.txt
    +│   ├── liver_2.fq_trimming_report.txt
    +│   ├── lung_1.fq_trimming_report.txt
    +│   └── lung_2.fq_trimming_report.txt
    +└── trimmed_fastq
    +    ├── gut_1_val_1.fq.gz
    +    ├── gut_2_val_2.fq.gz
    +    ├── liver_1_val_1.fq.gz
    +    ├── liver_2_val_2.fq.gz
    +    ├── lung_1_val_1.fq.gz
    +    └── lung_2_val_2.fq.gz
    +
    +3 directories, 15 files
    +
    +
    +
    + +
    +
    +Warning +
    +
    +
    +

    You need to remove existing output folder/files if you want to have a clean output. By default, nextflow will overwrite existing files, and keep all the remaining files in the same specified output directory.

    +
    +
    +
    + +
    +
    +
    process MULTIQC {
    +    publishDir params.outdir, mode:'copy', saveAs: { filename -> filename.endsWith(".html") ? "MultiQC/multiqc_$filename" : filename }
    +    container "/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-multiqc-1.21--pyhdfd78af_0.img"
    +
    +    input:
    +    path quantification
    +    path fastqc
    +
    +    output:
    +    path "*.html"
    +
    +    script:
    +    """
    +    multiqc . --filename $quantification
    +    """
    +}
    +
    +
    +
    +

    Challenge

    +

    Modify all the processes in rnaseq.nf such that we will have the following output structure

    +
    ./output
    +├── gut
    +│   ├── QC
    +│   │   ├── fastqc_gut_logs
    +│   │   │   ├── gut_1_fastqc.html
    +│   │   │   ├── gut_1_fastqc.zip
    +│   │   │   ├── gut_2_fastqc.html
    +│   │   │   └── gut_2_fastqc.zip
    +│   │   └── gut.html
    +│   ├── report
    +│   │   ├── gut_1.fq_trimming_report.txt
    +│   │   └── gut_2.fq_trimming_report.txt
    +│   └── trimmed_fastq
    +│       ├── gut_1_val_1.fq.gz
    +│       └── gut_2_val_2.fq.gz
    +├── liver
    +│   ├── QC
    +│   │   ├── fastqc_liver_logs
    +│   │   │   ├── liver_1_fastqc.html
    +│   │   │   ├── liver_1_fastqc.zip
    +│   │   │   ├── liver_2_fastqc.html
    +│   │   │   └── liver_2_fastqc.zip
    +│   │   └── liver.html
    +│   ├── report
    +│   │   ├── liver_1.fq_trimming_report.txt
    +│   │   └── liver_2.fq_trimming_report.txt
    +│   └── trimmed_fastq
    +│       ├── liver_1_val_1.fq.gz
    +│       └── liver_2_val_2.fq.gz
    +└── lung
    +    ├── QC
    +    │   ├── fastqc_lung_logs
    +    │   │   ├── lung_1_fastqc.html
    +    │   │   ├── lung_1_fastqc.zip
    +    │   │   ├── lung_2_fastqc.html
    +    │   │   └── lung_2_fastqc.zip
    +    │   └── lung.html
    +    ├── report
    +    │   ├── lung_1.fq_trimming_report.txt
    +    │   └── lung_2.fq_trimming_report.txt
    +    └── trimmed_fastq
    +        ├── lung_1_val_1.fq.gz
    +        └── lung_2_val_2.fq.gz
    +
    +15 directories, 27 files
    +
    + +
    +
    +
    process FASTQC {
    +    publishDir "$params.outdir/$sample_id/QC", mode:'copy'
    +    container "/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-fastqc-0.12.1--hdfd78af_0.img"
    +
    +    input:
    +    tuple val(sample_id), path(reads)
    +
    +    output:
    +    path "fastqc_${sample_id}_logs"
    +
    +    script:
    +    """
    +    mkdir fastqc_${sample_id}_logs
    +    fastqc -o fastqc_${sample_id}_logs -f fastq -q ${reads}
    +    """
    +}
    +
    +process MULTIQC {
    +    //publishDir params.outdir, mode:'copy', saveAs: { filename -> filename.endsWith(".html") ? "MultiQC/multiqc_$filename" : filename }
    +    publishDir "$params.outdir/$quantification/QC", mode:'copy'
    +    container "/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-multiqc-1.21--pyhdfd78af_0.img"
    +
    +    input:
    +    path quantification
    +    path fastqc
    +
    +    output:
    +    path "*.html"
    +
    +    script:
    +    """
    +    multiqc . --filename $quantification
    +    """
    +}
    +
    +process TRIMGALORE {
    +  container '/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-trim-galore-0.6.6--0.img'
    +  publishDir "${params.outdir}/${sample_id}/report", mode: "copy", pattern:"*report.txt"
    +  publishDir "${params.outdir}/${sample_id}/trimmed_fastq", mode: "copy", pattern:"*fq.gz"
    +
    +  input:
    +    tuple val(sample_id), path(reads)
    +
    +  output:
    +    tuple val(sample_id), path("*{3prime,5prime,trimmed,val}*.fq.gz"), emit: reads
    +    tuple val(sample_id), path("*report.txt")                        , emit: log     , optional: true
    +    tuple val(sample_id), path("*unpaired*.fq.gz")                   , emit: unpaired, optional: true
    +    tuple val(sample_id), path("*.html")                             , emit: html    , optional: true
    +    tuple val(sample_id), path("*.zip")                              , emit: zip     , optional: true
    +
    +  script:
    +    """
    +    trim_galore \\
    +      --paired \\
    +      --gzip \\
    +      ${reads[0]} \\
    +      ${reads[1]}
    +    """
    +}
    +
    +
    +
    +
    +
    +

    8.2 Scatter

    +

    The scatter operation involves distributing large input data into smaller chunks that can be analysed across multiple processes in parallel.

    +

    One very simple example of native scatter is how nextflow handles Channel factories with the Channel.fromPath or Channel.fromFilePairs method, where multiple input data is processed in parallel.

    +
    params.reads = "/scratch/users/.../training/nf-training/data/ggal/*_{1,2}.fq"
    +reads_ch = Channel.fromFilePairs("$params.reads")
    +
    +include { FASTQC as FASTQC_one } from './modules.nf'
    +
    +workflow {
    +  fastqc_ch = FASTQC_one(reads_ch)
    +}
    +

    From the above snippet from our rnaseq.nf, we will get three execution of FASTQC_one for each pairs of our input data.

    +

    Other than natively splitting execution by input data, Nextflow also provides operators to scatter existing input data for various benefits, such as faster processing. For example:

    + +
    +
    +

    8.2.1 Process per file chunk

    +

    Exercise

    +
    params.infile = "/data/reference/bed_files/Agilent_CRE_v2/S30409818_Covered_MERGED.bed"
    +params.size = 100000
    +
    +process count_line {
    +  debug true
    +  input: 
    +  file x
    +
    +  script:
    +  """
    +  wc -l $x 
    +  """
    +}
    +
    +workflow {
    +  Channel.fromPath(params.infile) \
    +    | splitText(by: params.size, file: true) \
    +    | count_line
    +}
    +

    Exercise

    +
    params.infile = "/scratch/users/rlupat/nfWorkshop/dev1/training/nf-training/data/ggal/*_{1,2}.fq"
    +params.size = 1000
    +
    +workflow {
    +  Channel.fromFilePairs(params.infile, flat: true) \
    +    | splitFastq(by: params.size, pe: true, file: true) \
    +    | view()
    +}
    +
    +
    +

    8.2.1 Process per file range

    +

    Exercise

    +
    Channel.from(1..22) \
    +   | map { chr -> ["sample${chr}", file("${chr}.indels.vcf"), file("${chr}.vcf")] } \
    +   | view()
    +
    >> nextflow run test_scatter.nf
    +
    +[sample1, /scratch/users/${users}/1.indels.vcf, /scratch/users/${users}/1.vcf]
    +[sample2, /scratch/users/${users}/2.indels.vcf, /scratch/users/${users}/2.vcf]
    +[sample3, /scratch/users/${users}/3.indels.vcf, /scratch/users/${users}/3.vcf]
    +[sample4, /scratch/users/${users}/4.indels.vcf, /scratch/users/${users}/4.vcf]
    +[sample5, /scratch/users/${users}/5.indels.vcf, /scratch/users/${users}/5.vcf]
    +[sample6, /scratch/users/${users}/6.indels.vcf, /scratch/users/${users}/6.vcf]
    +[sample7, /scratch/users/${users}/7.indels.vcf, /scratch/users/${users}/7.vcf]
    +[sample8, /scratch/users/${users}/8.indels.vcf, /scratch/users/${users}/8.vcf]
    +[sample9, /scratch/users/${users}/9.indels.vcf, /scratch/users/${users}/9.vcf]
    +[sample10, /scratch/users${users}/10.indels.vcf, /scratch/users${users}/10.vcf]
    +[sample11, /scratch/users${users}/11.indels.vcf, /scratch/users${users}/11.vcf]
    +[sample12, /scratch/users${users}/12.indels.vcf, /scratch/users${users}/12.vcf]
    +[sample13, /scratch/users${users}/13.indels.vcf, /scratch/users${users}/13.vcf]
    +[sample14, /scratch/users${users}/14.indels.vcf, /scratch/users${users}/14.vcf]
    +[sample15, /scratch/users${users}/15.indels.vcf, /scratch/users${users}/15.vcf]
    +[sample16, /scratch/users${users}/16.indels.vcf, /scratch/users${users}/16.vcf]
    +[sample17, /scratch/users${users}/17.indels.vcf, /scratch/users${users}/17.vcf]
    +[sample18, /scratch/users${users}/18.indels.vcf, /scratch/users${users}/18.vcf]
    +[sample19, /scratch/users${users}/19.indels.vcf, /scratch/users${users}/19.vcf]
    +[sample20, /scratch/users${users}/20.indels.vcf, /scratch/users${users}/20.vcf]
    +[sample21, /scratch/users${users}/21.indels.vcf, /scratch/users${users}/21.vcf]
    +[sample22, /scratch/users${users}/22.indels.vcf, /scratch/users${users}/22.vcf]
    +

    Exercise

    +
    params.infile = "/data/reference/bed_files/Agilent_CRE_v2/S30409818_Covered_MERGED.bed"
    +params.size = 100000
    +
    +process split_bed_by_chr {
    +  debug true
    +
    +  input:
    +  path bed
    +  val chr
    +
    +  output:
    +  path "*.bed"
    +
    +  script:
    +  """
    +  grep ^${chr}\t ${bed} > ${chr}.bed
    +  """
    +}
    +
    +workflow {
    +    split_bed_by_chr(params.infile, Channel.from(1..22)) | view()
    +}
    +

    Challenge

    +

    How do we include chr X and Y into the above split by chromosome?

    +
    + +
    +
    +
    workflow {
    +    split_bed_by_chr(params.infile, Channel.from(1..22,'X','Y').flatten()) | view()
    +}
    +
    +
    +
    +
    +
    +

    8.3 Gather

    +

    The gather operation consolidates results from parallel computations (can be from scatter) into a centralized process for aggregation and further processing.

    +

    Some of the Nextflow provided operators that facilitate this gather operation, include:

    + +
    +
    +

    8.3.1. Process all outputs altogether

    +

    Exercise

    +
    params.infile = "/data/reference/bed_files/Agilent_CRE_v2/S30409818_Covered_MERGED.bed"
    +params.size = 100000
    +
    +process split_bed_by_chr {
    +  debug true
    +
    +  input:
    +  path bed
    +  val chr
    +
    +  output:
    +  path "*.bed"
    +
    +  script:
    +  """
    +  grep ^${chr}\t ${bed} > ${chr}.bed
    +  """
    +}
    +
    +workflow {
    +    split_bed_by_chr(params.infile, Channel.from(1..22,'X','Y').flatten()) | collect | view()
    +}
    +
    +
    +

    8.3.2. Collect outputs into a file

    +

    Exercise

    +
    params.infile = "/data/reference/bed_files/Agilent_CRE_v2/S30409818_Covered_MERGED.bed"
    +params.size = 100000
    +
    +process split_bed_by_chr {
    +  debug true
    +
    +  input:
    +  path bed
    +  val chr
    +
    +  output:
    +  path "*.bed"
    +
    +  script:
    +  """
    +  grep ^${chr}\t ${bed} > ${chr}.bed
    +  """
    +}
    +
    +workflow {
    +    split_bed_by_chr(params.infile, Channel.from(1..22,'X','Y').flatten()) | collectFile(name: 'merged.bed', newLine:true) | view()
    +}
    +

    Exercise

    +
    workflow {
    +  Channel.fromPath("/scratch/users/rlupat/nfWorkshop/dev1/training/nf-training/data/ggal/*_1.fq", checkIfExists: true) \
    +    | collectFile(name: 'combined_1.fq', newLine:true) \
    +    | view
    +}
    + + +
    + +
    + +
    + + + + \ No newline at end of file diff --git a/workshops/8.1_scatter_gather_output.qmd b/workshops/8.1_scatter_gather_output.qmd new file mode 100644 index 0000000..7aab0ad --- /dev/null +++ b/workshops/8.1_scatter_gather_output.qmd @@ -0,0 +1,755 @@ +--- +title: "**Nextflow Development - Outputs, Scatter, and Gather**" +output: + html_document: + toc: false + toc_float: false +from: markdown+emoji +--- + +::: callout-tip + +### Objectives{.unlisted} +- Gain an understanding of how to structure nextflow published outputs +- Gain an understanding of how to do scatter & gather processes + +::: + +## **Environment Setup** + +Set up an interactive shell to run our Nextflow workflow: + +``` default +srun --pty -p prod_short --mem 8GB --mincpus 2 -t 0-2:00 bash +``` + +Load the required modules to run Nextflow: + +``` default +module load nextflow/23.04.1 +module load singularity/3.7.3 +``` + +Set the singularity cache environment variable: + +```default +export NXF_SINGULARITY_CACHEDIR=/config/binaries/singularity/containers_devel/nextflow +``` + +Singularity images downloaded by workflow executions will now be stored in this directory. + +You may want to include these, or other environmental variables, in your `.bashrc` file (or alternate) that is loaded when you log in so you don’t need to export variables every session. A complete list of environment variables can be found [here](https://www.nextflow.io/docs/latest/config.html#environment-variables). + +The training data can be cloned from: +```default +git clone https://github.com/nextflow-io/training.git +``` + + +## **RNA-seq Workflow and Module Files ** + +Previously, we created three Nextflow files and one config file: + +```default +├── nextflow.config +├── rnaseq.nf +├── modules.nf +└── modules + └── trimgalore.nf +``` + +- `rnaseq.nf`: main workflow script where parameters are defined and processes were called. + +```default +#!/usr/bin/env nextflow + +params.reads = "/scratch/users/.../training/nf-training/data/ggal/*_{1,2}.fq" +params.transcriptome_file = "/scratch/users/.../training/nf-training/data/ggal/transcriptome.fa" + +reads_ch = Channel.fromFilePairs("$params.reads") + +include { INDEX } from './modules.nf' +include { QUANTIFICATION as QT } from './modules.nf' +include { FASTQC as FASTQC_one } from './modules.nf' +include { FASTQC as FASTQC_two } from './modules.nf' +include { MULTIQC } from './modules.nf' +include { TRIMGALORE } from './modules/trimgalore.nf' + +workflow { + index_ch = INDEX(params.transcriptome_file) + quant_ch = QT(index_ch, reads_ch) + fastqc_ch = FASTQC_one(reads_ch) + trimgalore_out_ch = TRIMGALORE(reads_ch).reads + fastqc_cleaned_ch = FASTQC_two(trimgalore_out_ch) + multiqc_ch = MULTIQC(quant_ch, fastqc_ch) +} +``` +- `modules.nf`: script containing the majority of modules, including `INDEX`, `QUANTIFICATION`, `FASTQC`, and `MULTIQC` + +```default +process INDEX { + container "/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-salmon-1.10.1--h7e5ed60_0.img" + + input: + path transcriptome + + output: + path "salmon_idx" + + script: + """ + salmon index --threads $task.cpus -t $transcriptome -i salmon_idx + """ +} + +process QUANTIFICATION { + container "/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-salmon-1.10.1--h7e5ed60_0.img" + + input: + path salmon_index + tuple val(sample_id), path(reads) + + output: + path "$sample_id" + + script: + """ + salmon quant --threads $task.cpus --libType=U \ + -i $salmon_index -1 ${reads[0]} -2 ${reads[1]} -o $sample_id + """ +} + +process FASTQC { + container "/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-fastqc-0.12.1--hdfd78af_0.img" + + input: + tuple val(sample_id), path(reads) + + output: + path "fastqc_${sample_id}_logs" + + script: + """ + mkdir fastqc_${sample_id}_logs + fastqc -o fastqc_${sample_id}_logs -f fastq -q ${reads} + """ +} + +process MULTIQC { + publishDir params.outdir, mode:'copy' + container "/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-multiqc-1.21--pyhdfd78af_0.img" + + input: + path quantification + path fastqc + + output: + path "*.html" + + script: + """ + multiqc . --filename $quantification + """ +} +``` +- `modules/trimgalore.nf`: script inside a `modules` folder, containing only the `TRIMGALORE` process +```default +process TRIMGALORE { + container '/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-trim-galore-0.6.6--0.img' + + input: + tuple val(sample_id), path(reads) + + output: + tuple val(sample_id), path("*{3prime,5prime,trimmed,val}*.fq.gz"), emit: reads + tuple val(sample_id), path("*report.txt") , emit: log , optional: true + tuple val(sample_id), path("*unpaired*.fq.gz") , emit: unpaired, optional: true + tuple val(sample_id), path("*.html") , emit: html , optional: true + tuple val(sample_id), path("*.zip") , emit: zip , optional: true + + script: + """ + trim_galore \\ + --paired \\ + --gzip \\ + ${reads[0]} \\ + ${reads[1]} + """ +} +``` + +- `nextflow.config`: config file that enables singularity +```default +singularity { + enabled = true + autoMounts = true + cacheDir = "/config/binaries/singularity/containers_devel/nextflow" +} +``` + +Run the pipeline, specifying `--outdir`: + +```default +>>> nextflow run rnaseq.nf --outdir output +N E X T F L O W ~ version 23.04.1 +Launching `rnaseq.nf` [soggy_jennings] DSL2 - revision: 87afc1d98d +executor > local (16) +[93/d37ef0] process > INDEX [100%] 1 of 1 ✔ +[b3/4c4d9c] process > QT (1) [100%] 3 of 3 ✔ +[d0/173a6e] process > FASTQC_one (3) [100%] 3 of 3 ✔ +[58/0b8af2] process > TRIMGALORE (3) [100%] 3 of 3 ✔ +[c6/def175] process > FASTQC_two (3) [100%] 3 of 3 ✔ +[e0/bcf904] process > MULTIQC (3) [100%] 3 of 3 ✔ +``` + + +## 8.1. Organise outputs + +The output declaration block defines the channels used by the process to send out the results produced. However, this output only stays in the `work/` directory if there is no `publishDir` directive specified. + +Given each task is being executed in separate temporary work/ folder (e.g., work/f1/850698…), you may want to save important, non-intermediary, and/or final files in a results folder. + +To store our workflow result files, you need to explicitly mark them using the directive `publishDir` in the process that’s creating the files. For example: + +```default +process MULTIQC { + publishDir params.outdir, mode:'copy' + container "/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-multiqc-1.21--pyhdfd78af_0.img" + + input: + path quantification + path fastqc + + output: + path "*.html" + + script: + """ + multiqc . --filename $quantification + """ +} +``` + +The above example will copy all `html` files created by the MULTIQC process into the directory path specified in the `params.outdir` + +## 8.1.1. Store outputs matching a glob pattern + +You can use more than one `publishDir` to keep different outputs in separate directories. For each directive specify a different glob `pattern` using the pattern option to store into each directory only the files that match the provided pattern. + +For example: +```default +reads_ch = Channel.fromFilePairs('data/ggal/*_{1,2}.fq') + +process FOO { + publishDir "results/bam", pattern: "*.bam" + publishDir "results/bai", pattern: "*.bai" + + input: + tuple val(sample_id), path(sample_id_paths) + + output: + tuple val(sample_id), path("*.bam") + tuple val(sample_id), path("*.bai") + + script: + """ + echo your_command_here --sample $sample_id_paths > ${sample_id}.bam + echo your_command_here --sample $sample_id_paths > ${sample_id}.bai + """ +} +``` + +**Exercise** + +Use `publishDir` and `pattern` to keep the outputs from the `trimgalore.nf` into separate directories. + +::: {.callout-note appearance="simple" collapse="true"} +### Solution + +```default +process TRIMGALORE { + container '/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-trim-galore-0.6.6--0.img' + publishDir "$params.outdir/report", mode: "copy", pattern:"*report.txt" + publishDir "$params.outdir/trimmed_fastq", mode: "copy", pattern:"*fq.gz" + + input: + tuple val(sample_id), path(reads) + + output: + tuple val(sample_id), path("*{3prime,5prime,trimmed,val}*.fq.gz"), emit: reads + tuple val(sample_id), path("*report.txt") , emit: log , optional: true + tuple val(sample_id), path("*unpaired*.fq.gz") , emit: unpaired, optional: true + tuple val(sample_id), path("*.html") , emit: html , optional: true + tuple val(sample_id), path("*.zip") , emit: zip , optional: true + + script: + """ + trim_galore \\ + --paired \\ + --gzip \\ + ${reads[0]} \\ + ${reads[1]} + """ +} +``` + +Output should now look like +```default +>>> tree ./output +./output +├── gut.html +├── liver.html +├── lung.html +├── report +│   ├── gut_1.fq_trimming_report.txt +│   ├── gut_2.fq_trimming_report.txt +│   ├── liver_1.fq_trimming_report.txt +│   ├── liver_2.fq_trimming_report.txt +│   ├── lung_1.fq_trimming_report.txt +│   └── lung_2.fq_trimming_report.txt +└── trimmed_fastq + ├── gut_1_val_1.fq.gz + ├── gut_2_val_2.fq.gz + ├── liver_1_val_1.fq.gz + ├── liver_2_val_2.fq.gz + ├── lung_1_val_1.fq.gz + └── lung_2_val_2.fq.gz + +2 directories, 15 files +``` +::: + + + +## 8.1.2. Store outputs renaming files or in a sub-directory + +The `publishDir` directive also allow the use of `saveAs` option to give each file a name of your choice, providing a custom rule as a [closure](https://www.nextflow.io/docs/latest/script.html#closures). + +```default +process foo { + publishDir 'results', saveAs: { filename -> "foo_$filename" } + + output: + path '*.txt' + + ''' + touch this.txt + touch that.txt + ''' +} +``` + +The same pattern can be used to store specific files in separate directories depending on the actual name. + +```default +process foo { + publishDir 'results', saveAs: { filename -> filename.endsWith(".zip") ? "zips/$filename" : filename } + + output: + path '*' + + ''' + touch this.txt + touch that.zip + ''' +} +``` + +**Exercise** + +Modify the `MULTIQC` output with `saveAs` such that resulting folder is as follow: + +```default +./output +├── MultiQC +│   ├── multiqc_gut.html +│   ├── multiqc_liver.html +│   └── multiqc_lung.html +├── report +│   ├── gut_1.fq_trimming_report.txt +│   ├── gut_2.fq_trimming_report.txt +│   ├── liver_1.fq_trimming_report.txt +│   ├── liver_2.fq_trimming_report.txt +│   ├── lung_1.fq_trimming_report.txt +│   └── lung_2.fq_trimming_report.txt +└── trimmed_fastq + ├── gut_1_val_1.fq.gz + ├── gut_2_val_2.fq.gz + ├── liver_1_val_1.fq.gz + ├── liver_2_val_2.fq.gz + ├── lung_1_val_1.fq.gz + └── lung_2_val_2.fq.gz + +3 directories, 15 files +``` + +::: callout-warning +You need to remove existing output folder/files if you want to have a clean output. By default, nextflow will overwrite existing files, and keep all the remaining files in the same specified output directory. +::: + + +::: {.callout-note appearance="simple" collapse="true"} +### Solution + +```default +process MULTIQC { + publishDir params.outdir, mode:'copy', saveAs: { filename -> filename.endsWith(".html") ? "MultiQC/multiqc_$filename" : filename } + container "/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-multiqc-1.21--pyhdfd78af_0.img" + + input: + path quantification + path fastqc + + output: + path "*.html" + + script: + """ + multiqc . --filename $quantification + """ +} +``` +::: + +**Challenge** + +Modify all the processes in `rnaseq.nf` such that we will have the following output structure + +```default +./output +├── gut +│   ├── QC +│   │   ├── fastqc_gut_logs +│   │   │   ├── gut_1_fastqc.html +│   │   │   ├── gut_1_fastqc.zip +│   │   │   ├── gut_2_fastqc.html +│   │   │   └── gut_2_fastqc.zip +│   │   └── gut.html +│   ├── report +│   │   ├── gut_1.fq_trimming_report.txt +│   │   └── gut_2.fq_trimming_report.txt +│   └── trimmed_fastq +│   ├── gut_1_val_1.fq.gz +│   └── gut_2_val_2.fq.gz +├── liver +│   ├── QC +│   │   ├── fastqc_liver_logs +│   │   │   ├── liver_1_fastqc.html +│   │   │   ├── liver_1_fastqc.zip +│   │   │   ├── liver_2_fastqc.html +│   │   │   └── liver_2_fastqc.zip +│   │   └── liver.html +│   ├── report +│   │   ├── liver_1.fq_trimming_report.txt +│   │   └── liver_2.fq_trimming_report.txt +│   └── trimmed_fastq +│   ├── liver_1_val_1.fq.gz +│   └── liver_2_val_2.fq.gz +└── lung + ├── QC + │   ├── fastqc_lung_logs + │   │   ├── lung_1_fastqc.html + │   │   ├── lung_1_fastqc.zip + │   │   ├── lung_2_fastqc.html + │   │   └── lung_2_fastqc.zip + │   └── lung.html + ├── report + │   ├── lung_1.fq_trimming_report.txt + │   └── lung_2.fq_trimming_report.txt + └── trimmed_fastq + ├── lung_1_val_1.fq.gz + └── lung_2_val_2.fq.gz + +15 directories, 27 files +``` + +::: {.callout-note appearance="simple" collapse="true"} +### Solution + +```default +process FASTQC { + publishDir "$params.outdir/$sample_id/QC", mode:'copy' + container "/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-fastqc-0.12.1--hdfd78af_0.img" + + input: + tuple val(sample_id), path(reads) + + output: + path "fastqc_${sample_id}_logs" + + script: + """ + mkdir fastqc_${sample_id}_logs + fastqc -o fastqc_${sample_id}_logs -f fastq -q ${reads} + """ +} + +process MULTIQC { + //publishDir params.outdir, mode:'copy', saveAs: { filename -> filename.endsWith(".html") ? "MultiQC/multiqc_$filename" : filename } + publishDir "$params.outdir/$quantification/QC", mode:'copy' + container "/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-multiqc-1.21--pyhdfd78af_0.img" + + input: + path quantification + path fastqc + + output: + path "*.html" + + script: + """ + multiqc . --filename $quantification + """ +} + +process TRIMGALORE { + container '/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-trim-galore-0.6.6--0.img' + publishDir "${params.outdir}/${sample_id}/report", mode: "copy", pattern:"*report.txt" + publishDir "${params.outdir}/${sample_id}/trimmed_fastq", mode: "copy", pattern:"*fq.gz" + + input: + tuple val(sample_id), path(reads) + + output: + tuple val(sample_id), path("*{3prime,5prime,trimmed,val}*.fq.gz"), emit: reads + tuple val(sample_id), path("*report.txt") , emit: log , optional: true + tuple val(sample_id), path("*unpaired*.fq.gz") , emit: unpaired, optional: true + tuple val(sample_id), path("*.html") , emit: html , optional: true + tuple val(sample_id), path("*.zip") , emit: zip , optional: true + + script: + """ + trim_galore \\ + --paired \\ + --gzip \\ + ${reads[0]} \\ + ${reads[1]} + """ +} +``` +::: + + +## **8.2 Scatter** + +The `scatter` operation involves distributing large input data into smaller chunks that can be analysed across multiple processes in parallel. + +One very simple example of native `scatter` is how nextflow handles Channel factories with the `Channel.fromPath` or `Channel.fromFilePairs` method, where multiple input data is processed in parallel. + +```default +params.reads = "/scratch/users/.../training/nf-training/data/ggal/*_{1,2}.fq" +reads_ch = Channel.fromFilePairs("$params.reads") + +include { FASTQC as FASTQC_one } from './modules.nf' + +workflow { + fastqc_ch = FASTQC_one(reads_ch) +} +``` +From the above snippet from our `rnaseq.nf`, we will get three execution of FASTQC_one for each pairs of our input data. + +Other than natively splitting execution by input data, Nextflow also provides operators to scatter existing input data for various benefits, such as faster processing. For example: + +- [splitText](https://www.nextflow.io/docs/latest/operator.html#splittext) +- [splitFasta](https://www.nextflow.io/docs/latest/operator.html#splitfasta) +- [splitFastq](https://www.nextflow.io/docs/latest/operator.html#splitfastq) +- [map](https://www.nextflow.io/docs/latest/operator.html#map) with [from](https://www.nextflow.io/docs/latest/channel.html#from) or [fromList](https://www.nextflow.io/docs/latest/channel.html#fromlist) +- [flatten](https://www.nextflow.io/docs/latest/operator.html#flatten) + +## **8.2.1 Process per file chunk** + +**Exercise** + +```default +params.infile = "/data/reference/bed_files/Agilent_CRE_v2/S30409818_Covered_MERGED.bed" +params.size = 100000 + +process count_line { + debug true + input: + file x + + script: + """ + wc -l $x + """ +} + +workflow { + Channel.fromPath(params.infile) \ + | splitText(by: params.size, file: true) \ + | count_line +} +``` + +**Exercise** + +```default +params.infile = "/scratch/users/rlupat/nfWorkshop/dev1/training/nf-training/data/ggal/*_{1,2}.fq" +params.size = 1000 + +workflow { + Channel.fromFilePairs(params.infile, flat: true) \ + | splitFastq(by: params.size, pe: true, file: true) \ + | view() +} +``` + +## **8.2.1 Process per file range** + +**Exercise** + +```default +Channel.from(1..22) \ + | map { chr -> ["sample${chr}", file("${chr}.indels.vcf"), file("${chr}.vcf")] } \ + | view() +``` + +```default +>> nextflow run test_scatter.nf + +[sample1, /scratch/users/${users}/1.indels.vcf, /scratch/users/${users}/1.vcf] +[sample2, /scratch/users/${users}/2.indels.vcf, /scratch/users/${users}/2.vcf] +[sample3, /scratch/users/${users}/3.indels.vcf, /scratch/users/${users}/3.vcf] +[sample4, /scratch/users/${users}/4.indels.vcf, /scratch/users/${users}/4.vcf] +[sample5, /scratch/users/${users}/5.indels.vcf, /scratch/users/${users}/5.vcf] +[sample6, /scratch/users/${users}/6.indels.vcf, /scratch/users/${users}/6.vcf] +[sample7, /scratch/users/${users}/7.indels.vcf, /scratch/users/${users}/7.vcf] +[sample8, /scratch/users/${users}/8.indels.vcf, /scratch/users/${users}/8.vcf] +[sample9, /scratch/users/${users}/9.indels.vcf, /scratch/users/${users}/9.vcf] +[sample10, /scratch/users${users}/10.indels.vcf, /scratch/users${users}/10.vcf] +[sample11, /scratch/users${users}/11.indels.vcf, /scratch/users${users}/11.vcf] +[sample12, /scratch/users${users}/12.indels.vcf, /scratch/users${users}/12.vcf] +[sample13, /scratch/users${users}/13.indels.vcf, /scratch/users${users}/13.vcf] +[sample14, /scratch/users${users}/14.indels.vcf, /scratch/users${users}/14.vcf] +[sample15, /scratch/users${users}/15.indels.vcf, /scratch/users${users}/15.vcf] +[sample16, /scratch/users${users}/16.indels.vcf, /scratch/users${users}/16.vcf] +[sample17, /scratch/users${users}/17.indels.vcf, /scratch/users${users}/17.vcf] +[sample18, /scratch/users${users}/18.indels.vcf, /scratch/users${users}/18.vcf] +[sample19, /scratch/users${users}/19.indels.vcf, /scratch/users${users}/19.vcf] +[sample20, /scratch/users${users}/20.indels.vcf, /scratch/users${users}/20.vcf] +[sample21, /scratch/users${users}/21.indels.vcf, /scratch/users${users}/21.vcf] +[sample22, /scratch/users${users}/22.indels.vcf, /scratch/users${users}/22.vcf] +``` + +**Exercise** + +``` +params.infile = "/data/reference/bed_files/Agilent_CRE_v2/S30409818_Covered_MERGED.bed" +params.size = 100000 + +process split_bed_by_chr { + debug true + + input: + path bed + val chr + + output: + path "*.bed" + + script: + """ + grep ^${chr}\t ${bed} > ${chr}.bed + """ +} + +workflow { + split_bed_by_chr(params.infile, Channel.from(1..22)) | view() +} +``` + +**Challenge** + +How do we include chr `X` and `Y` into the above split by chromosome? + +::: {.callout-note appearance="simple" collapse="true"} +### Solution + +```default +workflow { + split_bed_by_chr(params.infile, Channel.from(1..22,'X','Y').flatten()) | view() +} +``` + +::: + + +## **8.3 Gather** + +The `gather` operation consolidates results from parallel computations (can be from `scatter`) into a centralized process for aggregation and further processing. + +Some of the Nextflow provided operators that facilitate this `gather` operation, include: + +- [collect](https://www.nextflow.io/docs/latest/operator.html#collect) +- [collectFile](https://www.nextflow.io/docs/latest/operator.html#collectfile) +- [map](https://www.nextflow.io/docs/latest/operator.html#map) + [groupTuple](https://www.nextflow.io/docs/latest/operator.html#grouptuple) + + +## **8.3.1. Process all outputs altogether** + +**Exercise** + +```default +params.infile = "/data/reference/bed_files/Agilent_CRE_v2/S30409818_Covered_MERGED.bed" +params.size = 100000 + +process split_bed_by_chr { + debug true + + input: + path bed + val chr + + output: + path "*.bed" + + script: + """ + grep ^${chr}\t ${bed} > ${chr}.bed + """ +} + +workflow { + split_bed_by_chr(params.infile, Channel.from(1..22,'X','Y').flatten()) | collect | view() +} +``` + +## **8.3.2. Collect outputs into a file** + +**Exercise** + +```default +params.infile = "/data/reference/bed_files/Agilent_CRE_v2/S30409818_Covered_MERGED.bed" +params.size = 100000 + +process split_bed_by_chr { + debug true + + input: + path bed + val chr + + output: + path "*.bed" + + script: + """ + grep ^${chr}\t ${bed} > ${chr}.bed + """ +} + +workflow { + split_bed_by_chr(params.infile, Channel.from(1..22,'X','Y').flatten()) | collectFile(name: 'merged.bed', newLine:true) | view() +} +``` + +**Exercise** + +```default +workflow { + Channel.fromPath("/scratch/users/rlupat/nfWorkshop/dev1/training/nf-training/data/ggal/*_1.fq", checkIfExists: true) \ + | collectFile(name: 'combined_1.fq', newLine:true) \ + | view +} +```