diff --git a/.nojekyll b/.nojekyll index e1c4602..0169212 100644 --- a/.nojekyll +++ b/.nojekyll @@ -1 +1 @@ -09801df5 \ No newline at end of file +341bf5ac \ No newline at end of file diff --git a/index.html b/index.html index 5309688..cf4210c 100644 --- a/index.html +++ b/index.html @@ -147,7 +147,7 @@
  • - Nextflow Operators + Metadata Propagation
  • diff --git a/search.json b/search.json index 0767562..de562b6 100644 --- a/search.json +++ b/search.json @@ -21,137 +21,53 @@ "text": "2.3.1. Running Nextflow Pipelines on a HPC \nNextflow, by default, spawns parallel task executions wherever it is running. You can use Nextflow’s executors feature to run these tasks using an HPC job schedulers such as SLURM and PBS Pro. Use a custom configuration file to send all processes to the job scheduler as separate jobs and define essential resource requests like cpus, time, memory, and queue inside a process {} scope.\n\nRun all workflow tasks as separate jobs on HPC\nIn this custom configuration file we have sent all tasks that a workflow is running to a PBS Pro job scheduler and specified jobs to be run on the normal queue, each running for a max time of 3 hours with 1 cpu and 4 Gb of memory:\nprocess {\n executor = 'slurm'\n queue = 'prod_short'\n cpus = 1\n time = '2h'\n memory = '4.GB'\n}\n\n\nRun processes with different resource profiles as HPC jobs\nAdjusting the custom configuration file above, we can use the withName {} process selector to specify process-specific resource requirements:\nprocess {\n executor = 'slurm'\n \n withName processONE {\n queue = 'prod_short'\n cpus = 1\n time = '2h'\n memory = '4.GB'\n }\n\n withName processTWO {\n queue = 'prod_med'\n cpus = 2\n time = '10h'\n memory = '50.GB'\n }\n}\n\n\nSpecify infrastructure-specific directives for your jobs\nAdjusting the custom configuration file above, we can define any native configuration options using the clusterOptions directive. We can use this to specify non-standard resources. Below we have specified which HPC project code to bill for all process jobs:\nYou can also setup a config to tailor specific to Peter Mac’s HPC partitions setup.\nexecutor {\n queueSize = 100\n queueStatInterval = '1 min'\n pollInterval = '1 min'\n submitRateLimit = '20 min'\n}\n\nprocess {\n executor = 'slurm'\n cache = 'lenient'\n beforeScript = 'module load singularity'\n stageInMode = 'symlink'\n queue = { task.time < 2.h ? 'prod_short' : task.time < 24.h ? 'prod_med' : 'prod' } \n}\n\n\n\n\n\n\nChallenge\n\n\n\nRun the previous nf-core/rnaseq workflow using the process and executor scope above (in a config file), and send each task to slurm.\n\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\nCreate a nextflow.config file\nprocess.executor = 'slurm'\nRun the nfcore/rna-seq workflow again\nnextflow run nf-core/rnaseq -r 3.11.1 \\\n -params-file workshop-params.yaml\n -profile singularity \\\n --max_memory '6.GB' \\\n --max_cpus 2 \\\n -resume \nDid you get the following error?\nsbatch: error: Batch job submission failed: Access/permission denied\nTry running the same workflow on login-node and observe the difference\n>>> squeue -u rlupat -i 5\n\n 17429286 prod nf-NFCOR rlupat R 0:03 1 papr-res-compute01\n\n\n\n\n\n\n\n2.3.2. Things to note for Peter Mac Cluster\n\nBest not to launch nextflow on a login-node\nEven though nextflow is not supposed to be doing any heavy computation, nextflow still consume CPUs and memory to do some of the operations. Our login node is not designed to handle multiple users running a Groovy applicaation that spawn further operations.\nIn saying that, launching nextflow from a compute node is also not possible from our previous exercise. So what is the solution?\nOur cluster prohibits compute nodes from spawning new jobs. There are only two partitions that are currently available to spawn new jobs janis and janis-dev. Therefore, if you are submitting your nextflow pipeline in an sbatch file, it is probably good to point that to the janis node.\n\n\nSet your working directory to a scratch space\nWhen we launch a nextflow workflow, by default it will use the current directory to create a work directory and all the intermediate files will be stored there, only to be cleaned at completion. This means that if you run a long running workflow, chances are your intermediate files will be sent to the tape long term archiving. There are also benefits for running in scratch, as we are using a faster spinning disk, resulting in a faster execution.\n\n\n\n2.3.3. Clean your work directory\nYour work directory can get very big very quickly (especially if you are using full sized datasets). It is good practise to clean your work directory regularly. Rather than removing the work folder with all of it’s contents, the Nextflow clean function allows you to selectively remove data associated with specific runs.\nnextflow clean -help\nThe -after, -before, and -but options are all very useful to select specific runs to clean. The -dry-run option is also very useful to see which files will be removed if you were to -force the clean command.\n\n\n\n\n\n\nChallenge\n\n\n\nYou Nextflow to clean your work work directory of staged files but keep your execution logs.\n\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\nUse the Nextflow clean command with the -k and -f options:\nnextflow clean -k -f\n\n\n\n\n\n\n2.3.4. Change default Nextflow cache strategy\nWorkflow execution is sometimes not resumed as expected. The default behaviour of Nextflow cache keys is to index the input files meta-data information. Reducing the cache stringency to lenient means the files cache keys are based only on filesize and path, and can help to avoid unexpectedly re-running certain processes when -resume is in use.\nTo apply lenient cache strategy to all of your runs, you could add to a custom configuration file:\nprocess {\n cache = 'lenient'\n}\nYou can specify different cache stategies for different processes by using withName or withLabel. You can specify a particular cache strategy be applied to certain profiles within your institutional config, or to apply to all profiles described within that config by placing the above process code block outside the profiles scope.\n\n\n2.3.5. Access private GitHub repositories\nTo interact with private repositories on GitHub, you can provide Nextflow with access to GitHub by specifying your GitHub user name and a Personal Access Token in the scm configuration file inside your specified .nextflow/ directory:\nproviders {\n\n github {\n user = 'rlupat'\n password = 'my-personal-access-token'\n }\n\n}\n\n\n2.3.6. Nextflow Tower\nBioCommons Tower Instance\n\n\n2.3.7. Additional resources \nHere are some useful resources to help you get started with running nf-core pipelines and developing Nextflow pipelines:\n\nNextflow tutorials\nnf-core pipeline tutorials\nNextflow patterns\nHPC tips and tricks\nNextflow coding best practice recommendations\nThe Nextflow blog\n\n\nThese materials are adapted from Customising Nf-Core Workshop by Sydney Informatics Hub" }, { - "objectID": "workshops/6.1_operators.html", - "href": "workshops/6.1_operators.html", - "title": "Nextflow Development - Channel Operators", + "objectID": "workshops/7.1_metadata_propagation.html", + "href": "workshops/7.1_metadata_propagation.html", + "title": "Nextflow Development - Metadata Proprogation", "section": "", - "text": "Objectives\n\n\n\n\nGain an understanding of Nextflow channel operators" + "text": "Objectives\n\n\n\n\nGain and understanding of how to manipulate and proprogate metadata" }, { - "objectID": "workshops/6.1_operators.html#environment-setup", - "href": "workshops/6.1_operators.html#environment-setup", - "title": "Nextflow Development - Channel Operators", + "objectID": "workshops/7.1_metadata_propagation.html#environment-setup", + "href": "workshops/7.1_metadata_propagation.html#environment-setup", + "title": "Nextflow Development - Metadata Proprogation", "section": "Environment Setup", "text": "Environment Setup\nSet up an interactive shell to run our Nextflow workflow:\nsrun --pty -p prod_short --mem 8GB --mincpus 2 -t 0-2:00 bash\nLoad the required modules to run Nextflow:\nmodule load nextflow/23.04.1\nmodule load singularity/3.7.3\nSet the singularity cache environment variable:\nexport NXF_SINGULARITY_CACHEDIR=/config/binaries/singularity/containers_devel/nextflow\nSingularity images downloaded by workflow executions will now be stored in this directory.\nYou may want to include these, or other environmental variables, in your .bashrc file (or alternate) that is loaded when you log in so you don’t need to export variables every session. A complete list of environment variables can be found here.\nThe training data can be cloned from:\ngit clone https://github.com/nextflow-io/training.git" }, { - "objectID": "workshops/6.1_operators.html#rna-seq-workflow-and-module-files", - "href": "workshops/6.1_operators.html#rna-seq-workflow-and-module-files", - "title": "Nextflow Development - Channel Operators", - "section": "RNA-seq Workflow and Module Files ", - "text": "RNA-seq Workflow and Module Files \nPreviously, we created three Nextflow files and one config file:\n├── nextflow.config\n├── rnaseq.nf\n├── modules.nf\n└── modules\n └── trimgalore.nf\n\nrnaseq.nf: main workflow script where parameters are defined and processes were called.\n\n#!/usr/bin/env nextflow\n\nparams.reads = \"/scratch/users/.../training/nf-training/data/ggal/*_{1,2}.fq\"\nparams.transcriptome_file = \"/scratch/users/.../training/nf-training/data/ggal/transcriptome.fa\"\n\nreads_ch = Channel.fromFilePairs(\"$params.reads\")\n\ninclude { INDEX } from './modules.nf'\ninclude { QUANTIFICATION as QT } from './modules.nf'\ninclude { FASTQC as FASTQC_one } from './modules.nf'\ninclude { FASTQC as FASTQC_two } from './modules.nf'\ninclude { MULTIQC } from './modules.nf'\ninclude { TRIMGALORE } from './modules/trimgalore.nf'\n\nworkflow {\n index_ch = INDEX(params.transcriptome_file)\n quant_ch = QT(index_ch, reads_ch)\n fastqc_ch = FASTQC_one(reads_ch)\n trimgalore_out_ch = TRIMGALORE(reads_ch).reads\n fastqc_cleaned_ch = FASTQC_two(trimgalore_out_ch)\n multiqc_ch = MULTIQC(quant_ch, fastqc_ch)\n}\n\nmodules.nf: script containing the majority of modules, including INDEX, QUANTIFICATION, FASTQC, and MULTIQC\n\nprocess INDEX {\n container \"/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-salmon-1.10.1--h7e5ed60_0.img\"\n\n input:\n path transcriptome\n\n output:\n path \"salmon_idx\"\n\n script:\n \"\"\"\n salmon index --threads $task.cpus -t $transcriptome -i salmon_idx\n \"\"\"\n}\n\nprocess QUANTIFICATION {\n container \"/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-salmon-1.10.1--h7e5ed60_0.img\"\n\n input:\n path salmon_index\n tuple val(sample_id), path(reads)\n\n output:\n path \"$sample_id\"\n\n script:\n \"\"\"\n salmon quant --threads $task.cpus --libType=U \\\n -i $salmon_index -1 ${reads[0]} -2 ${reads[1]} -o $sample_id\n \"\"\"\n}\n\nprocess FASTQC {\n container \"/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-fastqc-0.12.1--hdfd78af_0.img\"\n\n input:\n tuple val(sample_id), path(reads)\n\n output:\n path \"fastqc_${sample_id}_logs\"\n\n script:\n \"\"\"\n mkdir fastqc_${sample_id}_logs\n fastqc -o fastqc_${sample_id}_logs -f fastq -q ${reads}\n \"\"\"\n}\n\nprocess MULTIQC {\n publishDir params.outdir, mode:'copy'\n container \"/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-multiqc-1.21--pyhdfd78af_0.img\"\n\n input:\n path quantification\n path fastqc\n\n output:\n path \"*.html\"\n\n script:\n \"\"\"\n multiqc . --filename $quantification\n \"\"\"\n}\n\nmodules/trimgalore.nf: script inside a modules folder, containing only the TRIMGALORE process\n\nprocess TRIMGALORE {\n container '/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-trim-galore-0.6.6--0.img' \n\n input:\n tuple val(sample_id), path(reads)\n \n output:\n tuple val(sample_id), path(\"*{3prime,5prime,trimmed,val}*.fq.gz\"), emit: reads\n tuple val(sample_id), path(\"*report.txt\") , emit: log , optional: true\n tuple val(sample_id), path(\"*unpaired*.fq.gz\") , emit: unpaired, optional: true\n tuple val(sample_id), path(\"*.html\") , emit: html , optional: true\n tuple val(sample_id), path(\"*.zip\") , emit: zip , optional: true\n\n script:\n \"\"\"\n trim_galore \\\\\n --paired \\\\\n --gzip \\\\\n ${reads[0]} \\\\\n ${reads[1]}\n \"\"\"\n}\n\nnextflow.config: config file that enables singularity\n\nsingularity {\n enabled = true\n autoMounts = true\n cacheDir = \"/config/binaries/singularity/containers_devel/nextflow\"\n}\nRun the pipeline, specifying --outdir:\n>>> nextflow run rnaseq.nf --outdir output\nN E X T F L O W ~ version 23.04.1\nLaunching `rnaseq.nf` [soggy_jennings] DSL2 - revision: 87afc1d98d\nexecutor > local (16)\n[93/d37ef0] process > INDEX [100%] 1 of 1 ✔\n[b3/4c4d9c] process > QT (1) [100%] 3 of 3 ✔\n[d0/173a6e] process > FASTQC_one (3) [100%] 3 of 3 ✔\n[58/0b8af2] process > TRIMGALORE (3) [100%] 3 of 3 ✔\n[c6/def175] process > FASTQC_two (3) [100%] 3 of 3 ✔\n[e0/bcf904] process > MULTIQC (3) [100%] 3 of 3 ✔" - }, - { - "objectID": "workshops/6.1_operators.html#map", - "href": "workshops/6.1_operators.html#map", - "title": "Nextflow Development - Channel Operators", - "section": "6.1.1 map ", - "text": "6.1.1 map \nThe map operator applies a mapping function to each item in a channel. This function is expressed using the Groovy closure { }.\nChannel\n .of('hello', 'world')\n .map { word -> \n def word_size = word.size()\n [word, word_size] \n }\n .view()\nIn this example, a channel containing the strings hello and world is created.\nInside the map operator, the local variable word is declared, and used to represent each input value that is passed to the function, ie. each element in the channel, hello and world.\nThe map operator ‘loops’ through each element in the channel and assigns that element to the local varialbe word. A new local variable word_size is defined inside the map function, and calculates the length of the string using size(). Finally, a tuple is returned, where the first element is the string represented by the local word variable, and the second element is the length of the string, represented by the local word_size variable.\nOutput:\n[hello, 5]\n[world, 5]\nFor our RNA-seq pipeline, let’s first create separate transcriptome files for each organ: lung.transcriptome.fa, liver.transcriptome.fa, gut.transcriptome.fa\ncp \"/scratch/users/.../training/nf-training/data/ggal/transcriptome.fa\" \"/scratch/users/.../training/nf-training/data/ggal/lung.transcriptome.fa\"\ncp \"/scratch/users/.../training/nf-training/data/ggal/transcriptome.fa\" \"/scratch/users/.../training/nf-training/data/ggal/liver.transcriptome.fa\"\nmv \"/scratch/users/.../training/nf-training/data/ggal/transcriptome.fa\" \"/scratch/users/.../training/nf-training/data/ggal/gut.transcriptome.fa\"\nEnsure transcriptome.fa no longer exists:\n>>> ls /scratch/users/.../training/nf-training/data/ggal/\ngut_1.fq\ngut_2.fq\ngut.transcriptome.fa\nliver_1.fq\nliver_2.fq\nliver.transcriptome.fa\nlung_1.fq\nlung_2.fq\nlung.transcriptome.fa\nExercise\nCurrently in the rnaseq.nf script, we define the transcriptome_file parameter to be a single file.\nparams.transcriptome_file = \"/scratch/users/.../training/nf-training/data/ggal/transcriptome.fa\"\nSet the transcriptome_file parameter to match for all three .fa files using a glob path matcher.\nUse the fromPath channel factory to read in the transcriptome files, and the map operator to create a tuple where the first element is the sample (organ type) of the .fa, and the second element is the path of the .fa file. Assign the final output to be a channel called transcriptome_ch.\nThe getSimpleName() Groovy method can be used extract the sample name from our .fa file, for example:\ndef sample = fasta.getSimpleName()\nUse the view() channel operator to view the transcriptome_ch channel. The expected output:\n[lung, /scratch/users/.../training/nf-training/data/ggal/lung.transcriptome.fa]\n[liver, /scratch/users/.../training/nf-training/data/ggal/liver.transcriptome.fa]\n[gut, /scratch/users/.../training/nf-training/data/ggal/gut.transcriptome.fa]\n\n\n\n\n\n\nSolution\n\n\n\n\n\nThe transcriptome_file parameter is defined using *, using glob to match for all three .fa files. The fromPath channel factory is used to read the .fa files, and the map operator is used to create the tuple.\nIn the map function, the variable file was chosen to represent each element that is passed to the function. The function emits a tuple where the first element is the sample name, returned by the getSimpleName() method, and the second element is the .fa file path.\nparams.transcriptome_file = \"/scratch/users/.../nf-training/data/ggal/*.fa\"\n\ntranscriptome_ch = Channel.fromPath(\"$params.transcriptome_file\")\n .map { fasta -> \n def sample = fasta.getSimpleName()\n [sample, fasta]\n }\n .view()\n\n\n\n\nChallenge\nModify the INDEX process to match the input structure of transcriptome_ch. Modify the output of INDEX so that a tuple is emitted, where the first elememt is the value of the grouping key, and the second element is the path of the salmon_idx folder.\nIndex the transcriptome_ch using the INDEX process. Emit the output as index_ch.\n\n\n\n\n\n\nSolution\n\n\n\n\n\nThe input is now defined to be a tuple of two elements, where the first element is the grouping key and the second element is the path of the transcriptome file.\nprocess INDEX {\n container \"/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-salmon-1.10.1--h7e5ed60_0.img\"\n\n input:\n tuple val(sample_id), path(transcriptome)\n\n output:\n tuple val(sample_id), path(\"salmon_idx\")\n\n script:\n \"\"\"\n salmon index --threads $task.cpus -t $transcriptome -i salmon_idx\n \"\"\"\n}\nInside the workflow block, transcriptome_ch is used as input into the INDEX process. The process outputs are emitted as index_ch\nworkflow {\n index_ch = INDEX(transcriptome_ch)\n index_ch.view()\n}\nThe index_ch channel is now a tuple where the first element is the grouping key, and the second element is the path to the salmon index folder.\n>>> nextflow run rnaseq.nf\nN E X T F L O W ~ version 23.04.1\nLaunching `rnaseq.nf` [dreamy_linnaeus] DSL2 - revision: b4ec1d02bd\n[21/91088a] process > INDEX (3) [100%] 3 of 3\n[liver, /scratch/users/.../work/06/f0a54ba9191cce9f73f5a97bfb7bea/salmon_idx]\n[lung, /scratch/users/.../work/60/e84b1b1f06c43c8cf69a5c621d5a41/salmon_idx]\n[gut, /scratch/users/.../work/21/91088aafb553cb4b933bc2b3493f33/salmon_idx]\n\n\n\nCopy the new INDEX process into modules.nf. In the workflow block of rnaseq.nf, use transcriptome_ch as the input to the process INDEX." + "objectID": "workshops/7.1_metadata_propagation.html#metadata-parsing", + "href": "workshops/7.1_metadata_propagation.html#metadata-parsing", + "title": "Nextflow Development - Metadata Proprogation", + "section": "7.1 Metadata Parsing", + "text": "7.1 Metadata Parsing\nWe have covered a few different methods of metadata parsing.\n\n7.1.1 First Pass: .fromFilePairs\nA first pass attempt at pulling these files into Nextflow might use the fromFilePairs method:\nworkflow {\n Channel.fromFilePairs(\"/home/Shared/For_NF_Workshop/training/nf-training-advanced/metadata/data/reads/*/*_R{1,2}.fastq.gz\")\n .view\n}\nNextflow will pull out the first part of the fastq filename and returned us a channel of tuple elements where the first element is the filename-derived ID and the second element is a list of two fastq files.\nThe id is stored as a simple string. We’d like to move to using a map of key-value pairs because we have more than one piece of metadata to track. In this example, we have sample, replicate, tumor/normal, and treatment. We could add extra elements to the tuple, but this changes the ‘cardinality’ of the elements in the channel and adding extra elements would require updating all downstream processes. A map is a single object and is passed through Nextflow channels as one value, so adding extra metadata fields will not require us to change the cardinality of the downstream processes.\nThere are a couple of different ways we can pull out the metadata\nWe can use the tokenize method to split our id. To sanity-check, I just pipe the result directly into the view operator.\nworkflow {\n Channel.fromFilePairs(\"/home/Shared/For_NF_Workshop/training/nf-training-advanced/metadata/data/reads/*/*_R{1,2}.fastq.gz\")\n .map { id, reads ->\n tokens = id.tokenize(\"_\")\n }\n .view\n}\nIf we are confident about the stability of the naming scheme, we can destructure the list returned by tokenize and assign them to variables directly:\nmap { id, reads ->\n (sample, replicate, type) = id.tokenize(\"_\")\n meta = [sample:sample, replicate:replicate, type:type]\n [meta, reads]\n}\n\n\n\n\n\n\nNote\n\n\n\nMake sure that you're using a tuple with parentheses e.g. (one, two) rather than a List e.g. [one, two]\n\n\nIf we move back to the previous method, but decided that the ‘rep’ prefix on the replicate should be removed, we can use regular expressions to simply “subtract” pieces of a string. Here we remove a ‘rep’ prefix from the replicate variable if the prefix is present:\nmap { id, reads ->\n (sample, replicate, type) = id.tokenize(\"_\")\n replicate -= ~/^rep/\n meta = [sample:sample, replicate:replicate, type:type]\n [meta, reads]\n}\nBy setting up our the “meta”, in our tuple with the format above, allows us to access the values in “sample” throughout our modules/configs as ${meta.sample}." }, { - "objectID": "workshops/6.1_operators.html#combine", - "href": "workshops/6.1_operators.html#combine", - "title": "Nextflow Development - Channel Operators", - "section": "6.1.2 combine ", - "text": "6.1.2 combine \nThe combine operator produces the cross product (ie. outer product) combinations of two source channels.\nFor example: The words channel is combined with the numbers channel, emitting a channel where each element of numbers is paired with each element of words.\nnumbers = Channel.of(1, 2, 3)\nwords = Channel.of('hello', 'ciao')\n\nnumbers.combine(words).view()\nOutput:\n[1, hello]\n[2, hello]\n[3, hello]\n[1, ciao]\n[2, ciao]\n[3, ciao]\nThe by option can be used to combine items that share a matching key. This value is zero-based, and represents the index or list of indices for the grouping key. The emitted tuple will consist of multiple elements.\nFor example: source and target are channels consisting of multiple tuples, where the first element of each tuple represents the grouping key. Since indexing is zero-based, by is set to 0 to represent the first element of the tuple.\nsource = Channel.of( [1, 'alpha'], [2, 'beta'] )\ntarget = Channel.of( [1, 'x'], [1, 'y'], [1, 'z'], [2, 'p'], [2, 'q'], [2, 't'] )\n\nsource.combine(target, by: 0).view()\nEach value within the source and target channels are separate elements, resulting in the emitted tuple each containing 3 elements:\n[1, alpha, x]\n[1, alpha, y]\n[1, alpha, z]\n[2, beta, p]\n[2, beta, q]\n[2, beta, t]\nExercise\nIn our RNA-seq pipeline, create a channel quant_inputs_ch that contains the reads_ch combined with the index_ch via a matching key. The emitted channel should contain three elements, where the first element is the grouping key, the second element is the path to the salmon index folder, and the third element is a list of the .fq pairs.\nThe expected output:\n[liver, /scratch/users/.../work/cf/42458b80e050a466d62baf99d0c1cf/salmon_idx, [/scratch/users/.../training/nf-training/data/ggal/liver_1.fq, /scratch/users/.../training/nf-training/data/ggal/liver_2.fq]]\n[lung, /scratch/users/.../work/64/90a77a5f1ed5a0000f6620fd1fab9a/salmon_idx, [/scratch/users/.../training/nf-training/data/ggal/lung_1.fq, /scratch/users/.../training/nf-training/data/ggal/lung_2.fq]]\n[gut, /scratch/users/.../work/37/352b00bfb71156a9250150428ddf1d/salmon_idx, [/scratch/users/.../training/nf-training/data/ggal/gut_1.fq, /scratch/users/.../training/nf-training/data/ggal/gut_2.fq]]\nUse quant_inputs_ch as the input for the QT process within the workflow block.\nModify the process such that the input will be a tuple consisting of three elements, where the first element is the grouping key, the second element is the salmon index and the third element is the list of .fq reads. Also modify the output of the QT process to emit a tuple of two elements, where the first element is the grouping key and the second element is the $sample_id folder. Emit the process output as quant_ch in the workflow block.\n\n\n\n\n\n\nSolution\n\n\n\n\n\nThe reads_ch is combined with the index_ch using the combine channel operator with by: 0, and is assigned to the channel quant_inputs_ch. The new quant_inputs_ch channel is input into the QT process.\nworkflow {\n index_ch = INDEX(transcriptome_ch)\n\n quant_inputs_ch = index_ch.combine(reads_ch, by: 0)\n quant_ch = QT(quant_inputs_ch)\n}\nIn te QT process, the input has been modified to be a tuple of three elements - the first element is the grouping key, the second element is the path to the salmon index, and the third element is the list of .fq reads.\nprocess QUANTIFICATION {\n container \"/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-salmon-1.10.1--h7e5ed60_0.img\"\n\n input:\n tuple val(sample_id), path(salmon_index), path(reads)\n\n output:\n tuple val(sample_id), path(\"$sample_id\")\n\n script:\n \"\"\"\n salmon quant --threads $task.cpus --libType=U \\\n -i $salmon_index -1 ${reads[0]} -2 ${reads[1]} -o $sample_id\n \"\"\"\n}" + "objectID": "workshops/7.1_metadata_propagation.html#second-parse-.splitcsv", + "href": "workshops/7.1_metadata_propagation.html#second-parse-.splitcsv", + "title": "Nextflow Development - Metadata Proprogation", + "section": "Second Parse: .splitCsv", + "text": "Second Parse: .splitCsv\nWe have briefly touched on .splitCsv in the first week.\nAs a quick overview\nAssuming we have the samplesheet\nsample_name,fastq1,fastq2\ngut_sample,/.../training/nf-training/data/ggal/gut_1.fq,/.../training/nf-training/data/ggal/gut_2.fq\nliver_sample,/.../training/nf-training/data/ggal/liver_1.fq,/.../training/nf-training/data/ggal/liver_2.fq\nlung_sample,/.../training/nf-training/data/ggal/lung_1.fq,/.../training/nf-training/data/ggal/lung_2.fq\nWe can set up a workflow to read in these files as:\nparams.reads = \"/.../rnaseq_samplesheet.csv\"\n\nreads_ch = Channel.fromPath(params.reads)\nreads_ch.view()\nreads_ch = reads_ch.splitCsv(header:true)\nreads_ch.view()\n\n\n\n\n\n\nChallenge\n\n\n\nUsing .splitCsv and .map read in the samplesheet below: /home/Shared/For_NF_Workshop/training/nf-training-advanced/metadata/data/samplesheet.csv\nSet the meta to contain the following keys from the header id, repeat and type\n\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\nparams.input = \"/home/Shared/For_NF_Workshop/training/nf-training-advanced/metadata/data/samplesheet.csv\"\n\nch_sheet = Channel.fromPath(params.input)\n\nch_sheet.splitCsv(header:true)\n .map {\n it ->\n [[it.id, it.repeat, it.type], it.fastq_1, it.fastq_2]\n }.view()" }, { - "objectID": "workshops/6.1_operators.html#grouptuple", - "href": "workshops/6.1_operators.html#grouptuple", - "title": "Nextflow Development - Channel Operators", - "section": "6.1.3 groupTuple ", - "text": "6.1.3 groupTuple \nThe groupTuple operator collects tuples into groups based on a similar grouping key, emitting a new tuple for each distinct key. The groupTuple differs from the combine operator in that it is performed on one input channel, and the matching values are emitted as a list.\nChannel.of( [1, 'A'], [1, 'B'], [2, 'C'], [3, 'B'], [1, 'C'], [2, 'A'], [3, 'D'] )\n .groupTuple()\n .view()\nOutput:\n[1, [A, B, C]]\n[2, [C, A]]\n[3, [B, D]]\nBy default, the first element of each tuple is used as the grouping key. The by option can be used to specify a different index. For example, to group by the second element of each tuple:\nChannel.of( [1, 'A'], [1, 'B'], [2, 'C'], [3, 'B'], [1, 'C'], [2, 'A'], [3, 'D'] )\n .groupTuple(by: 1)\n .view()\n[[1, 2], A]\n[[1, 3], B]\n[[2, 1], C]\n[[3], D]\n\nIn the workflow script rnaseq.nf we defined the reads parameter to be multiple paired .fq files that are created into a channel using the fromFilePairs channel factory. This created a tuple where the first element is a unique grouping key, created automatically based on similarities in file name, and the second element contains the list of paired files.\n#!/usr/bin/env nextflow\n\nparams.reads = \"/scratch/users/.../nf-training/data/ggal/*_{1,2}.fq\"\n\nreads_ch = Channel.fromFilePairs(\"$params.reads\")\nNow, move the /scratch/users/.../nf-training/data/ggal/lung_2.fq file into another directory so the folder contains one lung .fq file:\n>>> mv /scratch/users/.../training/nf-training/data/ggal/lung_2.fq .\n>>> ls /scratch/users/.../training/nf-training/data/ggal\ngut_1.fq\ngut_2.fq\ngut.transcriptome.fa\nliver_1.fq\nliver_2.fq\nliver.transcriptome.fa\nlung_1.fq\nlung.transcriptome.fa\nExercise\nUse the fromPath channel factory to read all .fq files as separate elements.\nThen, use map to create a mapping function that returns a tuple, where the first element is the grouping key, and the second element is the .fq file(s).\nThen, use groupTuple() to create channels containing both single and paired .fq files. Within the groupTuple() operator, set sort: true, which orders the groups numerically, ensuring the first .fq is first.\nExpected output:\n[lung, [/scratch/users/.../training/nf-training/data/ggal/lung_1.fq]]\n[gut, [/scratch/users/.../training/nf-training/data/ggal/gut_1.fq, /scratch/users/.../training/nf-training/data/ggal/gut_2.fq]]\n[liver, [/scratch/users/.../training/nf-training/data/ggal/liver_1.fq, /scratch/users/.../training/nf-training/data/ggal/liver_2.fq]]\nInside the map function, the following can be used to extract the sample name from the .fq files. file is the local variable defined inside the function that represents each .fq file. The getName() method will return the file name without the full path, and replaceAll is used to remove the _2.fq and _1.fq file suffixes.\ndef group_key = file.getName().replaceAll(/_2.fq/,'').replaceAll(/_1.fq/,'')\nFor a full list of Nextflow file attributes, see here.\n\n\n\n\n\n\nSolution\n\n\n\n\n\nThe fromPath channel is used to read all .fq files separately. The map function is then used to create a two-element tuple where the first element is a grouping key and the second element is the list of .fq file(s).\nreads_ch = Channel.fromPath(\"/home/sli/nextflow_training/training/nf-training/data/ggal/*.fq\")\n .map { file ->\n def group_key = file.getName().replaceAll(/_2.fq/,'').replaceAll(/_1.fq/,'')\n [group_key, file]\n }\n .groupTuple(sort: true)\n .view()\n\n\n\nNow, run the workflow up to the combine step. The quant_inputs_ch should now consist of:\n[liver, /scratch/users/.../work/cf/42458b80e050a466d62baf99d0c1cf/salmon_idx, [/scratch/users/.../nf-training/data/ggal/liver_1.fq, /scratch/users/.../nf-training/data/ggal/liver_2.fq]]\n[lung, /scratch/users/.../work/64/90a77a5f1ed5a0000f6620fd1fab9a/salmon_idx, [/scratch/users/.../nf-training/data/ggal/lung_1.fq]]\n[gut, /scratch/users/.../work/37/352b00bfb71156a9250150428ddf1d/salmon_idx, [/scratch/users/.../nf-training/data/ggal/gut_1.fq, /scratch/users/.../nf-training/data/ggal/gut_2.fq]]" + "objectID": "workshops/7.1_metadata_propagation.html#manipulating-metadata-and-channels", + "href": "workshops/7.1_metadata_propagation.html#manipulating-metadata-and-channels", + "title": "Nextflow Development - Metadata Proprogation", + "section": "7.2 Manipulating Metadata and Channels", + "text": "7.2 Manipulating Metadata and Channels\nThere are a number of use cases where we will be interested in manipulating our metadata and channels.\nHere we will look at 2 use cases.\n\n7.2.1 Matching input channels\nAs we have seen in examples/challenges in the operators section, it is important to ensure that the format of the channels that you provide as inputs match the process definition.\nparams.reads = \"/home/Shared/For_NF_Workshop/training/nf-training/data/ggal/*_{1,2}.fq\"\n\nprocess printNumLines {\n input:\n path(reads)\n\n output:\n path(\"*txt\")\n\n script:\n \"\"\"\n wc -l ${reads}\n \"\"\"\n}\n\nworkflow {\n ch_input = Channel.fromFilePairs(\"$params.reads\")\n printNumLines( ch_input )\n}\nAs if the format does not match you will see and error similar to below:\n[myeung@papr-res-compute204 lesson7.1test]$ nextflow run test.nf \nN E X T F L O W ~ version 23.04.1\nLaunching `test.nf` [agitated_faggin] DSL2 - revision: c210080493\n[- ] process > printNumLines -\nor if using nf-core template\nERROR ~ Error executing process > 'PMCCCGTRC_UMIHYBCAP:UMIHYBCAP:PREPARE_GENOME:BEDTOOLS_SLOP'\n\nCaused by:\n Not a valid path value type: java.util.LinkedHashMap ([id:genome_size])\n\n\nTip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`\n\n -- Check '.nextflow.log' file for details\nWhen encountering these errors there are two methods to correct this:\n\nChange the input definition in the process\nUse variations of the channel operators to correct the format of your channel\n\nThere are cases where changing the input definition is impractical (i.e. when using nf-core modules/subworkflows).\nLet’s take a look at some select modules.\nBEDTOOLS_SLOP\nBEDTOOLS_INTERSECT\n\n\n\n\n\n\nChallenge\n\n\n\nAssuming that you have the following inputs\nch_target = Channel.fromPath(\"/home/Shared/For_NF_Workshop/training/nf-training-advanced/grouping/data/intervals.bed\")\nch_bait = Channel.fromPath(\"/home/Shared/For_NF_Workshop/training/nf-training-advanced/grouping/data/intervals2.bed\").map { fn -> [ [id: fn.baseName ], fn ] }\nch_sizes = Channel.fromPath(\"/home/Shared/For_NF_Workshop/training/nf-training-advanced/grouping/data/genome.sizes\")\nWrite a mini workflow that:\n\nTakes the ch_target bedfile and extends the bed by 20bp on both sides using BEDTOOLS_SLOP (You can use the config definition below as a helper, or write your own as an additional challenge)\nTake the output from BEDTOOLS_SLOP and input this output with the ch_baits to BEDTOOLS_INTERSECT\n\nHINT: The modules can be imported from this location: /home/Shared/For_NF_Workshop/training/pmcc-test/modules/nf-core/bedtools\nHINT: You will need need the following operators to achieve this .map and .combine\n\n\n\n\n\n\n\n\nConfig\n\n\n\n\n\n\nprocess {\n withName: 'BEDTOOLS_SLOP' {\n ext.args = \"-b 20\"\n ext.prefix = \"extended.bed\"\n }\n\n withName: 'BEDTOOLS_INTERSECT' {\n ext.prefix = \"intersect.bed\"\n }\n}\n:::\n\n:::{.callout-caution collapse=\"true\"}\n## **Solution**\n```default\ninclude { BEDTOOLS_SLOP } from '/home/Shared/For_NF_Workshop/training/pmcc-test/modules/nf-core/bedtools/slop/main'\ninclude { BEDTOOLS_INTERSECT } from '/home/Shared/For_NF_Workshop/training/pmcc-test/modules/nf-core/bedtools/intersect/main'\n\n\nch_target = Channel.fromPath(\"/home/Shared/For_NF_Workshop/training/nf-training-advanced/grouping/data/intervals.bed\")\nch_bait = Channel.fromPath(\"/home/Shared/For_NF_Workshop/training/nf-training-advanced/grouping/data/intervals2.bed\").map { fn -> [ [id: fn.baseName ], fn ] }\nch_sizes = Channel.fromPath(\"/home/Shared/For_NF_Workshop/training/nf-training-advanced/grouping/data/genome.sizes\")\n\nworkflow {\n BEDTOOLS_SLOP ( ch_target.map{ fn -> [ [id:fn.baseName], fn ]}, ch_sizes)\n\n target_bait_bed = BEDTOOLS_SLOP.out.bed.combine( ch_bait )\n BEDTOOLS_INTERSECT( target_bait_bed, ch_sizes.map{ fn -> [ [id: fn.baseName], fn]} )\n}\nnextflow run nfcoretest.nf -profile singularity -c test2.config --outdir nfcoretest" }, { - "objectID": "workshops/6.1_operators.html#flatten", - "href": "workshops/6.1_operators.html#flatten", - "title": "Nextflow Development - Channel Operators", - "section": "6.1.4 flatten ", - "text": "6.1.4 flatten \nThe flatten operator flattens each item from a source channel and emits the elements separately. Deeply nested inputs are also flattened.\nChannel.of( [1, [2, 3]], 4, [5, [6]] )\n .flatten()\n .view()\nOutput:\n1\n2\n3\n4\n5\n6\n\nWithin the script block of the QUANTIFICATION process in the RNA-seq pipeline, we are assuming the reads are paired, and specify -1 ${reads[0]} -2 ${reads[1]} as inputs to salmon quant.\nprocess QUANTIFICATION {\n container \"/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-salmon-1.10.1--h7e5ed60_0.img\"\n\n input:\n tuple val(sample_id), path(salmon_index), path(reads)\n\n output:\n tuple val(sample_id) path(\"$sample_id\")\n\n script:\n \"\"\"\n salmon quant --threads $task.cpus --libType=U \\\n -i $salmon_index -1 ${reads[0]} -2 ${reads[1]} -o $sample_id\n \"\"\"\n}\nNow that the input reads can be either single or paired, the QUANTIFICATION process needs to be modified to allow for either input type. This can be done using the flatten() operator, and conditional script statements. Additionally, the size() method can be used to calculate the size of a list.\nThe script block can be changed to the following:\n script:\n def input_reads = [reads]\n if( input_reads.flatten().size() == 1 )\n \"\"\"\n salmon quant --threads $task.cpus --libType=U \\\n -i $salmon_index -r $reads -o $sample_id\n \"\"\"\n else \n \"\"\"\n salmon quant --threads $task.cpus --libType=U \\\\\n -i $salmon_index -1 ${reads[0]} -2 ${reads[1]} -o $sample_id\n \"\"\"\nFirst, a new variable input_reads is defined, which consists of the reads input being converted into a list. This has to be done since Nextflow will automatically convert a list of length 1 into a path within process. If the size() method was used on a path type input, it will return the size of the file in bytes, and not the list size. Therefore, all inputs must first be converted into a list in order to correctly caculate the number of files.\ndef input_reads = [reads]\nFor reads that are already in a list (ie. paired reads), this will nest the input into another list, for example:\n[ [ file1, file2 ] ]\nIf the size() operator is used on this input, it will always return 1 since the encompassing list only contains one element. Therefore, the flatten() operator has to be used to emit the files as separate elements.\nThe final definition to obtain the number of files in reads becomes:\ninput_reads.flatten().size()\nFor single reads, the input to salmon quant becomes -r $reads\n\nExercise\nCurrently the TRIMGALORE process only accounts for paired reads.\nprocess TRIMGALORE {\n container '/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-trim-galore-0.6.6--0.img' \n\n input:\n tuple val(sample_id), path(reads)\n \n output:\n tuple val(sample_id), path(\"*{3prime,5prime,trimmed,val}*.fq.gz\"), emit: reads\n tuple val(sample_id), path(\"*report.txt\") , emit: log , optional: true\n tuple val(sample_id), path(\"*unpaired*.fq.gz\") , emit: unpaired, optional: true\n tuple val(sample_id), path(\"*.html\") , emit: html , optional: true\n tuple val(sample_id), path(\"*.zip\") , emit: zip , optional: true\n\n script:\n \"\"\"\n trim_galore \\\\\n --paired \\\\\n --gzip \\\\\n ${reads[0]} \\\\\n ${reads[1]}\n \"\"\"\n}\nModify the process such that both single and paired reads can be used. For single reads, the following script block can be used:\n\"\"\"\ntrim_galore \\\\\n --gzip \\\\\n $reads\n\"\"\"\n\n\n\n\n\n\nSolution\n\n\n\n\n\nprocess TRIMGALORE {\n container '/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-trim-galore-0.6.6--0.img' \n\n input:\n tuple val(sample_id), path(reads)\n \n output:\n tuple val(sample_id), path(\"*{3prime,5prime,trimmed,val}*.fq.gz\"), emit: reads\n tuple val(sample_id), path(\"*report.txt\") , emit: log , optional: true\n tuple val(sample_id), path(\"*unpaired*.fq.gz\") , emit: unpaired, optional: true\n tuple val(sample_id), path(\"*.html\") , emit: html , optional: true\n tuple val(sample_id), path(\"*.zip\") , emit: zip , optional: true\n\n script:\n def input_reads = [reads]\n\n if( input_reads.flatten().size() == 1 )\n \"\"\"\n trim_galore \\\\\n --gzip \\\\\n $reads\n \"\"\"\n else\n \"\"\"\n trim_galore \\\\\n --paired \\\\\n --gzip \\\\\n ${reads[0]} \\\\\n ${reads[1]}\n \"\"\"\n\n}\n\n\n\nExtension\nModify the FASTQC process such that the output is a tuple where the first element is the grouping key, and the second element is the path to the fastqc logs.\nModify the MULTIQC process such that the output is a tuple where the first element is the grouping key, and the second element is the path to the generated html file.\nFinally, run the entire workflow, specifying an --outdir. The workflow block should look like this:\nworkflow {\n index_ch = INDEX(transcriptome_ch)\n\n quant_inputs_ch = index_ch.combine(reads_ch, by: 0)\n quant_ch = QT(quant_inputs_ch)\n\n trimgalore_out_ch = TRIMGALORE(reads_ch).reads\n\n fastqc_ch = FASTQC_one(reads_ch)\n fastqc_cleaned_ch = FASTQC_two(trimgalore_out_ch)\n multiqc_ch = MULTIQC(quant_ch, fastqc_ch)\n}\n\n\n\n\n\n\nSolution\n\n\n\n\n\nThe output block of both processes have been modified to be tuples containing a grouping key.\nprocess FASTQC {\n container \"/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-fastqc-0.12.1--hdfd78af_0.img\"\n\n input:\n tuple val(sample_id), path(reads)\n\n output:\n tuple val(sample_id), path(\"fastqc_${sample_id}_logs\")\n\n script:\n \"\"\"\n mkdir fastqc_${sample_id}_logs\n fastqc -o fastqc_${sample_id}_logs -f fastq -q ${reads}\n \"\"\"\n}\n\nprocess MULTIQC {\n publishDir params.outdir, mode:'copy'\n container \"/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-multiqc-1.21--pyhdfd78af_0.img\"\n\n input:\n tuple val(sample_id), path(quantification)\n tuple val(sample_id), path(fastqc)\n\n output:\n tuple val(sample_id), path(\"*.html\")\n\n script:\n \"\"\"\n multiqc . --filename $quantification\n \"\"\"\n}\n\n\n\n\nThis workshop is adapted from Fundamentals Training, Advanced Training, Developer Tutorials, Nextflow Patterns materials from Nextflow, nf-core nf-core tools documentation and nf-validation" + "objectID": "workshops/7.1_metadata_propagation.html#grouping-with-metadata", + "href": "workshops/7.1_metadata_propagation.html#grouping-with-metadata", + "title": "Nextflow Development - Metadata Proprogation", + "section": "7.3 Grouping with Metadata", + "text": "7.3 Grouping with Metadata\nEarlier we introduced the function groupTuple\n\nch_reads = Channel.fromFilePairs(\"/home/Shared/For_NF_Workshop/training/nf-training-advanced/metadata/data/reads/*/*_R{1,2}.fastq.gz\")\n .map { id, reads ->\n (sample, replicate, type) = id.tokenize(\"_\")\n replicate -= ~/^rep/\n meta = [sample:sample, replicate:replicate, type:type]\n [meta, reads]\n}\n\n## Assume that we want to drop replicate from the meta and combine fastqs\n\nch_reads.map {\n meta, reads -> \n [ meta - meta.subMap('replicate') + [data_type: 'fastq'], reads ]\n }\n .groupTuple().view()" }, { - "objectID": "workshops/8.1_scatter_gather_output.html", - "href": "workshops/8.1_scatter_gather_output.html", - "title": "Nextflow Development - Outputs, Scatter, and Gather", + "objectID": "workshops/1.2_intro_nf_core.html", + "href": "workshops/1.2_intro_nf_core.html", + "title": "Introduction to nf-core", "section": "", - "text": "Objectives\n\n\n\n\nGain an understanding of how to structure nextflow published outputs\nGain an understanding of how to do scatter & gather processes" - }, - { - "objectID": "workshops/8.1_scatter_gather_output.html#environment-setup", - "href": "workshops/8.1_scatter_gather_output.html#environment-setup", - "title": "Nextflow Development - Outputs, Scatter, and Gather", - "section": "Environment Setup", - "text": "Environment Setup\nSet up an interactive shell to run our Nextflow workflow:\nsrun --pty -p prod_short --mem 8GB --mincpus 2 -t 0-2:00 bash\nLoad the required modules to run Nextflow:\nmodule load nextflow/23.04.1\nmodule load singularity/3.7.3\nSet the singularity cache environment variable:\nexport NXF_SINGULARITY_CACHEDIR=/config/binaries/singularity/containers_devel/nextflow\nSingularity images downloaded by workflow executions will now be stored in this directory.\nYou may want to include these, or other environmental variables, in your .bashrc file (or alternate) that is loaded when you log in so you don’t need to export variables every session. A complete list of environment variables can be found here.\nThe training data can be cloned from:\ngit clone https://github.com/nextflow-io/training.git" - }, - { - "objectID": "workshops/8.1_scatter_gather_output.html#rna-seq-workflow-and-module-files", - "href": "workshops/8.1_scatter_gather_output.html#rna-seq-workflow-and-module-files", - "title": "Nextflow Development - Outputs, Scatter, and Gather", - "section": "RNA-seq Workflow and Module Files ", - "text": "RNA-seq Workflow and Module Files \nPreviously, we created three Nextflow files and one config file:\n├── nextflow.config\n├── rnaseq.nf\n├── modules.nf\n└── modules\n └── trimgalore.nf\n\nrnaseq.nf: main workflow script where parameters are defined and processes were called.\n\n#!/usr/bin/env nextflow\n\nparams.reads = \"/scratch/users/.../training/nf-training/data/ggal/*_{1,2}.fq\"\nparams.transcriptome_file = \"/scratch/users/.../training/nf-training/data/ggal/transcriptome.fa\"\n\nreads_ch = Channel.fromFilePairs(\"$params.reads\")\n\ninclude { INDEX } from './modules.nf'\ninclude { QUANTIFICATION as QT } from './modules.nf'\ninclude { FASTQC as FASTQC_one } from './modules.nf'\ninclude { FASTQC as FASTQC_two } from './modules.nf'\ninclude { MULTIQC } from './modules.nf'\ninclude { TRIMGALORE } from './modules/trimgalore.nf'\n\nworkflow {\n index_ch = INDEX(params.transcriptome_file)\n quant_ch = QT(index_ch, reads_ch)\n fastqc_ch = FASTQC_one(reads_ch)\n trimgalore_out_ch = TRIMGALORE(reads_ch).reads\n fastqc_cleaned_ch = FASTQC_two(trimgalore_out_ch)\n multiqc_ch = MULTIQC(quant_ch, fastqc_ch)\n}\n\nmodules.nf: script containing the majority of modules, including INDEX, QUANTIFICATION, FASTQC, and MULTIQC\n\nprocess INDEX {\n container \"/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-salmon-1.10.1--h7e5ed60_0.img\"\n\n input:\n path transcriptome\n\n output:\n path \"salmon_idx\"\n\n script:\n \"\"\"\n salmon index --threads $task.cpus -t $transcriptome -i salmon_idx\n \"\"\"\n}\n\nprocess QUANTIFICATION {\n container \"/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-salmon-1.10.1--h7e5ed60_0.img\"\n\n input:\n path salmon_index\n tuple val(sample_id), path(reads)\n\n output:\n path \"$sample_id\"\n\n script:\n \"\"\"\n salmon quant --threads $task.cpus --libType=U \\\n -i $salmon_index -1 ${reads[0]} -2 ${reads[1]} -o $sample_id\n \"\"\"\n}\n\nprocess FASTQC {\n container \"/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-fastqc-0.12.1--hdfd78af_0.img\"\n\n input:\n tuple val(sample_id), path(reads)\n\n output:\n path \"fastqc_${sample_id}_logs\"\n\n script:\n \"\"\"\n mkdir fastqc_${sample_id}_logs\n fastqc -o fastqc_${sample_id}_logs -f fastq -q ${reads}\n \"\"\"\n}\n\nprocess MULTIQC {\n publishDir params.outdir, mode:'copy'\n container \"/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-multiqc-1.21--pyhdfd78af_0.img\"\n\n input:\n path quantification\n path fastqc\n\n output:\n path \"*.html\"\n\n script:\n \"\"\"\n multiqc . --filename $quantification\n \"\"\"\n}\n\nmodules/trimgalore.nf: script inside a modules folder, containing only the TRIMGALORE process\n\nprocess TRIMGALORE {\n container '/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-trim-galore-0.6.6--0.img' \n\n input:\n tuple val(sample_id), path(reads)\n \n output:\n tuple val(sample_id), path(\"*{3prime,5prime,trimmed,val}*.fq.gz\"), emit: reads\n tuple val(sample_id), path(\"*report.txt\") , emit: log , optional: true\n tuple val(sample_id), path(\"*unpaired*.fq.gz\") , emit: unpaired, optional: true\n tuple val(sample_id), path(\"*.html\") , emit: html , optional: true\n tuple val(sample_id), path(\"*.zip\") , emit: zip , optional: true\n\n script:\n \"\"\"\n trim_galore \\\\\n --paired \\\\\n --gzip \\\\\n ${reads[0]} \\\\\n ${reads[1]}\n \"\"\"\n}\n\nnextflow.config: config file that enables singularity\n\nsingularity {\n enabled = true\n autoMounts = true\n cacheDir = \"/config/binaries/singularity/containers_devel/nextflow\"\n}\nRun the pipeline, specifying --outdir:\n>>> nextflow run rnaseq.nf --outdir output\nN E X T F L O W ~ version 23.04.1\nLaunching `rnaseq.nf` [soggy_jennings] DSL2 - revision: 87afc1d98d\nexecutor > local (16)\n[93/d37ef0] process > INDEX [100%] 1 of 1 ✔\n[b3/4c4d9c] process > QT (1) [100%] 3 of 3 ✔\n[d0/173a6e] process > FASTQC_one (3) [100%] 3 of 3 ✔\n[58/0b8af2] process > TRIMGALORE (3) [100%] 3 of 3 ✔\n[c6/def175] process > FASTQC_two (3) [100%] 3 of 3 ✔\n[e0/bcf904] process > MULTIQC (3) [100%] 3 of 3 ✔" - }, - { - "objectID": "workshops/8.1_scatter_gather_output.html#organise-outputs", - "href": "workshops/8.1_scatter_gather_output.html#organise-outputs", - "title": "Nextflow Development - Outputs, Scatter, and Gather", - "section": "8.1. Organise outputs", - "text": "8.1. Organise outputs\nThe output declaration block defines the channels used by the process to send out the results produced. However, this output only stays in the work/ directory if there is no publishDir directive specified.\nGiven each task is being executed in separate temporary work/ folder (e.g., work/f1/850698…), you may want to save important, non-intermediary, and/or final files in a results folder.\nTo store our workflow result files, you need to explicitly mark them using the directive publishDir in the process that’s creating the files. For example:\nprocess MULTIQC {\n publishDir params.outdir, mode:'copy'\n container \"/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-multiqc-1.21--pyhdfd78af_0.img\"\n\n input:\n path quantification\n path fastqc\n\n output:\n path \"*.html\"\n\n script:\n \"\"\"\n multiqc . --filename $quantification\n \"\"\"\n}\nThe above example will copy all html files created by the MULTIQC process into the directory path specified in the params.outdir" - }, - { - "objectID": "workshops/8.1_scatter_gather_output.html#store-outputs-matching-a-glob-pattern", - "href": "workshops/8.1_scatter_gather_output.html#store-outputs-matching-a-glob-pattern", - "title": "Nextflow Development - Outputs, Scatter, and Gather", - "section": "8.1.1. Store outputs matching a glob pattern", - "text": "8.1.1. Store outputs matching a glob pattern\nYou can use more than one publishDir to keep different outputs in separate directories. For each directive specify a different glob pattern using the pattern option to store into each directory only the files that match the provided pattern.\nFor example:\nreads_ch = Channel.fromFilePairs('data/ggal/*_{1,2}.fq')\n\nprocess FOO {\n publishDir \"results/bam\", pattern: \"*.bam\"\n publishDir \"results/bai\", pattern: \"*.bai\"\n\n input:\n tuple val(sample_id), path(sample_id_paths)\n\n output:\n tuple val(sample_id), path(\"*.bam\")\n tuple val(sample_id), path(\"*.bai\")\n\n script:\n \"\"\"\n echo your_command_here --sample $sample_id_paths > ${sample_id}.bam\n echo your_command_here --sample $sample_id_paths > ${sample_id}.bai\n \"\"\"\n}\nExercise\nUse publishDir and pattern to keep the outputs from the trimgalore.nf into separate directories.\n\n\n\n\n\n\nSolution\n\n\n\n\n\nprocess TRIMGALORE {\n container '/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-trim-galore-0.6.6--0.img' \n publishDir \"$params.outdir/report\", mode: \"copy\", pattern:\"*report.txt\"\n publishDir \"$params.outdir/trimmed_fastq\", mode: \"copy\", pattern:\"*fq.gz\"\n\n input:\n tuple val(sample_id), path(reads)\n \n output:\n tuple val(sample_id), path(\"*{3prime,5prime,trimmed,val}*.fq.gz\"), emit: reads\n tuple val(sample_id), path(\"*report.txt\") , emit: log , optional: true\n tuple val(sample_id), path(\"*unpaired*.fq.gz\") , emit: unpaired, optional: true\n tuple val(sample_id), path(\"*.html\") , emit: html , optional: true\n tuple val(sample_id), path(\"*.zip\") , emit: zip , optional: true\n\n script:\n \"\"\"\n trim_galore \\\\\n --paired \\\\\n --gzip \\\\\n ${reads[0]} \\\\\n ${reads[1]}\n \"\"\"\n}\nOutput should now look like\n>>> tree ./output\n./output\n├── gut.html\n├── liver.html\n├── lung.html\n├── report\n│   ├── gut_1.fq_trimming_report.txt\n│   ├── gut_2.fq_trimming_report.txt\n│   ├── liver_1.fq_trimming_report.txt\n│   ├── liver_2.fq_trimming_report.txt\n│   ├── lung_1.fq_trimming_report.txt\n│   └── lung_2.fq_trimming_report.txt\n└── trimmed_fastq\n ├── gut_1_val_1.fq.gz\n ├── gut_2_val_2.fq.gz\n ├── liver_1_val_1.fq.gz\n ├── liver_2_val_2.fq.gz\n ├── lung_1_val_1.fq.gz\n └── lung_2_val_2.fq.gz\n\n2 directories, 15 files" - }, - { - "objectID": "workshops/8.1_scatter_gather_output.html#store-outputs-renaming-files-or-in-a-sub-directory", - "href": "workshops/8.1_scatter_gather_output.html#store-outputs-renaming-files-or-in-a-sub-directory", - "title": "Nextflow Development - Outputs, Scatter, and Gather", - "section": "8.1.2. Store outputs renaming files or in a sub-directory", - "text": "8.1.2. Store outputs renaming files or in a sub-directory\nThe publishDir directive also allow the use of saveAs option to give each file a name of your choice, providing a custom rule as a closure.\nprocess foo {\n publishDir 'results', saveAs: { filename -> \"foo_$filename\" }\n\n output: \n path '*.txt'\n\n '''\n touch this.txt\n touch that.txt\n '''\n}\nThe same pattern can be used to store specific files in separate directories depending on the actual name.\nprocess foo {\n publishDir 'results', saveAs: { filename -> filename.endsWith(\".zip\") ? \"zips/$filename\" : filename }\n\n output: \n path '*'\n\n '''\n touch this.txt\n touch that.zip\n '''\n}\nExercise\nModify the MULTIQC output with saveAs such that resulting folder is as follow:\n./output\n├── MultiQC\n│   ├── multiqc_gut.html\n│   ├── multiqc_liver.html\n│   └── multiqc_lung.html\n├── report\n│   ├── gut_1.fq_trimming_report.txt\n│   ├── gut_2.fq_trimming_report.txt\n│   ├── liver_1.fq_trimming_report.txt\n│   ├── liver_2.fq_trimming_report.txt\n│   ├── lung_1.fq_trimming_report.txt\n│   └── lung_2.fq_trimming_report.txt\n└── trimmed_fastq\n ├── gut_1_val_1.fq.gz\n ├── gut_2_val_2.fq.gz\n ├── liver_1_val_1.fq.gz\n ├── liver_2_val_2.fq.gz\n ├── lung_1_val_1.fq.gz\n └── lung_2_val_2.fq.gz\n\n3 directories, 15 files\n\n\n\n\n\n\nWarning\n\n\n\nYou need to remove existing output folder/files if you want to have a clean output. By default, nextflow will overwrite existing files, and keep all the remaining files in the same specified output directory.\n\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\nprocess MULTIQC {\n publishDir params.outdir, mode:'copy', saveAs: { filename -> filename.endsWith(\".html\") ? \"MultiQC/multiqc_$filename\" : filename }\n container \"/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-multiqc-1.21--pyhdfd78af_0.img\"\n\n input:\n path quantification\n path fastqc\n\n output:\n path \"*.html\"\n\n script:\n \"\"\"\n multiqc . --filename $quantification\n \"\"\"\n}\n\n\n\nChallenge\nModify all the processes in rnaseq.nf such that we will have the following output structure\n./output\n├── gut\n│   ├── QC\n│   │   ├── fastqc_gut_logs\n│   │   │   ├── gut_1_fastqc.html\n│   │   │   ├── gut_1_fastqc.zip\n│   │   │   ├── gut_2_fastqc.html\n│   │   │   └── gut_2_fastqc.zip\n│   │   └── gut.html\n│   ├── report\n│   │   ├── gut_1.fq_trimming_report.txt\n│   │   └── gut_2.fq_trimming_report.txt\n│   └── trimmed_fastq\n│   ├── gut_1_val_1.fq.gz\n│   └── gut_2_val_2.fq.gz\n├── liver\n│   ├── QC\n│   │   ├── fastqc_liver_logs\n│   │   │   ├── liver_1_fastqc.html\n│   │   │   ├── liver_1_fastqc.zip\n│   │   │   ├── liver_2_fastqc.html\n│   │   │   └── liver_2_fastqc.zip\n│   │   └── liver.html\n│   ├── report\n│   │   ├── liver_1.fq_trimming_report.txt\n│   │   └── liver_2.fq_trimming_report.txt\n│   └── trimmed_fastq\n│   ├── liver_1_val_1.fq.gz\n│   └── liver_2_val_2.fq.gz\n└── lung\n ├── QC\n │   ├── fastqc_lung_logs\n │   │   ├── lung_1_fastqc.html\n │   │   ├── lung_1_fastqc.zip\n │   │   ├── lung_2_fastqc.html\n │   │   └── lung_2_fastqc.zip\n │   └── lung.html\n ├── report\n │   ├── lung_1.fq_trimming_report.txt\n │   └── lung_2.fq_trimming_report.txt\n └── trimmed_fastq\n ├── lung_1_val_1.fq.gz\n └── lung_2_val_2.fq.gz\n\n15 directories, 27 files\n\n\n\n\n\n\nSolution\n\n\n\n\n\nprocess FASTQC {\n publishDir \"$params.outdir/$sample_id/QC\", mode:'copy'\n container \"/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-fastqc-0.12.1--hdfd78af_0.img\"\n\n input:\n tuple val(sample_id), path(reads)\n\n output:\n path \"fastqc_${sample_id}_logs\"\n\n script:\n \"\"\"\n mkdir fastqc_${sample_id}_logs\n fastqc -o fastqc_${sample_id}_logs -f fastq -q ${reads}\n \"\"\"\n}\n\nprocess MULTIQC {\n //publishDir params.outdir, mode:'copy', saveAs: { filename -> filename.endsWith(\".html\") ? \"MultiQC/multiqc_$filename\" : filename }\n publishDir \"$params.outdir/$quantification/QC\", mode:'copy'\n container \"/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-multiqc-1.21--pyhdfd78af_0.img\"\n\n input:\n path quantification\n path fastqc\n\n output:\n path \"*.html\"\n\n script:\n \"\"\"\n multiqc . --filename $quantification\n \"\"\"\n}\n\nprocess TRIMGALORE {\n container '/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-trim-galore-0.6.6--0.img'\n publishDir \"${params.outdir}/${sample_id}/report\", mode: \"copy\", pattern:\"*report.txt\"\n publishDir \"${params.outdir}/${sample_id}/trimmed_fastq\", mode: \"copy\", pattern:\"*fq.gz\"\n\n input:\n tuple val(sample_id), path(reads)\n\n output:\n tuple val(sample_id), path(\"*{3prime,5prime,trimmed,val}*.fq.gz\"), emit: reads\n tuple val(sample_id), path(\"*report.txt\") , emit: log , optional: true\n tuple val(sample_id), path(\"*unpaired*.fq.gz\") , emit: unpaired, optional: true\n tuple val(sample_id), path(\"*.html\") , emit: html , optional: true\n tuple val(sample_id), path(\"*.zip\") , emit: zip , optional: true\n\n script:\n \"\"\"\n trim_galore \\\\\n --paired \\\\\n --gzip \\\\\n ${reads[0]} \\\\\n ${reads[1]}\n \"\"\"\n}" - }, - { - "objectID": "workshops/8.1_scatter_gather_output.html#scatter", - "href": "workshops/8.1_scatter_gather_output.html#scatter", - "title": "Nextflow Development - Outputs, Scatter, and Gather", - "section": "8.2 Scatter", - "text": "8.2 Scatter\nThe scatter operation involves distributing large input data into smaller chunks that can be analysed across multiple processes in parallel.\nOne very simple example of native scatter is how nextflow handles Channel factories with the Channel.fromPath or Channel.fromFilePairs method, where multiple input data is processed in parallel.\nparams.reads = \"/scratch/users/.../training/nf-training/data/ggal/*_{1,2}.fq\"\nreads_ch = Channel.fromFilePairs(\"$params.reads\")\n\ninclude { FASTQC as FASTQC_one } from './modules.nf'\n\nworkflow {\n fastqc_ch = FASTQC_one(reads_ch)\n}\nFrom the above snippet from our rnaseq.nf, we will get three execution of FASTQC_one for each pairs of our input data.\nOther than natively splitting execution by input data, Nextflow also provides operators to scatter existing input data for various benefits, such as faster processing. For example:\n\nsplitText\nsplitFasta\nsplitFastq\nmap with from or fromList\nflatten" - }, - { - "objectID": "workshops/8.1_scatter_gather_output.html#process-per-file-chunk", - "href": "workshops/8.1_scatter_gather_output.html#process-per-file-chunk", - "title": "Nextflow Development - Outputs, Scatter, and Gather", - "section": "8.2.1 Process per file chunk", - "text": "8.2.1 Process per file chunk\nExercise\nparams.infile = \"/data/reference/bed_files/Agilent_CRE_v2/S30409818_Covered_MERGED.bed\"\nparams.size = 100000\n\nprocess count_line {\n debug true\n input: \n file x\n\n script:\n \"\"\"\n wc -l $x \n \"\"\"\n}\n\nworkflow {\n Channel.fromPath(params.infile) \\\n | splitText(by: params.size, file: true) \\\n | count_line\n}\nExercise\nparams.infile = \"/scratch/users/rlupat/nfWorkshop/dev1/training/nf-training/data/ggal/*_{1,2}.fq\"\nparams.size = 1000\n\nworkflow {\n Channel.fromFilePairs(params.infile, flat: true) \\\n | splitFastq(by: params.size, pe: true, file: true) \\\n | view()\n}" - }, - { - "objectID": "workshops/8.1_scatter_gather_output.html#process-per-file-range", - "href": "workshops/8.1_scatter_gather_output.html#process-per-file-range", - "title": "Nextflow Development - Outputs, Scatter, and Gather", - "section": "8.2.1 Process per file range", - "text": "8.2.1 Process per file range\nExercise\nChannel.from(1..22) \\\n | map { chr -> [\"sample${chr}\", file(\"${chr}.indels.vcf\"), file(\"${chr}.vcf\")] } \\\n | view()\n>> nextflow run test_scatter.nf\n\n[sample1, /scratch/users/${users}/1.indels.vcf, /scratch/users/${users}/1.vcf]\n[sample2, /scratch/users/${users}/2.indels.vcf, /scratch/users/${users}/2.vcf]\n[sample3, /scratch/users/${users}/3.indels.vcf, /scratch/users/${users}/3.vcf]\n[sample4, /scratch/users/${users}/4.indels.vcf, /scratch/users/${users}/4.vcf]\n[sample5, /scratch/users/${users}/5.indels.vcf, /scratch/users/${users}/5.vcf]\n[sample6, /scratch/users/${users}/6.indels.vcf, /scratch/users/${users}/6.vcf]\n[sample7, /scratch/users/${users}/7.indels.vcf, /scratch/users/${users}/7.vcf]\n[sample8, /scratch/users/${users}/8.indels.vcf, /scratch/users/${users}/8.vcf]\n[sample9, /scratch/users/${users}/9.indels.vcf, /scratch/users/${users}/9.vcf]\n[sample10, /scratch/users${users}/10.indels.vcf, /scratch/users${users}/10.vcf]\n[sample11, /scratch/users${users}/11.indels.vcf, /scratch/users${users}/11.vcf]\n[sample12, /scratch/users${users}/12.indels.vcf, /scratch/users${users}/12.vcf]\n[sample13, /scratch/users${users}/13.indels.vcf, /scratch/users${users}/13.vcf]\n[sample14, /scratch/users${users}/14.indels.vcf, /scratch/users${users}/14.vcf]\n[sample15, /scratch/users${users}/15.indels.vcf, /scratch/users${users}/15.vcf]\n[sample16, /scratch/users${users}/16.indels.vcf, /scratch/users${users}/16.vcf]\n[sample17, /scratch/users${users}/17.indels.vcf, /scratch/users${users}/17.vcf]\n[sample18, /scratch/users${users}/18.indels.vcf, /scratch/users${users}/18.vcf]\n[sample19, /scratch/users${users}/19.indels.vcf, /scratch/users${users}/19.vcf]\n[sample20, /scratch/users${users}/20.indels.vcf, /scratch/users${users}/20.vcf]\n[sample21, /scratch/users${users}/21.indels.vcf, /scratch/users${users}/21.vcf]\n[sample22, /scratch/users${users}/22.indels.vcf, /scratch/users${users}/22.vcf]\nExercise\nparams.infile = \"/data/reference/bed_files/Agilent_CRE_v2/S30409818_Covered_MERGED.bed\"\nparams.size = 100000\n\nprocess split_bed_by_chr {\n debug true\n\n input:\n path bed\n val chr\n\n output:\n path \"*.bed\"\n\n script:\n \"\"\"\n grep ^${chr}\\t ${bed} > ${chr}.bed\n \"\"\"\n}\n\nworkflow {\n split_bed_by_chr(params.infile, Channel.from(1..22)) | view()\n}\nChallenge\nHow do we include chr X and Y into the above split by chromosome?\n\n\n\n\n\n\nSolution\n\n\n\n\n\nworkflow {\n split_bed_by_chr(params.infile, Channel.from(1..22,'X','Y').flatten()) | view()\n}" - }, - { - "objectID": "workshops/8.1_scatter_gather_output.html#gather", - "href": "workshops/8.1_scatter_gather_output.html#gather", - "title": "Nextflow Development - Outputs, Scatter, and Gather", - "section": "8.3 Gather", - "text": "8.3 Gather\nThe gather operation consolidates results from parallel computations (can be from scatter) into a centralized process for aggregation and further processing.\nSome of the Nextflow provided operators that facilitate this gather operation, include:\n\ncollect\ncollectFile\nmap + groupTuple" - }, - { - "objectID": "workshops/8.1_scatter_gather_output.html#process-all-outputs-altogether", - "href": "workshops/8.1_scatter_gather_output.html#process-all-outputs-altogether", - "title": "Nextflow Development - Outputs, Scatter, and Gather", - "section": "8.3.1. Process all outputs altogether", - "text": "8.3.1. Process all outputs altogether\nExercise\nparams.infile = \"/data/reference/bed_files/Agilent_CRE_v2/S30409818_Covered_MERGED.bed\"\nparams.size = 100000\n\nprocess split_bed_by_chr {\n debug true\n\n input:\n path bed\n val chr\n\n output:\n path \"*.bed\"\n\n script:\n \"\"\"\n grep ^${chr}\\t ${bed} > ${chr}.bed\n \"\"\"\n}\n\nworkflow {\n split_bed_by_chr(params.infile, Channel.from(1..22,'X','Y').flatten()) | collect | view()\n}" - }, - { - "objectID": "workshops/8.1_scatter_gather_output.html#collect-outputs-into-a-file", - "href": "workshops/8.1_scatter_gather_output.html#collect-outputs-into-a-file", - "title": "Nextflow Development - Outputs, Scatter, and Gather", - "section": "8.3.2. Collect outputs into a file", - "text": "8.3.2. Collect outputs into a file\nExercise\nparams.infile = \"/data/reference/bed_files/Agilent_CRE_v2/S30409818_Covered_MERGED.bed\"\nparams.size = 100000\n\nprocess split_bed_by_chr {\n debug true\n\n input:\n path bed\n val chr\n\n output:\n path \"*.bed\"\n\n script:\n \"\"\"\n grep ^${chr}\\t ${bed} > ${chr}.bed\n \"\"\"\n}\n\nworkflow {\n split_bed_by_chr(params.infile, Channel.from(1..22,'X','Y').flatten()) | collectFile(name: 'merged.bed', newLine:true) | view()\n}\nExercise\nworkflow {\n Channel.fromPath(\"/scratch/users/rlupat/nfWorkshop/dev1/training/nf-training/data/ggal/*_1.fq\", checkIfExists: true) \\\n | collectFile(name: 'combined_1.fq', newLine:true) \\\n | view\n}" + "text": "Objectives\n\n\n\n\nLearn about the core features of nf-core.\nLearn the terminology used by nf-core.\nUse Nextflow to pull and run the nf-core/testpipeline workflow\n\n\n\nIntroduction to nf-core: Introduce nf-core features and concepts, structures, tools, and example nf-core pipelines\n\n1.2.1. What is nf-core?\nnf-core is a community effort to collect a curated set of analysis workflows built using Nextflow.\nnf-core provides a standardized set of best practices, guidelines, and templates for building and sharing bioinformatics workflows. These workflows are designed to be modular, scalable, and portable, allowing researchers to easily adapt and execute them using their own data and compute resources.\nThe community is a diverse group of bioinformaticians, developers, and researchers from around the world who collaborate on developing and maintaining a growing collection of high-quality workflows. These workflows cover a range of applications, including transcriptomics, proteomics, and metagenomics.\nOne of the key benefits of nf-core is that it promotes open development, testing, and peer review, ensuring that the workflows are robust, well-documented, and validated against real-world datasets. This helps to increase the reliability and reproducibility of bioinformatics analyses and ultimately enables researchers to accelerate their scientific discoveries.\nnf-core is published in Nature Biotechnology: Nat Biotechnol 38, 276–278 (2020). Nature Biotechnology\nKey Features of nf-core workflows\n\nDocumentation\n\nnf-core workflows have extensive documentation covering installation, usage, and description of output files to ensure that you won’t be left in the dark.\n\nStable Releases\n\nnf-core workflows use GitHub releases to tag stable versions of the code and software, making workflow runs totally reproducible.\n\nPackaged software\n\nPipeline dependencies are automatically downloaded and handled using Docker, Singularity, Conda, or other software management tools. There is no need for any software installations.\n\nPortable and reproducible\n\nnf-core workflows follow best practices to ensure maximum portability and reproducibility. The large community makes the workflows exceptionally well-tested and easy to execute.\n\nCloud-ready\n\nnf-core workflows are tested on AWS\n\n\n\n\n1.2.2. Executing an nf-core workflow\nThe nf-core website has a full list of workflows and asssociated documentation tno be explored.\nEach workflow has a dedicated page that includes expansive documentation that is split into 7 sections:\n\nIntroduction\n\nAn introduction and overview of the workflow\n\nResults\n\nExample output files generated from the full test dataset\n\nUsage docs\n\nDescriptions of how to execute the workflow\n\nParameters\n\nGrouped workflow parameters with descriptions\n\nOutput docs\n\nDescriptions and examples of the expected output files\n\nReleases & Statistics\n\nWorkflow version history and statistics\n\n\nAs nf-core is a community development project the code for a pipeline can be changed at any time. To ensure that you have locked in a specific version of a pipeline you can use Nextflow’s built-in functionality to pull a workflow. The Nextflow pull command can download and cache workflows from GitHub repositories:\nnextflow pull nf-core/<pipeline>\nNextflow run will also automatically pull the workflow if it was not already available locally:\nnextflow run nf-core/<pipeline>\nNextflow will pull the default git branch if a workflow version is not specified. This will be the master branch for nf-core workflows with a stable release. nf-core workflows use GitHub releases to tag stable versions of the code and software. You will always be able to execute a previous version of a workflow once it is released using the -revision or -r flag.\nFor this section of the workshop we will be using the nf-core/testpipeline as an example.\nAs we will be running some bioinformatics tools, we will need to make sure of the following:\n\nWe are not running on login node\nsingularity module is loaded (module load singularity/3.7.3)\n\n\n\n\n\n\n\nSetup an interactive session\n\n\n\nsrun --pty -p prod_short --mem 20GB --cpus-per-task 2 -t 0-2:00 /bin/bash\n\nEnsure the required modules are loaded\nmodule list\nCurrently Loaded Modulefiles:\n 1) java/jdk-17.0.6 2) nextflow/23.04.1 3) squashfs-tools/4.5 4) singularity/3.7.3\n\n\n\nWe will also create a separate output directory for this section.\ncd /scratch/users/<your-username>/nfWorkshop; mkdir ./lesson1.2 && cd $_\nThe base command we will be using for this section is:\nnextflow run nf-core/testpipeline -profile test,singularity --outdir my_results\n\n\n1.2.3. Workflow structure\nnf-core workflows start from a common template and follow the same structure. Although you won’t need to edit code in the workflow project directory, having a basic understanding of the project structure and some core terminology will help you understand how to configure its execution.\nLet’s take a look at the code for the nf-core/rnaseq pipeline.\nNextflow DSL2 workflows are built up of subworkflows and modules that are stored as separate .nf files.\nMost nf-core workflows consist of a single workflow file (there are a few exceptions). This is the main <workflow>.nf file that is used to bring everything else together. Instead of having one large monolithic script, it is broken up into a combination of subworkflows and modules.\nA subworkflow is a groups of modules that are used in combination with each other and have a common purpose. Subworkflows improve workflow readability and help with the reuse of modules within a workflow. The nf-core community also shares subworkflows in the nf-core subworkflows GitHub repository. Local subworkflows are workflow specific that are not shared in the nf-core subworkflows repository.\nLet’s take a look at the BAM_STATS_SAMTOOLS subworkflow.\nThis subworkflow is comprised of the following modules: - SAMTOOLS_STATS - SAMTOOLS_IDXSTATS, and - SAMTOOLS_FLAGSTAT\nA module is a wrapper for a process, most modules will execute a single tool and contain the following definitions: - inputs - outputs, and - script block.\nLike subworkflows, modules can also be shared in the nf-core modules GitHub repository or stored as a local module. All modules from the nf-core repository are version controlled and tested to ensure reproducibility. Local modules are workflow specific that are not shared in the nf-core modules repository.\n\n\n1.2.4. Viewing parameters\nEvery nf-core workflow has a full list of parameters on the nf-core website. When viewing these parameters online, you will also be shown a description and the type of the parameter. Some parameters will have additional text to help you understand when and how a parameter should be used.\n\n\n\n\n\nParameters and their descriptions can also be viewed in the command line using the run command with the --help parameter:\nnextflow run nf-core/<workflow> --help\n\n\n\n\n\n\nChallenge\n\n\n\nView the parameters for the nf-core/testpipeline workflow using the command line:\n\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\nThe nf-core/testpipeline workflow parameters can be printed using the run command and the --help option:\nnextflow run nf-core/testpipeline --help\n\n\n\n\n\n1.2.5. Parameters in the command line\nParameters can be customized using the command line. Any parameter can be configured on the command line by prefixing the parameter name with a double dash (--):\nnextflow run nf-core/<workflow> --<parameter>\n\n\n\n\n\n\nTip\n\n\n\nNextflow options are prefixed with a single dash (-) and workflow parameters are prefixed with a double dash (--).\n\n\nDepending on the parameter type, you may be required to add additional information after your parameter flag. For example, for a string parameter, you would add the string after the parameter flag:\nnextflow run nf-core/<workflow> --<parameter> string\n\n\n\n\n\n\nChallenge\n\n\n\nGive the MultiQC report for the nf-core/testpipeline workflow the name of your favorite animal using the multiqc_title parameter using a command line flag:\n\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\nAdd the --multiqc_title flag to your command and execute it. Use the -resume option to save time:\nnextflow run nf-core/testpipeline -profile test,singularity --multiqc_title koala --outdir my_results -resume\n\n\n\nIn this example, you can check your parameter has been applied by listing the files created in the results folder (my_results):\nls my_results/multiqc/\n\n\n1.2.6. Configuration files\nConfiguration files are .config files that can contain various workflow properties. Custom paths passed in the command-line using the -c option:\nnextflow run nf-core/<workflow> -profile test,docker -c <path/to/custom.config>\nMultiple custom .config files can be included at execution by separating them with a comma (,).\nCustom configuration files follow the same structure as the configuration file included in the workflow directory. Configuration properties are organized into scopes by grouping the properties in the same scope using the curly brackets notation. For example:\nalpha {\n x = 1\n y = 'string value..'\n}\nScopes allow you to quickly configure settings required to deploy a workflow on different infrastructure using different software management. For example, the executor scope can be used to provide settings for the deployment of a workflow on a HPC cluster. Similarly, the singularity scope controls how Singularity containers are executed by Nextflow. Multiple scopes can be included in the same .config file using a mix of dot prefixes and curly brackets. A full list of scopes is described in detail here.\n\n\n\n\n\n\nChallenge\n\n\n\nGive the MultiQC report for the nf-core/testpipeline workflow the name of your favorite color using the multiqc_title parameter in a custom my_custom.config file:\n\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\nCreate a custom my_custom.config file that contains your favourite colour, e.g., blue:\nparams {\n multiqc_title = \"blue\"\n}\nInclude the custom .config file in your execution command with the -c option:\nnextflow run nf-core/testpipeline --outdir my_results -profile test,singularity -resume -c my_custom.config\nCheck that it has been applied:\nls my_results/multiqc/\nWhy did this fail?\nYou can not use the params scope in custom configuration files. Parameters can only be configured using the -params-file option and the command line. While parameter is listed as a parameter on the STDOUT, it was not applied to the executed command.\nWe will revisit this at the end of the module\n\n\n\n\n\n1.2.7 Parameter files\nParameter files are used to define params options for a pipeline, generally written in the YAML format. They are added to a pipeline with the flag --params-file\nExample YAML:\n\"<parameter1_name>\": 1,\n\"<parameter2_name>\": \"<string>\",\n\"<parameter3_name>\": true\n\n\n\n\n\n\nChallenge\n\n\n\nBased on the failed application of the parameter multiqc_title create a my_params.yml setting multiqc_title to your favourite colour. Then re-run the pipeline with the your my_params.yml\n\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\nSet up my_params.yml\nmultiqc_title: \"black\"\nnextflow run nf-core/testpipeline -profile test,singularity --params-file my_params.yml --outdir Lesson1_2\n\n\n\n\n\n1.2.8. Default configuration files\nAll parameters will have a default setting that is defined using the nextflow.config file in the workflow project directory. By default, most parameters are set to null or false and are only activated by a profile or configuration file.\nThere are also several includeConfig statements in the nextflow.config file that are used to load additional .config files from the conf/ folder. Each additional .config file contains categorized configuration information for your workflow execution, some of which can be optionally included:\n\nbase.config\n\nIncluded by the workflow by default.\nGenerous resource allocations using labels.\nDoes not specify any method for software management and expects software to be available (or specified elsewhere).\n\nigenomes.config\n\nIncluded by the workflow by default.\nDefault configuration to access reference files stored on AWS iGenomes.\n\nmodules.config\n\nIncluded by the workflow by default.\nModule-specific configuration options (both mandatory and optional).\n\n\nNotably, configuration files can also contain the definition of one or more profiles. A profile is a set of configuration attributes that can be activated when launching a workflow by using the -profile command option:\nnextflow run nf-core/<workflow> -profile <profile>\nProfiles used by nf-core workflows include:\n\nSoftware management profiles\n\nProfiles for the management of software using software management tools, e.g., docker, singularity, and conda.\n\nTest profiles\n\nProfiles to execute the workflow with a standardized set of test data and parameters, e.g., test and test_full.\n\n\nMultiple profiles can be specified in a comma-separated (,) list when you execute your command. The order of profiles is important as they will be read from left to right:\nnextflow run nf-core/<workflow> -profile test,singularity\nnf-core workflows are required to define software containers and conda environments that can be activated using profiles.\n\n\n\n\n\n\nTip\n\n\n\nIf you’re computer has internet access and one of Conda, Singularity, or Docker installed, you should be able to run any nf-core workflow with the test profile and the respective software management profile ‘out of the box’. The test data profile will pull small test files directly from the nf-core/test-data GitHub repository and run it on your local system. The test profile is an important control to check the workflow is working as expected and is a great way to trial a workflow. Some workflows have multiple test profiles for you to test.\n\n\n\n\n\n\n\n\nKey points\n\n\n\n\nnf-core is a community effort to collect a curated set of analysis workflows built using Nextflow.\nNextflow can be used to pull nf-core workflows.\nnf-core workflows follow similar structures\nnf-core workflows are configured using parameters and profiles\n\n\n\n\nThese materials are adapted from Customising Nf-Core Workshop by Sydney Informatics Hub" }, { "objectID": "workshops/4.1_modules.html", @@ -455,53 +371,137 @@ "text": "2.2.1. Nextflow log\nIt is important to keep a record of the commands you have run to generate your results. Nextflow helps with this by creating and storing metadata and logs about the run in hidden files and folders in your current directory (unless otherwise specified). This data can be used by Nextflow to generate reports. It can also be queried using the Nextflow log command:\nnextflow log\nThe log command has multiple options to facilitate the queries and is especially useful while debugging a workflow and inspecting execution metadata. You can view all of the possible log options with -h flag:\nnextflow log -h\nTo query a specific execution you can use the RUN NAME or a SESSION ID:\nnextflow log <run name>\nTo get more information, you can use the -f option with named fields. For example:\nnextflow log <run name> -f process,hash,duration\nThere are many other fields you can query. You can view a full list of fields with the -l option:\nnextflow log -l\n\n\n\n\n\n\nChallenge\n\n\n\nUse the log command to view with process, hash, and script fields for your tasks from your most recent Nextflow execution.\n\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\nUse the log command to get a list of you recent executions:\nnextflow log\nTIMESTAMP DURATION RUN NAME STATUS REVISION ID SESSION ID COMMAND \n2023-11-21 22:43:14 14m 17s jovial_angela OK 3bec2331ca 319751c3-25a6-4085-845c-6da28cd771df nextflow run nf-core/rnaseq\n2023-11-21 23:05:49 1m 36s marvelous_shannon OK 3bec2331ca 319751c3-25a6-4085-845c-6da28cd771df nextflow run nf-core/rnaseq\n2023-11-21 23:10:00 1m 35s deadly_babbage OK 3bec2331ca 319751c3-25a6-4085-845c-6da28cd771df nextflow run nf-core/rnaseq\nQuery the process, hash, and script using the -f option for the most recent run:\nnextflow log marvelous_shannon -f process,hash,script\n\n[... truncated ...]\n\nNFCORE_RNASEQ:RNASEQ:SUBREAD_FEATURECOUNTS 7c/f936d4 \n featureCounts \\\n -B -C -g gene_biotype -t exon \\\n -p \\\n -T 2 \\\n -a chr22_with_ERCC92.gtf \\\n -s 2 \\\n -o HBR_Rep1_ERCC.featureCounts.txt \\\n HBR_Rep1_ERCC.markdup.sorted.bam\n\n cat <<-END_VERSIONS > versions.yml\n \"NFCORE_RNASEQ:RNASEQ:SUBREAD_FEATURECOUNTS\":\n subread: $( echo $(featureCounts -v 2>&1) | sed -e \"s/featureCounts v//g\")\n END_VERSIONS\n\n[... truncated ... ]\n\nNFCORE_RNASEQ:RNASEQ:MULTIQC 7a/8449d7 \n multiqc \\\n -f \\\n \\\n \\\n .\n\n cat <<-END_VERSIONS > versions.yml\n \"NFCORE_RNASEQ:RNASEQ:MULTIQC\":\n multiqc: $( multiqc --version | sed -e \"s/multiqc, version //g\" )\n END_VERSIONS\n \n\n\n\n\n\n2.2.2. Execution cache and resume\nTask execution caching is an essential feature of modern workflow managers. As such, Nextflow provides an automated caching mechanism for every execution. When using the Nextflow -resume option, successfully completed tasks from previous executions are skipped and the previously cached results are used in downstream tasks.\nNextflow caching mechanism works by assigning a unique ID to each task. The task unique ID is generated as a 128-bit hash value composing the the complete file path, file size, and last modified timestamp. These ID’s are used to create a separate execution directory where the tasks are executed and the outputs are stored. Nextflow will take care of the inputs and outputs in these folders for you.\nYou can re-launch the previously executed nf-core/rnaseq workflow again, but with a -resume flag, and observe the progress. Notice the time it takes to complete the workflow.\nnextflow run nf-core/rnaseq -r 3.11.1 \\\n --input samplesheet.csv \\\n --outdir ./my_results \\\n --fasta $materials/ref/chr22_with_ERCC92.fa \\\n --gtf $materials/ref/chr22_with_ERCC92.gtf \\\n -profile singularity \\\n --skip_markduplicates true \\\n --save_trimmed true \\\n --save_unaligned true \\\n --max_memory '6.GB' \\\n --max_cpus 2 \\\n -resume \n\n[80/ec6ff8] process > NFCORE_RNASEQ:RNASEQ:PREPARE_GENOME:GTF2BED (chr22_with_ERCC92.gtf) [100%] 1 of 1, cached: 1 ✔\n[1a/7bec9c] process > NFCORE_RNASEQ:RNASEQ:PREPARE_GENOME:GTF_GENE_FILTER (chr22_with_ERCC92.fa) [100%] 1 of 1, cached: 1 ✔\nExecuting this workflow will create a my_results directory with selected results files and add some further sub-directories into the work directory\nIn the schematic above, the hexadecimal numbers, such as 80/ec6ff8, identify the unique task execution. These numbers are also the prefix of the work directories where each task is executed.\nYou can inspect the files produced by a task by looking inside the work directory and using these numbers to find the task-specific execution path:\nls work/80/ec6ff8ba69a8b5b8eede3679e9f978/\nIf you look inside the work directory of a FASTQC task, you will find the files that were staged and created when this task was executed:\n>>> ls -la work/e9/60b2e80b2835a3e1ad595d55ac5bf5/ \n\ntotal 15895\ndrwxrwxr-x 2 rlupat rlupat 4096 Nov 22 03:39 .\ndrwxrwxr-x 4 rlupat rlupat 4096 Nov 22 03:38 ..\n-rw-rw-r-- 1 rlupat rlupat 0 Nov 22 03:39 .command.begin\n-rw-rw-r-- 1 rlupat rlupat 9509 Nov 22 03:39 .command.err\n-rw-rw-r-- 1 rlupat rlupat 9609 Nov 22 03:39 .command.log\n-rw-rw-r-- 1 rlupat rlupat 100 Nov 22 03:39 .command.out\n-rw-rw-r-- 1 rlupat rlupat 10914 Nov 22 03:39 .command.run\n-rw-rw-r-- 1 rlupat rlupat 671 Nov 22 03:39 .command.sh\n-rw-rw-r-- 1 rlupat rlupat 231 Nov 22 03:39 .command.trace\n-rw-rw-r-- 1 rlupat rlupat 1 Nov 22 03:39 .exitcode\nlrwxrwxrwx 1 rlupat rlupat 63 Nov 22 03:39 HBR_Rep1_ERCC_1.fastq.gz -> HBR_Rep1_ERCC-Mix2_Build37-ErccTranscripts-chr22.read1.fastq.gz\n-rw-rw-r-- 1 rlupat rlupat 2368 Nov 22 03:39 HBR_Rep1_ERCC_1.fastq.gz_trimming_report.txt\n-rw-rw-r-- 1 rlupat rlupat 697080 Nov 22 03:39 HBR_Rep1_ERCC_1_val_1_fastqc.html\n-rw-rw-r-- 1 rlupat rlupat 490526 Nov 22 03:39 HBR_Rep1_ERCC_1_val_1_fastqc.zip\n-rw-rw-r-- 1 rlupat rlupat 6735205 Nov 22 03:39 HBR_Rep1_ERCC_1_val_1.fq.gz\nlrwxrwxrwx 1 rlupat rlupat 63 Nov 22 03:39 HBR_Rep1_ERCC_2.fastq.gz -> HBR_Rep1_ERCC-Mix2_Build37-ErccTranscripts-chr22.read2.fastq.gz\n-rw-rw-r-- 1 rlupat rlupat 2688 Nov 22 03:39 HBR_Rep1_ERCC_2.fastq.gz_trimming_report.txt\n-rw-rw-r-- 1 rlupat rlupat 695591 Nov 22 03:39 HBR_Rep1_ERCC_2_val_2_fastqc.html\n-rw-rw-r-- 1 rlupat rlupat 485732 Nov 22 03:39 HBR_Rep1_ERCC_2_val_2_fastqc.zip\n-rw-rw-r-- 1 rlupat rlupat 7088948 Nov 22 03:39 HBR_Rep1_ERCC_2_val_2.fq.gz\nlrwxrwxrwx 1 rlupat rlupat 102 Nov 22 03:39 HBR_Rep1_ERCC-Mix2_Build37-ErccTranscripts-chr22.read1.fastq.gz -> /data/seqliner/test-data/rna-seq/fastq/HBR_Rep1_ERCC-Mix2_Build37-ErccTranscripts-chr22.read1.fastq.gz\nlrwxrwxrwx 1 rlupat rlupat 102 Nov 22 03:39 HBR_Rep1_ERCC-Mix2_Build37-ErccTranscripts-chr22.read2.fastq.gz -> /data/seqliner/test-data/rna-seq/fastq/HBR_Rep1_ERCC-Mix2_Build37-ErccTranscripts-chr22.read2.fastq.gz\n-rw-rw-r-- 1 rlupat rlupat 109 Nov 22 03:39 versions.yml\nThe FASTQC process runs twice, executing in a different work directories for each set of inputs. Therefore, in the previous example, the work directory [e9/60b2e8] represents just one of the four sets of input data that was processed.\nIt’s very likely you will execute a workflow multiple times as you find the parameters that best suit your data. You can save a lot of spaces (and time) by resuming a workflow from the last step that was completed successfully and/or unmodified.\nIn practical terms, the workflow is executed from the beginning. However, before launching the execution of a process, Nextflow uses the task unique ID to check if the work directory already exists and that it contains a valid command exit state with the expected output files. If this condition is satisfied, the task execution is skipped and previously computed results are used as the process results.\nNotably, the -resume functionality is very sensitive. Even touching a file in the work directory can invalidate the cache.\n\n\n\n\n\n\nChallenge\n\n\n\nInvalidate the cache by touching a .fastq.gz file in a FASTQC task work directory (you can use the touch command). Execute the workflow again with the -resume option to show that the cache has been invalidated.\n\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\nExecute the workflow for the first time (if you have not already).\nUse the task ID shown for the FASTQC process and use it to find and touch a the sample1_R1.fastq.gz file:\ntouch work/ff/21abfa87cc7cdec037ce4f36807d32/HBR_Rep1_ERCC_1.fastq.gz\nExecute the workflow again with the -resume command option:\nnextflow run nf-core/rnaseq -r 3.11.1 \\\n --input samplesheet.csv \\\n --outdir ./my_results \\\n --fasta $materials/ref/chr22_with_ERCC92.fa \\\n --gtf $materials/ref/chr22_with_ERCC92.gtf \\\n -profile singularity \\\n --skip_markduplicates true \\\n --save_trimmed true \\\n --save_unaligned true \\\n --max_memory '6.GB' \\\n --max_cpus 2 \\\n -resume \nYou should see that some task were invalid and were executed again.\nWhy did this happen?\nIn this example, the cache of two FASTQC tasks were invalid. The fastq file we touch is used by in the pipeline in multiple places. Thus, touching the symlink for this file and changing the date of last modification disrupted two task executions.\n\n\n\n\n\n2.2.3. Troubleshoot warning and error messages\nWhile our previous workflow execution completed successfully, there were a couple of warning messages that may be cause for concern:\n-[nf-core/rnaseq] Pipeline completed successfully with skipped sampl(es)-\n-[nf-core/rnaseq] Please check MultiQC report: 2/2 samples failed strandedness check.-\nCompleted at: 20-Nov-2023 00:29:04\nDuration : 10m 15s\nCPU hours : 0.3 \nSucceeded : 72\n\n\n\n\n\n\nHandling dodgy error messages 🤬\n\n\n\nThe first warning message isn’t very descriptive (see this pull request). You might come across issues like this when running nf-core pipelines, too. Bug reports and user feedback is very important to open source software communities like nf-core. If you come across any issues, submit a GitHub issue or start a discussion in the relevant nf-core Slack channel so others are aware and it can be addressed by the pipeline’s developers.\n\n\n➤ Take a look at the MultiQC report, as directed by the second message. You can find the MultiQC report in the lesson2.1/ directory:\nls -la lesson2.1/multiqc/star_salmon/\ntotal 1402\ndrwxrwxr-x 4 rlupat rlupat 4096 Nov 22 00:29 .\ndrwxrwxr-x 3 rlupat rlupat 4096 Nov 22 00:29 ..\ndrwxrwxr-x 2 rlupat rlupat 8192 Nov 22 00:29 multiqc_data\ndrwxrwxr-x 5 rlupat rlupat 4096 Nov 22 00:29 multiqc_plots\n-rw-rw-r-- 1 rlupat rlupat 1419998 Nov 22 00:29 multiqc_report.html\n➤ Download the multiqc_report.html the file navigator panel on the left side of your VS Code window by right-clicking on it and then selecting Download. Open the file on your computer.\nTake a look a the section labelled WARNING: Fail Strand Check\nThe warning we have received is indicating that the read strandedness we specified in our samplesheet.csv and inferred strandedness identified by the RSeqQC process in the pipeline do not match. It looks like the test samplesheet have incorrectly specified strandedness as forward in the samplesheet.csv when our raw reads actually show an equal distribution of sense and antisense reads.\nFor those who are not familiar with RNAseq data, incorrectly specified strandedness may negatively impact the read quantification step (process: Salmon quant) and give us inaccurate results. So, let’s clarify how the Salmon quant process is gathering strandedness information for our input files by default and find a way to address this with the parameters provided by the nf-core/rnaseq pipeline.\n\n\n\n2.2.4. Identify the run command for a process\nTo observe exactly what command is being run for a process, we can attempt to infer this information from the module’s main.nf script in the modules/ directory. However, given all the different parameters that may be applied at the process level, this may not be very clear.\n➤ Take a look at the Salmon quant main.nf file:\nnf-core-rnaseq-3.11.1/workflow/modules/nf-core/salmon/quant/main.nf\nUnless you are familiar with developing nf-core pipelines, it can be very hard to see what is actually happening in the code, given all the different variables and conditional arguments inside this script. Above the script block we can see strandedness is being applied using a few different conditional arguments. Instead of trying to infer how the $strandedness variable is being defined and applied to the process, let’s use the hidden command files saved for this task in the work/ directory.\n\n\n\n\n\n\nHidden files in the work directory!\n\n\n\nRemember that the pipeline’s results are cached in the work directory. In addition to the cached files, each task execution directories inside the work directory contains a number of hidden files:\n\n.command.sh: The command script run for the task.\n.command.run: The command wrapped used to run the task.\n.command.out: The task’s standard output log.\n.command.err: The task’s standard error log.\n.command.log: The wrapper execution output.\n.command.begin: A file created as soon as the job is launched.\n.exitcode: A file containing the task exit code (0 if successful)\n\n\n\nWith nextflow log command that we discussed previously, there are multiple options to facilitate the queries and is especially useful while debugging a pipeline and while inspecting pipeline execution metadata.\nTo understand how Salmon quant is interpreting strandedness, we’re going to use this command to track down the hidden .command.sh scripts for each Salmon quant task that was run. This will allow us to find out how Salmon quant handles strandedness and if there is a way for us to override this.\n➤ Use the Nextflow log command to get the unique run name information of the previously executed pipelines:\nnextflow log <run-name>\nThat command will list out all the work subdirectories for all processes run.\nAnd we now need to find the specific hidden.command.sh for Salmon tasks. But how to find them? 🤔\n➤ Let’s add some custom bash code to query a Nextflow run with the run name from the previous lesson. First, save your run name in a bash variable. For example:\nrun_name=marvelous_shannon\n➤ And let’s save the tool of interest (salmon) in another bash variable to pull it from a run command:\ntool=salmon\n➤ Next, run the following bash command:\nnextflow log ${run_name} | while read line;\n do\n cmd=$(ls ${line}/.command.sh 2>/dev/null);\n if grep -q $tool $cmd;\n then \n echo $cmd; \n fi; \n done \nThat will list all process .command.sh scripts containing ‘salmon’. There are a few different processes that run Salmon to perform other steps in the workflow. We are looking for Salmon quant which performs the read quantification:\n/scratch/users/rlupat/nfWorkshop/lesson2.1/work/57/fba8f9a2385dac5fa31688ba1afa9b/.command.sh\n/scratch/users/rlupat/nfWorkshop/lesson2.1/work/30/0113a58c14ca8d3099df04ebf388f3/.command.sh\n/scratch/users/rlupat/nfWorkshop/lesson2.1/work/ec/95d6bd12d578c3bce22b5de4ed43fe/.command.sh\n/scratch/users/rlupat/nfWorkshop/lesson2.1/work/49/6fedcb09e666432ae6ddf8b1e8f488/.command.sh\n/scratch/users/rlupat/nfWorkshop/lesson2.1/work/b4/2ca8d05b049438262745cde92955e9/.command.sh\n/scratch/users/rlupat/nfWorkshop/lesson2.1/work/38/875d68dae270504138bb3d72d511a7/.command.sh\n/scratch/users/rlupat/nfWorkshop/lesson2.1/work/72/776810a99695b1c114cbb103f4a0e6/.command.sh\n/scratch/users/rlupat/nfWorkshop/lesson2.1/work/1c/dc3f54cc7952bf55e6742dd4783392/.command.sh\n/scratch/users/rlupat/nfWorkshop/lesson2.1/work/f3/5116a5b412bde7106645671e4c6ffb/.command.sh\n/scratch/users/rlupat/nfWorkshop/lesson2.1/work/17/fb0c791810f42a438e812d5c894ebf/.command.sh\n/scratch/users/rlupat/nfWorkshop/lesson2.1/work/4c/931a9b60b2f3cf770028854b1c673b/.command.sh\n/scratch/users/rlupat/nfWorkshop/lesson2.1/work/91/e1c99d8acb5adf295b37fd3bbc86a5/.command.sh\nCompared with the salmon quant main.nf file, we get a lot more fine scale details from the .command.sh process scripts:\n>>> cat main.nf\nsalmon quant \\\\\n --geneMap $gtf \\\\\n --threads $task.cpus \\\\\n --libType=$strandedness \\\\\n $reference \\\\\n $input_reads \\\\\n $args \\\\\n -o $prefix\n>>> cat .command.sh\nsalmon quant \\\n --geneMap chr22_with_ERCC92.gtf \\\n --threads 2 \\\n --libType=ISF \\\n -t genome.transcripts.fa \\\n -a HBR_Rep1_ERCC.Aligned.toTranscriptome.out.bam \\\n \\\n -o HBR_Rep1_ERCC\nLooking at the nf-core/rnaseq Parameter documentation and Salmon documentation, we found that we can override this default using the --salmon_quant_libtype A parameter to indicate our data is unstranded and override samplesheet.csv input.\n\n\n\n\n\n\nHow do I get rid of the strandedness check warning message?\n\n\n\nIf we want to get rid of the warning message Please check MultiQC report: 2/2 samples failed strandedness check, we’ll have to change the strandedness fields in our samplesheet.csv. Keep in mind, doing this will invalidate the pipeline’s cache and cause the pipeline to run from the beginning.\n\n\n\n\n\n2.2.5. Write a parameter file\nFrom the previous section we learn that Nextflow accepts either yaml or json formats for parameter files. Any of the pipeline-specific parameters can be supplied to a Nextflow pipeline in this way.\n\n\n\n\n\n\nChallenge\n\n\n\nFill in the parameters file below and save as workshop-params.yaml. This time, include the --salmon_quant_libtype A parameter.\n💡 YAML formatting tips!\n\nStrings need to be inside double quotes\nBooleans (true/false) and numbers do not require quotes\n\ninput: \"\"\noutdir: \"lesson2.2\"\nfasta: \"\"\ngtf: \"\"\nstar_index: \"\"\nsalmon_index: \"\"\nskip_markduplicates: \nsave_trimmed: \nsave_unaligned: \nsalmon_quant_libtype: \"A\" \n\n\n\n\n2.2.6. Apply the parameter file\n➤ Once your params file has been saved, run:\nnextflow run nf-core/rnaseq -r 3.11.1 \\\n -params-file workshop-params.yaml\n -profile singularity \\\n --max_memory '6.GB' \\\n --max_cpus 2 \\\n -resume \nThe number of pipeline-specific parameters we’ve added to our run command has been significantly reduced. The only -- parameters we’ve provided to the run command relate to how the pipeline is executed on our interative job. These resource limits won’t be applicable to others who will run the pipeline on a different infrastructure.\nAs the workflow runs a second time, you will notice 4 things:\n\nThe command is much tidier thanks to offloading some parameters to the params file\nThe -resume flag. Nextflow has lots of run options including the ability to use cached output!\nSome processes will be pulled from the cache. These processes remain unaffected by our addition of a new parameter.\n\nThis run of the pipeline will complete in a much shorter time.\n\n-[nf-core/rnaseq] Pipeline completed successfully with skipped sampl(es)-\n-[nf-core/rnaseq] Please check MultiQC report: 2/2 samples failed strandedness check.-\nCompleted at: 21-Apr-2023 05:58:06\nDuration : 1m 51s\nCPU hours : 0.3 (82.2% cached)\nSucceeded : 11\nCached : 55\n\n\nThese materials are adapted from Customising Nf-Core Workshop by Sydney Informatics Hub" }, { - "objectID": "workshops/7.1_metadata_proprogation.html", - "href": "workshops/7.1_metadata_proprogation.html", - "title": "Nextflow Development - Metadata Proprogation", + "objectID": "workshops/8.1_scatter_gather_output.html", + "href": "workshops/8.1_scatter_gather_output.html", + "title": "Nextflow Development - Outputs, Scatter, and Gather", "section": "", - "text": "Objectives\n\n\n\n\nGain and understanding of how to manipulate and proprogate metadata" + "text": "Objectives\n\n\n\n\nGain an understanding of how to structure nextflow published outputs\nGain an understanding of how to do scatter & gather processes" }, { - "objectID": "workshops/7.1_metadata_proprogation.html#environment-setup", - "href": "workshops/7.1_metadata_proprogation.html#environment-setup", - "title": "Nextflow Development - Metadata Proprogation", + "objectID": "workshops/8.1_scatter_gather_output.html#environment-setup", + "href": "workshops/8.1_scatter_gather_output.html#environment-setup", + "title": "Nextflow Development - Outputs, Scatter, and Gather", "section": "Environment Setup", "text": "Environment Setup\nSet up an interactive shell to run our Nextflow workflow:\nsrun --pty -p prod_short --mem 8GB --mincpus 2 -t 0-2:00 bash\nLoad the required modules to run Nextflow:\nmodule load nextflow/23.04.1\nmodule load singularity/3.7.3\nSet the singularity cache environment variable:\nexport NXF_SINGULARITY_CACHEDIR=/config/binaries/singularity/containers_devel/nextflow\nSingularity images downloaded by workflow executions will now be stored in this directory.\nYou may want to include these, or other environmental variables, in your .bashrc file (or alternate) that is loaded when you log in so you don’t need to export variables every session. A complete list of environment variables can be found here.\nThe training data can be cloned from:\ngit clone https://github.com/nextflow-io/training.git" }, { - "objectID": "workshops/7.1_metadata_proprogation.html#metadata-parsing", - "href": "workshops/7.1_metadata_proprogation.html#metadata-parsing", - "title": "Nextflow Development - Metadata Proprogation", - "section": "7.1 Metadata Parsing", - "text": "7.1 Metadata Parsing\nWe have covered a few different methods of metadata parsing.\n\n7.1.1 First Pass: .fromFilePairs\nA first pass attempt at pulling these files into Nextflow might use the fromFilePairs method:\nworkflow {\n Channel.fromFilePairs(\"/home/Shared/For_NF_Workshop/training/nf-training-advanced/metadata/data/reads/*/*_R{1,2}.fastq.gz\")\n .view\n}\nNextflow will pull out the first part of the fastq filename and returned us a channel of tuple elements where the first element is the filename-derived ID and the second element is a list of two fastq files.\nThe id is stored as a simple string. We’d like to move to using a map of key-value pairs because we have more than one piece of metadata to track. In this example, we have sample, replicate, tumor/normal, and treatment. We could add extra elements to the tuple, but this changes the ‘cardinality’ of the elements in the channel and adding extra elements would require updating all downstream processes. A map is a single object and is passed through Nextflow channels as one value, so adding extra metadata fields will not require us to change the cardinality of the downstream processes.\nThere are a couple of different ways we can pull out the metadata\nWe can use the tokenize method to split our id. To sanity-check, I just pipe the result directly into the view operator.\nworkflow {\n Channel.fromFilePairs(\"/home/Shared/For_NF_Workshop/training/nf-training-advanced/metadata/data/reads/*/*_R{1,2}.fastq.gz\")\n .map { id, reads ->\n tokens = id.tokenize(\"_\")\n }\n .view\n}\nIf we are confident about the stability of the naming scheme, we can destructure the list returned by tokenize and assign them to variables directly:\nmap { id, reads ->\n (sample, replicate, type) = id.tokenize(\"_\")\n meta = [sample:sample, replicate:replicate, type:type]\n [meta, reads]\n}\n\n\n\n\n\n\nNote\n\n\n\nMake sure that you're using a tuple with parentheses e.g. (one, two) rather than a List e.g. [one, two]\n\n\nIf we move back to the previous method, but decided that the ‘rep’ prefix on the replicate should be removed, we can use regular expressions to simply “subtract” pieces of a string. Here we remove a ‘rep’ prefix from the replicate variable if the prefix is present:\nmap { id, reads ->\n (sample, replicate, type) = id.tokenize(\"_\")\n replicate -= ~/^rep/\n meta = [sample:sample, replicate:replicate, type:type]\n [meta, reads]\n}\nBy setting up our the “meta”, in our tuple with the format above, allows us to access the values in “sample” throughout our modules/configs as ${meta.sample}." + "objectID": "workshops/8.1_scatter_gather_output.html#rna-seq-workflow-and-module-files", + "href": "workshops/8.1_scatter_gather_output.html#rna-seq-workflow-and-module-files", + "title": "Nextflow Development - Outputs, Scatter, and Gather", + "section": "RNA-seq Workflow and Module Files ", + "text": "RNA-seq Workflow and Module Files \nPreviously, we created three Nextflow files and one config file:\n├── nextflow.config\n├── rnaseq.nf\n├── modules.nf\n└── modules\n └── trimgalore.nf\n\nrnaseq.nf: main workflow script where parameters are defined and processes were called.\n\n#!/usr/bin/env nextflow\n\nparams.reads = \"/scratch/users/.../training/nf-training/data/ggal/*_{1,2}.fq\"\nparams.transcriptome_file = \"/scratch/users/.../training/nf-training/data/ggal/transcriptome.fa\"\n\nreads_ch = Channel.fromFilePairs(\"$params.reads\")\n\ninclude { INDEX } from './modules.nf'\ninclude { QUANTIFICATION as QT } from './modules.nf'\ninclude { FASTQC as FASTQC_one } from './modules.nf'\ninclude { FASTQC as FASTQC_two } from './modules.nf'\ninclude { MULTIQC } from './modules.nf'\ninclude { TRIMGALORE } from './modules/trimgalore.nf'\n\nworkflow {\n index_ch = INDEX(params.transcriptome_file)\n quant_ch = QT(index_ch, reads_ch)\n fastqc_ch = FASTQC_one(reads_ch)\n trimgalore_out_ch = TRIMGALORE(reads_ch).reads\n fastqc_cleaned_ch = FASTQC_two(trimgalore_out_ch)\n multiqc_ch = MULTIQC(quant_ch, fastqc_ch)\n}\n\nmodules.nf: script containing the majority of modules, including INDEX, QUANTIFICATION, FASTQC, and MULTIQC\n\nprocess INDEX {\n container \"/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-salmon-1.10.1--h7e5ed60_0.img\"\n\n input:\n path transcriptome\n\n output:\n path \"salmon_idx\"\n\n script:\n \"\"\"\n salmon index --threads $task.cpus -t $transcriptome -i salmon_idx\n \"\"\"\n}\n\nprocess QUANTIFICATION {\n container \"/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-salmon-1.10.1--h7e5ed60_0.img\"\n\n input:\n path salmon_index\n tuple val(sample_id), path(reads)\n\n output:\n path \"$sample_id\"\n\n script:\n \"\"\"\n salmon quant --threads $task.cpus --libType=U \\\n -i $salmon_index -1 ${reads[0]} -2 ${reads[1]} -o $sample_id\n \"\"\"\n}\n\nprocess FASTQC {\n container \"/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-fastqc-0.12.1--hdfd78af_0.img\"\n\n input:\n tuple val(sample_id), path(reads)\n\n output:\n path \"fastqc_${sample_id}_logs\"\n\n script:\n \"\"\"\n mkdir fastqc_${sample_id}_logs\n fastqc -o fastqc_${sample_id}_logs -f fastq -q ${reads}\n \"\"\"\n}\n\nprocess MULTIQC {\n publishDir params.outdir, mode:'copy'\n container \"/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-multiqc-1.21--pyhdfd78af_0.img\"\n\n input:\n path quantification\n path fastqc\n\n output:\n path \"*.html\"\n\n script:\n \"\"\"\n multiqc . --filename $quantification\n \"\"\"\n}\n\nmodules/trimgalore.nf: script inside a modules folder, containing only the TRIMGALORE process\n\nprocess TRIMGALORE {\n container '/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-trim-galore-0.6.6--0.img' \n\n input:\n tuple val(sample_id), path(reads)\n \n output:\n tuple val(sample_id), path(\"*{3prime,5prime,trimmed,val}*.fq.gz\"), emit: reads\n tuple val(sample_id), path(\"*report.txt\") , emit: log , optional: true\n tuple val(sample_id), path(\"*unpaired*.fq.gz\") , emit: unpaired, optional: true\n tuple val(sample_id), path(\"*.html\") , emit: html , optional: true\n tuple val(sample_id), path(\"*.zip\") , emit: zip , optional: true\n\n script:\n \"\"\"\n trim_galore \\\\\n --paired \\\\\n --gzip \\\\\n ${reads[0]} \\\\\n ${reads[1]}\n \"\"\"\n}\n\nnextflow.config: config file that enables singularity\n\nsingularity {\n enabled = true\n autoMounts = true\n cacheDir = \"/config/binaries/singularity/containers_devel/nextflow\"\n}\nRun the pipeline, specifying --outdir:\n>>> nextflow run rnaseq.nf --outdir output\nN E X T F L O W ~ version 23.04.1\nLaunching `rnaseq.nf` [soggy_jennings] DSL2 - revision: 87afc1d98d\nexecutor > local (16)\n[93/d37ef0] process > INDEX [100%] 1 of 1 ✔\n[b3/4c4d9c] process > QT (1) [100%] 3 of 3 ✔\n[d0/173a6e] process > FASTQC_one (3) [100%] 3 of 3 ✔\n[58/0b8af2] process > TRIMGALORE (3) [100%] 3 of 3 ✔\n[c6/def175] process > FASTQC_two (3) [100%] 3 of 3 ✔\n[e0/bcf904] process > MULTIQC (3) [100%] 3 of 3 ✔" }, { - "objectID": "workshops/7.1_metadata_proprogation.html#second-parse-.splitcsv", - "href": "workshops/7.1_metadata_proprogation.html#second-parse-.splitcsv", - "title": "Nextflow Development - Metadata Proprogation", - "section": "Second Parse: .splitCsv", - "text": "Second Parse: .splitCsv\nWe have briefly touched on .splitCsv in the first week.\nAs a quick overview\nAssuming we have the samplesheet\nsample_name,fastq1,fastq2\ngut_sample,/.../training/nf-training/data/ggal/gut_1.fq,/.../training/nf-training/data/ggal/gut_2.fq\nliver_sample,/.../training/nf-training/data/ggal/liver_1.fq,/.../training/nf-training/data/ggal/liver_2.fq\nlung_sample,/.../training/nf-training/data/ggal/lung_1.fq,/.../training/nf-training/data/ggal/lung_2.fq\nWe can set up a workflow to read in these files as:\nparams.reads = \"/.../rnaseq_samplesheet.csv\"\n\nreads_ch = Channel.fromPath(params.reads)\nreads_ch.view()\nreads_ch = reads_ch.splitCsv(header:true)\nreads_ch.view()\n\n\n\n\n\n\nChallenge\n\n\n\nUsing .splitCsv and .map read in the samplesheet below: /home/Shared/For_NF_Workshop/training/nf-training-advanced/metadata/data/samplesheet.csv\nSet the meta to contain the following keys from the header id, repeat and type\n\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\nparams.input = \"/home/Shared/For_NF_Workshop/training/nf-training-advanced/metadata/data/samplesheet.csv\"\n\nch_sheet = Channel.fromPath(params.input)\n\nch_sheet.splitCsv(header:true)\n .map {\n it ->\n [[it.id, it.repeat, it.type], it.fastq_1, it.fastq_2]\n }.view()" + "objectID": "workshops/8.1_scatter_gather_output.html#organise-outputs", + "href": "workshops/8.1_scatter_gather_output.html#organise-outputs", + "title": "Nextflow Development - Outputs, Scatter, and Gather", + "section": "8.1. Organise outputs", + "text": "8.1. Organise outputs\nThe output declaration block defines the channels used by the process to send out the results produced. However, this output only stays in the work/ directory if there is no publishDir directive specified.\nGiven each task is being executed in separate temporary work/ folder (e.g., work/f1/850698…), you may want to save important, non-intermediary, and/or final files in a results folder.\nTo store our workflow result files, you need to explicitly mark them using the directive publishDir in the process that’s creating the files. For example:\nprocess MULTIQC {\n publishDir params.outdir, mode:'copy'\n container \"/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-multiqc-1.21--pyhdfd78af_0.img\"\n\n input:\n path quantification\n path fastqc\n\n output:\n path \"*.html\"\n\n script:\n \"\"\"\n multiqc . --filename $quantification\n \"\"\"\n}\nThe above example will copy all html files created by the MULTIQC process into the directory path specified in the params.outdir" }, { - "objectID": "workshops/7.1_metadata_proprogation.html#manipulating-metadata-and-channels", - "href": "workshops/7.1_metadata_proprogation.html#manipulating-metadata-and-channels", - "title": "Nextflow Development - Metadata Proprogation", - "section": "7.2 Manipulating Metadata and Channels", - "text": "7.2 Manipulating Metadata and Channels\nThere are a number of use cases where we will be interested in manipulating our metadata and channels.\nHere we will look at 2 use cases.\n\n7.2.1 Matching input channels\nAs we have seen in examples/challenges in the operators section, it is important to ensure that the format of the channels that you provide as inputs match the process definition.\nparams.reads = \"/home/Shared/For_NF_Workshop/training/nf-training/data/ggal/*_{1,2}.fq\"\n\nprocess printNumLines {\n input:\n path(reads)\n\n output:\n path(\"*txt\")\n\n script:\n \"\"\"\n wc -l ${reads}\n \"\"\"\n}\n\nworkflow {\n ch_input = Channel.fromFilePairs(\"$params.reads\")\n printNumLines( ch_input )\n}\nAs if the format does not match you will see and error similar to below:\n[myeung@papr-res-compute204 lesson7.1test]$ nextflow run test.nf \nN E X T F L O W ~ version 23.04.1\nLaunching `test.nf` [agitated_faggin] DSL2 - revision: c210080493\n[- ] process > printNumLines -\nor if using nf-core template\nERROR ~ Error executing process > 'PMCCCGTRC_UMIHYBCAP:UMIHYBCAP:PREPARE_GENOME:BEDTOOLS_SLOP'\n\nCaused by:\n Not a valid path value type: java.util.LinkedHashMap ([id:genome_size])\n\n\nTip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`\n\n -- Check '.nextflow.log' file for details\nWhen encountering these errors there are two methods to correct this:\n\nChange the input definition in the process\nUse variations of the channel operators to correct the format of your channel\n\nThere are cases where changing the input definition is impractical (i.e. when using nf-core modules/subworkflows).\nLet’s take a look at some select modules.\nBEDTOOLS_SLOP\nBEDTOOLS_INTERSECT\n\n\n\n\n\n\nChallenge\n\n\n\nAssuming that you have the following inputs\nch_target = Channel.fromPath(\"/home/Shared/For_NF_Workshop/training/nf-training-advanced/grouping/data/intervals.bed\")\nch_bait = Channel.fromPath(\"/home/Shared/For_NF_Workshop/training/nf-training-advanced/grouping/data/intervals2.bed\").map { fn -> [ [id: fn.baseName ], fn ] }\nch_sizes = Channel.fromPath(\"/home/Shared/For_NF_Workshop/training/nf-training-advanced/grouping/data/genome.sizes\")\nWrite a mini workflow that:\n\nTakes the ch_target bedfile and extends the bed by 20bp on both sides using BEDTOOLS_SLOP (You can use the config definition below as a helper, or write your own as an additional challenge)\nTake the output from BEDTOOLS_SLOP and input this output with the ch_baits to BEDTOOLS_INTERSECT\n\nHINT: The modules can be imported from this location: /home/Shared/For_NF_Workshop/training/pmcc-test/modules/nf-core/bedtools\nHINT: You will need need the following operators to achieve this .map and .combine\n\n\n\n\n\n\n\n\nConfig\n\n\n\n\n\n\nprocess {\n withName: 'BEDTOOLS_SLOP' {\n ext.args = \"-b 20\"\n ext.prefix = \"extended.bed\"\n }\n\n withName: 'BEDTOOLS_INTERSECT' {\n ext.prefix = \"intersect.bed\"\n }\n}\n:::\n\n:::{.callout-caution collapse=\"true\"}\n## **Solution**\n```default\ninclude { BEDTOOLS_SLOP } from '/home/Shared/For_NF_Workshop/training/pmcc-test/modules/nf-core/bedtools/slop/main'\ninclude { BEDTOOLS_INTERSECT } from '/home/Shared/For_NF_Workshop/training/pmcc-test/modules/nf-core/bedtools/intersect/main'\n\n\nch_target = Channel.fromPath(\"/home/Shared/For_NF_Workshop/training/nf-training-advanced/grouping/data/intervals.bed\")\nch_bait = Channel.fromPath(\"/home/Shared/For_NF_Workshop/training/nf-training-advanced/grouping/data/intervals2.bed\").map { fn -> [ [id: fn.baseName ], fn ] }\nch_sizes = Channel.fromPath(\"/home/Shared/For_NF_Workshop/training/nf-training-advanced/grouping/data/genome.sizes\")\n\nworkflow {\n BEDTOOLS_SLOP ( ch_target.map{ fn -> [ [id:fn.baseName], fn ]}, ch_sizes)\n\n target_bait_bed = BEDTOOLS_SLOP.out.bed.combine( ch_bait )\n BEDTOOLS_INTERSECT( target_bait_bed, ch_sizes.map{ fn -> [ [id: fn.baseName], fn]} )\n}\nnextflow run nfcoretest.nf -profile singularity -c test2.config --outdir nfcoretest" + "objectID": "workshops/8.1_scatter_gather_output.html#store-outputs-matching-a-glob-pattern", + "href": "workshops/8.1_scatter_gather_output.html#store-outputs-matching-a-glob-pattern", + "title": "Nextflow Development - Outputs, Scatter, and Gather", + "section": "8.1.1. Store outputs matching a glob pattern", + "text": "8.1.1. Store outputs matching a glob pattern\nYou can use more than one publishDir to keep different outputs in separate directories. For each directive specify a different glob pattern using the pattern option to store into each directory only the files that match the provided pattern.\nFor example:\nreads_ch = Channel.fromFilePairs('data/ggal/*_{1,2}.fq')\n\nprocess FOO {\n publishDir \"results/bam\", pattern: \"*.bam\"\n publishDir \"results/bai\", pattern: \"*.bai\"\n\n input:\n tuple val(sample_id), path(sample_id_paths)\n\n output:\n tuple val(sample_id), path(\"*.bam\")\n tuple val(sample_id), path(\"*.bai\")\n\n script:\n \"\"\"\n echo your_command_here --sample $sample_id_paths > ${sample_id}.bam\n echo your_command_here --sample $sample_id_paths > ${sample_id}.bai\n \"\"\"\n}\nExercise\nUse publishDir and pattern to keep the outputs from the trimgalore.nf into separate directories.\n\n\n\n\n\n\nSolution\n\n\n\n\n\nprocess TRIMGALORE {\n container '/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-trim-galore-0.6.6--0.img' \n publishDir \"$params.outdir/report\", mode: \"copy\", pattern:\"*report.txt\"\n publishDir \"$params.outdir/trimmed_fastq\", mode: \"copy\", pattern:\"*fq.gz\"\n\n input:\n tuple val(sample_id), path(reads)\n \n output:\n tuple val(sample_id), path(\"*{3prime,5prime,trimmed,val}*.fq.gz\"), emit: reads\n tuple val(sample_id), path(\"*report.txt\") , emit: log , optional: true\n tuple val(sample_id), path(\"*unpaired*.fq.gz\") , emit: unpaired, optional: true\n tuple val(sample_id), path(\"*.html\") , emit: html , optional: true\n tuple val(sample_id), path(\"*.zip\") , emit: zip , optional: true\n\n script:\n \"\"\"\n trim_galore \\\\\n --paired \\\\\n --gzip \\\\\n ${reads[0]} \\\\\n ${reads[1]}\n \"\"\"\n}\nOutput should now look like\n>>> tree ./output\n./output\n├── gut.html\n├── liver.html\n├── lung.html\n├── report\n│   ├── gut_1.fq_trimming_report.txt\n│   ├── gut_2.fq_trimming_report.txt\n│   ├── liver_1.fq_trimming_report.txt\n│   ├── liver_2.fq_trimming_report.txt\n│   ├── lung_1.fq_trimming_report.txt\n│   └── lung_2.fq_trimming_report.txt\n└── trimmed_fastq\n ├── gut_1_val_1.fq.gz\n ├── gut_2_val_2.fq.gz\n ├── liver_1_val_1.fq.gz\n ├── liver_2_val_2.fq.gz\n ├── lung_1_val_1.fq.gz\n └── lung_2_val_2.fq.gz\n\n2 directories, 15 files" }, { - "objectID": "workshops/7.1_metadata_proprogation.html#grouping-with-metadata", - "href": "workshops/7.1_metadata_proprogation.html#grouping-with-metadata", - "title": "Nextflow Development - Metadata Proprogation", - "section": "7.3 Grouping with Metadata", - "text": "7.3 Grouping with Metadata\nEarlier we introduced the function groupTuple\n\nch_reads = Channel.fromFilePairs(\"/home/Shared/For_NF_Workshop/training/nf-training-advanced/metadata/data/reads/*/*_R{1,2}.fastq.gz\")\n .map { id, reads ->\n (sample, replicate, type) = id.tokenize(\"_\")\n replicate -= ~/^rep/\n meta = [sample:sample, replicate:replicate, type:type]\n [meta, reads]\n}\n\n## Assume that we want to drop replicate from the meta and combine fastqs\n\nch_reads.map {\n meta, reads -> \n [ meta - meta.subMap('replicate') + [data_type: 'fastq'], reads ]\n }\n .groupTuple().view()" + "objectID": "workshops/8.1_scatter_gather_output.html#store-outputs-renaming-files-or-in-a-sub-directory", + "href": "workshops/8.1_scatter_gather_output.html#store-outputs-renaming-files-or-in-a-sub-directory", + "title": "Nextflow Development - Outputs, Scatter, and Gather", + "section": "8.1.2. Store outputs renaming files or in a sub-directory", + "text": "8.1.2. Store outputs renaming files or in a sub-directory\nThe publishDir directive also allow the use of saveAs option to give each file a name of your choice, providing a custom rule as a closure.\nprocess foo {\n publishDir 'results', saveAs: { filename -> \"foo_$filename\" }\n\n output: \n path '*.txt'\n\n '''\n touch this.txt\n touch that.txt\n '''\n}\nThe same pattern can be used to store specific files in separate directories depending on the actual name.\nprocess foo {\n publishDir 'results', saveAs: { filename -> filename.endsWith(\".zip\") ? \"zips/$filename\" : filename }\n\n output: \n path '*'\n\n '''\n touch this.txt\n touch that.zip\n '''\n}\nExercise\nModify the MULTIQC output with saveAs such that resulting folder is as follow:\n./output\n├── MultiQC\n│   ├── multiqc_gut.html\n│   ├── multiqc_liver.html\n│   └── multiqc_lung.html\n├── report\n│   ├── gut_1.fq_trimming_report.txt\n│   ├── gut_2.fq_trimming_report.txt\n│   ├── liver_1.fq_trimming_report.txt\n│   ├── liver_2.fq_trimming_report.txt\n│   ├── lung_1.fq_trimming_report.txt\n│   └── lung_2.fq_trimming_report.txt\n└── trimmed_fastq\n ├── gut_1_val_1.fq.gz\n ├── gut_2_val_2.fq.gz\n ├── liver_1_val_1.fq.gz\n ├── liver_2_val_2.fq.gz\n ├── lung_1_val_1.fq.gz\n └── lung_2_val_2.fq.gz\n\n3 directories, 15 files\n\n\n\n\n\n\nWarning\n\n\n\nYou need to remove existing output folder/files if you want to have a clean output. By default, nextflow will overwrite existing files, and keep all the remaining files in the same specified output directory.\n\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\nprocess MULTIQC {\n publishDir params.outdir, mode:'copy', saveAs: { filename -> filename.endsWith(\".html\") ? \"MultiQC/multiqc_$filename\" : filename }\n container \"/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-multiqc-1.21--pyhdfd78af_0.img\"\n\n input:\n path quantification\n path fastqc\n\n output:\n path \"*.html\"\n\n script:\n \"\"\"\n multiqc . --filename $quantification\n \"\"\"\n}\n\n\n\nChallenge\nModify all the processes in rnaseq.nf such that we will have the following output structure\n./output\n├── gut\n│   ├── QC\n│   │   ├── fastqc_gut_logs\n│   │   │   ├── gut_1_fastqc.html\n│   │   │   ├── gut_1_fastqc.zip\n│   │   │   ├── gut_2_fastqc.html\n│   │   │   └── gut_2_fastqc.zip\n│   │   └── gut.html\n│   ├── report\n│   │   ├── gut_1.fq_trimming_report.txt\n│   │   └── gut_2.fq_trimming_report.txt\n│   └── trimmed_fastq\n│   ├── gut_1_val_1.fq.gz\n│   └── gut_2_val_2.fq.gz\n├── liver\n│   ├── QC\n│   │   ├── fastqc_liver_logs\n│   │   │   ├── liver_1_fastqc.html\n│   │   │   ├── liver_1_fastqc.zip\n│   │   │   ├── liver_2_fastqc.html\n│   │   │   └── liver_2_fastqc.zip\n│   │   └── liver.html\n│   ├── report\n│   │   ├── liver_1.fq_trimming_report.txt\n│   │   └── liver_2.fq_trimming_report.txt\n│   └── trimmed_fastq\n│   ├── liver_1_val_1.fq.gz\n│   └── liver_2_val_2.fq.gz\n└── lung\n ├── QC\n │   ├── fastqc_lung_logs\n │   │   ├── lung_1_fastqc.html\n │   │   ├── lung_1_fastqc.zip\n │   │   ├── lung_2_fastqc.html\n │   │   └── lung_2_fastqc.zip\n │   └── lung.html\n ├── report\n │   ├── lung_1.fq_trimming_report.txt\n │   └── lung_2.fq_trimming_report.txt\n └── trimmed_fastq\n ├── lung_1_val_1.fq.gz\n └── lung_2_val_2.fq.gz\n\n15 directories, 27 files\n\n\n\n\n\n\nSolution\n\n\n\n\n\nprocess FASTQC {\n publishDir \"$params.outdir/$sample_id/QC\", mode:'copy'\n container \"/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-fastqc-0.12.1--hdfd78af_0.img\"\n\n input:\n tuple val(sample_id), path(reads)\n\n output:\n path \"fastqc_${sample_id}_logs\"\n\n script:\n \"\"\"\n mkdir fastqc_${sample_id}_logs\n fastqc -o fastqc_${sample_id}_logs -f fastq -q ${reads}\n \"\"\"\n}\n\nprocess MULTIQC {\n //publishDir params.outdir, mode:'copy', saveAs: { filename -> filename.endsWith(\".html\") ? \"MultiQC/multiqc_$filename\" : filename }\n publishDir \"$params.outdir/$quantification/QC\", mode:'copy'\n container \"/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-multiqc-1.21--pyhdfd78af_0.img\"\n\n input:\n path quantification\n path fastqc\n\n output:\n path \"*.html\"\n\n script:\n \"\"\"\n multiqc . --filename $quantification\n \"\"\"\n}\n\nprocess TRIMGALORE {\n container '/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-trim-galore-0.6.6--0.img'\n publishDir \"${params.outdir}/${sample_id}/report\", mode: \"copy\", pattern:\"*report.txt\"\n publishDir \"${params.outdir}/${sample_id}/trimmed_fastq\", mode: \"copy\", pattern:\"*fq.gz\"\n\n input:\n tuple val(sample_id), path(reads)\n\n output:\n tuple val(sample_id), path(\"*{3prime,5prime,trimmed,val}*.fq.gz\"), emit: reads\n tuple val(sample_id), path(\"*report.txt\") , emit: log , optional: true\n tuple val(sample_id), path(\"*unpaired*.fq.gz\") , emit: unpaired, optional: true\n tuple val(sample_id), path(\"*.html\") , emit: html , optional: true\n tuple val(sample_id), path(\"*.zip\") , emit: zip , optional: true\n\n script:\n \"\"\"\n trim_galore \\\\\n --paired \\\\\n --gzip \\\\\n ${reads[0]} \\\\\n ${reads[1]}\n \"\"\"\n}" }, { - "objectID": "workshops/1.2_intro_nf_core.html", - "href": "workshops/1.2_intro_nf_core.html", - "title": "Introduction to nf-core", + "objectID": "workshops/8.1_scatter_gather_output.html#scatter", + "href": "workshops/8.1_scatter_gather_output.html#scatter", + "title": "Nextflow Development - Outputs, Scatter, and Gather", + "section": "8.2 Scatter", + "text": "8.2 Scatter\nThe scatter operation involves distributing large input data into smaller chunks that can be analysed across multiple processes in parallel.\nOne very simple example of native scatter is how nextflow handles Channel factories with the Channel.fromPath or Channel.fromFilePairs method, where multiple input data is processed in parallel.\nparams.reads = \"/scratch/users/.../training/nf-training/data/ggal/*_{1,2}.fq\"\nreads_ch = Channel.fromFilePairs(\"$params.reads\")\n\ninclude { FASTQC as FASTQC_one } from './modules.nf'\n\nworkflow {\n fastqc_ch = FASTQC_one(reads_ch)\n}\nFrom the above snippet from our rnaseq.nf, we will get three execution of FASTQC_one for each pairs of our input data.\nOther than natively splitting execution by input data, Nextflow also provides operators to scatter existing input data for various benefits, such as faster processing. For example:\n\nsplitText\nsplitFasta\nsplitFastq\nmap with from or fromList\nflatten" + }, + { + "objectID": "workshops/8.1_scatter_gather_output.html#process-per-file-chunk", + "href": "workshops/8.1_scatter_gather_output.html#process-per-file-chunk", + "title": "Nextflow Development - Outputs, Scatter, and Gather", + "section": "8.2.1 Process per file chunk", + "text": "8.2.1 Process per file chunk\nExercise\nparams.infile = \"/data/reference/bed_files/Agilent_CRE_v2/S30409818_Covered_MERGED.bed\"\nparams.size = 100000\n\nprocess count_line {\n debug true\n input: \n file x\n\n script:\n \"\"\"\n wc -l $x \n \"\"\"\n}\n\nworkflow {\n Channel.fromPath(params.infile) \\\n | splitText(by: params.size, file: true) \\\n | count_line\n}\nExercise\nparams.infile = \"/scratch/users/rlupat/nfWorkshop/dev1/training/nf-training/data/ggal/*_{1,2}.fq\"\nparams.size = 1000\n\nworkflow {\n Channel.fromFilePairs(params.infile, flat: true) \\\n | splitFastq(by: params.size, pe: true, file: true) \\\n | view()\n}" + }, + { + "objectID": "workshops/8.1_scatter_gather_output.html#process-per-file-range", + "href": "workshops/8.1_scatter_gather_output.html#process-per-file-range", + "title": "Nextflow Development - Outputs, Scatter, and Gather", + "section": "8.2.1 Process per file range", + "text": "8.2.1 Process per file range\nExercise\nChannel.from(1..22) \\\n | map { chr -> [\"sample${chr}\", file(\"${chr}.indels.vcf\"), file(\"${chr}.vcf\")] } \\\n | view()\n>> nextflow run test_scatter.nf\n\n[sample1, /scratch/users/${users}/1.indels.vcf, /scratch/users/${users}/1.vcf]\n[sample2, /scratch/users/${users}/2.indels.vcf, /scratch/users/${users}/2.vcf]\n[sample3, /scratch/users/${users}/3.indels.vcf, /scratch/users/${users}/3.vcf]\n[sample4, /scratch/users/${users}/4.indels.vcf, /scratch/users/${users}/4.vcf]\n[sample5, /scratch/users/${users}/5.indels.vcf, /scratch/users/${users}/5.vcf]\n[sample6, /scratch/users/${users}/6.indels.vcf, /scratch/users/${users}/6.vcf]\n[sample7, /scratch/users/${users}/7.indels.vcf, /scratch/users/${users}/7.vcf]\n[sample8, /scratch/users/${users}/8.indels.vcf, /scratch/users/${users}/8.vcf]\n[sample9, /scratch/users/${users}/9.indels.vcf, /scratch/users/${users}/9.vcf]\n[sample10, /scratch/users${users}/10.indels.vcf, /scratch/users${users}/10.vcf]\n[sample11, /scratch/users${users}/11.indels.vcf, /scratch/users${users}/11.vcf]\n[sample12, /scratch/users${users}/12.indels.vcf, /scratch/users${users}/12.vcf]\n[sample13, /scratch/users${users}/13.indels.vcf, /scratch/users${users}/13.vcf]\n[sample14, /scratch/users${users}/14.indels.vcf, /scratch/users${users}/14.vcf]\n[sample15, /scratch/users${users}/15.indels.vcf, /scratch/users${users}/15.vcf]\n[sample16, /scratch/users${users}/16.indels.vcf, /scratch/users${users}/16.vcf]\n[sample17, /scratch/users${users}/17.indels.vcf, /scratch/users${users}/17.vcf]\n[sample18, /scratch/users${users}/18.indels.vcf, /scratch/users${users}/18.vcf]\n[sample19, /scratch/users${users}/19.indels.vcf, /scratch/users${users}/19.vcf]\n[sample20, /scratch/users${users}/20.indels.vcf, /scratch/users${users}/20.vcf]\n[sample21, /scratch/users${users}/21.indels.vcf, /scratch/users${users}/21.vcf]\n[sample22, /scratch/users${users}/22.indels.vcf, /scratch/users${users}/22.vcf]\nExercise\nparams.infile = \"/data/reference/bed_files/Agilent_CRE_v2/S30409818_Covered_MERGED.bed\"\nparams.size = 100000\n\nprocess split_bed_by_chr {\n debug true\n\n input:\n path bed\n val chr\n\n output:\n path \"*.bed\"\n\n script:\n \"\"\"\n grep ^${chr}\\t ${bed} > ${chr}.bed\n \"\"\"\n}\n\nworkflow {\n split_bed_by_chr(params.infile, Channel.from(1..22)) | view()\n}\nChallenge\nHow do we include chr X and Y into the above split by chromosome?\n\n\n\n\n\n\nSolution\n\n\n\n\n\nworkflow {\n split_bed_by_chr(params.infile, Channel.from(1..22,'X','Y').flatten()) | view()\n}" + }, + { + "objectID": "workshops/8.1_scatter_gather_output.html#gather", + "href": "workshops/8.1_scatter_gather_output.html#gather", + "title": "Nextflow Development - Outputs, Scatter, and Gather", + "section": "8.3 Gather", + "text": "8.3 Gather\nThe gather operation consolidates results from parallel computations (can be from scatter) into a centralized process for aggregation and further processing.\nSome of the Nextflow provided operators that facilitate this gather operation, include:\n\ncollect\ncollectFile\nmap + groupTuple" + }, + { + "objectID": "workshops/8.1_scatter_gather_output.html#process-all-outputs-altogether", + "href": "workshops/8.1_scatter_gather_output.html#process-all-outputs-altogether", + "title": "Nextflow Development - Outputs, Scatter, and Gather", + "section": "8.3.1. Process all outputs altogether", + "text": "8.3.1. Process all outputs altogether\nExercise\nparams.infile = \"/data/reference/bed_files/Agilent_CRE_v2/S30409818_Covered_MERGED.bed\"\nparams.size = 100000\n\nprocess split_bed_by_chr {\n debug true\n\n input:\n path bed\n val chr\n\n output:\n path \"*.bed\"\n\n script:\n \"\"\"\n grep ^${chr}\\t ${bed} > ${chr}.bed\n \"\"\"\n}\n\nworkflow {\n split_bed_by_chr(params.infile, Channel.from(1..22,'X','Y').flatten()) | collect | view()\n}" + }, + { + "objectID": "workshops/8.1_scatter_gather_output.html#collect-outputs-into-a-file", + "href": "workshops/8.1_scatter_gather_output.html#collect-outputs-into-a-file", + "title": "Nextflow Development - Outputs, Scatter, and Gather", + "section": "8.3.2. Collect outputs into a file", + "text": "8.3.2. Collect outputs into a file\nExercise\nparams.infile = \"/data/reference/bed_files/Agilent_CRE_v2/S30409818_Covered_MERGED.bed\"\nparams.size = 100000\n\nprocess split_bed_by_chr {\n debug true\n\n input:\n path bed\n val chr\n\n output:\n path \"*.bed\"\n\n script:\n \"\"\"\n grep ^${chr}\\t ${bed} > ${chr}.bed\n \"\"\"\n}\n\nworkflow {\n split_bed_by_chr(params.infile, Channel.from(1..22,'X','Y').flatten()) | collectFile(name: 'merged.bed', newLine:true) | view()\n}\nExercise\nworkflow {\n Channel.fromPath(\"/scratch/users/rlupat/nfWorkshop/dev1/training/nf-training/data/ggal/*_1.fq\", checkIfExists: true) \\\n | collectFile(name: 'combined_1.fq', newLine:true) \\\n | view\n}" + }, + { + "objectID": "workshops/6.1_operators.html", + "href": "workshops/6.1_operators.html", + "title": "Nextflow Development - Channel Operators", "section": "", - "text": "Objectives\n\n\n\n\nLearn about the core features of nf-core.\nLearn the terminology used by nf-core.\nUse Nextflow to pull and run the nf-core/testpipeline workflow\n\n\n\nIntroduction to nf-core: Introduce nf-core features and concepts, structures, tools, and example nf-core pipelines\n\n1.2.1. What is nf-core?\nnf-core is a community effort to collect a curated set of analysis workflows built using Nextflow.\nnf-core provides a standardized set of best practices, guidelines, and templates for building and sharing bioinformatics workflows. These workflows are designed to be modular, scalable, and portable, allowing researchers to easily adapt and execute them using their own data and compute resources.\nThe community is a diverse group of bioinformaticians, developers, and researchers from around the world who collaborate on developing and maintaining a growing collection of high-quality workflows. These workflows cover a range of applications, including transcriptomics, proteomics, and metagenomics.\nOne of the key benefits of nf-core is that it promotes open development, testing, and peer review, ensuring that the workflows are robust, well-documented, and validated against real-world datasets. This helps to increase the reliability and reproducibility of bioinformatics analyses and ultimately enables researchers to accelerate their scientific discoveries.\nnf-core is published in Nature Biotechnology: Nat Biotechnol 38, 276–278 (2020). Nature Biotechnology\nKey Features of nf-core workflows\n\nDocumentation\n\nnf-core workflows have extensive documentation covering installation, usage, and description of output files to ensure that you won’t be left in the dark.\n\nStable Releases\n\nnf-core workflows use GitHub releases to tag stable versions of the code and software, making workflow runs totally reproducible.\n\nPackaged software\n\nPipeline dependencies are automatically downloaded and handled using Docker, Singularity, Conda, or other software management tools. There is no need for any software installations.\n\nPortable and reproducible\n\nnf-core workflows follow best practices to ensure maximum portability and reproducibility. The large community makes the workflows exceptionally well-tested and easy to execute.\n\nCloud-ready\n\nnf-core workflows are tested on AWS\n\n\n\n\n1.2.2. Executing an nf-core workflow\nThe nf-core website has a full list of workflows and asssociated documentation tno be explored.\nEach workflow has a dedicated page that includes expansive documentation that is split into 7 sections:\n\nIntroduction\n\nAn introduction and overview of the workflow\n\nResults\n\nExample output files generated from the full test dataset\n\nUsage docs\n\nDescriptions of how to execute the workflow\n\nParameters\n\nGrouped workflow parameters with descriptions\n\nOutput docs\n\nDescriptions and examples of the expected output files\n\nReleases & Statistics\n\nWorkflow version history and statistics\n\n\nAs nf-core is a community development project the code for a pipeline can be changed at any time. To ensure that you have locked in a specific version of a pipeline you can use Nextflow’s built-in functionality to pull a workflow. The Nextflow pull command can download and cache workflows from GitHub repositories:\nnextflow pull nf-core/<pipeline>\nNextflow run will also automatically pull the workflow if it was not already available locally:\nnextflow run nf-core/<pipeline>\nNextflow will pull the default git branch if a workflow version is not specified. This will be the master branch for nf-core workflows with a stable release. nf-core workflows use GitHub releases to tag stable versions of the code and software. You will always be able to execute a previous version of a workflow once it is released using the -revision or -r flag.\nFor this section of the workshop we will be using the nf-core/testpipeline as an example.\nAs we will be running some bioinformatics tools, we will need to make sure of the following:\n\nWe are not running on login node\nsingularity module is loaded (module load singularity/3.7.3)\n\n\n\n\n\n\n\nSetup an interactive session\n\n\n\nsrun --pty -p prod_short --mem 20GB --cpus-per-task 2 -t 0-2:00 /bin/bash\n\nEnsure the required modules are loaded\nmodule list\nCurrently Loaded Modulefiles:\n 1) java/jdk-17.0.6 2) nextflow/23.04.1 3) squashfs-tools/4.5 4) singularity/3.7.3\n\n\n\nWe will also create a separate output directory for this section.\ncd /scratch/users/<your-username>/nfWorkshop; mkdir ./lesson1.2 && cd $_\nThe base command we will be using for this section is:\nnextflow run nf-core/testpipeline -profile test,singularity --outdir my_results\n\n\n1.2.3. Workflow structure\nnf-core workflows start from a common template and follow the same structure. Although you won’t need to edit code in the workflow project directory, having a basic understanding of the project structure and some core terminology will help you understand how to configure its execution.\nLet’s take a look at the code for the nf-core/rnaseq pipeline.\nNextflow DSL2 workflows are built up of subworkflows and modules that are stored as separate .nf files.\nMost nf-core workflows consist of a single workflow file (there are a few exceptions). This is the main <workflow>.nf file that is used to bring everything else together. Instead of having one large monolithic script, it is broken up into a combination of subworkflows and modules.\nA subworkflow is a groups of modules that are used in combination with each other and have a common purpose. Subworkflows improve workflow readability and help with the reuse of modules within a workflow. The nf-core community also shares subworkflows in the nf-core subworkflows GitHub repository. Local subworkflows are workflow specific that are not shared in the nf-core subworkflows repository.\nLet’s take a look at the BAM_STATS_SAMTOOLS subworkflow.\nThis subworkflow is comprised of the following modules: - SAMTOOLS_STATS - SAMTOOLS_IDXSTATS, and - SAMTOOLS_FLAGSTAT\nA module is a wrapper for a process, most modules will execute a single tool and contain the following definitions: - inputs - outputs, and - script block.\nLike subworkflows, modules can also be shared in the nf-core modules GitHub repository or stored as a local module. All modules from the nf-core repository are version controlled and tested to ensure reproducibility. Local modules are workflow specific that are not shared in the nf-core modules repository.\n\n\n1.2.4. Viewing parameters\nEvery nf-core workflow has a full list of parameters on the nf-core website. When viewing these parameters online, you will also be shown a description and the type of the parameter. Some parameters will have additional text to help you understand when and how a parameter should be used.\n\n\n\n\n\nParameters and their descriptions can also be viewed in the command line using the run command with the --help parameter:\nnextflow run nf-core/<workflow> --help\n\n\n\n\n\n\nChallenge\n\n\n\nView the parameters for the nf-core/testpipeline workflow using the command line:\n\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\nThe nf-core/testpipeline workflow parameters can be printed using the run command and the --help option:\nnextflow run nf-core/testpipeline --help\n\n\n\n\n\n1.2.5. Parameters in the command line\nParameters can be customized using the command line. Any parameter can be configured on the command line by prefixing the parameter name with a double dash (--):\nnextflow run nf-core/<workflow> --<parameter>\n\n\n\n\n\n\nTip\n\n\n\nNextflow options are prefixed with a single dash (-) and workflow parameters are prefixed with a double dash (--).\n\n\nDepending on the parameter type, you may be required to add additional information after your parameter flag. For example, for a string parameter, you would add the string after the parameter flag:\nnextflow run nf-core/<workflow> --<parameter> string\n\n\n\n\n\n\nChallenge\n\n\n\nGive the MultiQC report for the nf-core/testpipeline workflow the name of your favorite animal using the multiqc_title parameter using a command line flag:\n\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\nAdd the --multiqc_title flag to your command and execute it. Use the -resume option to save time:\nnextflow run nf-core/testpipeline -profile test,singularity --multiqc_title koala --outdir my_results -resume\n\n\n\nIn this example, you can check your parameter has been applied by listing the files created in the results folder (my_results):\nls my_results/multiqc/\n\n\n1.2.6. Configuration files\nConfiguration files are .config files that can contain various workflow properties. Custom paths passed in the command-line using the -c option:\nnextflow run nf-core/<workflow> -profile test,docker -c <path/to/custom.config>\nMultiple custom .config files can be included at execution by separating them with a comma (,).\nCustom configuration files follow the same structure as the configuration file included in the workflow directory. Configuration properties are organized into scopes by grouping the properties in the same scope using the curly brackets notation. For example:\nalpha {\n x = 1\n y = 'string value..'\n}\nScopes allow you to quickly configure settings required to deploy a workflow on different infrastructure using different software management. For example, the executor scope can be used to provide settings for the deployment of a workflow on a HPC cluster. Similarly, the singularity scope controls how Singularity containers are executed by Nextflow. Multiple scopes can be included in the same .config file using a mix of dot prefixes and curly brackets. A full list of scopes is described in detail here.\n\n\n\n\n\n\nChallenge\n\n\n\nGive the MultiQC report for the nf-core/testpipeline workflow the name of your favorite color using the multiqc_title parameter in a custom my_custom.config file:\n\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\nCreate a custom my_custom.config file that contains your favourite colour, e.g., blue:\nparams {\n multiqc_title = \"blue\"\n}\nInclude the custom .config file in your execution command with the -c option:\nnextflow run nf-core/testpipeline --outdir my_results -profile test,singularity -resume -c my_custom.config\nCheck that it has been applied:\nls my_results/multiqc/\nWhy did this fail?\nYou can not use the params scope in custom configuration files. Parameters can only be configured using the -params-file option and the command line. While parameter is listed as a parameter on the STDOUT, it was not applied to the executed command.\nWe will revisit this at the end of the module\n\n\n\n\n\n1.2.7 Parameter files\nParameter files are used to define params options for a pipeline, generally written in the YAML format. They are added to a pipeline with the flag --params-file\nExample YAML:\n\"<parameter1_name>\": 1,\n\"<parameter2_name>\": \"<string>\",\n\"<parameter3_name>\": true\n\n\n\n\n\n\nChallenge\n\n\n\nBased on the failed application of the parameter multiqc_title create a my_params.yml setting multiqc_title to your favourite colour. Then re-run the pipeline with the your my_params.yml\n\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\nSet up my_params.yml\nmultiqc_title: \"black\"\nnextflow run nf-core/testpipeline -profile test,singularity --params-file my_params.yml --outdir Lesson1_2\n\n\n\n\n\n1.2.8. Default configuration files\nAll parameters will have a default setting that is defined using the nextflow.config file in the workflow project directory. By default, most parameters are set to null or false and are only activated by a profile or configuration file.\nThere are also several includeConfig statements in the nextflow.config file that are used to load additional .config files from the conf/ folder. Each additional .config file contains categorized configuration information for your workflow execution, some of which can be optionally included:\n\nbase.config\n\nIncluded by the workflow by default.\nGenerous resource allocations using labels.\nDoes not specify any method for software management and expects software to be available (or specified elsewhere).\n\nigenomes.config\n\nIncluded by the workflow by default.\nDefault configuration to access reference files stored on AWS iGenomes.\n\nmodules.config\n\nIncluded by the workflow by default.\nModule-specific configuration options (both mandatory and optional).\n\n\nNotably, configuration files can also contain the definition of one or more profiles. A profile is a set of configuration attributes that can be activated when launching a workflow by using the -profile command option:\nnextflow run nf-core/<workflow> -profile <profile>\nProfiles used by nf-core workflows include:\n\nSoftware management profiles\n\nProfiles for the management of software using software management tools, e.g., docker, singularity, and conda.\n\nTest profiles\n\nProfiles to execute the workflow with a standardized set of test data and parameters, e.g., test and test_full.\n\n\nMultiple profiles can be specified in a comma-separated (,) list when you execute your command. The order of profiles is important as they will be read from left to right:\nnextflow run nf-core/<workflow> -profile test,singularity\nnf-core workflows are required to define software containers and conda environments that can be activated using profiles.\n\n\n\n\n\n\nTip\n\n\n\nIf you’re computer has internet access and one of Conda, Singularity, or Docker installed, you should be able to run any nf-core workflow with the test profile and the respective software management profile ‘out of the box’. The test data profile will pull small test files directly from the nf-core/test-data GitHub repository and run it on your local system. The test profile is an important control to check the workflow is working as expected and is a great way to trial a workflow. Some workflows have multiple test profiles for you to test.\n\n\n\n\n\n\n\n\nKey points\n\n\n\n\nnf-core is a community effort to collect a curated set of analysis workflows built using Nextflow.\nNextflow can be used to pull nf-core workflows.\nnf-core workflows follow similar structures\nnf-core workflows are configured using parameters and profiles\n\n\n\n\nThese materials are adapted from Customising Nf-Core Workshop by Sydney Informatics Hub" + "text": "Objectives\n\n\n\n\nGain an understanding of Nextflow channel operators" + }, + { + "objectID": "workshops/6.1_operators.html#environment-setup", + "href": "workshops/6.1_operators.html#environment-setup", + "title": "Nextflow Development - Channel Operators", + "section": "Environment Setup", + "text": "Environment Setup\nSet up an interactive shell to run our Nextflow workflow:\nsrun --pty -p prod_short --mem 8GB --mincpus 2 -t 0-2:00 bash\nLoad the required modules to run Nextflow:\nmodule load nextflow/23.04.1\nmodule load singularity/3.7.3\nSet the singularity cache environment variable:\nexport NXF_SINGULARITY_CACHEDIR=/config/binaries/singularity/containers_devel/nextflow\nSingularity images downloaded by workflow executions will now be stored in this directory.\nYou may want to include these, or other environmental variables, in your .bashrc file (or alternate) that is loaded when you log in so you don’t need to export variables every session. A complete list of environment variables can be found here.\nThe training data can be cloned from:\ngit clone https://github.com/nextflow-io/training.git" + }, + { + "objectID": "workshops/6.1_operators.html#rna-seq-workflow-and-module-files", + "href": "workshops/6.1_operators.html#rna-seq-workflow-and-module-files", + "title": "Nextflow Development - Channel Operators", + "section": "RNA-seq Workflow and Module Files ", + "text": "RNA-seq Workflow and Module Files \nPreviously, we created three Nextflow files and one config file:\n├── nextflow.config\n├── rnaseq.nf\n├── modules.nf\n└── modules\n └── trimgalore.nf\n\nrnaseq.nf: main workflow script where parameters are defined and processes were called.\n\n#!/usr/bin/env nextflow\n\nparams.reads = \"/scratch/users/.../training/nf-training/data/ggal/*_{1,2}.fq\"\nparams.transcriptome_file = \"/scratch/users/.../training/nf-training/data/ggal/transcriptome.fa\"\n\nreads_ch = Channel.fromFilePairs(\"$params.reads\")\n\ninclude { INDEX } from './modules.nf'\ninclude { QUANTIFICATION as QT } from './modules.nf'\ninclude { FASTQC as FASTQC_one } from './modules.nf'\ninclude { FASTQC as FASTQC_two } from './modules.nf'\ninclude { MULTIQC } from './modules.nf'\ninclude { TRIMGALORE } from './modules/trimgalore.nf'\n\nworkflow {\n index_ch = INDEX(params.transcriptome_file)\n quant_ch = QT(index_ch, reads_ch)\n fastqc_ch = FASTQC_one(reads_ch)\n trimgalore_out_ch = TRIMGALORE(reads_ch).reads\n fastqc_cleaned_ch = FASTQC_two(trimgalore_out_ch)\n multiqc_ch = MULTIQC(quant_ch, fastqc_ch)\n}\n\nmodules.nf: script containing the majority of modules, including INDEX, QUANTIFICATION, FASTQC, and MULTIQC\n\nprocess INDEX {\n container \"/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-salmon-1.10.1--h7e5ed60_0.img\"\n\n input:\n path transcriptome\n\n output:\n path \"salmon_idx\"\n\n script:\n \"\"\"\n salmon index --threads $task.cpus -t $transcriptome -i salmon_idx\n \"\"\"\n}\n\nprocess QUANTIFICATION {\n container \"/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-salmon-1.10.1--h7e5ed60_0.img\"\n\n input:\n path salmon_index\n tuple val(sample_id), path(reads)\n\n output:\n path \"$sample_id\"\n\n script:\n \"\"\"\n salmon quant --threads $task.cpus --libType=U \\\n -i $salmon_index -1 ${reads[0]} -2 ${reads[1]} -o $sample_id\n \"\"\"\n}\n\nprocess FASTQC {\n container \"/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-fastqc-0.12.1--hdfd78af_0.img\"\n\n input:\n tuple val(sample_id), path(reads)\n\n output:\n path \"fastqc_${sample_id}_logs\"\n\n script:\n \"\"\"\n mkdir fastqc_${sample_id}_logs\n fastqc -o fastqc_${sample_id}_logs -f fastq -q ${reads}\n \"\"\"\n}\n\nprocess MULTIQC {\n publishDir params.outdir, mode:'copy'\n container \"/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-multiqc-1.21--pyhdfd78af_0.img\"\n\n input:\n path quantification\n path fastqc\n\n output:\n path \"*.html\"\n\n script:\n \"\"\"\n multiqc . --filename $quantification\n \"\"\"\n}\n\nmodules/trimgalore.nf: script inside a modules folder, containing only the TRIMGALORE process\n\nprocess TRIMGALORE {\n container '/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-trim-galore-0.6.6--0.img' \n\n input:\n tuple val(sample_id), path(reads)\n \n output:\n tuple val(sample_id), path(\"*{3prime,5prime,trimmed,val}*.fq.gz\"), emit: reads\n tuple val(sample_id), path(\"*report.txt\") , emit: log , optional: true\n tuple val(sample_id), path(\"*unpaired*.fq.gz\") , emit: unpaired, optional: true\n tuple val(sample_id), path(\"*.html\") , emit: html , optional: true\n tuple val(sample_id), path(\"*.zip\") , emit: zip , optional: true\n\n script:\n \"\"\"\n trim_galore \\\\\n --paired \\\\\n --gzip \\\\\n ${reads[0]} \\\\\n ${reads[1]}\n \"\"\"\n}\n\nnextflow.config: config file that enables singularity\n\nsingularity {\n enabled = true\n autoMounts = true\n cacheDir = \"/config/binaries/singularity/containers_devel/nextflow\"\n}\nRun the pipeline, specifying --outdir:\n>>> nextflow run rnaseq.nf --outdir output\nN E X T F L O W ~ version 23.04.1\nLaunching `rnaseq.nf` [soggy_jennings] DSL2 - revision: 87afc1d98d\nexecutor > local (16)\n[93/d37ef0] process > INDEX [100%] 1 of 1 ✔\n[b3/4c4d9c] process > QT (1) [100%] 3 of 3 ✔\n[d0/173a6e] process > FASTQC_one (3) [100%] 3 of 3 ✔\n[58/0b8af2] process > TRIMGALORE (3) [100%] 3 of 3 ✔\n[c6/def175] process > FASTQC_two (3) [100%] 3 of 3 ✔\n[e0/bcf904] process > MULTIQC (3) [100%] 3 of 3 ✔" + }, + { + "objectID": "workshops/6.1_operators.html#map", + "href": "workshops/6.1_operators.html#map", + "title": "Nextflow Development - Channel Operators", + "section": "6.1.1 map ", + "text": "6.1.1 map \nThe map operator applies a mapping function to each item in a channel. This function is expressed using the Groovy closure { }.\nChannel\n .of('hello', 'world')\n .map { word -> \n def word_size = word.size()\n [word, word_size] \n }\n .view()\nIn this example, a channel containing the strings hello and world is created.\nInside the map operator, the local variable word is declared, and used to represent each input value that is passed to the function, ie. each element in the channel, hello and world.\nThe map operator ‘loops’ through each element in the channel and assigns that element to the local varialbe word. A new local variable word_size is defined inside the map function, and calculates the length of the string using size(). Finally, a tuple is returned, where the first element is the string represented by the local word variable, and the second element is the length of the string, represented by the local word_size variable.\nOutput:\n[hello, 5]\n[world, 5]\nFor our RNA-seq pipeline, let’s first create separate transcriptome files for each organ: lung.transcriptome.fa, liver.transcriptome.fa, gut.transcriptome.fa\ncp \"/scratch/users/.../training/nf-training/data/ggal/transcriptome.fa\" \"/scratch/users/.../training/nf-training/data/ggal/lung.transcriptome.fa\"\ncp \"/scratch/users/.../training/nf-training/data/ggal/transcriptome.fa\" \"/scratch/users/.../training/nf-training/data/ggal/liver.transcriptome.fa\"\nmv \"/scratch/users/.../training/nf-training/data/ggal/transcriptome.fa\" \"/scratch/users/.../training/nf-training/data/ggal/gut.transcriptome.fa\"\nEnsure transcriptome.fa no longer exists:\n>>> ls /scratch/users/.../training/nf-training/data/ggal/\ngut_1.fq\ngut_2.fq\ngut.transcriptome.fa\nliver_1.fq\nliver_2.fq\nliver.transcriptome.fa\nlung_1.fq\nlung_2.fq\nlung.transcriptome.fa\nExercise\nCurrently in the rnaseq.nf script, we define the transcriptome_file parameter to be a single file.\nparams.transcriptome_file = \"/scratch/users/.../training/nf-training/data/ggal/transcriptome.fa\"\nSet the transcriptome_file parameter to match for all three .fa files using a glob path matcher.\nUse the fromPath channel factory to read in the transcriptome files, and the map operator to create a tuple where the first element is the sample (organ type) of the .fa, and the second element is the path of the .fa file. Assign the final output to be a channel called transcriptome_ch.\nThe getSimpleName() Groovy method can be used extract the sample name from our .fa file, for example:\ndef sample = fasta.getSimpleName()\nUse the view() channel operator to view the transcriptome_ch channel. The expected output:\n[lung, /scratch/users/.../training/nf-training/data/ggal/lung.transcriptome.fa]\n[liver, /scratch/users/.../training/nf-training/data/ggal/liver.transcriptome.fa]\n[gut, /scratch/users/.../training/nf-training/data/ggal/gut.transcriptome.fa]\n\n\n\n\n\n\nSolution\n\n\n\n\n\nThe transcriptome_file parameter is defined using *, using glob to match for all three .fa files. The fromPath channel factory is used to read the .fa files, and the map operator is used to create the tuple.\nIn the map function, the variable file was chosen to represent each element that is passed to the function. The function emits a tuple where the first element is the sample name, returned by the getSimpleName() method, and the second element is the .fa file path.\nparams.transcriptome_file = \"/scratch/users/.../nf-training/data/ggal/*.fa\"\n\ntranscriptome_ch = Channel.fromPath(\"$params.transcriptome_file\")\n .map { fasta -> \n def sample = fasta.getSimpleName()\n [sample, fasta]\n }\n .view()\n\n\n\n\nChallenge\nModify the INDEX process to match the input structure of transcriptome_ch. Modify the output of INDEX so that a tuple is emitted, where the first elememt is the value of the grouping key, and the second element is the path of the salmon_idx folder.\nIndex the transcriptome_ch using the INDEX process. Emit the output as index_ch.\n\n\n\n\n\n\nSolution\n\n\n\n\n\nThe input is now defined to be a tuple of two elements, where the first element is the grouping key and the second element is the path of the transcriptome file.\nprocess INDEX {\n container \"/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-salmon-1.10.1--h7e5ed60_0.img\"\n\n input:\n tuple val(sample_id), path(transcriptome)\n\n output:\n tuple val(sample_id), path(\"salmon_idx\")\n\n script:\n \"\"\"\n salmon index --threads $task.cpus -t $transcriptome -i salmon_idx\n \"\"\"\n}\nInside the workflow block, transcriptome_ch is used as input into the INDEX process. The process outputs are emitted as index_ch\nworkflow {\n index_ch = INDEX(transcriptome_ch)\n index_ch.view()\n}\nThe index_ch channel is now a tuple where the first element is the grouping key, and the second element is the path to the salmon index folder.\n>>> nextflow run rnaseq.nf\nN E X T F L O W ~ version 23.04.1\nLaunching `rnaseq.nf` [dreamy_linnaeus] DSL2 - revision: b4ec1d02bd\n[21/91088a] process > INDEX (3) [100%] 3 of 3\n[liver, /scratch/users/.../work/06/f0a54ba9191cce9f73f5a97bfb7bea/salmon_idx]\n[lung, /scratch/users/.../work/60/e84b1b1f06c43c8cf69a5c621d5a41/salmon_idx]\n[gut, /scratch/users/.../work/21/91088aafb553cb4b933bc2b3493f33/salmon_idx]\n\n\n\nCopy the new INDEX process into modules.nf. In the workflow block of rnaseq.nf, use transcriptome_ch as the input to the process INDEX." + }, + { + "objectID": "workshops/6.1_operators.html#combine", + "href": "workshops/6.1_operators.html#combine", + "title": "Nextflow Development - Channel Operators", + "section": "6.1.2 combine ", + "text": "6.1.2 combine \nThe combine operator produces the cross product (ie. outer product) combinations of two source channels.\nFor example: The words channel is combined with the numbers channel, emitting a channel where each element of numbers is paired with each element of words.\nnumbers = Channel.of(1, 2, 3)\nwords = Channel.of('hello', 'ciao')\n\nnumbers.combine(words).view()\nOutput:\n[1, hello]\n[2, hello]\n[3, hello]\n[1, ciao]\n[2, ciao]\n[3, ciao]\nThe by option can be used to combine items that share a matching key. This value is zero-based, and represents the index or list of indices for the grouping key. The emitted tuple will consist of multiple elements.\nFor example: source and target are channels consisting of multiple tuples, where the first element of each tuple represents the grouping key. Since indexing is zero-based, by is set to 0 to represent the first element of the tuple.\nsource = Channel.of( [1, 'alpha'], [2, 'beta'] )\ntarget = Channel.of( [1, 'x'], [1, 'y'], [1, 'z'], [2, 'p'], [2, 'q'], [2, 't'] )\n\nsource.combine(target, by: 0).view()\nEach value within the source and target channels are separate elements, resulting in the emitted tuple each containing 3 elements:\n[1, alpha, x]\n[1, alpha, y]\n[1, alpha, z]\n[2, beta, p]\n[2, beta, q]\n[2, beta, t]\nExercise\nIn our RNA-seq pipeline, create a channel quant_inputs_ch that contains the reads_ch combined with the index_ch via a matching key. The emitted channel should contain three elements, where the first element is the grouping key, the second element is the path to the salmon index folder, and the third element is a list of the .fq pairs.\nThe expected output:\n[liver, /scratch/users/.../work/cf/42458b80e050a466d62baf99d0c1cf/salmon_idx, [/scratch/users/.../training/nf-training/data/ggal/liver_1.fq, /scratch/users/.../training/nf-training/data/ggal/liver_2.fq]]\n[lung, /scratch/users/.../work/64/90a77a5f1ed5a0000f6620fd1fab9a/salmon_idx, [/scratch/users/.../training/nf-training/data/ggal/lung_1.fq, /scratch/users/.../training/nf-training/data/ggal/lung_2.fq]]\n[gut, /scratch/users/.../work/37/352b00bfb71156a9250150428ddf1d/salmon_idx, [/scratch/users/.../training/nf-training/data/ggal/gut_1.fq, /scratch/users/.../training/nf-training/data/ggal/gut_2.fq]]\nUse quant_inputs_ch as the input for the QT process within the workflow block.\nModify the process such that the input will be a tuple consisting of three elements, where the first element is the grouping key, the second element is the salmon index and the third element is the list of .fq reads. Also modify the output of the QT process to emit a tuple of two elements, where the first element is the grouping key and the second element is the $sample_id folder. Emit the process output as quant_ch in the workflow block.\n\n\n\n\n\n\nSolution\n\n\n\n\n\nThe reads_ch is combined with the index_ch using the combine channel operator with by: 0, and is assigned to the channel quant_inputs_ch. The new quant_inputs_ch channel is input into the QT process.\nworkflow {\n index_ch = INDEX(transcriptome_ch)\n\n quant_inputs_ch = index_ch.combine(reads_ch, by: 0)\n quant_ch = QT(quant_inputs_ch)\n}\nIn te QT process, the input has been modified to be a tuple of three elements - the first element is the grouping key, the second element is the path to the salmon index, and the third element is the list of .fq reads.\nprocess QUANTIFICATION {\n container \"/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-salmon-1.10.1--h7e5ed60_0.img\"\n\n input:\n tuple val(sample_id), path(salmon_index), path(reads)\n\n output:\n tuple val(sample_id), path(\"$sample_id\")\n\n script:\n \"\"\"\n salmon quant --threads $task.cpus --libType=U \\\n -i $salmon_index -1 ${reads[0]} -2 ${reads[1]} -o $sample_id\n \"\"\"\n}" + }, + { + "objectID": "workshops/6.1_operators.html#grouptuple", + "href": "workshops/6.1_operators.html#grouptuple", + "title": "Nextflow Development - Channel Operators", + "section": "6.1.3 groupTuple ", + "text": "6.1.3 groupTuple \nThe groupTuple operator collects tuples into groups based on a similar grouping key, emitting a new tuple for each distinct key. The groupTuple differs from the combine operator in that it is performed on one input channel, and the matching values are emitted as a list.\nChannel.of( [1, 'A'], [1, 'B'], [2, 'C'], [3, 'B'], [1, 'C'], [2, 'A'], [3, 'D'] )\n .groupTuple()\n .view()\nOutput:\n[1, [A, B, C]]\n[2, [C, A]]\n[3, [B, D]]\nBy default, the first element of each tuple is used as the grouping key. The by option can be used to specify a different index. For example, to group by the second element of each tuple:\nChannel.of( [1, 'A'], [1, 'B'], [2, 'C'], [3, 'B'], [1, 'C'], [2, 'A'], [3, 'D'] )\n .groupTuple(by: 1)\n .view()\n[[1, 2], A]\n[[1, 3], B]\n[[2, 1], C]\n[[3], D]\n\nIn the workflow script rnaseq.nf we defined the reads parameter to be multiple paired .fq files that are created into a channel using the fromFilePairs channel factory. This created a tuple where the first element is a unique grouping key, created automatically based on similarities in file name, and the second element contains the list of paired files.\n#!/usr/bin/env nextflow\n\nparams.reads = \"/scratch/users/.../nf-training/data/ggal/*_{1,2}.fq\"\n\nreads_ch = Channel.fromFilePairs(\"$params.reads\")\nNow, move the /scratch/users/.../nf-training/data/ggal/lung_2.fq file into another directory so the folder contains one lung .fq file:\n>>> mv /scratch/users/.../training/nf-training/data/ggal/lung_2.fq .\n>>> ls /scratch/users/.../training/nf-training/data/ggal\ngut_1.fq\ngut_2.fq\ngut.transcriptome.fa\nliver_1.fq\nliver_2.fq\nliver.transcriptome.fa\nlung_1.fq\nlung.transcriptome.fa\nExercise\nUse the fromPath channel factory to read all .fq files as separate elements.\nThen, use map to create a mapping function that returns a tuple, where the first element is the grouping key, and the second element is the .fq file(s).\nThen, use groupTuple() to create channels containing both single and paired .fq files. Within the groupTuple() operator, set sort: true, which orders the groups numerically, ensuring the first .fq is first.\nExpected output:\n[lung, [/scratch/users/.../training/nf-training/data/ggal/lung_1.fq]]\n[gut, [/scratch/users/.../training/nf-training/data/ggal/gut_1.fq, /scratch/users/.../training/nf-training/data/ggal/gut_2.fq]]\n[liver, [/scratch/users/.../training/nf-training/data/ggal/liver_1.fq, /scratch/users/.../training/nf-training/data/ggal/liver_2.fq]]\nInside the map function, the following can be used to extract the sample name from the .fq files. file is the local variable defined inside the function that represents each .fq file. The getName() method will return the file name without the full path, and replaceAll is used to remove the _2.fq and _1.fq file suffixes.\ndef group_key = file.getName().replaceAll(/_2.fq/,'').replaceAll(/_1.fq/,'')\nFor a full list of Nextflow file attributes, see here.\n\n\n\n\n\n\nSolution\n\n\n\n\n\nThe fromPath channel is used to read all .fq files separately. The map function is then used to create a two-element tuple where the first element is a grouping key and the second element is the list of .fq file(s).\nreads_ch = Channel.fromPath(\"/home/sli/nextflow_training/training/nf-training/data/ggal/*.fq\")\n .map { file ->\n def group_key = file.getName().replaceAll(/_2.fq/,'').replaceAll(/_1.fq/,'')\n [group_key, file]\n }\n .groupTuple(sort: true)\n .view()\n\n\n\nNow, run the workflow up to the combine step. The quant_inputs_ch should now consist of:\n[liver, /scratch/users/.../work/cf/42458b80e050a466d62baf99d0c1cf/salmon_idx, [/scratch/users/.../nf-training/data/ggal/liver_1.fq, /scratch/users/.../nf-training/data/ggal/liver_2.fq]]\n[lung, /scratch/users/.../work/64/90a77a5f1ed5a0000f6620fd1fab9a/salmon_idx, [/scratch/users/.../nf-training/data/ggal/lung_1.fq]]\n[gut, /scratch/users/.../work/37/352b00bfb71156a9250150428ddf1d/salmon_idx, [/scratch/users/.../nf-training/data/ggal/gut_1.fq, /scratch/users/.../nf-training/data/ggal/gut_2.fq]]" + }, + { + "objectID": "workshops/6.1_operators.html#flatten", + "href": "workshops/6.1_operators.html#flatten", + "title": "Nextflow Development - Channel Operators", + "section": "6.1.4 flatten ", + "text": "6.1.4 flatten \nThe flatten operator flattens each item from a source channel and emits the elements separately. Deeply nested inputs are also flattened.\nChannel.of( [1, [2, 3]], 4, [5, [6]] )\n .flatten()\n .view()\nOutput:\n1\n2\n3\n4\n5\n6\n\nWithin the script block of the QUANTIFICATION process in the RNA-seq pipeline, we are assuming the reads are paired, and specify -1 ${reads[0]} -2 ${reads[1]} as inputs to salmon quant.\nprocess QUANTIFICATION {\n container \"/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-salmon-1.10.1--h7e5ed60_0.img\"\n\n input:\n tuple val(sample_id), path(salmon_index), path(reads)\n\n output:\n tuple val(sample_id) path(\"$sample_id\")\n\n script:\n \"\"\"\n salmon quant --threads $task.cpus --libType=U \\\n -i $salmon_index -1 ${reads[0]} -2 ${reads[1]} -o $sample_id\n \"\"\"\n}\nNow that the input reads can be either single or paired, the QUANTIFICATION process needs to be modified to allow for either input type. This can be done using the flatten() operator, and conditional script statements. Additionally, the size() method can be used to calculate the size of a list.\nThe script block can be changed to the following:\n script:\n def input_reads = [reads]\n if( input_reads.flatten().size() == 1 )\n \"\"\"\n salmon quant --threads $task.cpus --libType=U \\\n -i $salmon_index -r $reads -o $sample_id\n \"\"\"\n else \n \"\"\"\n salmon quant --threads $task.cpus --libType=U \\\\\n -i $salmon_index -1 ${reads[0]} -2 ${reads[1]} -o $sample_id\n \"\"\"\nFirst, a new variable input_reads is defined, which consists of the reads input being converted into a list. This has to be done since Nextflow will automatically convert a list of length 1 into a path within process. If the size() method was used on a path type input, it will return the size of the file in bytes, and not the list size. Therefore, all inputs must first be converted into a list in order to correctly caculate the number of files.\ndef input_reads = [reads]\nFor reads that are already in a list (ie. paired reads), this will nest the input into another list, for example:\n[ [ file1, file2 ] ]\nIf the size() operator is used on this input, it will always return 1 since the encompassing list only contains one element. Therefore, the flatten() operator has to be used to emit the files as separate elements.\nThe final definition to obtain the number of files in reads becomes:\ninput_reads.flatten().size()\nFor single reads, the input to salmon quant becomes -r $reads\n\nExercise\nCurrently the TRIMGALORE process only accounts for paired reads.\nprocess TRIMGALORE {\n container '/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-trim-galore-0.6.6--0.img' \n\n input:\n tuple val(sample_id), path(reads)\n \n output:\n tuple val(sample_id), path(\"*{3prime,5prime,trimmed,val}*.fq.gz\"), emit: reads\n tuple val(sample_id), path(\"*report.txt\") , emit: log , optional: true\n tuple val(sample_id), path(\"*unpaired*.fq.gz\") , emit: unpaired, optional: true\n tuple val(sample_id), path(\"*.html\") , emit: html , optional: true\n tuple val(sample_id), path(\"*.zip\") , emit: zip , optional: true\n\n script:\n \"\"\"\n trim_galore \\\\\n --paired \\\\\n --gzip \\\\\n ${reads[0]} \\\\\n ${reads[1]}\n \"\"\"\n}\nModify the process such that both single and paired reads can be used. For single reads, the following script block can be used:\n\"\"\"\ntrim_galore \\\\\n --gzip \\\\\n $reads\n\"\"\"\n\n\n\n\n\n\nSolution\n\n\n\n\n\nprocess TRIMGALORE {\n container '/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-trim-galore-0.6.6--0.img' \n\n input:\n tuple val(sample_id), path(reads)\n \n output:\n tuple val(sample_id), path(\"*{3prime,5prime,trimmed,val}*.fq.gz\"), emit: reads\n tuple val(sample_id), path(\"*report.txt\") , emit: log , optional: true\n tuple val(sample_id), path(\"*unpaired*.fq.gz\") , emit: unpaired, optional: true\n tuple val(sample_id), path(\"*.html\") , emit: html , optional: true\n tuple val(sample_id), path(\"*.zip\") , emit: zip , optional: true\n\n script:\n def input_reads = [reads]\n\n if( input_reads.flatten().size() == 1 )\n \"\"\"\n trim_galore \\\\\n --gzip \\\\\n $reads\n \"\"\"\n else\n \"\"\"\n trim_galore \\\\\n --paired \\\\\n --gzip \\\\\n ${reads[0]} \\\\\n ${reads[1]}\n \"\"\"\n\n}\n\n\n\nExtension\nModify the FASTQC process such that the output is a tuple where the first element is the grouping key, and the second element is the path to the fastqc logs.\nModify the MULTIQC process such that the output is a tuple where the first element is the grouping key, and the second element is the path to the generated html file.\nFinally, run the entire workflow, specifying an --outdir. The workflow block should look like this:\nworkflow {\n index_ch = INDEX(transcriptome_ch)\n\n quant_inputs_ch = index_ch.combine(reads_ch, by: 0)\n quant_ch = QT(quant_inputs_ch)\n\n trimgalore_out_ch = TRIMGALORE(reads_ch).reads\n\n fastqc_ch = FASTQC_one(reads_ch)\n fastqc_cleaned_ch = FASTQC_two(trimgalore_out_ch)\n multiqc_ch = MULTIQC(quant_ch, fastqc_ch)\n}\n\n\n\n\n\n\nSolution\n\n\n\n\n\nThe output block of both processes have been modified to be tuples containing a grouping key.\nprocess FASTQC {\n container \"/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-fastqc-0.12.1--hdfd78af_0.img\"\n\n input:\n tuple val(sample_id), path(reads)\n\n output:\n tuple val(sample_id), path(\"fastqc_${sample_id}_logs\")\n\n script:\n \"\"\"\n mkdir fastqc_${sample_id}_logs\n fastqc -o fastqc_${sample_id}_logs -f fastq -q ${reads}\n \"\"\"\n}\n\nprocess MULTIQC {\n publishDir params.outdir, mode:'copy'\n container \"/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-multiqc-1.21--pyhdfd78af_0.img\"\n\n input:\n tuple val(sample_id), path(quantification)\n tuple val(sample_id), path(fastqc)\n\n output:\n tuple val(sample_id), path(\"*.html\")\n\n script:\n \"\"\"\n multiqc . --filename $quantification\n \"\"\"\n}\n\n\n\n\nThis workshop is adapted from Fundamentals Training, Advanced Training, Developer Tutorials, Nextflow Patterns materials from Nextflow, nf-core nf-core tools documentation and nf-validation" }, { "objectID": "workshops/00_setup.html", diff --git a/sessions/1_intro_run_nf.html b/sessions/1_intro_run_nf.html index dad19d1..afd6901 100644 --- a/sessions/1_intro_run_nf.html +++ b/sessions/1_intro_run_nf.html @@ -147,7 +147,7 @@
  • - Nextflow Operators + Metadata Propagation
  • diff --git a/sessions/2_nf_dev_intro.html b/sessions/2_nf_dev_intro.html index 1c15bca..735ed29 100644 --- a/sessions/2_nf_dev_intro.html +++ b/sessions/2_nf_dev_intro.html @@ -147,7 +147,7 @@
  • - Nextflow Operators + Metadata Propagation
  • @@ -283,7 +283,7 @@

    Workshop schedule

    12th Jun 2024 -Working with Nextflow Built-in Functions |
    operators | metadata | output-scatter-gather +Working with Nextflow Built-in Functions | operators | metadata | output-scatter-gather Introduction to nextflow operators, metadata propagation, scatter, and gather 19th Jun 2024 diff --git a/sitemap.xml b/sitemap.xml index 76ab052..bacc303 100644 --- a/sitemap.xml +++ b/sitemap.xml @@ -2,66 +2,66 @@ https://PMCC-BioinformaticsCore.github.io/nextflow-intro-workshop/sessions/2_nf_dev_intro.html - 2024-06-19T00:29:31.263Z + 2024-06-19T00:35:05.156Z https://PMCC-BioinformaticsCore.github.io/nextflow-intro-workshop/index.html - 2024-06-19T00:29:30.522Z + 2024-06-19T00:35:04.378Z https://PMCC-BioinformaticsCore.github.io/nextflow-intro-workshop/workshops/2.3_tips_and_tricks.html - 2024-06-19T00:29:29.031Z + 2024-06-19T00:35:02.763Z - https://PMCC-BioinformaticsCore.github.io/nextflow-intro-workshop/workshops/6.1_operators.html - 2024-06-19T00:29:28.234Z + https://PMCC-BioinformaticsCore.github.io/nextflow-intro-workshop/workshops/7.1_metadata_propagation.html + 2024-06-19T00:35:01.925Z - https://PMCC-BioinformaticsCore.github.io/nextflow-intro-workshop/workshops/8.1_scatter_gather_output.html - 2024-06-19T00:29:26.854Z + https://PMCC-BioinformaticsCore.github.io/nextflow-intro-workshop/workshops/1.2_intro_nf_core.html + 2024-06-19T00:35:00.714Z https://PMCC-BioinformaticsCore.github.io/nextflow-intro-workshop/workshops/4.1_modules.html - 2024-06-19T00:29:25.902Z + 2024-06-19T00:34:59.391Z https://PMCC-BioinformaticsCore.github.io/nextflow-intro-workshop/workshops/4.1_draft_future_sess.html - 2024-06-19T00:29:24.491Z + 2024-06-19T00:34:57.819Z https://PMCC-BioinformaticsCore.github.io/nextflow-intro-workshop/workshops/1.1_intro_nextflow.html - 2024-06-19T00:29:22.901Z + 2024-06-19T00:34:56.004Z https://PMCC-BioinformaticsCore.github.io/nextflow-intro-workshop/workshops/3.1_creating_a_workflow.html - 2024-06-19T00:29:22.299Z + 2024-06-19T00:34:55.381Z https://PMCC-BioinformaticsCore.github.io/nextflow-intro-workshop/workshops/5.1_nf_core_template.html - 2024-06-19T00:29:24.138Z + 2024-06-19T00:34:57.423Z https://PMCC-BioinformaticsCore.github.io/nextflow-intro-workshop/workshops/2.2_troubleshooting.html - 2024-06-19T00:29:25.142Z + 2024-06-19T00:34:58.500Z - https://PMCC-BioinformaticsCore.github.io/nextflow-intro-workshop/workshops/7.1_metadata_proprogation.html - 2024-06-19T00:29:26.344Z + https://PMCC-BioinformaticsCore.github.io/nextflow-intro-workshop/workshops/8.1_scatter_gather_output.html + 2024-06-19T00:34:59.907Z - https://PMCC-BioinformaticsCore.github.io/nextflow-intro-workshop/workshops/1.2_intro_nf_core.html - 2024-06-19T00:29:27.566Z + https://PMCC-BioinformaticsCore.github.io/nextflow-intro-workshop/workshops/6.1_operators.html + 2024-06-19T00:35:01.462Z https://PMCC-BioinformaticsCore.github.io/nextflow-intro-workshop/workshops/00_setup.html - 2024-06-19T00:29:28.597Z + 2024-06-19T00:35:02.318Z https://PMCC-BioinformaticsCore.github.io/nextflow-intro-workshop/workshops/2.1_customise_and_run.html - 2024-06-19T00:29:30.184Z + 2024-06-19T00:35:04.033Z https://PMCC-BioinformaticsCore.github.io/nextflow-intro-workshop/sessions/1_intro_run_nf.html - 2024-06-19T00:29:30.889Z + 2024-06-19T00:35:04.765Z diff --git a/workshops/00_setup.html b/workshops/00_setup.html index d1afc48..cd4c1c4 100644 --- a/workshops/00_setup.html +++ b/workshops/00_setup.html @@ -181,7 +181,7 @@
  • - Nextflow Operators + Metadata Propagation
  • diff --git a/workshops/1.1_intro_nextflow.html b/workshops/1.1_intro_nextflow.html index 099feb5..d7add76 100644 --- a/workshops/1.1_intro_nextflow.html +++ b/workshops/1.1_intro_nextflow.html @@ -181,7 +181,7 @@
  • - Nextflow Operators + Metadata Propagation
  • diff --git a/workshops/1.2_intro_nf_core.html b/workshops/1.2_intro_nf_core.html index 4a656ad..97d166b 100644 --- a/workshops/1.2_intro_nf_core.html +++ b/workshops/1.2_intro_nf_core.html @@ -181,7 +181,7 @@
  • - Nextflow Operators + Metadata Propagation
  • diff --git a/workshops/2.1_customise_and_run.html b/workshops/2.1_customise_and_run.html index eb05031..874fee7 100644 --- a/workshops/2.1_customise_and_run.html +++ b/workshops/2.1_customise_and_run.html @@ -181,7 +181,7 @@
  • - Nextflow Operators + Metadata Propagation
  • diff --git a/workshops/2.2_troubleshooting.html b/workshops/2.2_troubleshooting.html index 69a7fc9..b153936 100644 --- a/workshops/2.2_troubleshooting.html +++ b/workshops/2.2_troubleshooting.html @@ -181,7 +181,7 @@
  • - Nextflow Operators + Metadata Propagation
  • diff --git a/workshops/2.3_tips_and_tricks.html b/workshops/2.3_tips_and_tricks.html index fafbe0f..9f3ce6b 100644 --- a/workshops/2.3_tips_and_tricks.html +++ b/workshops/2.3_tips_and_tricks.html @@ -181,7 +181,7 @@
  • - Nextflow Operators + Metadata Propagation
  • diff --git a/workshops/3.1_creating_a_workflow.html b/workshops/3.1_creating_a_workflow.html index 0bc2614..e277b0a 100644 --- a/workshops/3.1_creating_a_workflow.html +++ b/workshops/3.1_creating_a_workflow.html @@ -181,7 +181,7 @@
  • - Nextflow Operators + Metadata Propagation
  • diff --git a/workshops/4.1_draft_future_sess.html b/workshops/4.1_draft_future_sess.html index b3fa049..60b11fb 100644 --- a/workshops/4.1_draft_future_sess.html +++ b/workshops/4.1_draft_future_sess.html @@ -181,7 +181,7 @@
  • - Nextflow Operators + Metadata Propagation
  • diff --git a/workshops/4.1_modules.html b/workshops/4.1_modules.html index fb346e8..0b79fd4 100644 --- a/workshops/4.1_modules.html +++ b/workshops/4.1_modules.html @@ -181,7 +181,7 @@
  • - Nextflow Operators + Metadata Propagation
  • diff --git a/workshops/5.1_nf_core_template.html b/workshops/5.1_nf_core_template.html index a218fbd..916d23c 100644 --- a/workshops/5.1_nf_core_template.html +++ b/workshops/5.1_nf_core_template.html @@ -181,7 +181,7 @@
  • - Nextflow Operators + Metadata Propagation
  • diff --git a/workshops/6.1_operators.html b/workshops/6.1_operators.html index 682048a..c9ee667 100644 --- a/workshops/6.1_operators.html +++ b/workshops/6.1_operators.html @@ -181,7 +181,7 @@
  • - Nextflow Operators + Metadata Propagation
  • diff --git a/workshops/7.1_metadata_propagation.html b/workshops/7.1_metadata_propagation.html new file mode 100644 index 0000000..d3d05fd --- /dev/null +++ b/workshops/7.1_metadata_propagation.html @@ -0,0 +1,759 @@ + + + + + + + + + +Peter Mac Nextflow Workshop - Nextflow Development - Metadata Proprogation + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    +
    + +
    + +
    + + + + +
    + +
    +
    +

    Nextflow Development - Metadata Proprogation

    +
    + + + +
    + + + + +
    + + +
    + +
    +
    +
    + +
    +
    +Objectives +
    +
    +
    +
      +
    • Gain and understanding of how to manipulate and proprogate metadata
    • +
    +
    +
    +
    +

    Environment Setup

    +

    Set up an interactive shell to run our Nextflow workflow:

    +
    srun --pty -p prod_short --mem 8GB --mincpus 2 -t 0-2:00 bash
    +

    Load the required modules to run Nextflow:

    +
    module load nextflow/23.04.1
    +module load singularity/3.7.3
    +

    Set the singularity cache environment variable:

    +
    export NXF_SINGULARITY_CACHEDIR=/config/binaries/singularity/containers_devel/nextflow
    +

    Singularity images downloaded by workflow executions will now be stored in this directory.

    +

    You may want to include these, or other environmental variables, in your .bashrc file (or alternate) that is loaded when you log in so you don’t need to export variables every session. A complete list of environment variables can be found here.

    +

    The training data can be cloned from:

    +
    git clone https://github.com/nextflow-io/training.git
    +
    +
    +

    7.1 Metadata Parsing

    +

    We have covered a few different methods of metadata parsing.

    +
    +

    7.1.1 First Pass: .fromFilePairs

    +

    A first pass attempt at pulling these files into Nextflow might use the fromFilePairs method:

    +
    workflow {
    +    Channel.fromFilePairs("/home/Shared/For_NF_Workshop/training/nf-training-advanced/metadata/data/reads/*/*_R{1,2}.fastq.gz")
    +    .view
    +}
    +

    Nextflow will pull out the first part of the fastq filename and returned us a channel of tuple elements where the first element is the filename-derived ID and the second element is a list of two fastq files.

    +

    The id is stored as a simple string. We’d like to move to using a map of key-value pairs because we have more than one piece of metadata to track. In this example, we have sample, replicate, tumor/normal, and treatment. We could add extra elements to the tuple, but this changes the ‘cardinality’ of the elements in the channel and adding extra elements would require updating all downstream processes. A map is a single object and is passed through Nextflow channels as one value, so adding extra metadata fields will not require us to change the cardinality of the downstream processes.

    +

    There are a couple of different ways we can pull out the metadata

    +

    We can use the tokenize method to split our id. To sanity-check, I just pipe the result directly into the view operator.

    +
    workflow {
    +    Channel.fromFilePairs("/home/Shared/For_NF_Workshop/training/nf-training-advanced/metadata/data/reads/*/*_R{1,2}.fastq.gz")
    +    .map { id, reads ->
    +        tokens = id.tokenize("_")
    +    }
    +    .view
    +}
    +

    If we are confident about the stability of the naming scheme, we can destructure the list returned by tokenize and assign them to variables directly:

    +
    map { id, reads ->
    +    (sample, replicate, type) = id.tokenize("_")
    +    meta = [sample:sample, replicate:replicate, type:type]
    +    [meta, reads]
    +}
    +
    +
    +
    + +
    +
    +Note +
    +
    +
    +
    Make sure that you're using a tuple with parentheses e.g. (one, two) rather than a List e.g. [one, two]
    +
    +
    +

    If we move back to the previous method, but decided that the ‘rep’ prefix on the replicate should be removed, we can use regular expressions to simply “subtract” pieces of a string. Here we remove a ‘rep’ prefix from the replicate variable if the prefix is present:

    +
    map { id, reads ->
    +    (sample, replicate, type) = id.tokenize("_")
    +    replicate -= ~/^rep/
    +    meta = [sample:sample, replicate:replicate, type:type]
    +    [meta, reads]
    +}
    +

    By setting up our the “meta”, in our tuple with the format above, allows us to access the values in “sample” throughout our modules/configs as ${meta.sample}.

    +
    +
    +
    +

    Second Parse: .splitCsv

    +

    We have briefly touched on .splitCsv in the first week.

    +

    As a quick overview

    +

    Assuming we have the samplesheet

    +
    sample_name,fastq1,fastq2
    +gut_sample,/.../training/nf-training/data/ggal/gut_1.fq,/.../training/nf-training/data/ggal/gut_2.fq
    +liver_sample,/.../training/nf-training/data/ggal/liver_1.fq,/.../training/nf-training/data/ggal/liver_2.fq
    +lung_sample,/.../training/nf-training/data/ggal/lung_1.fq,/.../training/nf-training/data/ggal/lung_2.fq
    +

    We can set up a workflow to read in these files as:

    +
    params.reads = "/.../rnaseq_samplesheet.csv"
    +
    +reads_ch = Channel.fromPath(params.reads)
    +reads_ch.view()
    +reads_ch = reads_ch.splitCsv(header:true)
    +reads_ch.view()
    +
    +
    +
    + +
    +
    +Challenge +
    +
    +
    +

    Using .splitCsv and .map read in the samplesheet below: /home/Shared/For_NF_Workshop/training/nf-training-advanced/metadata/data/samplesheet.csv

    +

    Set the meta to contain the following keys from the header id, repeat and type

    +
    +
    +
    + +
    +
    +
    params.input = "/home/Shared/For_NF_Workshop/training/nf-training-advanced/metadata/data/samplesheet.csv"
    +
    +ch_sheet = Channel.fromPath(params.input)
    +
    +ch_sheet.splitCsv(header:true)
    +    .map {
    +        it ->
    +            [[it.id, it.repeat, it.type], it.fastq_1, it.fastq_2]
    +    }.view()
    +
    +
    +
    +
    +
    +
    +

    7.2 Manipulating Metadata and Channels

    +

    There are a number of use cases where we will be interested in manipulating our metadata and channels.

    +

    Here we will look at 2 use cases.

    +
    +

    7.2.1 Matching input channels

    +

    As we have seen in examples/challenges in the operators section, it is important to ensure that the format of the channels that you provide as inputs match the process definition.

    +
    params.reads = "/home/Shared/For_NF_Workshop/training/nf-training/data/ggal/*_{1,2}.fq"
    +
    +process printNumLines {
    +    input:
    +    path(reads)
    +
    +    output:
    +    path("*txt")
    +
    +    script:
    +    """
    +    wc -l ${reads}
    +    """
    +}
    +
    +workflow {
    +    ch_input = Channel.fromFilePairs("$params.reads")
    +    printNumLines( ch_input )
    +}
    +

    As if the format does not match you will see and error similar to below:

    +
    [myeung@papr-res-compute204 lesson7.1test]$ nextflow run test.nf 
    +N E X T F L O W  ~  version 23.04.1
    +Launching `test.nf` [agitated_faggin] DSL2 - revision: c210080493
    +[-        ] process > printNumLines -
    +

    or if using nf-core template

    +
    ERROR ~ Error executing process > 'PMCCCGTRC_UMIHYBCAP:UMIHYBCAP:PREPARE_GENOME:BEDTOOLS_SLOP'
    +
    +Caused by:
    +  Not a valid path value type: java.util.LinkedHashMap ([id:genome_size])
    +
    +
    +Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`
    +
    + -- Check '.nextflow.log' file for details
    +

    When encountering these errors there are two methods to correct this:

    +
      +
    1. Change the input definition in the process
    2. +
    3. Use variations of the channel operators to correct the format of your channel
    4. +
    +

    There are cases where changing the input definition is impractical (i.e. when using nf-core modules/subworkflows).

    +

    Let’s take a look at some select modules.

    +

    BEDTOOLS_SLOP

    +

    BEDTOOLS_INTERSECT

    +
    +
    +
    + +
    +
    +Challenge +
    +
    +
    +

    Assuming that you have the following inputs

    +
    ch_target = Channel.fromPath("/home/Shared/For_NF_Workshop/training/nf-training-advanced/grouping/data/intervals.bed")
    +ch_bait = Channel.fromPath("/home/Shared/For_NF_Workshop/training/nf-training-advanced/grouping/data/intervals2.bed").map { fn -> [ [id: fn.baseName ], fn ] }
    +ch_sizes = Channel.fromPath("/home/Shared/For_NF_Workshop/training/nf-training-advanced/grouping/data/genome.sizes")
    +

    Write a mini workflow that:

    +
      +
    1. Takes the ch_target bedfile and extends the bed by 20bp on both sides using BEDTOOLS_SLOP (You can use the config definition below as a helper, or write your own as an additional challenge)
    2. +
    3. Take the output from BEDTOOLS_SLOP and input this output with the ch_baits to BEDTOOLS_INTERSECT
    4. +
    +

    HINT: The modules can be imported from this location: /home/Shared/For_NF_Workshop/training/pmcc-test/modules/nf-core/bedtools

    +

    HINT: You will need need the following operators to achieve this .map and .combine

    +
    +
    +
    + +
    +
    +
    
    +process {
    +    withName: 'BEDTOOLS_SLOP' {
    +        ext.args = "-b 20"
    +        ext.prefix = "extended.bed"
    +    }
    +
    +    withName: 'BEDTOOLS_INTERSECT' {
    +        ext.prefix = "intersect.bed"
    +    }
    +}
    +:::
    +
    +:::{.callout-caution collapse="true"}
    +## **Solution**
    +```default
    +include { BEDTOOLS_SLOP } from '/home/Shared/For_NF_Workshop/training/pmcc-test/modules/nf-core/bedtools/slop/main'
    +include { BEDTOOLS_INTERSECT } from '/home/Shared/For_NF_Workshop/training/pmcc-test/modules/nf-core/bedtools/intersect/main'
    +
    +
    +ch_target = Channel.fromPath("/home/Shared/For_NF_Workshop/training/nf-training-advanced/grouping/data/intervals.bed")
    +ch_bait = Channel.fromPath("/home/Shared/For_NF_Workshop/training/nf-training-advanced/grouping/data/intervals2.bed").map { fn -> [ [id: fn.baseName ], fn ] }
    +ch_sizes = Channel.fromPath("/home/Shared/For_NF_Workshop/training/nf-training-advanced/grouping/data/genome.sizes")
    +
    +workflow {
    +    BEDTOOLS_SLOP ( ch_target.map{ fn -> [ [id:fn.baseName], fn ]}, ch_sizes)
    +
    +    target_bait_bed = BEDTOOLS_SLOP.out.bed.combine( ch_bait )
    +    BEDTOOLS_INTERSECT( target_bait_bed, ch_sizes.map{ fn -> [ [id: fn.baseName], fn]} )
    +}
    +
    nextflow run nfcoretest.nf -profile singularity -c test2.config --outdir nfcoretest
    +
    +
    +
    +
    +
    +
    +

    7.3 Grouping with Metadata

    +

    Earlier we introduced the function groupTuple

    +
    
    +ch_reads = Channel.fromFilePairs("/home/Shared/For_NF_Workshop/training/nf-training-advanced/metadata/data/reads/*/*_R{1,2}.fastq.gz")
    +    .map { id, reads ->
    +        (sample, replicate, type) = id.tokenize("_")
    +        replicate -= ~/^rep/
    +        meta = [sample:sample, replicate:replicate, type:type]
    +    [meta, reads]
    +}
    +
    +## Assume that we want to drop replicate from the meta and combine fastqs
    +
    +ch_reads.map {
    +    meta, reads -> 
    +        [ meta - meta.subMap('replicate') + [data_type: 'fastq'], reads ]
    +    }
    +    .groupTuple().view()
    + + +
    + +
    + +
    + + + + \ No newline at end of file diff --git a/workshops/8.1_scatter_gather_output.html b/workshops/8.1_scatter_gather_output.html index b52d3cb..b1466ea 100644 --- a/workshops/8.1_scatter_gather_output.html +++ b/workshops/8.1_scatter_gather_output.html @@ -181,7 +181,7 @@
  • - Nextflow Operators + Metadata Propagation