diff --git a/.nojekyll b/.nojekyll index 5f37abd..a49e755 100644 --- a/.nojekyll +++ b/.nojekyll @@ -1 +1 @@ -9e9c49e2 \ No newline at end of file +338f3e2a \ No newline at end of file diff --git a/search.json b/search.json index 3251abb..41f5fe4 100644 --- a/search.json +++ b/search.json @@ -28,18 +28,18 @@ "text": "Objectives\n\n\n\n\nLearn about the core features of nf-core.\nLearn the terminology used by nf-core.\nUse Nextflow to pull and run the nf-core/testpipeline workflow\n\n\n\nIntroduction to nf-core: Introduce nf-core features and concepts, structures, tools, and example nf-core pipelines\n\n1.2.1. What is nf-core?\nnf-core is a community effort to collect a curated set of analysis workflows built using Nextflow.\nnf-core provides a standardized set of best practices, guidelines, and templates for building and sharing bioinformatics workflows. These workflows are designed to be modular, scalable, and portable, allowing researchers to easily adapt and execute them using their own data and compute resources.\nThe community is a diverse group of bioinformaticians, developers, and researchers from around the world who collaborate on developing and maintaining a growing collection of high-quality workflows. These workflows cover a range of applications, including transcriptomics, proteomics, and metagenomics.\nOne of the key benefits of nf-core is that it promotes open development, testing, and peer review, ensuring that the workflows are robust, well-documented, and validated against real-world datasets. This helps to increase the reliability and reproducibility of bioinformatics analyses and ultimately enables researchers to accelerate their scientific discoveries.\nnf-core is published in Nature Biotechnology: Nat Biotechnol 38, 276–278 (2020). Nature Biotechnology\nKey Features of nf-core workflows\n\nDocumentation\n\nnf-core workflows have extensive documentation covering installation, usage, and description of output files to ensure that you won’t be left in the dark.\n\nStable Releases\n\nnf-core workflows use GitHub releases to tag stable versions of the code and software, making workflow runs totally reproducible.\n\nPackaged software\n\nPipeline dependencies are automatically downloaded and handled using Docker, Singularity, Conda, or other software management tools. There is no need for any software installations.\n\nPortable and reproducible\n\nnf-core workflows follow best practices to ensure maximum portability and reproducibility. The large community makes the workflows exceptionally well-tested and easy to execute.\n\nCloud-ready\n\nnf-core workflows are tested on AWS\n\n\n\n\n1.2.2. Executing an nf-core workflow\nThe nf-core website has a full list of workflows and asssociated documentation tno be explored.\nEach workflow has a dedicated page that includes expansive documentation that is split into 7 sections:\n\nIntroduction\n\nAn introduction and overview of the workflow\n\nResults\n\nExample output files generated from the full test dataset\n\nUsage docs\n\nDescriptions of how to execute the workflow\n\nParameters\n\nGrouped workflow parameters with descriptions\n\nOutput docs\n\nDescriptions and examples of the expected output files\n\nReleases & Statistics\n\nWorkflow version history and statistics\n\n\nAs nf-core is a community development project the code for a pipeline can be changed at any time. To ensure that you have locked in a specific version of a pipeline you can use Nextflow’s built-in functionality to pull a workflow. The Nextflow pull command can download and cache workflows from GitHub repositories:\nnextflow pull nf-core/<pipeline>\nNextflow run will also automatically pull the workflow if it was not already available locally:\nnextflow run nf-core/<pipeline>\nNextflow will pull the default git branch if a workflow version is not specified. This will be the master branch for nf-core workflows with a stable release. nf-core workflows use GitHub releases to tag stable versions of the code and software. You will always be able to execute a previous version of a workflow once it is released using the -revision or -r flag.\nFor this section of the workshop we will be using the nf-core/testpipeline as an example.\nAs we will be running some bioinformatics tools, we will need to make sure of the following:\n\nWe are not running on login node\nsingularity module is loaded (module load singularity/3.7.3)\n\n\n\n\n\n\n\nSetup an interactive session\n\n\n\nsrun --pty -p prod_short --mem 20GB --cpus-per-task 2 -t 0-2:00 /bin/bash\n\nEnsure the required modules are loaded\nmodule list\nCurrently Loaded Modulefiles:\n 1) java/jdk-17.0.6 2) nextflow/23.04.1 3) squashfs-tools/4.5 4) singularity/3.7.3\n\n\n\nWe will also create a separate output directory for this section.\ncd /scratch/users/<your-username>/nfWorkshop; mkdir ./lesson1.2 && cd $_\nThe base command we will be using for this section is:\nnextflow run nf-core/testpipeline -profile test,singularity --outdir my_results\n\n\n1.2.3. Workflow structure\nnf-core workflows start from a common template and follow the same structure. Although you won’t need to edit code in the workflow project directory, having a basic understanding of the project structure and some core terminology will help you understand how to configure its execution.\nLet’s take a look at the code for the nf-core/rnaseq pipeline.\nNextflow DSL2 workflows are built up of subworkflows and modules that are stored as separate .nf files.\nMost nf-core workflows consist of a single workflow file (there are a few exceptions). This is the main <workflow>.nf file that is used to bring everything else together. Instead of having one large monolithic script, it is broken up into a combination of subworkflows and modules.\nA subworkflow is a groups of modules that are used in combination with each other and have a common purpose. Subworkflows improve workflow readability and help with the reuse of modules within a workflow. The nf-core community also shares subworkflows in the nf-core subworkflows GitHub repository. Local subworkflows are workflow specific that are not shared in the nf-core subworkflows repository.\nLet’s take a look at the BAM_STATS_SAMTOOLS subworkflow.\nThis subworkflow is comprised of the following modules: - SAMTOOLS_STATS - SAMTOOLS_IDXSTATS, and - SAMTOOLS_FLAGSTAT\nA module is a wrapper for a process, most modules will execute a single tool and contain the following definitions: - inputs - outputs, and - script block.\nLike subworkflows, modules can also be shared in the nf-core modules GitHub repository or stored as a local module. All modules from the nf-core repository are version controlled and tested to ensure reproducibility. Local modules are workflow specific that are not shared in the nf-core modules repository.\n\n\n1.2.4. Viewing parameters\nEvery nf-core workflow has a full list of parameters on the nf-core website. When viewing these parameters online, you will also be shown a description and the type of the parameter. Some parameters will have additional text to help you understand when and how a parameter should be used.\n\n\n\n\n\nParameters and their descriptions can also be viewed in the command line using the run command with the --help parameter:\nnextflow run nf-core/<workflow> --help\n\n\n\n\n\n\nChallenge\n\n\n\nView the parameters for the nf-core/testpipeline workflow using the command line:\n\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\nThe nf-core/testpipeline workflow parameters can be printed using the run command and the --help option:\nnextflow run nf-core/testpipeline --help\n\n\n\n\n\n1.2.5. Parameters in the command line\nParameters can be customized using the command line. Any parameter can be configured on the command line by prefixing the parameter name with a double dash (--):\nnextflow run nf-core/<workflow> --<parameter>\n\n\n\n\n\n\nTip\n\n\n\nNextflow options are prefixed with a single dash (-) and workflow parameters are prefixed with a double dash (--).\n\n\nDepending on the parameter type, you may be required to add additional information after your parameter flag. For example, for a string parameter, you would add the string after the parameter flag:\nnextflow run nf-core/<workflow> --<parameter> string\n\n\n\n\n\n\nChallenge\n\n\n\nGive the MultiQC report for the nf-core/testpipeline workflow the name of your favorite animal using the multiqc_title parameter using a command line flag:\n\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\nAdd the --multiqc_title flag to your command and execute it. Use the -resume option to save time:\nnextflow run nf-core/testpipeline -profile test,singularity --multiqc_title koala --outdir my_results -resume\n\n\n\nIn this example, you can check your parameter has been applied by listing the files created in the results folder (my_results):\nls my_results/multiqc/\n\n\n1.2.6. Configuration files\nConfiguration files are .config files that can contain various workflow properties. Custom paths passed in the command-line using the -c option:\nnextflow run nf-core/<workflow> -profile test,docker -c <path/to/custom.config>\nMultiple custom .config files can be included at execution by separating them with a comma (,).\nCustom configuration files follow the same structure as the configuration file included in the workflow directory. Configuration properties are organized into scopes by grouping the properties in the same scope using the curly brackets notation. For example:\nalpha {\n x = 1\n y = 'string value..'\n}\nScopes allow you to quickly configure settings required to deploy a workflow on different infrastructure using different software management. For example, the executor scope can be used to provide settings for the deployment of a workflow on a HPC cluster. Similarly, the singularity scope controls how Singularity containers are executed by Nextflow. Multiple scopes can be included in the same .config file using a mix of dot prefixes and curly brackets. A full list of scopes is described in detail here.\n\n\n\n\n\n\nChallenge\n\n\n\nGive the MultiQC report for the nf-core/testpipeline workflow the name of your favorite color using the multiqc_title parameter in a custom my_custom.config file:\n\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\nCreate a custom my_custom.config file that contains your favourite colour, e.g., blue:\nparams {\n multiqc_title = \"blue\"\n}\nInclude the custom .config file in your execution command with the -c option:\nnextflow run nf-core/testpipeline --outdir my_results -profile test,singularity -resume -c my_custom.config\nCheck that it has been applied:\nls my_results/multiqc/\nWhy did this fail?\nYou can not use the params scope in custom configuration files. Parameters can only be configured using the -params-file option and the command line. While parameter is listed as a parameter on the STDOUT, it was not applied to the executed command.\nWe will revisit this at the end of the module\n\n\n\n\n\n1.2.7 Parameter files\nParameter files are used to define params options for a pipeline, generally written in the YAML format. They are added to a pipeline with the flag --params-file\nExample YAML:\n\"<parameter1_name>\": 1,\n\"<parameter2_name>\": \"<string>\",\n\"<parameter3_name>\": true\n\n\n\n\n\n\nChallenge\n\n\n\nBased on the failed application of the parameter multiqc_title create a my_params.yml setting multiqc_title to your favourite colour. Then re-run the pipeline with the your my_params.yml\n\n\n\n\n\n\n\n\nSolution\n\n\n\n\n\nSet up my_params.yml\nmultiqc_title: \"black\"\nnextflow run nf-core/testpipeline -profile test,singularity --params-file my_params.yml --outdir Lesson1_2\n\n\n\n\n\n1.2.8. Default configuration files\nAll parameters will have a default setting that is defined using the nextflow.config file in the workflow project directory. By default, most parameters are set to null or false and are only activated by a profile or configuration file.\nThere are also several includeConfig statements in the nextflow.config file that are used to load additional .config files from the conf/ folder. Each additional .config file contains categorized configuration information for your workflow execution, some of which can be optionally included:\n\nbase.config\n\nIncluded by the workflow by default.\nGenerous resource allocations using labels.\nDoes not specify any method for software management and expects software to be available (or specified elsewhere).\n\nigenomes.config\n\nIncluded by the workflow by default.\nDefault configuration to access reference files stored on AWS iGenomes.\n\nmodules.config\n\nIncluded by the workflow by default.\nModule-specific configuration options (both mandatory and optional).\n\n\nNotably, configuration files can also contain the definition of one or more profiles. A profile is a set of configuration attributes that can be activated when launching a workflow by using the -profile command option:\nnextflow run nf-core/<workflow> -profile <profile>\nProfiles used by nf-core workflows include:\n\nSoftware management profiles\n\nProfiles for the management of software using software management tools, e.g., docker, singularity, and conda.\n\nTest profiles\n\nProfiles to execute the workflow with a standardized set of test data and parameters, e.g., test and test_full.\n\n\nMultiple profiles can be specified in a comma-separated (,) list when you execute your command. The order of profiles is important as they will be read from left to right:\nnextflow run nf-core/<workflow> -profile test,singularity\nnf-core workflows are required to define software containers and conda environments that can be activated using profiles.\n\n\n\n\n\n\nTip\n\n\n\nIf you’re computer has internet access and one of Conda, Singularity, or Docker installed, you should be able to run any nf-core workflow with the test profile and the respective software management profile ‘out of the box’. The test data profile will pull small test files directly from the nf-core/test-data GitHub repository and run it on your local system. The test profile is an important control to check the workflow is working as expected and is a great way to trial a workflow. Some workflows have multiple test profiles for you to test.\n\n\n\n\n\n\n\n\nKey points\n\n\n\n\nnf-core is a community effort to collect a curated set of analysis workflows built using Nextflow.\nNextflow can be used to pull nf-core workflows.\nnf-core workflows follow similar structures\nnf-core workflows are configured using parameters and profiles\n\n\n\n\nThese materials are adapted from Customising Nf-Core Workshop by Sydney Informatics Hub" }, { - "objectID": "workshops/1.1_intro_nextflow.html", - "href": "workshops/1.1_intro_nextflow.html", - "title": "Introduction to Nextflow", + "objectID": "workshops/4.1_draft_future_sess.html", + "href": "workshops/4.1_draft_future_sess.html", + "title": "Nextflow Development - Metadata Parsing", "section": "", - "text": "Objectives\n\n\n\n\nLearn about the benefits of a workflow manager.\nLearn Nextflow terminology.\nLearn basic commands and options to run a Nextflow workflow" + "text": "Currently, we have defined the reads parameter as a string:\nparams.reads = \"/.../training/nf-training/data/ggal/gut_{1,2}.fq\"\nTo group the reads parameter, the fromFilePairs channel factory can be used. Add the following to the workflow block and run the workflow:\nreads_ch = Channel.fromFilePairs(\"$params.reads\")\nreads_ch.view()\nThe reads parameter is being converted into a file pair group using fromFilePairs, and is assigned to reads_ch. The reads_ch consists of a tuple of two items – the first is the grouping key of the matching pair (gut), and the second is a list of paths to each file:\n[gut, [/.../training/nf-training/data/ggal/gut_1.fq, /.../training/nf-training/data/ggal/gut_2.fq]]\nGlob patterns can also be used to create channels of file pair groups. Inside the data directory, we have pairs of gut, liver, and lung files that can all be read into reads_ch.\n>>> ls \"/.../training/nf-training/data/ggal/\"\n\ngut_1.fq gut_2.fq liver_1.fq liver_2.fq lung_1.fq lung_2.fq transcriptome.fa\nRun the rnaseq.nf workflow specifying all .fq files inside /.../training/nf-training/data/ggal/ as the reads parameter via the command line:\nnextflow run rnaseq.nf --reads '/.../training/nf-training/data/ggal/*_{1,2}.fq'\nFile paths that include one or more wildcards (ie. *, ?, etc.) MUST be wrapped in single-quoted characters to avoid Bash expanding the glob on the command line.\nThe reads_ch now contains three tuple elements with unique grouping keys:\n[gut, [/.../training/nf-training/data/ggal/gut_1.fq, /.../training/nf-training/data/ggal/gut_2.fq]]\n[liver, [/.../training/nf-training/data/ggal/liver_1.fq, /.../training/nf-training/data/ggal/liver_2.fq]]\n[lung, [/.../training/nf-training/data/ggal/lung_1.fq, /.../training/nf-training/data/ggal/lung_2.fq]]\nThe grouping key metadata can also be explicitly created without having to rely on file names, using the map channel operator. Let’s start by creating a samplesheet rnaseq_samplesheet.csv with column headings sample_name, fastq1, and fastq2, and fill in a custom sample_name, along with the paths to the .fq files.\nsample_name,fastq1,fastq2\ngut_sample,/.../training/nf-training/data/ggal/gut_1.fq,/.../training/nf-training/data/ggal/gut_2.fq\nliver_sample,/.../training/nf-training/data/ggal/liver_1.fq,/.../training/nf-training/data/ggal/liver_2.fq\nlung_sample,/.../training/nf-training/data/ggal/lung_1.fq,/.../training/nf-training/data/ggal/lung_2.fq\nLet’s now supply the path to rnaseq_samplesheet.csv to the reads parameter in rnaseq.nf.\nparams.reads = \"/.../rnaseq_samplesheet.csv\"\nPreviously, the reads parameter consisted of a string of the .fq files directly. Now, it is a string to a .csv file containing the .fq files. Therefore, the channel factory method that reads the input file also needs to be changed. Since the parameter is now a single file path, the fromPath method can first be used, which creates a channel of Path type object. The splitCsv channel operator can then be used to parse the contents of the channel.\nreads_ch = Channel.fromPath(params.reads)\nreads_ch.view()\n\nreads_ch = reads_ch.splitCsv(header:true)\nreads_ch.view()\nWhen using splitCsv in the above example, header is set to true. This will use the first line of the .csv file as the column names. Let’s run the pipeline containing the new input parameter.\n>>> nextflow run rnaseq.nf\n\nN E X T F L O W ~ version 23.04.1\nLaunching `rnaseq.nf` [distraught_avogadro] DSL2 - revision: 525e081ba2\nreads: rnaseq_samplesheet.csv\nreads: $params.reads\nexecutor > local (1)\n[4e/eeae2a] process > INDEX [100%] 1 of 1 ✔\n/.../rnaseq_samplesheet.csv\n[sample_name:gut_sample, fastq1:/.../training/nf-training/data/ggal/gut_1.fq, fastq2:/.../training/nf-training/data/ggal/gut_2.fq]\n[sample_name:liver_sample, fastq1:/.../training/nf-training/data/ggal/liver_1.fq, fastq2:/.../training/nf-training/data/ggal/liver_2.f]\n[sample_name:lung_sample, fastq1:/.../training/nf-training/data/ggal/lung_1.fq, fastq2:/.../training/nf-training/data/ggal/lung_2.fq]\nThe /.../rnaseq_samplesheet.csv is the output of reads_ch directly after the fromPath channel factory method was used. Here, the channel is a Path type object. After invoking the splitCsv channel operator, the reads_ch is now replaced with a channel consisting of three elements, where each element is a row in the .csv file, returned as a list. Since header was set to true, each element in the list is also mapped to the column names. This can be used when creating the custom grouping key.\nTo create grouping key metadata from the list output by splitCsv, the map channel operator can be used.\n reads_ch = reads_ch.map { row -> \n grp_meta = \"$row.sample_name\"\n [grp_meta, [row.fastq1, row.fastq2]]\n }\n reads_ch.view()\nHere, for each list in reads_ch, we assign it to a variable row. We then create custom grouping key metadata grp_meta based on the sample_name column from the .csv, which can be accessed via the row variable by . separation. After the custom metadata key is assigned, a tuple is created by assigning grp_meta as the first element, and the two .fq files as the second element, accessed via the row variable by . separation.\nLet’s run the pipeline containing the custom grouping key:\n>>> nextflow run rnaseq.nf\n\nN E X T F L O W ~ version 23.04.1\nLaunching `rnaseq.nf` [happy_torricelli] DSL2 - revision: e9e1499a97\nreads: rnaseq_samplesheet.csv\nreads: $params.reads\n[- ] process > INDEX -\n[gut_sample, [/.../training/nf-training/data/ggal/gut_1.fq, /.../training/nf-training/data/ggal/gut_2.fq]]\n[liver_sample, [/home/sli/test/training/nf-training/data/ggal/liver_1.fq, /.../training/nf-training/data/ggal/liver_2.fq]]\n[lung_sample, [/.../training/nf-training/data/ggal/lung_1.fq, /.../training/nf-training/data/ggal/lung_2.fq]]\nThe custom grouping key can be created from multiple values in the samplesheet. For example, grp_meta = [sample : row.sample_name , file : row.fastq1] will create the metadata key using both the sample_name and fastq1 file names. The samplesheet can also be created to include multiple sample characteristics, such as lane, data_type, etc. Each of these characteristics can be used to ensure an adequte grouping key is creaed for that sample." }, { - "objectID": "workshops/1.1_intro_nextflow.html#footnotes", - "href": "workshops/1.1_intro_nextflow.html#footnotes", - "title": "Introduction to Nextflow", - "section": "Footnotes", - "text": "Footnotes\n\n\nhttps://www.lexico.com/definition/workflow↩︎" + "objectID": "workshops/4.1_draft_future_sess.html#metadata-parsing", + "href": "workshops/4.1_draft_future_sess.html#metadata-parsing", + "title": "Nextflow Development - Metadata Parsing", + "section": "", + "text": "Currently, we have defined the reads parameter as a string:\nparams.reads = \"/.../training/nf-training/data/ggal/gut_{1,2}.fq\"\nTo group the reads parameter, the fromFilePairs channel factory can be used. Add the following to the workflow block and run the workflow:\nreads_ch = Channel.fromFilePairs(\"$params.reads\")\nreads_ch.view()\nThe reads parameter is being converted into a file pair group using fromFilePairs, and is assigned to reads_ch. The reads_ch consists of a tuple of two items – the first is the grouping key of the matching pair (gut), and the second is a list of paths to each file:\n[gut, [/.../training/nf-training/data/ggal/gut_1.fq, /.../training/nf-training/data/ggal/gut_2.fq]]\nGlob patterns can also be used to create channels of file pair groups. Inside the data directory, we have pairs of gut, liver, and lung files that can all be read into reads_ch.\n>>> ls \"/.../training/nf-training/data/ggal/\"\n\ngut_1.fq gut_2.fq liver_1.fq liver_2.fq lung_1.fq lung_2.fq transcriptome.fa\nRun the rnaseq.nf workflow specifying all .fq files inside /.../training/nf-training/data/ggal/ as the reads parameter via the command line:\nnextflow run rnaseq.nf --reads '/.../training/nf-training/data/ggal/*_{1,2}.fq'\nFile paths that include one or more wildcards (ie. *, ?, etc.) MUST be wrapped in single-quoted characters to avoid Bash expanding the glob on the command line.\nThe reads_ch now contains three tuple elements with unique grouping keys:\n[gut, [/.../training/nf-training/data/ggal/gut_1.fq, /.../training/nf-training/data/ggal/gut_2.fq]]\n[liver, [/.../training/nf-training/data/ggal/liver_1.fq, /.../training/nf-training/data/ggal/liver_2.fq]]\n[lung, [/.../training/nf-training/data/ggal/lung_1.fq, /.../training/nf-training/data/ggal/lung_2.fq]]\nThe grouping key metadata can also be explicitly created without having to rely on file names, using the map channel operator. Let’s start by creating a samplesheet rnaseq_samplesheet.csv with column headings sample_name, fastq1, and fastq2, and fill in a custom sample_name, along with the paths to the .fq files.\nsample_name,fastq1,fastq2\ngut_sample,/.../training/nf-training/data/ggal/gut_1.fq,/.../training/nf-training/data/ggal/gut_2.fq\nliver_sample,/.../training/nf-training/data/ggal/liver_1.fq,/.../training/nf-training/data/ggal/liver_2.fq\nlung_sample,/.../training/nf-training/data/ggal/lung_1.fq,/.../training/nf-training/data/ggal/lung_2.fq\nLet’s now supply the path to rnaseq_samplesheet.csv to the reads parameter in rnaseq.nf.\nparams.reads = \"/.../rnaseq_samplesheet.csv\"\nPreviously, the reads parameter consisted of a string of the .fq files directly. Now, it is a string to a .csv file containing the .fq files. Therefore, the channel factory method that reads the input file also needs to be changed. Since the parameter is now a single file path, the fromPath method can first be used, which creates a channel of Path type object. The splitCsv channel operator can then be used to parse the contents of the channel.\nreads_ch = Channel.fromPath(params.reads)\nreads_ch.view()\n\nreads_ch = reads_ch.splitCsv(header:true)\nreads_ch.view()\nWhen using splitCsv in the above example, header is set to true. This will use the first line of the .csv file as the column names. Let’s run the pipeline containing the new input parameter.\n>>> nextflow run rnaseq.nf\n\nN E X T F L O W ~ version 23.04.1\nLaunching `rnaseq.nf` [distraught_avogadro] DSL2 - revision: 525e081ba2\nreads: rnaseq_samplesheet.csv\nreads: $params.reads\nexecutor > local (1)\n[4e/eeae2a] process > INDEX [100%] 1 of 1 ✔\n/.../rnaseq_samplesheet.csv\n[sample_name:gut_sample, fastq1:/.../training/nf-training/data/ggal/gut_1.fq, fastq2:/.../training/nf-training/data/ggal/gut_2.fq]\n[sample_name:liver_sample, fastq1:/.../training/nf-training/data/ggal/liver_1.fq, fastq2:/.../training/nf-training/data/ggal/liver_2.f]\n[sample_name:lung_sample, fastq1:/.../training/nf-training/data/ggal/lung_1.fq, fastq2:/.../training/nf-training/data/ggal/lung_2.fq]\nThe /.../rnaseq_samplesheet.csv is the output of reads_ch directly after the fromPath channel factory method was used. Here, the channel is a Path type object. After invoking the splitCsv channel operator, the reads_ch is now replaced with a channel consisting of three elements, where each element is a row in the .csv file, returned as a list. Since header was set to true, each element in the list is also mapped to the column names. This can be used when creating the custom grouping key.\nTo create grouping key metadata from the list output by splitCsv, the map channel operator can be used.\n reads_ch = reads_ch.map { row -> \n grp_meta = \"$row.sample_name\"\n [grp_meta, [row.fastq1, row.fastq2]]\n }\n reads_ch.view()\nHere, for each list in reads_ch, we assign it to a variable row. We then create custom grouping key metadata grp_meta based on the sample_name column from the .csv, which can be accessed via the row variable by . separation. After the custom metadata key is assigned, a tuple is created by assigning grp_meta as the first element, and the two .fq files as the second element, accessed via the row variable by . separation.\nLet’s run the pipeline containing the custom grouping key:\n>>> nextflow run rnaseq.nf\n\nN E X T F L O W ~ version 23.04.1\nLaunching `rnaseq.nf` [happy_torricelli] DSL2 - revision: e9e1499a97\nreads: rnaseq_samplesheet.csv\nreads: $params.reads\n[- ] process > INDEX -\n[gut_sample, [/.../training/nf-training/data/ggal/gut_1.fq, /.../training/nf-training/data/ggal/gut_2.fq]]\n[liver_sample, [/home/sli/test/training/nf-training/data/ggal/liver_1.fq, /.../training/nf-training/data/ggal/liver_2.fq]]\n[lung_sample, [/.../training/nf-training/data/ggal/lung_1.fq, /.../training/nf-training/data/ggal/lung_2.fq]]\nThe custom grouping key can be created from multiple values in the samplesheet. For example, grp_meta = [sample : row.sample_name , file : row.fastq1] will create the metadata key using both the sample_name and fastq1 file names. The samplesheet can also be created to include multiple sample characteristics, such as lane, data_type, etc. Each of these characteristics can be used to ensure an adequte grouping key is creaed for that sample." }, { "objectID": "workshops/3.1_creating_a_workflow.html", @@ -60,7 +60,21 @@ "href": "workshops/3.1_creating_a_workflow.html#creating-an-rnaseq-workflow", "title": "Nextflow Development - Creating a Nextflow Workflow", "section": "Creating an RNAseq Workflow", - "text": "Creating an RNAseq Workflow\n\n\n\n\n\n\nObjectives\n\n\n\n\nDevelop a Nextflow workflow\nRead data of different types into a Nextflow workflow\nOutput Nextflow process results to a predefined directory\n\n\n\n\n4.1.1. Define Workflow Parameters\nLet’s create a Nextflow script rnaseq.nf for a RNA-seq workflow. The code begins with a shebang, which declares Nextflow as the interpreter.\n#!/usr/bin/env nextflow\nOne way to define the workflow parameters is inside the Nextflow script.\nparams.reads = \"/.../training/nf-training/data/ggal/*_{1,2}.fq\"\nparams.transcriptome_file = \"/.../training/nf-training/data/ggal/transcriptome.fa\"\nparams.multiqc = \"/.../training/nf-training/multiqc\"\n\nprintln \"reads: $params.reads\"\nWorkflow parameters can be defined and accessed inside the Nextflow script by prepending the prefix params to a variable name, separated by a dot character, eg. params.reads.\nDifferent data types can be assigned as a parameter in Nextflow. The reads parameter is defined as multiple .fq files. The transcriptome_file parameter is defined as one file, /.../training/nf-training/data/ggal/transcriptome.fa. The multiqc parameter is defined as a directory, /.../training/nf-training/data/ggal/multiqc.\nThe Groovy println command is then used to print the contents of the reads parameter, which is access with the $ character.\nRun the script:\n>>> nextflow run rnaseq.nf\nN E X T F L O W ~ version 23.04.1\nLaunching `rnaseq.nf` [astonishing_raman] DSL2 - revision: 8c9adc1772\nreads: /.../training/nf-training/data/ggal/*_{1,2}.fq\n\n\n\n4.1.2. Create a transcriptome index file\nCommands or scripts can be executed inside a process.\nprocess INDEX {\n input:\n path transcriptome\n\n output:\n path \"salmon_idx\"\n\n script:\n \"\"\"\n salmon index --threads $task.cpus -t $transcriptome -i salmon_idx\n \"\"\"\n}\nThe INDEX process takes an input path, and assigns that input as the variable transcriptome. The path type qualifier will allow Nextflow to stage the files in the process execution directory, where they can be accessed by the script via the defined variable name, ie. transcriptome. The code between the three double-quotes of the script block will be executed, and accesses the input transcriptome variable using $. The output is a path, with a filename salmon_idx. The output path can also be defined using wildcards, eg. path \"*_idx\".\nNote that the name of the input file is not used and is only referenced to by the input variable name. This feature allows pipeline tasks to be self-contained and decoupled from the execution environment. As best practice, avoid referencing files that are not defined in the process script.\nTo execute the INDEX process, a workflow scope will need to be added.\nworkflow {\n index_ch = INDEX(params.transcriptome_file)\n}\nHere, the params.transcriptome_file parameter we defined earlier in the Nextflow script is used as an input into the INDEX process. The output of the process is assigned to the index_ch channel.\nRun the Nextflow script:\n>>> nextflow run rnaseq.nf\n\nERROR ~ Error executing process > 'INDEX'\n\nCaused by:\n Process `INDEX` terminated with an error exit status (127)\n\nCommand executed:\n\n salmon index --threads 1 -t transcriptome.fa -i salmon_index\n\nCommand exit status:\n 127\n\nCommand output:\n (empty)\n\nCommand error:\n .command.sh: line 2: salmon: command not found\n\nWork dir:\n /.../work/85/495a21afcaaf5f94780aff6b2a964c\n\nTip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh`\n\n -- Check '.nextflow.log' file for details\nWhen a process execution exits with a non-zero exit status, the workflow will be stopped. Nextflow will output the cause of the error, the command that caused the error, the exit status, the standard output (if available), the comand standard error, and the work directory where the process was executed.\nLet’s first look inside the process execution directory:\n>>> ls -a /.../work/85/495a21afcaaf5f94780aff6b2a964c \n\n. .command.begin .command.log .command.run .exitcode\n.. .command.err .command.out .command.sh transcriptome.fa\nWe can see that the input file transcriptome.fa has been staged inside this process execution directory by being symbolically linked. This allows it to be accessed by the script.\nInside the .command.err script, we can see that the salmon command was not found, resulting in the termination of the Nextflow workflow.\nSingularity containers can be used to execute the process within an environment that contains the package of interest. Create a config file nextflow.config containing the following:\nsingularity {\n enabled = true\n autoMounts = true\n cacheDir = \"/config/binaries/singularity/containers_devel/nextflow\"\n}\nThe container process directive can be used to specify the required container:\nprocess INDEX {\n container \"/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-salmon-1.10.1--h7e5ed60_0.img\"\n\n input:\n path transcriptome\n\n output:\n path \"salmon_idx\"\n\n script:\n \"\"\"\n salmon index --threads $task.cpus -t $transcriptome -i salmon_idx\n \"\"\"\n}\nRun the Nextflow script:\n>>> nextflow run rnaseq.nf\nN E X T F L O W ~ version 23.04.1\nLaunching `rnaseq.nf` [distraught_goldwasser] DSL2 - revision: bdebf34e16\nexecutor > local (1)\n[37/7ef8f0] process > INDEX [100%] 1 of 1 ✔\nThe newly created nextflow.config files does not need to be specified in the nextflow run command. This file is automatically searched for and used by Nextflow.\nAn alternative to singularity containers is the use of a module. Since the script block is executed as a Bash script, it can contain any command or script normally executed on the command line. If there is a module present in the host environment, it can be loaded as part of the process script.\nprocess INDEX {\n input:\n path transcriptome\n\n output:\n path \"salmon_idx\"\n\n script:\n \"\"\"\n module purge\n module load salmon/1.3.0\n\n salmon index --threads $task.cpus -t $transcriptome -i salmon_idx\n \"\"\"\n}\nRun the Nextflow script:\n>>> nextflow run rnaseq.nf\nN E X T F L O W ~ version 23.04.1\nLaunching `rnaseq.nf` [reverent_liskov] DSL2 - revision: b74c22049d\nexecutor > local (1)\n[ba/3c12ab] process > INDEX [100%] 1 of 1 ✔\n\n\n\n4.1.3. Collect Read Files By Pairs\nPreviously, we have defined the reads parameter to be the following:\nparams.reads = \"/.../training/nf-training/data/ggal/*_{1,2}.fq\"\nChallenge: Convert the reads parameter into a tuple channel called reads_ch, where the first element is a unique grouping key, and the second element is the paired .fq files. Then, view the contents of reads_ch\n\n\n\n\n\n\nAnswer\n\n\n\n\n\nreads_ch = Channel.fromFilePairs(\"$params.reads\")\nreads_ch.view()\nThe fromFilePairs channel factory will automatically group input files into a tuple with a unique grouping key. The view() channel operator can be used to view the contents of the channel.\n>>> nextflow run rnaseq.nf\n\n[gut, [/.../training/nf-training/data/ggal/gut_1.fq, /.../training/nf-training/data/ggal/gut_2.fq]]\n[liver, [/.../training/nf-training/data/ggal/liver_1.fq, /.../training/nf-training/data/ggal/liver_2.fq]]\n[lung, [/.../training/nf-training/data/ggal/lung_1.fq, /.../training/nf-training/data/ggal/lung_2.fq]]\n\n\n\n\n\n4.1.4. Perform Expression Quantification\nLet’s add a new process QUANTIFICATION that uses both the indexed transcriptome file and the .fq file pairs to execute the salmon quant command.\nprocess QUANTIFICATION {\n input:\n path salmon_index\n tuple val(sample_id), path(reads)\n\n output:\n path \"$sample_id\"\n\n script:\n \"\"\"\n salmon quant --threads $task.cpus --libType=U \\\n -i $salmon_index -1 ${reads[0]} -2 ${reads[1]} -o $sample_id\n \"\"\"\n}\nThe QUANTIFICATION process takes two inputs, the first is the path to the salmon_index created from the INDEX process. The second input is set to match the output of fromFilePairs – a tuple where the first element is a value (ie. grouping key), and the second element is a list of paths to the .fq reads.\nIn the script block, the salmon quant command saves the output of the tool as $sample_id. This output is emitted by the QUANTIFICATION process, using $ to access the Nextflow variable.\nChallenge:\nSet the following as the execution container for QUANTIFICATION:\n/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-salmon-1.10.1--h7e5ed60_0.img\nAssign index_ch and reads_ch as the inputs to this process, and emit the process outputs as quant_ch. View the contents of quant_ch\n\n\n\n\n\n\nAnswer\n\n\n\n\n\nTo assign a container to a process, the container directive can be used.\nprocess QUANTIFICATION {\n container \"/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-salmon-1.10.1--h7e5ed60_0.img\"\n\n input:\n path salmon_index\n tuple val(sample_id), path(reads)\n\n output:\n path \"$sample_id\"\n\n script:\n \"\"\"\n salmon quant --threads $task.cpus --libType=U \\\n -i $salmon_index -1 ${reads[0]} -2 ${reads[1]} -o $sample_id\n \"\"\"\n}\nTo run the QUANTIFICATION process and emit the outputs as quant_ch, the following can be added to the end of the workflow block:\nquant_ch = QUANTIFICATION(index_ch, reads_ch)\nquant_ch.view()\nThe script can now be run:\n>>> nextflow run rnaseq.nf \nN E X T F L O W ~ version 23.04.1\nLaunching `rnaseq.nf` [elated_cray] DSL2 - revision: abe41f4f69\nexecutor > local (4)\n[e5/e75095] process > INDEX [100%] 1 of 1 ✔\n[4c/68a000] process > QUANTIFICATION (1) [100%] 3 of 3 ✔\n/.../work/b1/d861d26d4d36864a17d2cec8d67c80/liver\n/.../work/b4/a6545471c1f949b2723d43a9cce05f/lung\n/.../work/4c/68a000f7c6503e8ae1fe4d0d3c93d8/gut\nIn the Nextflow output, we can see that the QUANTIFICATION process has been ran three times, since the reads_ch consists of three elements. Nextflow will automatically run the QUANTIFICATION process on each of the elements in the input channel, creating separate process execution work directories for each execution.\n\n\n\n\n\n4.1.5. Quality Control\nNow, let’s implement a FASTQC quality control process for the input fastq reads.\nChallenge:\nCreate a process called FASTQC that takes reads_ch as an input, and declares the process input to be a tuple matching the structure of reads_ch, where the first element is assigned the variable sample_id, and the second variable is assigned the varible reads. This FASTQC process will first create an output directory fastqc_${sample_id}_logs, then perform fastqc on the input reads and save the results in the newly created directory fastqc_${sample_id}_logs:\nmkdir fastqc_${sample_id}_logs\nfastqc -o fastqc_${sample_id}_logs -f fastq -q ${reads}\nTake fastqc_${sample_id}_logs as the output of the process, and assign it to the channel fastqc_ch. Finally, specify the process container to be the following:\n/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-fastqc-0.12.1--hdfd78af_0.img\n\n\n\n\n\n\nAnswer\n\n\n\n\n\nThe process FASTQC is created in rnaseq.nf. Since the input channel is a tuple, the process input declaration is a tuple containing elements that match the structure of the incoming channel. The first element of the tuple is assigned the variable sample_id, and the second element of the tuple is assigned the variable reads. The relevant container is specified using the container process directive.\nprocess FASTQC {\n container \"/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-fastqc-0.12.1--hdfd78af_0.img\"\n\n input:\n tuple val(sample_id), path(reads)\n\n output:\n path \"fastqc_${sample_id}_logs\"\n\n script:\n \"\"\"\n mkdir fastqc_${sample_id}_logs\n fastqc -o fastqc_${sample_id}_logs -f fastq -q ${reads}\n \"\"\"\n}\nIn the workflow scope, the following can be added:\nfastqc_ch = FASTQC(reads_ch)\nThe FASTQC process is called, taking reads_ch as an input. The output of the process is assigned to be fastqc_ch.\n>>> nextflow run rnaseq.nf\nN E X T F L O W ~ version 23.04.1\nLaunching `rnaseq.nf` [sad_jennings] DSL2 - revision: cfae7ccc0e\nexecutor > local (7)\n[b5/6bece3] process > INDEX [100%] 1 of 1 ✔\n[32/46f20b] process > QUANTIFICATION (3) [100%] 3 of 3 ✔\n[44/27aa8d] process > FASTQC (2) [100%] 3 of 3 ✔\nIn the Nextflow output, we can see that the FASTQC has been ran three times as expected, since the reads_ch consists of three elements.\n\n\n\n\n\n4.1.6. MultiQC Report\nSo far, the generated outputs have all been saved inside the Nextflow work directory. For the FASTQC process, the specified output directory is only created inside the process execution directory. To save results to a specified folder, the publishDir process directive can be used.\nLet’s create a new MULTIQC process in our workflow that takes the outputs from the QUANTIFICATION and FASTQC processes to create a final report using the multiqc tool, and publish the process outputs to a directory outside of the process execution directory.\nprocess MULTIQC {\n publishDir params.outdir, mode:'copy'\n container \"/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-multiqc-1.21--pyhdfd78af_0.img\"\n\n input:\n path quantification\n path fastqc\n\n output:\n path \"*.html\"\n\n script:\n \"\"\"\n multiqc . --filename $quantification\n \"\"\"\n}\nIn the MULTIQC process, the multiqc command is performed on both quantification and fastqc inputs, and publishes the report to a directory defined by the outdir parameter. Only files that match the declaration in the output block are published, not all the outputs of a process. By default, files are published to the target folder creating a symbolic link to the file produced in the process execution directory. This behavior can be modified using the mode option, eg. copy, which copies the file from the process execution directory to the specified output directory.\nAdd the following to the end of workflow scope:\nmultiqc_ch = MULTIQC(quant_ch, fastqc_ch)\nRun the pipeline, specifying an output directory using the outdir parameter:\nnextflow run rnaseq.nf --outdir \"results\"\nA results directory containing the output multiqc reports will be created outside of the process execution directory.\n>>> ls results\ngut.html liver.html lung.html\n\n\n\n\n\n\n\nKey points\n\n\n\n\nCommands or scripts can be executed inside a process\nEnvironments can be defined using the container process directive\nThe input declaration for a process must match the structure of the channel that is being passed into that process\n\n\n\n\nThis workshop is adapted from Fundamentals Training, Advanced Training, Developer Tutorials, and Nextflow Patterns materials from Nextflow and nf-core" + "text": "Creating an RNAseq Workflow\n\n\n\n\n\n\nObjectives\n\n\n\n\nDevelop a Nextflow workflow\nRead data of different types into a Nextflow workflow\nOutput Nextflow process results to a predefined directory\n\n\n\n\n4.1.1. Define Workflow Parameters\nLet’s create a Nextflow script rnaseq.nf for a RNA-seq workflow. The code begins with a shebang, which declares Nextflow as the interpreter.\n#!/usr/bin/env nextflow\nOne way to define the workflow parameters is inside the Nextflow script.\nparams.reads = \"/.../training/nf-training/data/ggal/*_{1,2}.fq\"\nparams.transcriptome_file = \"/.../training/nf-training/data/ggal/transcriptome.fa\"\nparams.multiqc = \"/.../training/nf-training/multiqc\"\n\nprintln \"reads: $params.reads\"\nWorkflow parameters can be defined and accessed inside the Nextflow script by prepending the prefix params to a variable name, separated by a dot character, eg. params.reads.\nDifferent data types can be assigned as a parameter in Nextflow. The reads parameter is defined as multiple .fq files. The transcriptome_file parameter is defined as one file, /.../training/nf-training/data/ggal/transcriptome.fa. The multiqc parameter is defined as a directory, /.../training/nf-training/data/ggal/multiqc.\nThe Groovy println command is then used to print the contents of the reads parameter, which is access with the $ character.\nRun the script:\n>>> nextflow run rnaseq.nf\nN E X T F L O W ~ version 23.04.1\nLaunching `rnaseq.nf` [astonishing_raman] DSL2 - revision: 8c9adc1772\nreads: /.../training/nf-training/data/ggal/*_{1,2}.fq\n\n\n\n4.1.2. Create a transcriptome index file\nCommands or scripts can be executed inside a process.\nprocess INDEX {\n input:\n path transcriptome\n\n output:\n path \"salmon_idx\"\n\n script:\n \"\"\"\n salmon index --threads $task.cpus -t $transcriptome -i salmon_idx\n \"\"\"\n}\nThe INDEX process takes an input path, and assigns that input as the variable transcriptome. The path type qualifier will allow Nextflow to stage the files in the process execution directory, where they can be accessed by the script via the defined variable name, ie. transcriptome. The code between the three double-quotes of the script block will be executed, and accesses the input transcriptome variable using $. The output is a path, with a filename salmon_idx. The output path can also be defined using wildcards, eg. path \"*_idx\".\nNote that the name of the input file is not used and is only referenced to by the input variable name. This feature allows pipeline tasks to be self-contained and decoupled from the execution environment. As best practice, avoid referencing files that are not defined in the process script.\nTo execute the INDEX process, a workflow scope will need to be added.\nworkflow {\n index_ch = INDEX(params.transcriptome_file)\n}\nHere, the params.transcriptome_file parameter we defined earlier in the Nextflow script is used as an input into the INDEX process. The output of the process is assigned to the index_ch channel.\nRun the Nextflow script:\n>>> nextflow run rnaseq.nf\n\nERROR ~ Error executing process > 'INDEX'\n\nCaused by:\n Process `INDEX` terminated with an error exit status (127)\n\nCommand executed:\n\n salmon index --threads 1 -t transcriptome.fa -i salmon_index\n\nCommand exit status:\n 127\n\nCommand output:\n (empty)\n\nCommand error:\n .command.sh: line 2: salmon: command not found\n\nWork dir:\n /.../work/85/495a21afcaaf5f94780aff6b2a964c\n\nTip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh`\n\n -- Check '.nextflow.log' file for details\nWhen a process execution exits with a non-zero exit status, the workflow will be stopped. Nextflow will output the cause of the error, the command that caused the error, the exit status, the standard output (if available), the comand standard error, and the work directory where the process was executed.\nLet’s first look inside the process execution directory:\n>>> ls -a /.../work/85/495a21afcaaf5f94780aff6b2a964c \n\n. .command.begin .command.log .command.run .exitcode\n.. .command.err .command.out .command.sh transcriptome.fa\nWe can see that the input file transcriptome.fa has been staged inside this process execution directory by being symbolically linked. This allows it to be accessed by the script.\nInside the .command.err script, we can see that the salmon command was not found, resulting in the termination of the Nextflow workflow.\nSingularity containers can be used to execute the process within an environment that contains the package of interest. Create a config file nextflow.config containing the following:\nsingularity {\n enabled = true\n autoMounts = true\n cacheDir = \"/config/binaries/singularity/containers_devel/nextflow\"\n}\nThe container process directive can be used to specify the required container:\nprocess INDEX {\n container \"/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-salmon-1.10.1--h7e5ed60_0.img\"\n\n input:\n path transcriptome\n\n output:\n path \"salmon_idx\"\n\n script:\n \"\"\"\n salmon index --threads $task.cpus -t $transcriptome -i salmon_idx\n \"\"\"\n}\nRun the Nextflow script:\n>>> nextflow run rnaseq.nf\nN E X T F L O W ~ version 23.04.1\nLaunching `rnaseq.nf` [distraught_goldwasser] DSL2 - revision: bdebf34e16\nexecutor > local (1)\n[37/7ef8f0] process > INDEX [100%] 1 of 1 ✔\nThe newly created nextflow.config files does not need to be specified in the nextflow run command. This file is automatically searched for and used by Nextflow.\nAn alternative to singularity containers is the use of a module. Since the script block is executed as a Bash script, it can contain any command or script normally executed on the command line. If there is a module present in the host environment, it can be loaded as part of the process script.\nprocess INDEX {\n input:\n path transcriptome\n\n output:\n path \"salmon_idx\"\n\n script:\n \"\"\"\n module purge\n module load salmon/1.3.0\n\n salmon index --threads $task.cpus -t $transcriptome -i salmon_idx\n \"\"\"\n}\nRun the Nextflow script:\n>>> nextflow run rnaseq.nf\nN E X T F L O W ~ version 23.04.1\nLaunching `rnaseq.nf` [reverent_liskov] DSL2 - revision: b74c22049d\nexecutor > local (1)\n[ba/3c12ab] process > INDEX [100%] 1 of 1 ✔\n\n\n\n4.1.3. Collect Read Files By Pairs\nPreviously, we have defined the reads parameter to be the following:\nparams.reads = \"/.../training/nf-training/data/ggal/*_{1,2}.fq\"\nChallenge: Convert the reads parameter into a tuple channel called reads_ch, where the first element is a unique grouping key, and the second element is the paired .fq files. Then, view the contents of reads_ch\n\n\n\n\n\n\nAnswer\n\n\n\n\n\nreads_ch = Channel.fromFilePairs(\"$params.reads\")\nreads_ch.view()\nThe fromFilePairs channel factory will automatically group input files into a tuple with a unique grouping key. The view() channel operator can be used to view the contents of the channel.\n>>> nextflow run rnaseq.nf\n\n[gut, [/.../training/nf-training/data/ggal/gut_1.fq, /.../training/nf-training/data/ggal/gut_2.fq]]\n[liver, [/.../training/nf-training/data/ggal/liver_1.fq, /.../training/nf-training/data/ggal/liver_2.fq]]\n[lung, [/.../training/nf-training/data/ggal/lung_1.fq, /.../training/nf-training/data/ggal/lung_2.fq]]\n\n\n\n\n\n4.1.4. Perform Expression Quantification\nLet’s add a new process QUANTIFICATION that uses both the indexed transcriptome file and the .fq file pairs to execute the salmon quant command.\nprocess QUANTIFICATION {\n input:\n path salmon_index\n tuple val(sample_id), path(reads)\n\n output:\n path \"$sample_id\"\n\n script:\n \"\"\"\n salmon quant --threads $task.cpus --libType=U \\\n -i $salmon_index -1 ${reads[0]} -2 ${reads[1]} -o $sample_id\n \"\"\"\n}\nThe QUANTIFICATION process takes two inputs, the first is the path to the salmon_index created from the INDEX process. The second input is set to match the output of fromFilePairs – a tuple where the first element is a value (ie. grouping key), and the second element is a list of paths to the .fq reads.\nIn the script block, the salmon quant command saves the output of the tool as $sample_id. This output is emitted by the QUANTIFICATION process, using $ to access the Nextflow variable.\nChallenge:\nSet the following as the execution container for QUANTIFICATION:\n/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-salmon-1.10.1--h7e5ed60_0.img\nAssign index_ch and reads_ch as the inputs to this process, and emit the process outputs as quant_ch. View the contents of quant_ch\n\n\n\n\n\n\nAnswer\n\n\n\n\n\nTo assign a container to a process, the container directive can be used.\nprocess QUANTIFICATION {\n container \"/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-salmon-1.10.1--h7e5ed60_0.img\"\n\n input:\n path salmon_index\n tuple val(sample_id), path(reads)\n\n output:\n path \"$sample_id\"\n\n script:\n \"\"\"\n salmon quant --threads $task.cpus --libType=U \\\n -i $salmon_index -1 ${reads[0]} -2 ${reads[1]} -o $sample_id\n \"\"\"\n}\nTo run the QUANTIFICATION process and emit the outputs as quant_ch, the following can be added to the end of the workflow block:\nquant_ch = QUANTIFICATION(index_ch, reads_ch)\nquant_ch.view()\nThe script can now be run:\n>>> nextflow run rnaseq.nf \nN E X T F L O W ~ version 23.04.1\nLaunching `rnaseq.nf` [elated_cray] DSL2 - revision: abe41f4f69\nexecutor > local (4)\n[e5/e75095] process > INDEX [100%] 1 of 1 ✔\n[4c/68a000] process > QUANTIFICATION (1) [100%] 3 of 3 ✔\n/.../work/b1/d861d26d4d36864a17d2cec8d67c80/liver\n/.../work/b4/a6545471c1f949b2723d43a9cce05f/lung\n/.../work/4c/68a000f7c6503e8ae1fe4d0d3c93d8/gut\nIn the Nextflow output, we can see that the QUANTIFICATION process has been ran three times, since the reads_ch consists of three elements. Nextflow will automatically run the QUANTIFICATION process on each of the elements in the input channel, creating separate process execution work directories for each execution.\n\n\n\n\n\n4.1.5. Quality Control\nNow, let’s implement a FASTQC quality control process for the input fastq reads.\nChallenge:\nCreate a process called FASTQC that takes reads_ch as an input, and declares the process input to be a tuple matching the structure of reads_ch, where the first element is assigned the variable sample_id, and the second variable is assigned the variable reads. This FASTQC process will first create an output directory fastqc_${sample_id}_logs, then perform fastqc on the input reads and save the results in the newly created directory fastqc_${sample_id}_logs:\nmkdir fastqc_${sample_id}_logs\nfastqc -o fastqc_${sample_id}_logs -f fastq -q ${reads}\nTake fastqc_${sample_id}_logs as the output of the process, and assign it to the channel fastqc_ch. Finally, specify the process container to be the following:\n/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-fastqc-0.12.1--hdfd78af_0.img\n\n\n\n\n\n\nAnswer\n\n\n\n\n\nThe process FASTQC is created in rnaseq.nf. Since the input channel is a tuple, the process input declaration is a tuple containing elements that match the structure of the incoming channel. The first element of the tuple is assigned the variable sample_id, and the second element of the tuple is assigned the variable reads. The relevant container is specified using the container process directive.\nprocess FASTQC {\n container \"/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-fastqc-0.12.1--hdfd78af_0.img\"\n\n input:\n tuple val(sample_id), path(reads)\n\n output:\n path \"fastqc_${sample_id}_logs\"\n\n script:\n \"\"\"\n mkdir fastqc_${sample_id}_logs\n fastqc -o fastqc_${sample_id}_logs -f fastq -q ${reads}\n \"\"\"\n}\nIn the workflow scope, the following can be added:\nfastqc_ch = FASTQC(reads_ch)\nThe FASTQC process is called, taking reads_ch as an input. The output of the process is assigned to be fastqc_ch.\n>>> nextflow run rnaseq.nf\nN E X T F L O W ~ version 23.04.1\nLaunching `rnaseq.nf` [sad_jennings] DSL2 - revision: cfae7ccc0e\nexecutor > local (7)\n[b5/6bece3] process > INDEX [100%] 1 of 1 ✔\n[32/46f20b] process > QUANTIFICATION (3) [100%] 3 of 3 ✔\n[44/27aa8d] process > FASTQC (2) [100%] 3 of 3 ✔\nIn the Nextflow output, we can see that the FASTQC has been ran three times as expected, since the reads_ch consists of three elements.\n\n\n\n\n\n4.1.6. MultiQC Report\nSo far, the generated outputs have all been saved inside the Nextflow work directory. For the FASTQC process, the specified output directory is only created inside the process execution directory. To save results to a specified folder, the publishDir process directive can be used.\nLet’s create a new MULTIQC process in our workflow that takes the outputs from the QUANTIFICATION and FASTQC processes to create a final report using the multiqc tool, and publish the process outputs to a directory outside of the process execution directory.\nprocess MULTIQC {\n publishDir params.outdir, mode:'copy'\n container \"/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-multiqc-1.21--pyhdfd78af_0.img\"\n\n input:\n path quantification\n path fastqc\n\n output:\n path \"*.html\"\n\n script:\n \"\"\"\n multiqc . --filename $quantification\n \"\"\"\n}\nIn the MULTIQC process, the multiqc command is performed on both quantification and fastqc inputs, and publishes the report to a directory defined by the outdir parameter. Only files that match the declaration in the output block are published, not all the outputs of a process. By default, files are published to the target folder creating a symbolic link to the file produced in the process execution directory. This behavior can be modified using the mode option, eg. copy, which copies the file from the process execution directory to the specified output directory.\nAdd the following to the end of workflow scope:\nmultiqc_ch = MULTIQC(quant_ch, fastqc_ch)\nRun the pipeline, specifying an output directory using the outdir parameter:\nnextflow run rnaseq.nf --outdir \"results\"\nA results directory containing the output multiqc reports will be created outside of the process execution directory.\n>>> ls results\ngut.html liver.html lung.html\n\n\n\n\n\n\n\nKey points\n\n\n\n\nCommands or scripts can be executed inside a process\nEnvironments can be defined using the container process directive\nThe input declaration for a process must match the structure of the channel that is being passed into that process\n\n\n\n\nThis workshop is adapted from Fundamentals Training, Advanced Training, Developer Tutorials, and Nextflow Patterns materials from Nextflow and nf-core\n^*Draft for Future Sessions" + }, + { + "objectID": "workshops/1.1_intro_nextflow.html", + "href": "workshops/1.1_intro_nextflow.html", + "title": "Introduction to Nextflow", + "section": "", + "text": "Objectives\n\n\n\n\nLearn about the benefits of a workflow manager.\nLearn Nextflow terminology.\nLearn basic commands and options to run a Nextflow workflow" + }, + { + "objectID": "workshops/1.1_intro_nextflow.html#footnotes", + "href": "workshops/1.1_intro_nextflow.html#footnotes", + "title": "Introduction to Nextflow", + "section": "Footnotes", + "text": "Footnotes\n\n\nhttps://www.lexico.com/definition/workflow↩︎" }, { "objectID": "workshops/2.2_troubleshooting.html", diff --git a/sessions/2_nf_dev_intro.html b/sessions/2_nf_dev_intro.html index 6f84e84..b599231 100644 --- a/sessions/2_nf_dev_intro.html +++ b/sessions/2_nf_dev_intro.html @@ -223,7 +223,7 @@

Learning Objectives:

Set up requirements

-

Please complete the Setup Instructions before the course.

+

Please complete the Setup Instructions before the course.

If you have any trouble, please get in contact with us ASAP via Slack/Teams.

@@ -238,7 +238,7 @@

Workshop schedule

-Setup +Setup Follow these instructions to install VS Code and setup your workspace Prior to workshop diff --git a/sitemap.xml b/sitemap.xml index 50896b8..a2b23ea 100644 --- a/sitemap.xml +++ b/sitemap.xml @@ -2,42 +2,46 @@ https://PMCC-BioinformaticsCore.github.io/nextflow-intro-workshop/sessions/2_nf_dev_intro.html - 2024-05-29T01:04:34.157Z + 2024-05-29T01:20:43.219Z https://PMCC-BioinformaticsCore.github.io/nextflow-intro-workshop/index.html - 2024-05-29T01:04:33.255Z + 2024-05-29T01:20:42.278Z https://PMCC-BioinformaticsCore.github.io/nextflow-intro-workshop/workshops/2.3_tips_and_tricks.html - 2024-05-29T01:04:31.453Z + 2024-05-29T01:20:40.418Z https://PMCC-BioinformaticsCore.github.io/nextflow-intro-workshop/workshops/1.2_intro_nf_core.html - 2024-05-29T01:04:30.469Z + 2024-05-29T01:20:39.398Z - https://PMCC-BioinformaticsCore.github.io/nextflow-intro-workshop/workshops/1.1_intro_nextflow.html - 2024-05-29T01:04:28.781Z + https://PMCC-BioinformaticsCore.github.io/nextflow-intro-workshop/workshops/4.1_draft_future_sess.html + 2024-05-29T01:20:37.648Z https://PMCC-BioinformaticsCore.github.io/nextflow-intro-workshop/workshops/3.1_creating_a_workflow.html - 2024-05-29T01:04:28.073Z + 2024-05-29T01:20:36.497Z + + + https://PMCC-BioinformaticsCore.github.io/nextflow-intro-workshop/workshops/1.1_intro_nextflow.html + 2024-05-29T01:20:37.202Z https://PMCC-BioinformaticsCore.github.io/nextflow-intro-workshop/workshops/2.2_troubleshooting.html - 2024-05-29T01:04:29.577Z + 2024-05-29T01:20:38.501Z https://PMCC-BioinformaticsCore.github.io/nextflow-intro-workshop/workshops/00_setup.html - 2024-05-29T01:04:30.910Z + 2024-05-29T01:20:39.844Z https://PMCC-BioinformaticsCore.github.io/nextflow-intro-workshop/workshops/2.1_customise_and_run.html - 2024-05-29T01:04:32.863Z + 2024-05-29T01:20:41.860Z https://PMCC-BioinformaticsCore.github.io/nextflow-intro-workshop/sessions/1_intro_run_nf.html - 2024-05-29T01:04:33.703Z + 2024-05-29T01:20:42.751Z diff --git a/workshops/3.1_creating_a_workflow.html b/workshops/3.1_creating_a_workflow.html index 38e0cb2..962a1b5 100644 --- a/workshops/3.1_creating_a_workflow.html +++ b/workshops/3.1_creating_a_workflow.html @@ -826,7 +826,7 @@

4.1.5. Quality Control

Now, let’s implement a FASTQC quality control process for the input fastq reads.

Challenge:

-

Create a process called FASTQC that takes reads_ch as an input, and declares the process input to be a tuple matching the structure of reads_ch, where the first element is assigned the variable sample_id, and the second variable is assigned the varible reads. This FASTQC process will first create an output directory fastqc_${sample_id}_logs, then perform fastqc on the input reads and save the results in the newly created directory fastqc_${sample_id}_logs:

+

Create a process called FASTQC that takes reads_ch as an input, and declares the process input to be a tuple matching the structure of reads_ch, where the first element is assigned the variable sample_id, and the second variable is assigned the variable reads. This FASTQC process will first create an output directory fastqc_${sample_id}_logs, then perform fastqc on the input reads and save the results in the newly created directory fastqc_${sample_id}_logs:

mkdir fastqc_${sample_id}_logs
 fastqc -o fastqc_${sample_id}_logs -f fastq -q ${reads}

Take fastqc_${sample_id}_logs as the output of the process, and assign it to the channel fastqc_ch. Finally, specify the process container to be the following:

@@ -922,6 +922,7 @@

4.1.6. MultiQC Repo

This workshop is adapted from Fundamentals Training, Advanced Training, Developer Tutorials, and Nextflow Patterns materials from Nextflow and nf-core

+

^*Draft for Future Sessions

diff --git a/workshops/4.1_draft_future_sess.html b/workshops/4.1_draft_future_sess.html new file mode 100644 index 0000000..70a6c0b --- /dev/null +++ b/workshops/4.1_draft_future_sess.html @@ -0,0 +1,520 @@ + + + + + + + + + +Peter Mac Nextflow Workshop - Nextflow Development - Metadata Parsing + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+
+ +
+ +
+ + + + +
+ +
+
+

Nextflow Development - Metadata Parsing

+
+ + + +
+ + + + +
+ + +
+ +
+

**Metadata Parsing

+

Currently, we have defined the reads parameter as a string:

+
params.reads = "/.../training/nf-training/data/ggal/gut_{1,2}.fq"
+

To group the reads parameter, the fromFilePairs channel factory can be used. Add the following to the workflow block and run the workflow:

+
reads_ch = Channel.fromFilePairs("$params.reads")
+reads_ch.view()
+

The reads parameter is being converted into a file pair group using fromFilePairs, and is assigned to reads_ch. The reads_ch consists of a tuple of two items – the first is the grouping key of the matching pair (gut), and the second is a list of paths to each file:

+
[gut, [/.../training/nf-training/data/ggal/gut_1.fq, /.../training/nf-training/data/ggal/gut_2.fq]]
+

Glob patterns can also be used to create channels of file pair groups. Inside the data directory, we have pairs of gut, liver, and lung files that can all be read into reads_ch.

+
>>> ls "/.../training/nf-training/data/ggal/"
+
+gut_1.fq  gut_2.fq  liver_1.fq  liver_2.fq  lung_1.fq  lung_2.fq  transcriptome.fa
+

Run the rnaseq.nf workflow specifying all .fq files inside /.../training/nf-training/data/ggal/ as the reads parameter via the command line:

+
nextflow run rnaseq.nf --reads '/.../training/nf-training/data/ggal/*_{1,2}.fq'
+

File paths that include one or more wildcards (ie. *, ?, etc.) MUST be wrapped in single-quoted characters to avoid Bash expanding the glob on the command line.

+

The reads_ch now contains three tuple elements with unique grouping keys:

+
[gut, [/.../training/nf-training/data/ggal/gut_1.fq, /.../training/nf-training/data/ggal/gut_2.fq]]
+[liver, [/.../training/nf-training/data/ggal/liver_1.fq, /.../training/nf-training/data/ggal/liver_2.fq]]
+[lung, [/.../training/nf-training/data/ggal/lung_1.fq, /.../training/nf-training/data/ggal/lung_2.fq]]
+

The grouping key metadata can also be explicitly created without having to rely on file names, using the map channel operator. Let’s start by creating a samplesheet rnaseq_samplesheet.csv with column headings sample_name, fastq1, and fastq2, and fill in a custom sample_name, along with the paths to the .fq files.

+
sample_name,fastq1,fastq2
+gut_sample,/.../training/nf-training/data/ggal/gut_1.fq,/.../training/nf-training/data/ggal/gut_2.fq
+liver_sample,/.../training/nf-training/data/ggal/liver_1.fq,/.../training/nf-training/data/ggal/liver_2.fq
+lung_sample,/.../training/nf-training/data/ggal/lung_1.fq,/.../training/nf-training/data/ggal/lung_2.fq
+

Let’s now supply the path to rnaseq_samplesheet.csv to the reads parameter in rnaseq.nf.

+
params.reads = "/.../rnaseq_samplesheet.csv"
+

Previously, the reads parameter consisted of a string of the .fq files directly. Now, it is a string to a .csv file containing the .fq files. Therefore, the channel factory method that reads the input file also needs to be changed. Since the parameter is now a single file path, the fromPath method can first be used, which creates a channel of Path type object. The splitCsv channel operator can then be used to parse the contents of the channel.

+
reads_ch = Channel.fromPath(params.reads)
+reads_ch.view()
+
+reads_ch = reads_ch.splitCsv(header:true)
+reads_ch.view()
+

When using splitCsv in the above example, header is set to true. This will use the first line of the .csv file as the column names. Let’s run the pipeline containing the new input parameter.

+
>>> nextflow run rnaseq.nf
+
+N E X T F L O W  ~  version 23.04.1
+Launching `rnaseq.nf` [distraught_avogadro] DSL2 - revision: 525e081ba2
+reads: rnaseq_samplesheet.csv
+reads: $params.reads
+executor >  local (1)
+[4e/eeae2a] process > INDEX [100%] 1 of 1 ✔
+/.../rnaseq_samplesheet.csv
+[sample_name:gut_sample, fastq1:/.../training/nf-training/data/ggal/gut_1.fq, fastq2:/.../training/nf-training/data/ggal/gut_2.fq]
+[sample_name:liver_sample, fastq1:/.../training/nf-training/data/ggal/liver_1.fq, fastq2:/.../training/nf-training/data/ggal/liver_2.f]
+[sample_name:lung_sample, fastq1:/.../training/nf-training/data/ggal/lung_1.fq, fastq2:/.../training/nf-training/data/ggal/lung_2.fq]
+

The /.../rnaseq_samplesheet.csv is the output of reads_ch directly after the fromPath channel factory method was used. Here, the channel is a Path type object. After invoking the splitCsv channel operator, the reads_ch is now replaced with a channel consisting of three elements, where each element is a row in the .csv file, returned as a list. Since header was set to true, each element in the list is also mapped to the column names. This can be used when creating the custom grouping key.

+

To create grouping key metadata from the list output by splitCsv, the map channel operator can be used.

+
  reads_ch = reads_ch.map { row -> 
+      grp_meta = "$row.sample_name"
+      [grp_meta, [row.fastq1, row.fastq2]]
+      }
+  reads_ch.view()
+

Here, for each list in reads_ch, we assign it to a variable row. We then create custom grouping key metadata grp_meta based on the sample_name column from the .csv, which can be accessed via the row variable by . separation. After the custom metadata key is assigned, a tuple is created by assigning grp_meta as the first element, and the two .fq files as the second element, accessed via the row variable by . separation.

+

Let’s run the pipeline containing the custom grouping key:

+
>>> nextflow run rnaseq.nf
+
+N E X T F L O W  ~  version 23.04.1
+Launching `rnaseq.nf` [happy_torricelli] DSL2 - revision: e9e1499a97
+reads: rnaseq_samplesheet.csv
+reads: $params.reads
+[-        ] process > INDEX -
+[gut_sample, [/.../training/nf-training/data/ggal/gut_1.fq, /.../training/nf-training/data/ggal/gut_2.fq]]
+[liver_sample, [/home/sli/test/training/nf-training/data/ggal/liver_1.fq, /.../training/nf-training/data/ggal/liver_2.fq]]
+[lung_sample, [/.../training/nf-training/data/ggal/lung_1.fq, /.../training/nf-training/data/ggal/lung_2.fq]]
+

The custom grouping key can be created from multiple values in the samplesheet. For example, grp_meta = [sample : row.sample_name , file : row.fastq1] will create the metadata key using both the sample_name and fastq1 file names. The samplesheet can also be created to include multiple sample characteristics, such as lane, data_type, etc. Each of these characteristics can be used to ensure an adequte grouping key is creaed for that sample.

+ + +
+ +
+ +
+ + + + \ No newline at end of file