Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update parameters #12

Merged
merged 12 commits into from
Jun 11, 2024
85 changes: 48 additions & 37 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,23 +1,59 @@
[![Nextflow](https://img.shields.io/badge/nextflow-%E2%89%A523.04.3-brightgreen.svg)](https://www.nextflow.io/)

# Example Pipeline for IRIDA Next
# Genomic Address Service Nomenclature Workflow

This is an example pipeline to be used for integration with IRIDA Next.
This workflow takes provided JSON-formatted MLST allelic profiles and assigns cluster addresses to samples based on an existing cluster designations. This pipeline is designed to be integrated into IRIDA Next. However, it may be run as a stand-alone pipeline.

A brief overview of the usage of this pipeline is given below. Detailed documentation can be found in the [docs/](docs/) directory.

# Input

The input to the pipeline is a standard sample sheet (passed as `--input samplesheet.csv`) that looks like:

| sample | fastq_1 | fastq_2 |
| ------- | --------------- | --------------- |
| SampleA | file_1.fastq.gz | file_2.fastq.gz |
| sample | mlst_alleles | address |
| ------- | ----------------- | ------- |
| sampleA | sampleA.mlst.json | 1.1.1 |
| sampleQ | sampleQ.mlst.json | |
| sampleF | sampleF.mlst.json | |

The structure of this file is defined in [assets/schema_input.json](assets/schema_input.json). Validation of the sample sheet is performed by [nf-validation](https://nextflow-io.github.io/nf-validation/).

Details on the columns can be found in the [Full samplesheet](docs/usage.md#full-samplesheet) documentation.

# Parameters

The main parameters are `--input` as defined above and `--output` for specifying the output results directory. You may wish to provide `-profile singularity` to specify the use of singularity containers and `-r [branch]` to specify which GitHub branch you would like to run.

## Profile dists

The following can be used to adjust parameters for the [profile_dists][] tool.

- `--pd_outfmt`: The output format for distances. For this pipeline the only valid value is _pairwise_ (required by [gas call][]).
emarinier marked this conversation as resolved.
Show resolved Hide resolved
- `--pd_distm`: The distance method/unit, either _hamming_ or _scaled_. For _hamming_ distances, the distance values will be a non-negative integer. For _scaled_ distances, the distance values are between 0 and 1.
- `--pd_missing_threshold`: The maximum proportion of missing data per locus for a locus to be kept in the analysis. Values from 0 to 1.
- `--pd_sample_quality_threshold`: The maximum proportion of missing data per sample for a sample to be kept in the analysis. Values from 0 to 1.
- `--pd_file_type`: Output format file type. One of _text_ or _parquet_.
- `--pd_mapping_file`: A file used to map allele codes to integers for internal distance calculations. This is the same file as produced from the _profile dists_ step (the [allele_map.json](docs/output.md#profile-dists) file). Normally, this is unneeded unless you wish to override the automated process of mapping alleles to integers.
- `--pd_skip`: Skip QA/QC steps. Can be used as a flag, `--pd_skip`, or passing a boolean, `--pd_skip true` or `--pd_skip false`.
- `--pd_columns`: Defines the loci to keep within the analysis (default when unset is to keep all loci). Formatted as a single column file with one locus name per line. For example:
emarinier marked this conversation as resolved.
Show resolved Hide resolved
- **Single column format**
```
loci1
loci2
loci3
```
- `--pd_count_missing`: Count missing alleles as different. Can be used as a flag, `--pd_count_missing`, or passing a boolean, `--pd_count_missing true` or `--pd_count_missing false`. If true, will consider missing allele calls for the same locus between samples as a difference, increasing the distance counts.

## GAS CALL

The following can be used to adjust parameters for the [gas call][] tool.

- `--gm_thresholds`: Thresholds delimited by `,`. Values should match units from `--pd_distm` (either _hamming_ or _scaled_).
- `--gm_method`: The linkage method to use for clustering. Value should be one of _single_, _average_, or _complete_.
- `--gm_delimiter`: Delimiter desired for nomenclature code. Must be alphanumeric or one of `._-`.

## Other

Other parameters (defaults from nf-core) are defined in [nextflow_schema.json](nextflow_schmea.json).

# Running
Expand All @@ -39,51 +75,26 @@ An example of the what the contents of the IRIDA Next JSON file looks like for t
```
{
"files": {
"global": [
{
"path": "summary/summary.txt.gz"
}
],
"global": [],
"samples": {
"SAMPLE1": [
{
"path": "assembly/SAMPLE1.assembly.fa.gz"
}
],
"SAMPLE2": [
"sampleF": [
{
"path": "assembly/SAMPLE2.assembly.fa.gz"
"path": "input/sampleF_error_report.csv"
}
],
"SAMPLE3": [
{
"path": "assembly/SAMPLE3.assembly.fa.gz"
}
]
}
},
"metadata": {
"samples": {
"SAMPLE1": {
"reads.1": "sample1_R1.fastq.gz",
"reads.2": "sample1_R2.fastq.gz"
},
"SAMPLE2": {
"reads.1": "sample2_R1.fastq.gz",
"reads.2": "sample2_R2.fastq.gz"
},
"SAMPLE3": {
"reads.1": "sample1_R1.fastq.gz",
"reads.2": "null"
"sampleQ": {
"address": "1.1.3",
}
}
}
}
```

Within the `files` section of this JSON file, all of the output paths are relative to the `outdir`. Therefore, `"path": "assembly/SAMPLE1.assembly.fa.gz"` refers to a file located within `outdir/assembly/SAMPLE1.assembly.fa.gz`.

There is also a pipeline execution summary output file provided (specified in the above JSON as `"global": [{"path":"summary/summary.txt.gz"}]`). However, there is no formatting specification for this file.
Within the `files` section of this JSON file, all of the output paths are relative to the `outdir`. Therefore, `"path": "input/sampleF_error_report.csv"` refers to a file located within `outdir/input/sampleF_error_report.csv`. This file is generated only if a sample fails the input check during samplesheet assessment.

## Test profile

Expand All @@ -95,7 +106,7 @@ nextflow run phac-nml/gasnomenclature -profile docker,test -r main -latest --out

# Legal

Copyright 2023 Government of Canada
Copyright 2024 Government of Canada

Licensed under the MIT License (the "License"); you may not use
this work except in compliance with the License. You may obtain a copy of the
Expand Down
4 changes: 2 additions & 2 deletions assets/schema_input.json
Original file line number Diff line number Diff line change
Expand Up @@ -17,8 +17,8 @@
"mlst_alleles": {
"type": "string",
"format": "file-path",
"pattern": "^\\S+\\.mlst\\.json(\\.gz)?$",
"errorMessage": "MLST JSON file from locidex report, cannot contain spaces and must have the extension: '.mlst.json' or '.mlst.json.gz'"
"pattern": "^\\S+\\.mlst(\\.subtyping)?\\.json(\\.gz)?$",
"errorMessage": "MLST JSON file from locidex report, cannot contain spaces and must have the extension: '.mlst.json', '.mlst.json.gz', '.mlst.subtyping.json', or 'mlst.subtyping.json.gz'"
},
"address": {
"type": "string",
Expand Down
69 changes: 45 additions & 24 deletions docs/output.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,72 +6,93 @@ This document describes the output produced by the pipeline.

The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory.

- assembly: very small mock assembly files for each sample
- generate: intermediate files used in generating the IRIDA Next JSON output
- pipeline_info: information about the pipeline's execution
- simplify: simplified intermediate files used in generating the IRIDA Next JSON output
- summary: summary report about the pipeline's execution and results
- call: The cluster addresses from the [genomic_address_service](https://github.com/phac-nml/genomic_address_service).
- cluster: The cluster file required by GAS_call.
- distances: Distances between genomes from [profile_dists](https://github.com/phac-nml/profile_dists).
- filter: The cluster addresses from only the query samples.
- input: An error report that is only generated when sample IDs and MLST JSON files do not match.
- locidex: The merged MLST JSON files for reference and query samples.
- pipeline_info: Information about the pipeline's execution

The IRIDA Next-compliant JSON output file will be named `iridanext.output.json.gz` and will be written to the top-level of the results directory. This file is compressed using GZIP and conforms to the [IRIDA Next JSON output specifications](https://github.com/phac-nml/pipeline-standards#42-irida-next-json).

## Pipeline overview

The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes data using the following steps:

- [Assembly stub](#assembly-stub) - Performs a stub assembly by generating a mock assembly
- [Generate sample JSON](#generate-sample-json) - Generates a JSON file for each sample
- [Generate summary](#generate-summary) - Generates a summary text file describing the samples and assemblies
- [Simplify IRIDA JSON](#simplify-irida-json) - Simplifies the sample JSONs by limiting nesting depth
- [Input check](#input-check) - Performs a validation check on the samplesheet inputs to ensure that the sampleID precisely matches the MLST JSON key.
- [Locidex merge](#locidex-merge) - Merges MLST profile JSON files into a single profiles file for reference and query samples.
- [Profile dists](#profile-dists) - Computes pairwise distances between genomes using MLST allele differences.
- [Cluster file](#cluster-file) - Generates the expected_clusters.txt file from reference sample addresses for use in GAS_call.
- [GAS call](#gas-call) - Generates hierarchical cluster addresses.
- [Filter query](#filter-query) - Filters and generates a csv file containing only the cluster addresses for query samples.
- [IRIDA Next Output](#irida-next-output) - Generates a JSON output file that is compliant with IRIDA Next
- [Pipeline information](#pipeline-information) - Report metrics generated during the workflow execution

### Assembly stub
### Input Check

<details markdown="1">
<summary>Output files</summary>

- `assembly/`
- Mock assembly files: `ID.assembly.fa.gz`
- `input/`
- `sampleID_error_report.csv`

</details>

### Generate sample JSON
### Locidex merge

<details markdown="1">
<summary>Output files</summary>

- `generate/`
- JSON files: `ID.json.gz`
- `locidex/merge/`
- reference samples: `reference/merged_ref/merged_profiles_ref.tsv`
- query samples: `query/merged_value/merged_profile_value.tsv`

</details>

### Generate summary
### Profile Dists

<details markdown="1">
<summary>Output files</summary>

- `summary/`
- Text summary describing samples and assemblies: `summary.txt.gz`
- `distances/`
- Mapping allele identifiers to integers: `allele_map.json`
- The query MLST profiles: `query_profile.text`
- The reference MLST profiles: `ref_profile.text`
- The computed distances based on MLST allele differences: `results.text`
- Information on the profile_dists run: `run.json`

</details>

### Simplify IRIDA JSON
### Cluster File

<details markdown="1">
<summary>Output files</summary>

- `simplify/`
- Simplified JSON files: `ID.simple.json.gz`
- `cluster/`
- `expected_clusters.txt`

</details>

### IRIDA Next Output
### GAS call

<details markdown="1">
<summary>Output files</summary>

- `/`
- IRIDA Next-compliant JSON output: `iridanext.output.json.gz`
- `call/`
- The computed cluster addresses: `clusters.text`
- Information on the GAS mcluster run: `run.json`
- Thesholds used to compute cluster addresses: `thresholds.json`

</details>

### Filter Query

<details markdown="1">
<summary>Output files</summary>

- `filter/`
- `new_addresses.csv`

</details>

Expand Down
24 changes: 12 additions & 12 deletions docs/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

## Introduction

This pipeline is an example that illustrates running a nf-core-compliant pipeline on IRIDA Next.
This workflow takes provided JSON-formatted MLST allelic profiles and assigns cluster addresses to samples based on an existing cluster designations. This pipeline is designed to be integrated into IRIDA Next. However, it may be run as a stand-alone pipeline.

## Samplesheet input

Expand All @@ -14,22 +14,22 @@ You will need to create a samplesheet with information about the samples you wou

### Full samplesheet

The input samplesheet must contain three columns: `ID`, `fastq_1`, `fastq_2`. The IDs within a samplesheet should be unique. All other columns will be ignored.
The input samplesheet must contain three columns: `sample`, `mlst_alleles`, `address`. The sample names within a samplesheet should be unique. All other columns will be ignored.

A final samplesheet file consisting of both single- and paired-end data may look something like the one below.
A final samplesheet file consisting of mlst_alleles and addresses may look something like the one below:

```csv title="samplesheet.csv"
sample,fastq_1,fastq_2
SAMPLE1,sample1_R1.fastq.gz,sample1_R2.fastq.gz
SAMPLE2,sample2_R1.fastq.gz,sample2_R2.fastq.gz
SAMPLE3,sample1_R1.fastq.gz,
sample,mlst_alleles,address
sampleA,sampleA.mlst.json.gz,1.1.1
sampleQ,sampleQ.mlst.json.gz,2.2.2
sampleF,sampleF.mlst.json,
```

| Column | Description |
| --------- | -------------------------------------------------------------------------------------------------------------------------- |
| `sample` | Custom sample name. Samples should be unique within a samplesheet. |
| `fastq_1` | Full path to FastQ file for Illumina short reads 1. File has to be gzipped and have the extension ".fastq.gz" or ".fq.gz". |
| `fastq_2` | Full path to FastQ file for Illumina short reads 2. File has to be gzipped and have the extension ".fastq.gz" or ".fq.gz". |
| Column | Description |
| -------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `sample` | Custom sample name. Samples should be unique within a samplesheet. |
| `mlst_alleles` | Full path to an MLST JSON file describing the loci/alleles for the sample against some MLST scheme. A way to generate this file is via [locidex](https://github.com/phac-nml/locidex). File can optionally be gzipped and must have the extension ".mlst.json", ".mlst.subtyping.json" (or with an additional ".gz" if gzipped). |
| `address` | Hierarchal clustering address. If left empty for a sample, the pipeline will perform de novo clustering based on the provided cluster designations and thresholds. |
apetkau marked this conversation as resolved.
Show resolved Hide resolved

An [example samplesheet](../assets/samplesheet.csv) has been provided with the pipeline.

Expand Down
2 changes: 1 addition & 1 deletion modules/local/gas/call/main.nf
Original file line number Diff line number Diff line change
Expand Up @@ -20,12 +20,12 @@ process GAS_CALL{
path "versions.yml", emit: versions

script:
// Need to add more args for gas call below
prefix = "Called"
"""
gas call --dists $distances \\
--rclusters $reference_clusters \\
--outdir ${prefix} \\
--method ${params.gm_method} \\
--threshold ${params.gm_thresholds} \\
--delimeter ${params.gm_delimiter}

Expand Down
3 changes: 0 additions & 3 deletions modules/local/profile_dists/main.nf
Original file line number Diff line number Diff line change
Expand Up @@ -32,9 +32,6 @@ process PROFILE_DISTS{
if(columns){
args = args + " --columns $columns"
}
if(params.pd_force){
args = args + " --force"
}
if(params.pd_skip){
args = args + " --skip"
}
Expand Down
11 changes: 3 additions & 8 deletions nextflow.config
Original file line number Diff line number Diff line change
Expand Up @@ -11,9 +11,6 @@ params {

// Input options
input = null
project_name = 'assembly'
assembler = 'stub'
random_seed = 1

// Boilerplate options
outdir = null
Expand Down Expand Up @@ -51,19 +48,17 @@ params {
pd_distm = "hamming"
pd_missing_threshold = 1.0
pd_sample_quality_threshold = 1.0
pd_match_threshold = -1.0
pd_file_type = "text"
pd_mapping_file = null // default is no file
pd_force = false
pd_skip = false
pd_columns = null
pd_count_missing = true
pd_count_missing = false


// GAS Call
gm_thresholds = "10,5,0"
gm_delimiter = "'.'" // note the single quotes surrounding the delimiter
ref_clusters = ""
gm_method = "average"
gm_delimiter = "."

}

Expand Down
Loading
Loading