phac-nml · kylacochrane · Jun 11, 2024 · May 27, 2024 · May 27, 2024 · May 27, 2024
diff --git a/README.md b/README.md
@@ -1,23 +1,59 @@
 [![Nextflow](https://img.shields.io/badge/nextflow-%E2%89%A523.04.3-brightgreen.svg)](https://www.nextflow.io/)
 
-# Example Pipeline for IRIDA Next
+# Genomic Address Service Nomenclature Workflow
 
-This is an example pipeline to be used for integration with IRIDA Next.
+This workflow takes provided JSON-formatted MLST allelic profiles and assigns cluster addresses to samples based on an existing cluster designations. This pipeline is designed to be integrated into IRIDA Next. However, it may be run as a stand-alone pipeline.
+
+A brief overview of the usage of this pipeline is given below. Detailed documentation can be found in the [docs/](docs/) directory.
 
 # Input
 
 The input to the pipeline is a standard sample sheet (passed as `--input samplesheet.csv`) that looks like:
 
-| sample  | fastq_1         | fastq_2         |
-| ------- | --------------- | --------------- |
-| SampleA | file_1.fastq.gz | file_2.fastq.gz |
+| sample  | mlst_alleles      | address |
+| ------- | ----------------- | ------- |
+| sampleA | sampleA.mlst.json | 1.1.1   |
+| sampleQ | sampleQ.mlst.json |         |
+| sampleF | sampleF.mlst.json |         |
 
 The structure of this file is defined in [assets/schema_input.json](assets/schema_input.json). Validation of the sample sheet is performed by [nf-validation](https://nextflow-io.github.io/nf-validation/).
 
+Details on the columns can be found in the [Full samplesheet](docs/usage.md#full-samplesheet) documentation.
+
 # Parameters
 
 The main parameters are `--input` as defined above and `--output` for specifying the output results directory. You may wish to provide `-profile singularity` to specify the use of singularity containers and `-r [branch]` to specify which GitHub branch you would like to run.
 
+## Profile dists
+
+The following can be used to adjust parameters for the [profile_dists][] tool.
+
+- `--pd_outfmt`: The output format for distances. For this pipeline the only valid value is _pairwise_ (required by [gas call][]).
+- `--pd_distm`: The distance method/unit, either _hamming_ or _scaled_. For _hamming_ distances, the distance values will be a non-negative integer. For _scaled_ distances, the distance values are between 0 and 1.
+- `--pd_missing_threshold`: The maximum proportion of missing data per locus for a locus to be kept in the analysis. Values from 0 to 1.
+- `--pd_sample_quality_threshold`: The maximum proportion of missing data per sample for a sample to be kept in the analysis. Values from 0 to 1.
+- `--pd_file_type`: Output format file type. One of _text_ or _parquet_.
+- `--pd_mapping_file`: A file used to map allele codes to integers for internal distance calculations. This is the same file as produced from the _profile dists_ step (the [allele_map.json](docs/output.md#profile-dists) file). Normally, this is unneeded unless you wish to override the automated process of mapping alleles to integers.
+- `--pd_skip`: Skip QA/QC steps. Can be used as a flag, `--pd_skip`, or passing a boolean, `--pd_skip true` or `--pd_skip false`.
+- `--pd_columns`: Defines the loci to keep within the analysis (default when unset is to keep all loci). Formatted as a single column file with one locus name per line. For example:
+  - **Single column format**
+    ```
+    loci1
+    loci2
+    loci3
+    ```
+- `--pd_count_missing`: Count missing alleles as different. Can be used as a flag, `--pd_count_missing`, or passing a boolean, `--pd_count_missing true` or `--pd_count_missing false`. If true, will consider missing allele calls for the same locus between samples as a difference, increasing the distance counts.
+
+## GAS CALL
+
+The following can be used to adjust parameters for the [gas call][] tool.
+
+- `--gm_thresholds`: Thresholds delimited by `,`. Values should match units from `--pd_distm` (either _hamming_ or _scaled_).
+- `--gm_method`: The linkage method to use for clustering. Value should be one of _single_, _average_, or _complete_.
+- `--gm_delimiter`: Delimiter desired for nomenclature code. Must be alphanumeric or one of `._-`.
+
+## Other
+
 Other parameters (defaults from nf-core) are defined in [nextflow_schema.json](nextflow_schmea.json).
 
 # Running
@@ -39,51 +75,26 @@ An example of the what the contents of the IRIDA Next JSON file looks like for t
 ```
 {
     "files": {
-        "global": [
-            {
-                "path": "summary/summary.txt.gz"
-            }
-        ],
+        "global": [],
         "samples": {
-            "SAMPLE1": [
-                {
-                    "path": "assembly/SAMPLE1.assembly.fa.gz"
-                }
-            ],
-            "SAMPLE2": [
+            "sampleF": [
                 {
-                    "path": "assembly/SAMPLE2.assembly.fa.gz"
+                    "path": "input/sampleF_error_report.csv"
                 }
             ],
-            "SAMPLE3": [
-                {
-                    "path": "assembly/SAMPLE3.assembly.fa.gz"
-                }
-            ]
         }
     },
     "metadata": {
         "samples": {
-            "SAMPLE1": {
-                "reads.1": "sample1_R1.fastq.gz",
-                "reads.2": "sample1_R2.fastq.gz"
-            },
-            "SAMPLE2": {
-                "reads.1": "sample2_R1.fastq.gz",
-                "reads.2": "sample2_R2.fastq.gz"
-            },
-            "SAMPLE3": {
-                "reads.1": "sample1_R1.fastq.gz",
-                "reads.2": "null"
+            "sampleQ": {
+                "address": "1.1.3",
             }
         }
     }
 }
 ```
 
-Within the `files` section of this JSON file, all of the output paths are relative to the `outdir`. Therefore, `"path": "assembly/SAMPLE1.assembly.fa.gz"` refers to a file located within `outdir/assembly/SAMPLE1.assembly.fa.gz`.
-
-There is also a pipeline execution summary output file provided (specified in the above JSON as `"global": [{"path":"summary/summary.txt.gz"}]`). However, there is no formatting specification for this file.
+Within the `files` section of this JSON file, all of the output paths are relative to the `outdir`. Therefore, `"path": "input/sampleF_error_report.csv"` refers to a file located within `outdir/input/sampleF_error_report.csv`. This file is generated only if a sample fails the input check during samplesheet assessment.
 
 ## Test profile
 
@@ -95,7 +106,7 @@ nextflow run phac-nml/gasnomenclature -profile docker,test -r main -latest --out
 
 # Legal
 
-Copyright 2023 Government of Canada
+Copyright 2024 Government of Canada
 
 Licensed under the MIT License (the "License"); you may not use
 this work except in compliance with the License. You may obtain a copy of the

diff --git a/assets/schema_input.json b/assets/schema_input.json
@@ -17,8 +17,8 @@
             "mlst_alleles": {
                 "type": "string",
                 "format": "file-path",
-                "pattern": "^\\S+\\.mlst\\.json(\\.gz)?$",
-                "errorMessage": "MLST JSON file from locidex report, cannot contain spaces and must have the extension: '.mlst.json' or '.mlst.json.gz'"
+                "pattern": "^\\S+\\.mlst(\\.subtyping)?\\.json(\\.gz)?$",
+                "errorMessage": "MLST JSON file from locidex report, cannot contain spaces and must have the extension: '.mlst.json', '.mlst.json.gz', '.mlst.subtyping.json', or 'mlst.subtyping.json.gz'"
             },
             "address": {
                 "type": "string",

diff --git a/docs/output.md b/docs/output.md
@@ -6,72 +6,93 @@ This document describes the output produced by the pipeline.
 
 The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory.
 
-- assembly: very small mock assembly files for each sample
-- generate: intermediate files used in generating the IRIDA Next JSON output
-- pipeline_info: information about the pipeline's execution
-- simplify: simplified intermediate files used in generating the IRIDA Next JSON output
-- summary: summary report about the pipeline's execution and results
+- call: The cluster addresses from the [genomic_address_service](https://github.com/phac-nml/genomic_address_service).
+- cluster: The cluster file required by GAS_call.
+- distances: Distances between genomes from [profile_dists](https://github.com/phac-nml/profile_dists).
+- filter: The cluster addresses from only the query samples.
+- input: An error report that is only generated when sample IDs and MLST JSON files do not match.
+- locidex: The merged MLST JSON files for reference and query samples.
+- pipeline_info: Information about the pipeline's execution
 
 The IRIDA Next-compliant JSON output file will be named `iridanext.output.json.gz` and will be written to the top-level of the results directory. This file is compressed using GZIP and conforms to the [IRIDA Next JSON output specifications](https://github.com/phac-nml/pipeline-standards#42-irida-next-json).
 
 ## Pipeline overview
 
 The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes data using the following steps:
 
-- [Assembly stub](#assembly-stub) - Performs a stub assembly by generating a mock assembly
-- [Generate sample JSON](#generate-sample-json) - Generates a JSON file for each sample
-- [Generate summary](#generate-summary) - Generates a summary text file describing the samples and assemblies
-- [Simplify IRIDA JSON](#simplify-irida-json) - Simplifies the sample JSONs by limiting nesting depth
+- [Input check](#input-check) - Performs a validation check on the samplesheet inputs to ensure that the sampleID precisely matches the MLST JSON key.
+- [Locidex merge](#locidex-merge) - Merges MLST profile JSON files into a single profiles file for reference and query samples.
+- [Profile dists](#profile-dists) - Computes pairwise distances between genomes using MLST allele differences.
+- [Cluster file](#cluster-file) - Generates the expected_clusters.txt file from reference sample addresses for use in GAS_call.
+- [GAS call](#gas-call) - Generates hierarchical cluster addresses.
+- [Filter query](#filter-query) - Filters and generates a csv file containing only the cluster addresses for query samples.
 - [IRIDA Next Output](#irida-next-output) - Generates a JSON output file that is compliant with IRIDA Next
 - [Pipeline information](#pipeline-information) - Report metrics generated during the workflow execution
 
-### Assembly stub
+### Input Check
 
 <details markdown="1">
 <summary>Output files</summary>
 
-- `assembly/`
-  - Mock assembly files: `ID.assembly.fa.gz`
+- `input/`
+  - `sampleID_error_report.csv`
 
 </details>
 
-### Generate sample JSON
+### Locidex merge
 
 <details markdown="1">
 <summary>Output files</summary>
 
-- `generate/`
-  - JSON files: `ID.json.gz`
+- `locidex/merge/`
+  - reference samples: `reference/merged_ref/merged_profiles_ref.tsv`
+  - query samples: `query/merged_value/merged_profile_value.tsv`
 
 </details>
 
-### Generate summary
+### Profile Dists
 
 <details markdown="1">
 <summary>Output files</summary>
 
-- `summary/`
-  - Text summary describing samples and assemblies: `summary.txt.gz`
+- `distances/`
+  - Mapping allele identifiers to integers: `allele_map.json`
+  - The query MLST profiles: `query_profile.text`
+  - The reference MLST profiles: `ref_profile.text`
+  - The computed distances based on MLST allele differences: `results.text`
+  - Information on the profile_dists run: `run.json`
 
 </details>
 
-### Simplify IRIDA JSON
+### Cluster File
 
 <details markdown="1">
 <summary>Output files</summary>
 
-- `simplify/`
-  - Simplified JSON files: `ID.simple.json.gz`
+- `cluster/`
+  - `expected_clusters.txt`
 
 </details>
 
-### IRIDA Next Output
+### GAS call
 
 <details markdown="1">
 <summary>Output files</summary>
 
-- `/`
-  - IRIDA Next-compliant JSON output: `iridanext.output.json.gz`
+- `call/`
+  - The computed cluster addresses: `clusters.text`
+  - Information on the GAS mcluster run: `run.json`
+  - Thesholds used to compute cluster addresses: `thresholds.json`
+
+</details>
+
+### Filter Query
+
+<details markdown="1">
+<summary>Output files</summary>
+
+- `filter/`
+  - `new_addresses.csv`
 
 </details>
 

diff --git a/docs/usage.md b/docs/usage.md
@@ -2,7 +2,7 @@
 
 ## Introduction
 
-This pipeline is an example that illustrates running a nf-core-compliant pipeline on IRIDA Next.
+This workflow takes provided JSON-formatted MLST allelic profiles and assigns cluster addresses to samples based on an existing cluster designations. This pipeline is designed to be integrated into IRIDA Next. However, it may be run as a stand-alone pipeline.
 
 ## Samplesheet input
 
@@ -14,22 +14,22 @@ You will need to create a samplesheet with information about the samples you wou
 
 ### Full samplesheet
 
-The input samplesheet must contain three columns: `ID`, `fastq_1`, `fastq_2`. The IDs within a samplesheet should be unique. All other columns will be ignored.
+The input samplesheet must contain three columns: `sample`, `mlst_alleles`, `address`. The sample names within a samplesheet should be unique. All other columns will be ignored.
 
-A final samplesheet file consisting of both single- and paired-end data may look something like the one below.
+A final samplesheet file consisting of mlst_alleles and addresses may look something like the one below:
 
 ```csv title="samplesheet.csv"
-sample,fastq_1,fastq_2
-SAMPLE1,sample1_R1.fastq.gz,sample1_R2.fastq.gz
-SAMPLE2,sample2_R1.fastq.gz,sample2_R2.fastq.gz
-SAMPLE3,sample1_R1.fastq.gz,
+sample,mlst_alleles,address
+sampleA,sampleA.mlst.json.gz,1.1.1
+sampleQ,sampleQ.mlst.json.gz,2.2.2
+sampleF,sampleF.mlst.json,
 ```
 
-| Column    | Description                                                                                                                |
-| --------- | -------------------------------------------------------------------------------------------------------------------------- |
-| `sample`  | Custom sample name. Samples should be unique within a samplesheet.                                                         |
-| `fastq_1` | Full path to FastQ file for Illumina short reads 1. File has to be gzipped and have the extension ".fastq.gz" or ".fq.gz". |
-| `fastq_2` | Full path to FastQ file for Illumina short reads 2. File has to be gzipped and have the extension ".fastq.gz" or ".fq.gz". |
+| Column         | Description                                                                                                                                                                                                                                                                                                                      |
+| -------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `sample`       | Custom sample name. Samples should be unique within a samplesheet.                                                                                                                                                                                                                                                               |
+| `mlst_alleles` | Full path to an MLST JSON file describing the loci/alleles for the sample against some MLST scheme. A way to generate this file is via [locidex](https://github.com/phac-nml/locidex). File can optionally be gzipped and must have the extension ".mlst.json", ".mlst.subtyping.json" (or with an additional ".gz" if gzipped). |
+| `address`      | Hierarchal clustering address. If left empty for a sample, the pipeline will perform de novo clustering based on the provided cluster designations and thresholds.                                                                                                                                                               |
 
 An [example samplesheet](../assets/samplesheet.csv) has been provided with the pipeline.
 

diff --git a/modules/local/gas/call/main.nf b/modules/local/gas/call/main.nf
@@ -20,12 +20,12 @@ process GAS_CALL{
     path  "versions.yml", emit: versions
 
     script:
-    // Need to add more args for gas call below
     prefix = "Called"
     """
     gas call --dists $distances \\
                 --rclusters $reference_clusters \\
                 --outdir ${prefix} \\
+                --method ${params.gm_method} \\
                 --threshold ${params.gm_thresholds} \\
                 --delimeter ${params.gm_delimiter}
 

diff --git a/modules/local/profile_dists/main.nf b/modules/local/profile_dists/main.nf
@@ -32,9 +32,6 @@ process PROFILE_DISTS{
     if(columns){
         args = args + " --columns $columns"
     }
-    if(params.pd_force){
-        args = args + " --force"
-    }
     if(params.pd_skip){
         args = args + " --skip"
     }

diff --git a/nextflow.config b/nextflow.config
@@ -11,9 +11,6 @@ params {
 
     // Input options
     input                      = null
-    project_name               = 'assembly'
-    assembler                  = 'stub'
-    random_seed                = 1
 
     // Boilerplate options
     outdir                     = null
@@ -51,19 +48,17 @@ params {
     pd_distm = "hamming"
     pd_missing_threshold = 1.0
     pd_sample_quality_threshold = 1.0
-    pd_match_threshold = -1.0
     pd_file_type = "text"
     pd_mapping_file = null // default is no file
-    pd_force = false
     pd_skip = false
     pd_columns = null
-    pd_count_missing = true
+    pd_count_missing = false
 
 
     // GAS Call
     gm_thresholds = "10,5,0"
-    gm_delimiter = "'.'" // note the single quotes surrounding the delimiter
-    ref_clusters = ""
+    gm_method = "average"
+    gm_delimiter = "."
 
 }