From fb1f2c43c8a598a80ca83d54993e6d0b850f2f52 Mon Sep 17 00:00:00 2001
From: Logan Blair <logankblair@gmail.com>
Date: Tue, 23 Jul 2024 11:11:06 -0700
Subject: [PATCH] Tutorial updates

---
 docs/search.json   |  4 ++--
 docs/tutorial.html | 11 ++++++-----
 2 files changed, 8 insertions(+), 7 deletions(-)

diff --git a/docs/search.json b/docs/search.json
index bc3944c..5ca8330 100644
--- a/docs/search.json
+++ b/docs/search.json
@@ -239,7 +239,7 @@
     "href": "tutorial.html#example-1-standard-run",
     "title": "Tutorial",
     "section": "Example 1: Standard Run",
-    "text": "Example 1: Standard Run\nThis example uses sequencing reads from an 2022 outbreak of Xanthomonas hortorum across several plant nurseries. Using whole-genome sequencing, researchers determined a shared genetic basis between strains at different locations. With this information, they traced the origin of the outbreak to a single supplier that sold infected cuttings. You can read more about the study here. \nWe’ll be treating the pathogen as an unknown and using the pathogensurveillance pipeline to determine what we know already (that these samples come from Xanthomonas hortorum). We’ll also see the high degree of shared DNA sequence between samples, which is seen from several plots that the pathogensurveillance pipeline generates automatically. \n\nSample input\nThe pipeline is designed to work with a wide variety of existing metadata sheets without extensive changes. Here’s a look at “xanthomonas.csv”, which serves as the only unique input file within the command to run the pipeline:\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nsample_id\npath_1\npath_2\nsequence_type\nreference\nreference_id\nreport_group\ncolor_by\ndate_isolated\ndate_received\nyear\nhost\ncv_key\nnursery\nX\nX.1\n\n\n\n\n22-299\ntest/data/reads/22-299_R1.fastq.gz\ntest/data/reads/22-299_R2.fastq.gz\nIllumina\n\n\nxan_test\nsubgroup\nyear\nnursery\n3/2/22\n3/29/22\n2022\nPelargonium x hortorum\nCV-1\nMD\n\n\n22-300\ntest/data/reads/22-300_R1.fastq.gz\ntest/data/reads/22-300_R2.fastq.gz\nIllumina\n\n\nxan_test\nsubgroup\nyear\nnursery\n3/2/22\n3/30/22\n2022\nPelargonium x hortorum\nCV-2\nMD\n\n\n22-301\ntest/data/reads/22-301_R1.fastq.gz\ntest/data/reads/22-301_R2.fastq.gz\nIllumina\n\n\nxan_test\nsubgroup\nyear\nnursery\n3/2/22\n3/31/22\n2022\nPelargonium x hortorum\nCV-3\nMD\n\n\n22-302\ntest/data/reads/22-302_R1.fastq.gz\ntest/data/reads/22-302_R2.fastq.gz\nIllumina\n\n\nxan_test\nsubgroup\nyear\nnursery\n3/2/22\n4/1/22\n2022\nPelargonium x hortorum\nCV-4\nMD\n\n\n22-303\ntest/data/reads/22-303_R1.fastq.gz\ntest/data/reads/22-303_R2.fastq.gz\nIllumina\n\n\nxan_test\nsubgroup\nyear\nnursery\n3/2/22\n4/2/22\n2022\nPelargonium x hortorum\nCV-5\nMD\n\n\n22-304\ntest/data/reads/22-304_R1.fastq.gz\ntest/data/reads/22-304_R2.fastq.gz\nIllumina\n\n\nxan_test\nsubgroup\nyear\nnursery\n3/7/22\n4/3/22\n2022\nPelargonium x hortorum\nCV-6\nMD\n\n\n\n\n\n\nThere is quite a bit of information in this file, but only a few columns are essential (and can be in any order). The input csv needs show the pipeline where to find the sequencing reads. These can either be present locally or they can be downloaded from the NCBI.\nUsing local reads: Columns “path_1” and “path_2” specify the path to forward and reverse reads. Each row corresponds to one individual sample. Reads for this tutorial are hosted on the pathogensurveilance github repo. . If your reads are single-ended, “path_2” should be left blank.\nShortread/Longread sequences*: Information in the column “sequencing_type” tells the pipeline these are derived from illumina shortreads. Other options for this column are “nanopore” and “pacbio”.\nDownloading reads: Sequence files may instead be hosted on the NCBI. In that case, the “shortread_1/shortread_2” columns should be substituted with a single “SRA” column, and they will be downloaded right after the pipeline checks the sample sheet. These downloads will show up in the folder path_surveil_data/reads. See test/data/metadata/xanthomonas.csv for an example using this input format.\nSpecifying a reference genome (optional): The “reference_refseq” column may be useful when you are relatively confident as to the identity of your samples and would like to include one particular reference for comparison. See Example 2 for an explanation of how to designate mandatory and optional references.\nAssigning sample groups (optional): The optional column “color_by” is used for data visualization. It will assign one or more columns to serve as grouping factors for the output report. Here, samples will be grouped by the values of the “year” and “nursery” columns. Note that multiple factors need to be separated by semicolons within the color_by column. \n\n\nRunning the pipeline\nHere is the full command used execute this example, using a docker container:\nnextflow run nf-core/pathogensurveillance --input https://raw.githubusercontent.com/grunwaldlab/pathogensurveillance/master/test/data/metadata/xanthomonas.csv --outdir xanthomonas --download_bakta_db true -profile docker -resume --max_cpus 8 --max_memory 30GB -resume\nWhen running your own analysis, you will need to provide your own path to the input CSV file.\nBy default, the pipeline will run on 128 GB of RAM and 16 threads. This is more resources than is strictly necessary and beyond the capacity of most desktop computers. We can scale this back a bit for this lightweight test run. This analysis will work with 8 cpus and 30 GB of RAM (albeit more slowly), which is specified by the –max_cpus and –max_memory settings.\nThe setting -resume is only necessary when resuming a previous analysis. However, it doesn’t hurt to include it at the start. If the pipeline is interrupted, this setting allows progress to pick up where it left off – as long as the previous command is executed from the same working directory.\nIf the pipeline begins successfully, you should see a screen tracking your progress:\n[25/63dcee] process &gt; PATHOGENSURVEILLANCE:INPUT_CHECK:SAMPLESHEET_CHECK (xanthomonas.csv)[100%] 1 of 1\n[-        ] process &gt; PATHOGENSURVEILLANCE:SRATOOLS_FASTERQDUMP                              -\n[-        ] process &gt; PATHOGENSURVEILLANCE:DOWNLOAD_ASSEMBLIES                               -\n[-        ] process &gt; PATHOGENSURVEILLANCE:SEQKIT_SLIDING                                    -\n[-        ] process &gt; PATHOGENSURVEILLANCE:FASTQC                                            -\n[-        ] process &gt; PATHOGENSURVEILLANCE:COARSE_SAMPLE_TAXONOMY:BBMAP_SENDSKETCH           -\nThe input and output of each process can be accessed from the work/ directory. The subdirectory within work/ is designated by the string to left of each step. Note that this location will be different each time the pipeline is run, and only the first part of the name of the subdirectory is shown. For this run, we could navigate to work/25/63dcee(etc) to access the input csv that is used for the next step. \n\n\nReport\nYou should see a message similar to this if the pipeline finishes successfully:\n-[nf-core/plantpathsurveil] Pipeline completed successfully-\n\nTo clean the cache, enter the command: \nnextflow clean evil_boyd -f \n\nCompleted at: 20-May-2024 12:44:40\nDuration    : 3h 29m 2s\nCPU hours   : 15.2\nSucceeded   : 253\nThe final report can be viewed as either a .pdf or .html file. It can be accessed inside the reports folder of the output directory (here: xanthomonas/reports). This report shows several key pieces of information about your samples.\nA note on storage management - pathogensurveillance creates a large number of intermediate files. For most users we recommend clearing these files after each run. To do so, run the script shown after the completion message (nextflow clean  -f). You would not want to do this if: (1) You still need to use the caching system. For example, imagine you would like to compare a new sample to 10 samples from a previous run. In that case, some files could be reused to make the pipeline work more quickly. (2) You would like to use intermediate files for your own analysis. By default, these files are saved in the output directory as symlinks to their location in the work/ directory, so you would need to retrieve these before clearing the cache. You could use alternatively use the option –copymode high to save all intermediate files to the published directory, though in the short term this doubles the storage footprint of each run.\nThis particular report has been included as an example \n\nSummary:\n\nPipeline Status Report: error messages for samples or sample groups\nInput Data: Data read from the input .csv file\n\n\nIdentification:\n\nInitial identification: Coarse identification from the bbmap sendsketch step. The first tab shows best species ID for each sample. The second tab shows similarity metrics between sample sequences and other reference genomes: %ANI (average nucleotide identity), %WKID (weighted kmer identity), and %completeness.\n\nFor more information about each metric, click the About this table tab underneath.\n\n\n\n\nMost similar organisms: Shows relationships between samples and references using % ani and % pocp (percentage of conserved proteins). For better resolution, you can interactively zoom in/out of plots.\n\n\nCore gene phylogeny: A core gene phylogeny uses the sequences of all gene shared by all of the genomes included in the tree to infer evolutionary relationships. It is the most robust identification provided by this pipeline, but its precision is still limited by the availability of similar reference sequences. Methods to generate this tree differ between prokaryotes and eukaryotes. Our input to the pipeline was prokaryotic DNA sequences, and the method to build this tree is based upon many different core genes shared between samples and references (for eukaryotes, this is constrained to BUSCO genes). This tree is built with iqtree and based upon shared core genes analyzed using the program pirate. You can highlight branches by hovering over and clicking on nodes.\n\n\n\nSNP trees: This tree is better suited for visualizing the genetic diversity among samples. However, the core gene phylogeny provides a much better source of information for evolutionary differences among samples and other known references.\n\n\nMinimum spanning network\n\nMinimum spanning network: The nodes represent unique multilocus genotypes, and the size of nodes is proportional to the # number of samples that share the same genotype. The edges represent the SNP differences between two given genotypes, and the darker the color of the edges, the fewer SNP differences between the two.",
+    "text": "Example 1: Standard Run\nThis example uses sequencing reads from an 2022 outbreak of Xanthomonas hortorum across several plant nurseries. Using whole-genome sequencing, researchers determined a shared genetic basis between strains at different locations. With this information, they traced the origin of the outbreak to a single supplier that sold infected cuttings. You can read more about the study here. \nWe’ll be treating the pathogen as an unknown and using the pathogensurveillance pipeline to determine what we know already (that these samples come from Xanthomonas hortorum). We’ll also see the high degree of shared DNA sequence between samples, which is seen from several plots that the pathogensurveillance pipeline generates automatically. \n\nSample input\nThe pipeline is designed to work with a wide variety of existing metadata sheets without extensive changes. Here’s a look at “xanthomonas.csv”, which serves as the only unique input file within the command to run the pipeline:\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nsample_id\npath_1\npath_2\nsequence_type\nreference\nreference_id\nreport_group\ncolor_by\ndate_isolated\ndate_received\nyear\nhost\ncv_key\nnursery\nX\nX.1\n\n\n\n\n22-299\ntest/data/reads/22-299_R1.fastq.gz\ntest/data/reads/22-299_R2.fastq.gz\nIllumina\n\n\nxan_test\nsubgroup\nyear\nnursery\n3/2/22\n3/29/22\n2022\nPelargonium x hortorum\nCV-1\nMD\n\n\n22-300\ntest/data/reads/22-300_R1.fastq.gz\ntest/data/reads/22-300_R2.fastq.gz\nIllumina\n\n\nxan_test\nsubgroup\nyear\nnursery\n3/2/22\n3/30/22\n2022\nPelargonium x hortorum\nCV-2\nMD\n\n\n22-301\ntest/data/reads/22-301_R1.fastq.gz\ntest/data/reads/22-301_R2.fastq.gz\nIllumina\n\n\nxan_test\nsubgroup\nyear\nnursery\n3/2/22\n3/31/22\n2022\nPelargonium x hortorum\nCV-3\nMD\n\n\n22-302\ntest/data/reads/22-302_R1.fastq.gz\ntest/data/reads/22-302_R2.fastq.gz\nIllumina\n\n\nxan_test\nsubgroup\nyear\nnursery\n3/2/22\n4/1/22\n2022\nPelargonium x hortorum\nCV-4\nMD\n\n\n22-303\ntest/data/reads/22-303_R1.fastq.gz\ntest/data/reads/22-303_R2.fastq.gz\nIllumina\n\n\nxan_test\nsubgroup\nyear\nnursery\n3/2/22\n4/2/22\n2022\nPelargonium x hortorum\nCV-5\nMD\n\n\n22-304\ntest/data/reads/22-304_R1.fastq.gz\ntest/data/reads/22-304_R2.fastq.gz\nIllumina\n\n\nxan_test\nsubgroup\nyear\nnursery\n3/7/22\n4/3/22\n2022\nPelargonium x hortorum\nCV-6\nMD\n\n\n\n\n\n\nThere is quite a bit of information in this file, but only a few columns are essential (and can be in any order). The input csv needs show the pipeline where to find the sequencing reads. These can either be present locally or they can be downloaded from the NCBI.\nSample ID: The “sample_id” column is used to name your samples. This information will be used in graphs, so it is recommended to keep names short but informative. If you do not include this column, sample IDs will be generated from the names of your fastq files.\nUsing local reads: Columns “path_1” and “path_2” specify the path to forward and reverse reads. Each row corresponds to one individual sample. Reads for this tutorial are hosted on the pathogensurveilance github repo. . If your reads are single-ended, “path_2” should be left blank.\nShortread/Longread sequences*: Information in the column “sequencing_type” tells the pipeline these are derived from illumina shortreads. Other options for this column are “nanopore” and “pacbio”.\nDownloading reads: Sequence files may instead be hosted on the NCBI. In that case, the “shortread_1/shortread_2” columns should be substituted with a single “SRA” column, and they will be downloaded right after the pipeline checks the sample sheet. These downloads will show up in the folder path_surveil_data/reads. See test/data/metadata/xanthomonas.csv for an example using this input format.\nSpecifying a reference genome (optional): The “reference_refseq” column may be useful when you are relatively confident as to the identity of your samples and would like to include one particular reference for comparison. See Example 2 for an explanation of how to designate mandatory and optional references.\nAssigning sample groups (optional): The optional column “color_by” is used for data visualization. It will assign one or more columns to serve as grouping factors for the output report. Here, samples will be grouped by the values of the “year” and “nursery” columns. Note that multiple factors need to be separated by semicolons within the color_by column. \n\n\nRunning the pipeline\nHere is the full command used execute this example, using a docker container:\nnextflow run nf-core/pathogensurveillance --sample_data https://raw.githubusercontent.com/grunwaldlab/pathogensurveillance/master/test/data/metadata/xanthomonas.csv --out_dir xanthomonas --download_bakta_db true -profile docker -resume --max_cpus 8 --max_memory 30GB -resume\nWhen running your own analysis, you will need to provide your own path to the input CSV file.\nBy default, the pipeline will run on 128 GB of RAM and 16 threads. This is more resources than is strictly necessary and beyond the capacity of most desktop computers. We can scale this back a bit for this lightweight test run. This analysis will work with 8 cpus and 30 GB of RAM (albeit more slowly), which is specified by the –max_cpus and –max_memory settings.\nThe setting -resume is only necessary when resuming a previous analysis. However, it doesn’t hurt to include it at the start. If the pipeline is interrupted, this setting allows progress to pick up where it left off – as long as the previous command is executed from the same working directory.\nIf the pipeline begins successfully, you should see a screen tracking your progress:\n[25/63dcee] process &gt; PATHOGENSURVEILLANCE:INPUT_CHECK:SAMPLESHEET_CHECK (xanthomonas.csv)[100%] 1 of 1\n[-        ] process &gt; PATHOGENSURVEILLANCE:SRATOOLS_FASTERQDUMP                              -\n[-        ] process &gt; PATHOGENSURVEILLANCE:DOWNLOAD_ASSEMBLIES                               -\n[-        ] process &gt; PATHOGENSURVEILLANCE:SEQKIT_SLIDING                                    -\n[-        ] process &gt; PATHOGENSURVEILLANCE:FASTQC                                            -\n[-        ] process &gt; PATHOGENSURVEILLANCE:COARSE_SAMPLE_TAXONOMY:BBMAP_SENDSKETCH           -\nThe input and output of each process can be accessed from the work/ directory. The subdirectory within work/ is designated by the string to left of each step. Note that this location will be different each time the pipeline is run, and only the first part of the name of the subdirectory is shown. For this run, we could navigate to work/25/63dcee(etc) to access the input csv that is used for the next step. \n\n\nReport\nYou should see a message similar to this if the pipeline finishes successfully:\n-[nf-core/plantpathsurveil] Pipeline completed successfully-\n\nTo clean the cache, enter the command: \nnextflow clean evil_boyd -f \n\nCompleted at: 20-May-2024 12:44:40\nDuration    : 3h 29m 2s\nCPU hours   : 15.2\nSucceeded   : 253\nThe final report can be viewed as either a .pdf or .html file. It can be accessed inside the reports folder of the output directory (here: xanthomonas/reports). This report shows several key pieces of information about your samples.\nA note on storage management - pathogensurveillance creates a large number of intermediate files. For most users we recommend clearing these files after each run. To do so, run the script shown after the completion message (nextflow clean  -f). You would not want to do this if: (1) You still need to use the caching system. For example, imagine you would like to compare a new sample to 10 samples from a previous run. In that case, some files could be reused to make the pipeline work more quickly. (2) You would like to use intermediate files for your own analysis. By default, these files are saved in the output directory as symlinks to their location in the work/ directory, so you would need to retrieve these before clearing the cache. You could use alternatively use the option –copymode high to save all intermediate files to the published directory, though in the short term this doubles the storage footprint of each run.\nThis particular report has been included as an example \n\nSummary:\n\nPipeline Status Report: error messages for samples or sample groups\nInput Data: Data read from the input .csv file\n\n\nIdentification:\n\nInitial identification: Coarse identification from the bbmap sendsketch step. The first tab shows best species ID for each sample. The second tab shows similarity metrics between sample sequences and other reference genomes: %ANI (average nucleotide identity), %WKID (weighted kmer identity), and %completeness.\n\nFor more information about each metric, click the About this table tab underneath.\n\n\n\n\nMost similar organisms: Shows relationships between samples and references using % ani and % pocp (percentage of conserved proteins). For better resolution, you can interactively zoom in/out of plots.\n\n\nCore gene phylogeny: A core gene phylogeny uses the sequences of all gene shared by all of the genomes included in the tree to infer evolutionary relationships. It is the most robust identification provided by this pipeline, but its precision is still limited by the availability of similar reference sequences. Methods to generate this tree differ between prokaryotes and eukaryotes. Our input to the pipeline was prokaryotic DNA sequences, and the method to build this tree is based upon many different core genes shared between samples and references (for eukaryotes, this is constrained to BUSCO genes). This tree is built with iqtree and based upon shared core genes analyzed using the program pirate.\n\n\n\nSNP trees: This tree is better suited for visualizing the genetic diversity among samples. However, the core gene phylogeny provides a much better source of information for evolutionary differences among samples and other known references.\n\n\nMinimum spanning network\n\nMinimum spanning network: The nodes represent unique multilocus genotypes, and the size of nodes is proportional to the # number of samples that share the same genotype. The edges represent the SNP differences between two given genotypes, and the darker the color of the edges, the fewer SNP differences between the two.",
     "crumbs": [
       "Tutorial"
     ]
@@ -249,7 +249,7 @@
     "href": "tutorial.html#example-2-defining-references",
     "title": "Tutorial",
     "section": "Example 2: Defining References",
-    "text": "Example 2: Defining References\nIf you know what your samples are already, you may want to tell the pipeline to use a “standard” reference genome instead of picking one that is more obscure (even if pathogensurveillance deems it to be a better fit). Other users may have a few different organisms of interest that they want to use as a points of comparison. For example, maybe there is a particularly nasty strain of V. cholerae that you want to see in relation to your other samples. There are a few options to select (or not select) reference genomes for these cases.\nPathogensurveillance uses two categories of reference genomes. Primary references are used for alignment and will always be displayed in phylogenetic trees. In contrast, contextual references are selected before the primary reference is known and they may or may not be used later. Some contextual references are chosen because they are really close matches to your samples, and these may be selected to become primary references. However, pathogensurveillance will select a few distantly related contextual references too. Some of these are used to “fill out” the phylogeny, and you may want a higher or lower number of contextual references depending on how you want your phylogenetic trees to look.\n\nChosing primary references\nTake this sample list containing three Mycobacterium abscessus samples and three Mycobacterium leprae samples:\n\n\n\n\n\n\nsample_id\nncbi.accession\n\n\n\n\nmycobacterium_abscessus1\nERR7253671\n\n\nmycobacterium_abscessus2\nERR7253669\n\n\nmycobacterium_abscessus3\nERR7253671\n\n\nmycobacterium_leperae1\nSRR6241707\n\n\nmycobacterium_leperae2\nSRR6241708\n\n\nmycobacterium_leperae3\nSRR6241709\n\n\n\n\n\n\nTo force the pipeline to use the NCBI specified Mycobacterium abscessus reference genome for the three Mycobacterium abscessus samples, and likewise make the three Mycobacterium leprae samples use the NCBI specified Mycobacterium leprae genome, we need to tell pathogensurveillance both where to find these reference sequences and how to use them. We can either specify a local path to the reads, or this can instead be specified through the ref_ncbi_accession column. Here, how the references are used here is controlled by the ref_primary_usage column:\n\n\n\n\n\n\n\n\n\n\n\n\nsample_id\nncbi.accession\nref_ncbi_accession\nref_primary_usage\n\n\n\n\nmycobacterium_abscessus1\nERR7253671\nGCF_001632805.1\nrequired\n\n\nmycobacterium_abscessus2\nERR7253669\nGCF_001632805.1\nrequired\n\n\nmycobacterium_abscessus3\nERR7253671\nGCF_001632805.1\nrequired\n\n\nmycobacterium_leprae1\nSRR6241707\nGCF_003253775.1\nrequired\n\n\nmycobacterium_leprae2\nSRR6241708\nGCF_003253775.1\nrequired\n\n\nmycobacterium_leprae3\nSRR6241709\nGCF_003253775.1\nrequired\n\n\n\n\n\n\n\n\n\nSpecifying contextual references\nTaking the previous Mycobacterium abscessus/leprae example, imagine we would like to see the comparison between Mycobacterium abscessus and Mycobacterium tuberculosis. We can do this by including Mycobacterium tuberculosis as a mandatory contextual reference:\n\n\n\n\n\n\n\n\n\n\n\n\nsample_id\nncbi.accession\nref_ncbi_accession\nref_contextual_usage\n\n\n\n\nmycobacterium_abscessus1\nERR7253671\nGCF_001632805.1\nrequired\n\n\nmycobacterium_abscessus2\nERR7253669\nGCF_001632805.1\nrequired\n\n\nmycobacterium_abscessus3\nERR7253671\nGCF_001632805.1\nrequired\n\n\nmycobacterium_leprae1\nSRR6241707\n\n\n\n\nmycobacterium_leprae2\nSRR6241708\n\n\n\n\nmycobacterium_leprae3\nSRR6241709\n\n\n\n\n\n\n\n\n\n\n\nSelecting references from an NCBI query\nIt is also possible to submit a valid NCBI query to the pipeline with reference genomes selected from query hits. For example, you could test how your Mycobacterium leprae samples compared to a bunch of different other Mycobacterium leprae genomes:\n\n\n\n\n\n\n\n\n\n\n\n\nsample_id\nncbi.accession\nref_ncbi_query\nref_ncbi_query_max\n\n\n\n\nmycobacterium_abscessus1\nERR7253671\n\nNA\n\n\nmycobacterium_abscessus2\nERR7253669\n\nNA\n\n\nmycobacterium_abscessus3\nERR7253671\n\nNA\n\n\nmycobacterium_leprae1\nSRR6241707\nmycobacterium leprae\n100\n\n\nmycobacterium_leprae2\nSRR6241708\nmycobacterium leprae\n100\n\n\nmycobacterium_leprae3\nSRR6241709\nmycobacterium leprae\n100\n\n\n\n\n\n\nSome things to keep in mind:\n\nDepending on your organism, this may a massive amount of data. Make sure you have queried NCBI beforehand to get a good handle on how many references you are downloading.\nThe optional parameter ref_ncbi_query_max is a good way of limiting this number when you are sampling from a densely populated clade, such as Mycobacterium leprae. This parameter can either be a set number (like shown here) or a percentage.\nThe NCBI API will fail if there are too many requests. See ncbi support for more detail.\n\n\n\n\nMultiple references per sample\nIf we would like to add multiple references per sample, we can enter this information through a separate reference csv. In this example, we specify one primary reference each for Mycobacterium abscessus and Mycobacterium leprae, then three additional contextual references for Mycobacterium leprae:\n\n\n\n\n\n\n\n\n\n\n\n\nref_group_ids\nref_path\nRef.primary.usage\nRef.contextual.Usage\n\n\n\n\nabscessus\ntest/data/refs/mycobacterium_abscessus_reference1.fna\nrequired\n\n\n\nleprae\ntest/data/refs/mycobacterium_leprae_reference1.fna\nrequired\n\n\n\nleprae\ntest/data/refs/mycobacterium_leprae_reference2.fna\n\noptional\n\n\nleprae\ntest/data/refs/mycobacterium_leprae_reference3.fna\n\noptional\n\n\nleprae\ntest/data/refs/mycobacterium_leprae_reference4.fna\n\noptional\n\n\n\n\n\n\nNote that the “ref_group_ids” column in the sample input csv needs to match the sample csv:\n\n\n\n\n\n\nsample_id\nncbi.accession\nref_group_ids\n\n\n\n\nmycobacterium_abscessus1\nERR7253671\nabscessus\n\n\nmycobacterium_abscessus2\nERR7253669\nabscessus\n\n\nmycobacterium_abscessus3\nERR7253671\nabscessus\n\n\nmycobacterium_leprae1\nSRR6241707\nleprae\n\n\nmycobacterium_leprae2\nSRR6241708\nleprae\n\n\nmycobacterium_leprae3\nSRR6241709\nleprae\n\n\n\n\n\n\nThe path to this reference csv needs to be specified in the command to run the pipeline:\nnextflow run nf-core/pathogensurveillance --sample_inut mycobacterium_samples.csv --reference_input mycobacterium_references.csv --output_dir mycobacterium_test --download_bakta_db true -profile docker",
+    "text": "Example 2: Defining References\nIf you know what your samples are already, you may want to tell the pipeline to use a “standard” reference genome instead of picking one that is more obscure – even if pathogensurveillance deems the alternative to be a better fit. Other users may have a few different organisms of interest that they want to use as a points of comparison. For example, maybe there is a particularly nasty strain of V. cholerae that you want to see in relation to your other samples. There are a few options to select (or not select) reference genomes for these cases.\nPathogensurveillance uses two categories of reference genomes. Primary references are used for alignment and will always be displayed in phylogenetic trees. In contrast, contextual references are selected before the primary reference is known and they may or may not be used later. Some contextual references are chosen because they are really close matches to your samples, and these may be selected to become primary references. However, pathogensurveillance will select a few distantly related contextual references too. Some of these are used to “fill out” the phylogeny, and you may want a higher or lower number of contextual references depending on how you want your phylogenetic trees to look.\n\nChosing primary references\nTake this sample list containing three Mycobacterium abscessus samples and three Mycobacterium leprae samples:\n\n\n\n\n\n\nsample_id\nncbi.accession\n\n\n\n\nmycobacterium_abscessus1\nERR7253671\n\n\nmycobacterium_abscessus2\nERR7253669\n\n\nmycobacterium_abscessus3\nERR7253671\n\n\nmycobacterium_leperae1\nSRR6241707\n\n\nmycobacterium_leperae2\nSRR6241708\n\n\nmycobacterium_leperae3\nSRR6241709\n\n\n\n\n\n\nTo force the pipeline to use the NCBI specified Mycobacterium abscessus reference genome for the three Mycobacterium abscessus samples, and likewise make the three Mycobacterium leprae samples use the NCBI specified Mycobacterium leprae genome, we need to tell pathogensurveillance both where to find these reference sequences and how to use them. We can either specify a local path to the reads, or this can instead be specified through the ref_ncbi_accession column. Here, how the references are used here is controlled by the ref_primary_usage column:\n\n\n\n\n\n\n\n\n\n\n\n\nsample_id\nncbi.accession\nref_ncbi_accession\nref_primary_usage\n\n\n\n\nmycobacterium_abscessus1\nERR7253671\nGCF_001632805.1\nrequired\n\n\nmycobacterium_abscessus2\nERR7253669\nGCF_001632805.1\nrequired\n\n\nmycobacterium_abscessus3\nERR7253671\nGCF_001632805.1\nrequired\n\n\nmycobacterium_leprae1\nSRR6241707\nGCF_003253775.1\nrequired\n\n\nmycobacterium_leprae2\nSRR6241708\nGCF_003253775.1\nrequired\n\n\nmycobacterium_leprae3\nSRR6241709\nGCF_003253775.1\nrequired\n\n\n\n\n\n\n\n\n\nSpecifying contextual references\nTaking the previous Mycobacterium abscessus/leprae example, imagine we would like to see the comparison between Mycobacterium abscessus and Mycobacterium tuberculosis. We can do this by including Mycobacterium tuberculosis as a mandatory contextual reference:\n\n\n\n\n\n\n\n\n\n\n\n\nsample_id\nncbi.accession\nref_ncbi_accession\nref_contextual_usage\n\n\n\n\nmycobacterium_abscessus1\nERR7253671\nGCF_001632805.1\nrequired\n\n\nmycobacterium_abscessus2\nERR7253669\nGCF_001632805.1\nrequired\n\n\nmycobacterium_abscessus3\nERR7253671\nGCF_001632805.1\nrequired\n\n\nmycobacterium_leprae1\nSRR6241707\n\n\n\n\nmycobacterium_leprae2\nSRR6241708\n\n\n\n\nmycobacterium_leprae3\nSRR6241709\n\n\n\n\n\n\n\n\n\n\n\nSelecting references from an NCBI query\nIt is also possible to submit a valid NCBI query to the pipeline with reference genomes selected from query hits. For example, you could test how your Mycobacterium leprae samples compared to a bunch of different other Mycobacterium leprae genomes:\n\n\n\n\n\n\n\n\n\n\n\n\nsample_id\nncbi.accession\nref_ncbi_query\nref_ncbi_query_max\n\n\n\n\nmycobacterium_abscessus1\nERR7253671\n\nNA\n\n\nmycobacterium_abscessus2\nERR7253669\n\nNA\n\n\nmycobacterium_abscessus3\nERR7253671\n\nNA\n\n\nmycobacterium_leprae1\nSRR6241707\nmycobacterium leprae\n100\n\n\nmycobacterium_leprae2\nSRR6241708\nmycobacterium leprae\n100\n\n\nmycobacterium_leprae3\nSRR6241709\nmycobacterium leprae\n100\n\n\n\n\n\n\nSome things to keep in mind:\n\nDepending on your organism, this may a massive amount of data. Make sure you have queried NCBI beforehand to get a good handle on how many references you are downloading.\nThe optional parameter ref_ncbi_query_max is a good way of limiting this number when you are sampling from a densely populated clade, such as Mycobacterium leprae. This parameter can either be a set number (like shown here) or a percentage.\nThe NCBI API will fail if there are too many requests. See ncbi support for more detail.\n\n\n\n\nMultiple references per sample\nIf we would like to add multiple references per sample, we can enter this information through a separate reference csv. In this example, we specify one primary reference each for Mycobacterium abscessus and Mycobacterium leprae, then three additional contextual references for Mycobacterium leprae:\n\n\n\n\n\n\n\n\n\n\n\n\nref_group_ids\nref_path\nRef.primary.usage\nRef.contextual.Usage\n\n\n\n\nabscessus\ntest/data/refs/mycobacterium_abscessus_reference1.fna\nrequired\n\n\n\nleprae\ntest/data/refs/mycobacterium_leprae_reference1.fna\nrequired\n\n\n\nleprae\ntest/data/refs/mycobacterium_leprae_reference2.fna\n\noptional\n\n\nleprae\ntest/data/refs/mycobacterium_leprae_reference3.fna\n\noptional\n\n\nleprae\ntest/data/refs/mycobacterium_leprae_reference4.fna\n\noptional\n\n\n\n\n\n\nNote that the “ref_group_ids” column in the sample input csv needs to match the sample csv:\n\n\n\n\n\n\nsample_id\nncbi.accession\nref_group_ids\n\n\n\n\nmycobacterium_abscessus1\nERR7253671\nabscessus\n\n\nmycobacterium_abscessus2\nERR7253669\nabscessus\n\n\nmycobacterium_abscessus3\nERR7253671\nabscessus\n\n\nmycobacterium_leprae1\nSRR6241707\nleprae\n\n\nmycobacterium_leprae2\nSRR6241708\nleprae\n\n\nmycobacterium_leprae3\nSRR6241709\nleprae\n\n\n\n\n\n\nThe path to this reference csv needs to be specified in the command to run the pipeline:\nnextflow run nf-core/pathogensurveillance --sample_data mycobacterium_samples.csv --reference_input mycobacterium_references.csv --out_dir mycobacterium_test --download_bakta_db true -profile docker",
     "crumbs": [
       "Tutorial"
     ]
diff --git a/docs/tutorial.html b/docs/tutorial.html
index 8738be3..ab02035 100644
--- a/docs/tutorial.html
+++ b/docs/tutorial.html
@@ -390,6 +390,7 @@ <h3 class="anchored" data-anchor-id="sample-input">Sample input</h3>
 </div>
 </div>
 <p><br>There is quite a bit of information in this file, but only a few columns are essential (and can be in any order). The input csv needs show the pipeline where to find the sequencing reads. These can either be present locally or they can be downloaded from the NCBI.</p>
+<p><strong>Sample ID</strong>: The “sample_id” column is used to name your samples. This information will be used in graphs, so it is recommended to keep names short but informative. If you do not include this column, sample IDs will be generated from the names of your fastq files.</p>
 <p><strong>Using local reads</strong>: Columns “path_1” and “path_2” specify the path to forward and reverse reads. Each row corresponds to one individual sample. Reads for this tutorial are hosted on the pathogensurveilance <a href="https://github.com/nf-core/pathogensurveillance/tree/dev/test/data" target="_blank">github repo. </a>. If your reads are single-ended, “path_2” should be left blank.</p>
 <p><strong>Shortread/Longread sequences</strong>*: Information in the column “sequencing_type” tells the pipeline these are derived from illumina shortreads. Other options for this column are “nanopore” and “pacbio”.</p>
 <p><strong>Downloading reads</strong>: Sequence files may instead be hosted on the NCBI. In that case, the “shortread_1/shortread_2” columns should be substituted with a single “SRA” column, and they will be downloaded right after the pipeline checks the sample sheet. These downloads will show up in the folder path_surveil_data/reads. See test/data/metadata/xanthomonas.csv for an example using this input format.</p>
@@ -399,7 +400,7 @@ <h3 class="anchored" data-anchor-id="sample-input">Sample input</h3>
 <section id="running-the-pipeline" class="level3">
 <h3 class="anchored" data-anchor-id="running-the-pipeline">Running the pipeline</h3>
 <p>Here is the full command used execute this example, using a docker container:</p>
-<div class="sourceCode" id="cb1"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a><span class="ex">nextflow</span> run nf-core/pathogensurveillance <span class="at">--input</span> https://raw.githubusercontent.com/grunwaldlab/pathogensurveillance/master/test/data/metadata/xanthomonas.csv <span class="at">--outdir</span> xanthomonas <span class="at">--download_bakta_db</span> true <span class="at">-profile</span> docker <span class="at">-resume</span> <span class="at">--max_cpus</span> 8 <span class="at">--max_memory</span> 30GB <span class="at">-resume</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
+<div class="sourceCode" id="cb1"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a><span class="ex">nextflow</span> run nf-core/pathogensurveillance <span class="at">--sample_data</span> https://raw.githubusercontent.com/grunwaldlab/pathogensurveillance/master/test/data/metadata/xanthomonas.csv <span class="at">--out_dir</span> xanthomonas <span class="at">--download_bakta_db</span> true <span class="at">-profile</span> docker <span class="at">-resume</span> <span class="at">--max_cpus</span> 8 <span class="at">--max_memory</span> 30GB <span class="at">-resume</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <p>When running your own analysis, you will need to provide your own path to the input CSV file.</p>
 <p>By default, the pipeline will run on 128 GB of RAM and 16 threads. This is more resources than is strictly necessary and beyond the capacity of most desktop computers. We can scale this back a bit for this lightweight test run. This analysis will work with 8 cpus and 30 GB of RAM (albeit more slowly), which is specified by the –max_cpus and –max_memory settings.</p>
 <p>The setting <code>-resume</code> is only necessary when resuming a previous analysis. However, it doesn’t hurt to include it at the start. If the pipeline is interrupted, this setting allows progress to pick up where it left off – as long as the previous command is executed from the same working directory.</p>
@@ -436,7 +437,7 @@ <h3 class="anchored" data-anchor-id="report">Report</h3>
 <hr>
 <p><strong>Identification:</strong></p>
 <ul>
-<li><p><strong>Initial identification</strong>: Coarse identification from the bbmap sendsketch step. The first tab shows best species ID for each sample<em>.</em> The second tab shows similarity metrics between sample sequences and other reference genomes: %ANI (average nucleotide identity), %WKID (weighted kmer identity), and %completeness.</p>
+<li><p><strong>Initial identification</strong>: Coarse identification from the bbmap sendsketch step. The first tab shows best species ID for each sample. The second tab shows similarity metrics between sample sequences and other reference genomes: %ANI (average nucleotide identity), %WKID (weighted kmer identity), and %completeness.</p>
 <ul>
 <li>For more information about each metric, click the <strong>About this table</strong> tab underneath.</li>
 </ul></li>
@@ -446,7 +447,7 @@ <h3 class="anchored" data-anchor-id="report">Report</h3>
 <p><strong>Most similar organisms</strong>: Shows relationships between samples and references using % ani and % pocp (percentage of conserved proteins). For better resolution, you can interactively zoom in/out of plots.</p>
 <hr>
 <p><img src="images/report_core_gene_phylogeny.png" class="img-fluid"></p>
-<p><strong>Core gene phylogeny</strong>: A core gene phylogeny uses the sequences of all gene shared by all of the genomes included in the tree to infer evolutionary relationships. It is the most robust identification provided by this pipeline, but its precision is still limited by the availability of similar reference sequences. Methods to generate this tree differ between prokaryotes and eukaryotes. Our input to the pipeline was prokaryotic DNA sequences, and the method to build this tree is based upon many different core genes shared between samples and references (for eukaryotes, this is constrained to BUSCO genes). This tree is built with iqtree and based upon shared core genes analyzed using the program pirate. You can highlight branches by hovering over and clicking on nodes.</p>
+<p><strong>Core gene phylogeny</strong>: A core gene phylogeny uses the sequences of all gene shared by all of the genomes included in the tree to infer evolutionary relationships. It is the most robust identification provided by this pipeline, but its precision is still limited by the availability of similar reference sequences. Methods to generate this tree differ between prokaryotes and eukaryotes. Our input to the pipeline was prokaryotic DNA sequences, and the method to build this tree is based upon many different core genes shared between samples and references (for eukaryotes, this is constrained to BUSCO genes). This tree is built with iqtree and based upon shared core genes analyzed using the program pirate.</p>
 <hr>
 <p><img src="images/report_snp_tree.png" class="img-fluid"></p>
 <ul>
@@ -461,7 +462,7 @@ <h3 class="anchored" data-anchor-id="report">Report</h3>
 </section>
 <section id="example-2-defining-references" class="level2">
 <h2 class="anchored" data-anchor-id="example-2-defining-references">Example 2: Defining References</h2>
-<p>If you know what your samples are already, you may want to tell the pipeline to use a “standard” reference genome instead of picking one that is more obscure (even if pathogensurveillance deems it to be a better fit). Other users may have a few different organisms of interest that they want to use as a points of comparison. For example, maybe there is a particularly nasty strain of <em>V. cholerae</em> that you want to see in relation to your other samples. There are a few options to select (or not select) reference genomes for these cases.</p>
+<p>If you know what your samples are already, you may want to tell the pipeline to use a “standard” reference genome instead of picking one that is more obscure – even if pathogensurveillance deems the alternative to be a better fit. Other users may have a few different organisms of interest that they want to use as a points of comparison. For example, maybe there is a particularly nasty strain of <em>V. cholerae</em> that you want to see in relation to your other samples. There are a few options to select (or not select) reference genomes for these cases.</p>
 <p>Pathogensurveillance uses two categories of reference genomes. Primary references are used for alignment and will always be displayed in phylogenetic trees. In contrast, contextual references are selected before the primary reference is known and they may or may not be used later. Some contextual references are chosen because they are really close matches to your samples, and these may be selected to become primary references. However, pathogensurveillance will select a few distantly related contextual references too. Some of these are used to “fill out” the phylogeny, and you may want a higher or lower number of contextual references depending on how you want your phylogenetic trees to look.</p>
 <section id="chosing-primary-references" class="level3">
 <h3 class="anchored" data-anchor-id="chosing-primary-references">Chosing primary references</h3>
@@ -811,7 +812,7 @@ <h3 class="anchored" data-anchor-id="multiple-references-per-sample">Multiple re
 </div>
 </div>
 <p>The path to this reference csv needs to be specified in the command to run the pipeline:</p>
-<div class="sourceCode" id="cb4"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb4-1"><a href="#cb4-1" aria-hidden="true" tabindex="-1"></a><span class="ex">nextflow</span> run nf-core/pathogensurveillance <span class="at">--sample_inut</span> mycobacterium_samples.csv <span class="at">--reference_input</span> mycobacterium_references.csv <span class="at">--output_dir</span> mycobacterium_test <span class="at">--download_bakta_db</span> true <span class="at">-profile</span> docker </span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
+<div class="sourceCode" id="cb4"><pre class="sourceCode bash code-with-copy"><code class="sourceCode bash"><span id="cb4-1"><a href="#cb4-1" aria-hidden="true" tabindex="-1"></a><span class="ex">nextflow</span> run nf-core/pathogensurveillance <span class="at">--sample_data</span> mycobacterium_samples.csv <span class="at">--reference_input</span> mycobacterium_references.csv <span class="at">--out_dir</span> mycobacterium_test <span class="at">--download_bakta_db</span> true <span class="at">-profile</span> docker </span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 
 
 </section>