diff --git a/topics/assembly/images/vgp_assembly/busco_hap1hap2.svg b/topics/assembly/images/vgp_assembly/busco_hap1hap2.svg new file mode 100644 index 00000000000000..94ef4fb42aa08d --- /dev/null +++ b/topics/assembly/images/vgp_assembly/busco_hap1hap2.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/topics/assembly/images/vgp_assembly/busco_pri_alt_solo.svg b/topics/assembly/images/vgp_assembly/busco_pri_alt_solo.svg new file mode 100644 index 00000000000000..5700dcaccfbeb4 --- /dev/null +++ b/topics/assembly/images/vgp_assembly/busco_pri_alt_solo.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/topics/assembly/images/vgp_assembly/merqury_cn_plot.png b/topics/assembly/images/vgp_assembly/merqury_cn_plot.png index 1f8e1ff87c97f6..51f2a4b31b20e4 100644 Binary files a/topics/assembly/images/vgp_assembly/merqury_cn_plot.png and b/topics/assembly/images/vgp_assembly/merqury_cn_plot.png differ diff --git a/topics/assembly/images/vgp_assembly/merqury_prialt_asm_prepurge.png b/topics/assembly/images/vgp_assembly/merqury_prialt_asm_prepurge.png index 92f22b8d85cdb8..f9991805390696 100644 Binary files a/topics/assembly/images/vgp_assembly/merqury_prialt_asm_prepurge.png and b/topics/assembly/images/vgp_assembly/merqury_prialt_asm_prepurge.png differ diff --git a/topics/assembly/images/vgp_assembly/merqury_prialt_priCN_prepurge.png b/topics/assembly/images/vgp_assembly/merqury_prialt_priCN_prepurge.png index 3773f6b6ae0830..ada70dc6dc9fcc 100644 Binary files a/topics/assembly/images/vgp_assembly/merqury_prialt_priCN_prepurge.png and b/topics/assembly/images/vgp_assembly/merqury_prialt_priCN_prepurge.png differ diff --git a/topics/assembly/images/vgp_assembly/yahs.png b/topics/assembly/images/vgp_assembly/yahs.png new file mode 100644 index 00000000000000..77e09b665c8583 Binary files /dev/null and b/topics/assembly/images/vgp_assembly/yahs.png differ diff --git a/topics/assembly/tutorials/vgp_genome_assembly/tutorial.bib b/topics/assembly/tutorials/vgp_genome_assembly/tutorial.bib index 1b2eaaa756913b..e3addcdd87b5b6 100644 --- a/topics/assembly/tutorials/vgp_genome_assembly/tutorial.bib +++ b/topics/assembly/tutorials/vgp_genome_assembly/tutorial.bib @@ -706,3 +706,20 @@ @article{Korbel2013 title = {Genome assembly and haplotyping with Hi-C}, journal = {Nature Biotechnology} } + +@article{Zhou2022, + author = {Zhou, Chenxi and McCarthy, Shane A and Durbin, Richard}, + title = "{YaHS: yet another Hi-C scaffolding tool}", + journal = {Bioinformatics}, + doi = {10.1093/bioinformatics/btac808}, + volume = {39}, + number = {1}, + pages = {btac808}, + year = {2022}, + month = {12}, + abstract = "{We present YaHS, a user-friendly command-line tool for the construction of chromosome-scale scaffolds from Hi-C data. It can be run with a single-line command, requires minimal input from users (an assembly file and an alignment file) which is compatible with similar tools and provides assembly results in multiple formats, thereby enabling rapid, robust and scalable construction of high-quality genome assemblies with high accuracy and contiguity.YaHS is implemented in C and licensed under the MIT License. The source code, documentation and tutorial are available at https://github.com/sanger-tol/yahs.Supplementary data are available at Bioinformatics online.}", + issn = {1367-4811}, + doi = {10.1093/bioinformatics/btac808}, + url = {https://doi.org/10.1093/bioinformatics/btac808}, + eprint = {https://academic.oup.com/bioinformatics/article-pdf/39/1/btac808/48763581/btac808.pdf} + } \ No newline at end of file diff --git a/topics/assembly/tutorials/vgp_genome_assembly/tutorial.md b/topics/assembly/tutorials/vgp_genome_assembly/tutorial.md index a9559c58bb5bf1..28907b169fa712 100644 --- a/topics/assembly/tutorials/vgp_genome_assembly/tutorial.md +++ b/topics/assembly/tutorials/vgp_genome_assembly/tutorial.md @@ -1,6 +1,6 @@ --- layout: tutorial_hands_on -title: VGP assembly pipeline +title: "VGP assembly pipeline: Step by Step" zenodo_link: 'https://zenodo.org/record/5887339' level: Intermediate tags: @@ -16,7 +16,7 @@ objectives: time_estimation: '5h' key_points: - "The VGP pipeline allows users to generate error-free, near gapless reference-quality genome assemblies" -- "The assembly can be divided in four main stages: genome profile analysis, HiFi long read phased assembly with hifiasm, Bionano hybrid scaffolding and Hi-C scaffolding" +- "The assembly can be divided into four main stages: genome profile analysis, HiFi long read phased assembly with hifiasm, Bionano hybrid scaffolding and Hi-C scaffolding" contributors: - delphine-l - astrovsky01 @@ -56,12 +56,10 @@ The {G10K} launched the Vertebrate Genome Project ({VGP}), whose goal is generat > Your results may differ! > -> Some of your results may slightly differ from results shown in this tutorial, depending on the versions of the tools used, since algorithms can change between versions. +> Some of your results may slightly differ from the results shown in this tutorial, depending on the versions of the tools used, since algorithms can change between versions. > {: .warning} - - > > > In this tutorial, we will cover: @@ -75,7 +73,7 @@ The {G10K} launched the Vertebrate Genome Project ({VGP}), whose goal is generat Before getting into the thick of things, let's go over some terms you will often hear when learning about genome assembly. These concepts will be used often throughout this tutorial as well, so please refer to this section as necessary to help your understanding. -**Pseudohaplotype assembly**: A genome assembly that consists of long phased haplotype blocks separated by regions where the haplotype cannot be distinguished (often homozygous regions). This can result in "switch errors", when the parental haplotypes alternate along the same sequence. These types of assemblies are usually represented by a _primary assembly_ and an _alternate assembly_. (This definition largely taken from the [NCBI's Genome Assembly Model](https://www.ncbi.nlm.nih.gov/assembly/model/#asmb_def).) +**Pseudohaplotype assembly**: A genome assembly that consists of long-phased haplotype blocks separated by regions where the haplotype cannot be distinguished (often homozygous regions). This can result in "switch errors", when the parental haplotypes alternate along the same sequence. These types of assemblies are usually represented by a _primary assembly_ and an _alternate assembly_. (This definition is largely taken from the [NCBI's Genome Assembly Model](https://www.ncbi.nlm.nih.gov/assembly/model/#asmb_def).) **Primary assembly**: The primary assembly is traditionally the more complete representation of an individual's genome and consists of homozygous regions and one set of loci for heterozygous regions. Because the primary assembly contains both homo- and heterozygous regions, it is more complete than the _alternate assambly_ which often reports only the other set of loci for heterozygous regions. Thus, the primary assembly is usually what one would use for downstream analyses. @@ -89,7 +87,7 @@ Before getting into the thick of things, let's go over some terms you will often **Contig**: A contiguous (*i.e.*, gapless) sequence in an assembly, usually inferred algorithmically from the unitig graph. -**False duplications**: Assembly errors that result in one region of the genome being represented twice in the same assembly as two separate regions. Not to be confused with optical or technical duplicates from PCR from short read sequencing. False duplications can further be classified as either _haplotypic duplications_ or _overlaps_. +**False duplications**: Assembly errors that result in one region of the genome being represented twice in the same assembly as two separate regions. Not to be confused with optical or technical duplicates from PCR from short-read sequencing. False duplications can further be classified as either _haplotypic duplications_ or _overlaps_. **Haplotypic duplication** can happen when a region that is heterozygous in the individual has the two haplotypes showing enough divergence that the assembler fails to interpret them as homologous. For example, say an individual is heterozygous in the region Chr1[1:100] and has Haplotype A from their mother and Haplotype B from their father; a false duplication can arise when the haplotypes are not recognized as being from the same region, and the assembler ends up placing both haplotypes in the same assembly, resulting in Chr1[1:100] being represented twice in one assembly. Ideally, a properly phased assembly would have Haplotype A in one assembly, *e.g.*, the primary, while Haplotype B is in the alternate. @@ -103,7 +101,7 @@ False duplications via **overlaps** result from unresolved overlaps in the assem For more about the specific scaffolding technologies used in the VGP pipeline (currently Bionano optical maps and Hi-C chromatin conformation data), please refer to those specific sections within this tutorial. -**HiFi reads**: PacBio {HiFi} reads are the focus of this tutorial. First described in 2019, they have revolutionized genome assembly by combining long (about 10-20 kbp) read lengths with high accuracy (>Q20) typically associated with short read sequencing ({% cite Wenger2019 %}). These higher read lengths enable HiFi reads to traverse some repeat regions that are problematic to assemble with short reads. +**HiFi reads**: PacBio {HiFi} reads are the focus of this tutorial. First described in 2019, they have revolutionized genome assembly by combining long (about 10-20 kbp) read lengths with high accuracy (>Q20) typically associated with short-read sequencing ({% cite Wenger2019 %}). These higher read lengths enable HiFi reads to traverse some repeat regions that are problematic to assemble with short reads. **Ultra-long reads**: Ultra-long reads are typically defined as reads of over 100 kbp, and are usually generated using Oxford Nanopore Technology. Read quality is often lower than HiFi or Illumina (*i.e.*, have a higher error rate), but they are often significantly longer than any other current sequencing technology, and can help assembly algorithms walk complex repeat regions in the assembly graphs. @@ -115,14 +113,31 @@ For more about the specific scaffolding technologies used in the VGP pipeline (c **Telomere-to-telomere assembly**: Often abbreviated as "T2T", this term refers to an assembly where each chromosome is completely gapless from telomere to telomere. The term usually refers to the recently completed CHM13 human genome ({% cite Nurk2022 %}), though there is an increasing number of efforts to generate T2T genomes for other species. - # VGP assembly pipeline overview -This training is an adaptation of the VGP assembly pipeline 2.0 (Fig. 2). +The {VGP} assembly pipeline has a modular organization, consisting of ten workflows (Fig. 1). It can used with the following types of input data: -![VGP pipeline schematic showing steps from genome profiling, to hifiasm contigging, to bionano scaffolding, to HiC scaffolding.](../../images/vgp_assembly/vgp_pipeline_2.0_current.png "VGP Pipeline 2.0. The pipeline starts with reference-free genome profiling using k-mers. Then HiFi reads are assembled into contigs using hifiasm, along with additional phasing data if available. If false duplicates are detected in QC, then the contigs undergo purging. Afterwards, scaffolding takes place with optical maps (if available) and Hi-C data. A final decontamination steps is performed before curation.") +| Input data | Assembly quality | Analysis trajectory
([Fig. 2)](#figure-2)| +|------|---------------|-----| +| HiFi | The minimum requirement | A | +| HiFi + HiC| Better continuity | B | +| HiFi + BioNano | Better continuity | C | +| HiFi + Hi-C + BioNano | Even better continuity | D | +| HiFi + parental data| Better haplotype resolution | E | +| HiFi + parental data + Hi-C| Better haplotype resolution and improved continuity | F | +| HiFi + parental + BioNano | Better haplotype resolution and improved continuity | G | +| HiFi + parental data + Hi-C + BioNano | Better haplotype resolution and ultimate continuity | H | -This training has been organized into four main sections: genome profile analysis, assembly of {HiFi} reads with hifiasm, scaffolding with Bionano optical maps, and scaffolding with {Hi-C} data. Additionally, the **assembly with hifiasm** section has two possible paths in this tutorial: solo contigging or solo w/HiC contigging. +If this table "HiFi" and "Hi-C" are derived from the individual whose genome is being assembled. "Parental data" is high-coverage Illumina data derived from the parents of the individual being assembled. Datasets containing parental data are also called "*Trios*". Each combination of input datasets is supported by an *analysis trajectory*: a combination of workflows designed for generating assembly given a particular combination of inputs. These trajectories are listed in the table above and shown in the figure below. We suggest at least 30✕ PacBio HiFi coverage and 30✕ Hi-C coverage per haplotype (parental genome); and up to 60✕ coverage to accurately assemble highly repetitive regions. + +![The nine workflows of Galaxy assembly pipeline](../../images/vgp_assembly/VGP_workflow_modules.svg "Eight analysis trajectories are possible depending on the combination of input data. A decision on whether or not to invoke Workflow 6 is based on the analysis of QC output of workflows 3, 4, or 5. Thicker lines connecting Workflows 7, 8, and 9 represent the fact that these workflows are invoked separately for each phased assembly (once for maternal and once for paternal).") +
+The first stage of the pipeline is the generation of *k*-mer profiles of the raw reads to estimate genome size, heterozygosity, repetitiveness, and error rate necessary for parameterizing downstream workflows. The generation of *k*-mer counts can be done from HiFi data only (Workflow 1) or include data from parental reads for trio-based phasing (Workflow 2; trio is a combination of paternal sequencing data with that from an offspring that is being assembled). The second stage is the phased contig assembly. In addition to using only {HiFi} reads (Workflow 3), the contig building (contiging) step can leverage {Hi-C} (Workflow 4) or parental read data (Workflow 5) to produce fully-phased haplotypes (hap1/hap2 or parental/maternal assigned haplotypes), using [`hifiasm`](https://github.com/chhylp123/hifiasm). The contiging workflows also produce a number of critical quality control (QC) metrics such as *k*-mer multiplicity profiles. Inspection of these profiles provides information to decide whether the third stage—purging of false duplication—is required. Purging (Workflow 6), using [`purge_dups`](https://github.com/dfguan/purge_dups) identifies and resolves haplotype-specific assembly segments incorrectly labeled as primary contigs, as well as heterozygous contig overlaps. This increases continuity and the quality of the final assembly. The purging stage is generally unnecessary for trio data for which reliable haplotype resolution is performed using *k*-mer profiles obtained from parental reads. The fourth stage, scaffolding, produces chromosome-level scaffolds using information provided by Bionano (Workflow 7), with [`Bionano Solve`](https://bionano.com/software-downloads/) (optional) and Hi-C (Workflow 8) data and [`YaHS`](https://github.com/c-zhou/yahsscaffolding) algorithms. A final stage of decontamination (Workflow 9) removes exogenous sequences (e.g., viral and bacterial sequences) from the scaffolded assembly. A separate workflow (WF0) is used for mitochondrial assembly. + +> A note on data quality +> We suggest at least 30✕ PacBio HiFi coverage and 30✕ Hi-C coverage per haplotype (parental genome); and up to 60✕ coverage to accurately assemble highly repetitive regions. +{: .comment} +This training has been organized into four main sections: genome profile analysis, assembly of {HiFi} reads with hifiasm, scaffolding with Bionano optical maps, and scaffolding with {Hi-C} data. Additionally, the **assembly with hifiasm** section has two possible paths in this tutorial: solo contiging or solo w/HiC contiging. Throughout this tutorial, there will be **detail boxes** with additional background information on the science behind the sequencing technologies and software we use in the pipeline. These boxes are minimized by default, but please expand them to learn more about the data we utilize in this pipeline. @@ -145,57 +160,88 @@ In order to reduce computation time, we will assemble samples from the yeast _Sa > {: .details} -The first step is to get the datasets from Zenodo. The VGP assembly pipeline uses data generated by a variety of technologies, including PacBio HiFi reads, Bionano optical maps, and Hi-C chromatin interaction maps. +The first step is to get the datasets from Zenodo. Specifically, we will be uploading two datasets: -> Data upload +1. A set of PacBio {HiFi} reads in `fasta` format +2. A set of Illumina {Hi-C} reads in `fastqsanger.gz` format + +## Uploading `fasta` datasets from Zenodo + +The following two steps demonstrate how to upload three PacBio {HiFi} datasets into your Galaxy history. + +> Uploading FASTA datasets from Zenodo +> +>**Step 1**: Create a new history for this tutorial +> +> {% snippet faqs/galaxy/histories_create_new.md %} +> +>**Step 2**: Copy the following URLs into the clipboard. +> - you can do this by clicking on {% icon copy %} button in the right upper corner of the box below. It will appear if you mouse over the box. > -> 1. Create a new history for this tutorial -> 2. Import the files from [Zenodo]({{ page.zenodo_link }}) +> ``` +> https://zenodo.org/record/6098306/files/HiFi_synthetic_50x_01.fasta +> https://zenodo.org/record/6098306/files/HiFi_synthetic_50x_02.fasta +> https://zenodo.org/record/6098306/files/HiFi_synthetic_50x_03.fasta +> ``` > -> - Open the file {% icon galaxy-upload %} __upload__ menu -> - Click on **Rule-based** tab -> - *"Upload data as"*: `Datasets` -> - Copy the tabular data, paste it into the textbox and press Build +>**Step 3**: Upload datasets into Galaxy. +> - set the datatype to `fasta` > -> ``` -> Hi-C_dataset_F https://zenodo.org/record/5550653/files/SRR7126301_1.fastq.gz?download=1 fastqsanger.gz Hi-C -> Hi-C_dataset_R https://zenodo.org/record/5550653/files/SRR7126301_2.fastq.gz?download=1 fastqsanger.gz Hi-C -> Bionano_dataset https://zenodo.org/record/5550653/files/bionano.cmap?download=1 cmap Bionano -> ``` +> {% snippet faqs/galaxy/datasets_import_via_link.md format="fasta" %} > -> - From **Rules** menu select `Add / Modify Column Definitions` -> - Click `Add Definition` button and select `Name`: column `A` -> - Click `Add Definition` button and select `URL`: column `B` -> - Click `Add Definition` button and select `Type`: column `C` -> - Click `Add Definition` button and select `Name Tag`: column `D` -> - Click `Apply` and press Upload +> {% snippet topics/assembly/tutorials/vgp_genome_assembly/faqs/dataset_upload_fasta_via_urls.md %} > -> 3. Import the remaining datasets from [Zenodo]({{ page.zenodo_link }}) +{: .hands_on} + +## Uploading `fastqsanger.gz` datasets from Zenodo + +Illumina {Hi-C} data is uploaded in essentially the same way as shown in the following two steps. + +> DANGER: Make sure you choose the correct format! +> When selecting datatype in "**Type (set all)**" drop-down, make sure you select `fastaqsanger` or `fastqsanger.gz` BUT NOT `fastqcssanger` or anything else! +{: .warning} + +> Uploading fastqsanger.gz datasets from Zenodo > -> - Open the file {% icon galaxy-upload %} __upload__ menu -> - Click on **Rule-based** tab -> - *"Upload data as"*: `Collections` -> - Copy the tabular data, paste it into the textbox and press Build +>**Step 1**: Copy the following URLs into the clipboard. You can do this by clicking on {% icon copy %} button in the right upper corner of the box below. It will appear if you mouse over the box. > -> ``` -> dataset_01 https://zenodo.org/record/6098306/files/HiFi_synthetic_50x_01.fasta?download=1 fasta HiFi HiFi_collection -> dataset_02 https://zenodo.org/record/6098306/files/HiFi_synthetic_50x_02.fasta?download=1 fasta HiFi HiFi_collection -> dataset_03 https://zenodo.org/record/6098306/files/HiFi_synthetic_50x_03.fasta?download=1 fasta HiFi HiFi_collection -> ``` +> ``` +> https://zenodo.org/record/5550653/files/SRR7126301_1.fastq.gz +> https://zenodo.org/record/5550653/files/SRR7126301_2.fastq.gz +> ``` > -> - From **Rules** menu select `Add / Modify Column Definitions` -> - Click `Add Definition` button and select `List Identifier(s)`: column `A` -> - Click `Add Definition` button and select `URL`: column `B` -> - Click `Add Definition` button and select `Type`: column `C` -> - Click `Add Definition` button and select `Group Tag`: column `D` -> - Click `Add Definition` button and select `Collection Name`: column `E` -> - Click `Apply` and press Upload +>**Step 2**: Upload datasets into Galaxy and set the datatype to `fastqsanger.gz` +> +> {% snippet faqs/galaxy/datasets_import_via_link.md format="fastqsanger.gz" %} +> +> {% snippet topics/assembly/tutorials/vgp_genome_assembly/faqs/dataset_upload_fastqsanger_via_urls.md %} > {: .hands_on} -### HiFi reads preprocessing with **cutadapt** +> These datasets are large! +> Hi-C datasets are large. It will take some time (~15 min) for them to be fully uploaded. Please, be patient. +{: .warning} + +## Organizing the data + +If everything goes smoothly your history will look like shown in the figure below. The three {HiFi} fasta files are better represented as a collection: {collection}. Also, importantly, +the workflow we will be using for the analysis of our data takes collection as input (it does not access individual datasets). So let's create a collection using steps outlined in the Tip {% icon tip %} "Creating a dataset collection": + +{% snippet faqs/galaxy/collections_build_list.md %} -Adapter trimming usually means trimming the adapter sequence off the ends of reads, which is where the adapter sequence is usually located in {NGS} reads. However, due to the nature of {SMRT} sequencing technology, adapters do not have a specific, predictable location in {HiFi} reads. Additionally, the reads containing adapter sequence could be of generally lower quality compared to the rest of the reads. Thus, we will use **cutadapt** not to trim, but to remove the entire read if a read is found to have an adapter inside of it. +The view of your history should transition from what is shown in the left pane below to what looks like the right pane: + +![AfterUpload](../../images/vgp_assembly/making_list.svg "History after uploading HiFi and HiC data (left). Creation of a list (collection) combines all HiFi datasets into a single history item called 'HiFi data' (right). See below for instructions on how to make this collection.") + +> Other ways to upload the data +> You can obviously upload your own datasets via URLs as illustrated above or from your own computer. In addition, you can upload data from a major repository called [GenomeArk](https://genomeark.org). GenomeArk is integrated directly into Galaxy Upload. To use GenomeArk follow the steps in the Tip {% icon tip %} below: +> +> {% snippet faqs/galaxy/datasets_upload_from_genomeark.md %} +{: .details} + +# HiFi reads preprocessing with **cutadapt** + +Adapter trimming usually means trimming the adapter sequence off the ends of reads, which is where the adapter sequence is usually located in {NGS} reads. However, due to the nature of {SMRT} sequencing technology, adapters do not have a specific, predictable location in {HiFi} reads. Additionally, the reads containing adapter sequences could be of generally lower quality compared to the rest of the reads. Thus, we will use **cutadapt** not to trim, but to remove the entire read if a read is found to have an adapter inside of it. > Background on PacBio HiFi reads > @@ -209,51 +255,52 @@ Adapter trimming usually means trimming the adapter sequence off the ends of rea > Primer removal with Cutadapt > -> 1. {% tool [Cutadapt](toolshed.g2.bx.psu.edu/repos/lparsons/cutadapt/cutadapt/3.4) %} with the following parameters: -> - *"Single-end or Paired-end reads?"*: `Single-end` -> - {% icon param-collection %} *"FASTQ/A file"*: `HiFi_collection` -> - In *"Read 1 Options"*: -> - In *"5' or 3' (Anywhere) Adapters"*: -> - {% icon param-repeat %} *"Insert 5' or 3' (Anywhere) Adapters"* -> - *"Source"*: `Enter custom sequence` -> - *"Enter custom 5' or 3' adapter name"*: `First adapter` -> - *"Enter custom 5' or 3' adapter sequence"*: `ATCTCTCTCAACAACAACAACGGAGGAGGAGGAAAAGAGAGAGAT` -> - {% icon param-repeat %} *"Insert 5' or 3' (Anywhere) Adapters"* -> - *"Source"*: `Enter custom sequence` -> - *"Enter custom 5' or 3' adapter name"*: `Second adapter` -> - *"Enter custom 5' or 3' adapter sequence"*: `ATCTCTCTCTTTTCCTCCTCCTCCGTTGTTGTTGTTGAGAGAGAT` -> - In *"Adapter Options"*: -> - *"Maximum error rate"*: `0.1` -> - *"Minimum overlap length"*: `35` -> - *"Look for adapters in the reverse complement"*: `Yes` -> - In *"Filter Options"*: -> - *"Discard Trimmed Reads"*: `Yes` -> -> > Select collection dataset -> > -> > 1. Click on {% icon param-collection %} **Dataset collection** in front of the input parameter you want to supply the collection to. -> > 2. Select the collection you want to use from the list -> > -> {: .tip} +>**Step 1**: Run {% tool [Cutadapt](toolshed.g2.bx.psu.edu/repos/lparsons/cutadapt/cutadapt/4.4+galaxy0) %} with the following parameters: +> 1. *"Single-end or Paired-end reads?"*: `Single-end` +> 2. {% icon param-collection %} *"FASTQ/A file"*: `HiFi_collection` +> 3. In *"Read 1 Options"*: +> - In *"5' or 3' (Anywhere) Adapters"*: +> - {% icon param-repeat %} *"Insert 5' or 3' (Anywhere) Adapters"* +> - *"Source"*: `Enter custom sequence` +> - *"Enter custom 5' or 3' adapter name"*: `First adapter` +> - *"Enter custom 5' or 3' adapter sequence"*: `ATCTCTCTCAACAACAACAACGGAGGAGGAGGAAAAGAGAGAGAT` +> - {% icon param-repeat %} *"Insert 5' or 3' (Anywhere) Adapters"* +> - *"Source"*: `Enter custom sequence` +> - *"Enter custom 5' or 3' adapter name"*: `Second adapter` +> - *"Enter custom 5' or 3' adapter sequence"*: `ATCTCTCTCTTTTCCTCCTCCTCCGTTGTTGTTGTTGAGAGAGAT` +> 4. In *"Adapter Options"*: +> - *"Maximum error rate"*: `0.1` +> - *"Minimum overlap length"*: `35` +> - *"Look for adapters in the reverse complement"*: `Yes` +> 5. In *"Filter Options"*: +> - *"Discard Trimmed Reads"*: `Yes` +> +> > Select collection dataset +> > +> > 1. Click on {% icon param-collection %} **Dataset collection** in front of the input parameter you want to supply the collection to. +> > 2. Select the collection you want to use from the list +> > +> {: .tip} +> +>
> -> 2. Rename the output file as `HiFi_collection (trim)`. +>**Step 2**: Rename the output file as `HiFi_collection (trimmed)`. > > {% snippet faqs/galaxy/datasets_rename.md %} > {: .hands_on} - # Genome profile analysis Before starting a *de novo* genome assembly project, it is useful to collect metrics on the properties of the genome under consideration, such as the expected genome size, so that you know what to expect from your assembly. Traditionally, DNA flow cytometry was considered the golden standard for estimating the genome size. Nowadays, experimental methods have been replaced by computational approaches ({% cite wang2020estimation %}). One of the widely used genome profiling methods is based on the analysis of *k*-mer frequencies. It allows one to provide information not only about the genomic complexity, such as the genome size and levels of heterozygosity and repeat content, but also about the data quality. > K-mer size, sequencing coverage and genome size > ->*K*-mers are unique substrings of length *k* contained within a DNA sequence. For example, the DNA sequence *TCGATCACA* can be decomposed into five unique *k*-mers that have five bases long: *TCGAT*, *CGATC*, *GATCA*, *ATCAC* and *TCACA*. A sequence of length L will have L-k+1 *k*-mers. On the other hand, the number of possible *k*-mers can be calculated as nk, where n is the number of possible monomers and k is the k-mer size. +>*K*-mers are unique substrings of length *k* contained within a DNA sequence. For example, the DNA sequence *TCGATCACA* can be decomposed into five unique *k*-mers that have five bases long: *TCGAT*, *CGATC*, *GATCA*, *ATCAC* and *TCACA*. A sequence of length L will have L-k+1 *k*-mers. On the other hand, the number of possible *k*-mers can be calculated as nk, where n is the number of possible monomers and k is the *k*-mer size. > > >---------| -------------|----------------------- -> Bases | K-mer size | Total possible k-mers +> Bases | *k*-mer size | Total possible *k*-mers >---------| -------------|----------------------- > 4 | 1 | 4 > 4 | 2 | 16 @@ -262,9 +309,9 @@ Before starting a *de novo* genome assembly project, it is useful to collect met > 4 | 10 | 1.048.576 >---------|--------------|----------------------- > -> Thus, the k-mer size is a key parameter, which must be large enough to map uniquely to the genome, but not too large, since it can lead to wasting computational resources. In the case of the human genome, *k*-mers of 31 bases in length lead to 96.96% of unique *k*-mers. +> Thus, the *k*-mer size is a key parameter, which must be large enough to map uniquely to the genome, but not too large, since it can lead to wasting computational resources. In the case of the human genome, *k*-mers of 31 bases in length lead to 96.96% of unique *k*-mers. > -> Each unique k-mer can be assigned a value for coverage based on the number of times it occurs in a sequence, whose distribution will approximate a Poisson distribution, with the peak corresponding to the average genome sequencing depth. From the genome coverage, the genome size can be easily computed. +> Each unique *k*-mer can be assigned a value for coverage based on the number of times it occurs in a sequence, whose distribution will approximate a Poisson distribution, with the peak corresponding to the average genome sequencing depth. From the genome coverage, the genome size can be easily computed. {: .details} ## Generation of _k_-mer spectra with **Meryl** @@ -273,81 +320,77 @@ Meryl will allow us to generate the *k*-mer profile by decomposing the sequencin > k-mer size estimation > -> Given an estimated genome size (G) and a tolerable collision rate (p), an appropriate k can be computed as k = log4 (G(1 − p)/p). +> Given an estimated genome size (*G*) and a tolerable collision rate (*p*), an appropriate *k* can be computed as $$ k = \log_4\left(\frac{G(1-p)}{p}\right) $$ . > {: .comment} > Generate k-mers count distribution > -> 1. Run {% tool [Meryl](toolshed.g2.bx.psu.edu/repos/iuc/meryl/meryl/1.3+galaxy6) %} with the following parameters: -> - *"Operation type selector"*: `Count operations` -> - *"Count operations"*: `Count: count the occurrences of canonical k-mers` -> - {% icon param-collection %} *"Input sequences"*: `HiFi_collection (trim)` -> - *"k-mer size selector"*: `Set a k-mer size` -> - "*k-mer size*": `31` +>**Step 1**: Run {% tool [Meryl](toolshed.g2.bx.psu.edu/repos/iuc/meryl/meryl/1.3+galaxy6) %} with the following parameters: +> 1. *"Operation type selector"*: `Count operations` +> 2. *"Count operations"*: `Count: count the occurrences of canonical k-mers` +> 3. {% icon param-collection %} *"Input sequences"*: `HiFi_collection (trim)` +> 4. *"k-mer size selector"*: `Set a k-mer size` +> 5. "*k-mer size*": `31` > -> > Selection of k-mer size -> > -> > We used 31 as *k*-mer size, as this length has demonstrated to be sufficiently long that most *k*-mers are not repetitive and is short enough to be more robust to sequencing errors. For very large (haploid size > 10 Gb) and/or very repetitive genomes, larger *k*-mer length is recommended to increase the number of unique *k*-mers. -> {: .comment} -> -> 2. Rename it `Collection meryldb` +> > Selection of k-mer size +> > +> > We used 31 as *k*-mer size, as this length has demonstrated to be sufficiently long that most *k*-mers are not repetitive and is short enough to be more robust to sequencing errors. For very large (haploid size > 10 Gb) and/or very repetitive genomes, larger *k*-mer length is recommended to increase the number of unique *k*-mers. +> {: .comment} +>
+>**Step 2**: Rename output as `meryldb` > -> 3. Run {% tool [Meryl](toolshed.g2.bx.psu.edu/repos/iuc/meryl/meryl/1.3+galaxy6) %} again with the following parameters: -> - *"Operation type selector"*: `Operations on sets of k-mers` -> - *"Operations on sets of k-mers"*: `Union-sum: return k-mers that occur in any input, set the count to the sum of the counts` -> - {% icon param-file %} *"Input meryldb"*: `Collection meryldb` +>**Step 3**: Run {% tool [Meryl](toolshed.g2.bx.psu.edu/repos/iuc/meryl/meryl/1.3+galaxy6) %} again with the following parameters: +> 1. *"Operation type selector"*: `Operations on sets of *k*-mers` +> 2. *"Operations on sets of k-mers"*: `Union-sum: return k-mers that occur in any input, set the count to the sum of the counts` +> 3. {% icon param-file %} *"Input meryldb"*: `Collection meryldb` > -> 4. Rename it as `Merged meryldb` +>**Step 4**: Rename it as `Merged meryldb` > -> 5. Run {% tool [Meryl](toolshed.g2.bx.psu.edu/repos/iuc/meryl/meryl/1.3+galaxy6) %} for the third time with the following parameters: +>**Step 5**: Run {% tool [Meryl](toolshed.g2.bx.psu.edu/repos/iuc/meryl/meryl/1.3+galaxy6) %} for the third time with the following parameters: > - *"Operation type selector"*: `Generate histogram dataset` > - {% icon param-file %} *"Input meryldb"*: `Merged meryldb` > -> 6. Finally, rename it as `Meryldb histogram`. +>**Step 6**: Finally, rename it as `meryldb histogram`. > {: .hands_on} - ## Genome profiling with **GenomeScope2** The next step is to infer the genome properties from the *k*-mer histogram generated by Meryl, for which we will use GenomeScope2. GenomeScope2 relies on a nonlinear least-squares optimization to fit a mixture of negative binomial distributions, generating estimated values for genome size, repetitiveness, and heterozygosity rates ({% cite RanalloBenavidez2020 %}). > Estimate genome properties > -> 1. {% tool [GenomeScope](toolshed.g2.bx.psu.edu/repos/iuc/genomescope/genomescope/2.0+galaxy2) %} with the following parameters: -> - {% icon param-file %} *"Input histogram file"*: `Meryldb histogram` -> - *Ploidy for model to use*: `2` -> - *"k-mer length used to calculate k-mer spectra"*: `31` -> -> - In "*Output options*": mark `Summary of the analysis` -> - In "*Advanced options*": -> - *"Create testing.tsv file with model parameters"*: `Yes` -> -> {: .comment} +> Run {% tool [GenomeScope](toolshed.g2.bx.psu.edu/repos/iuc/genomescope/genomescope/2.0+galaxy2) %} with the following parameters: +> 1. {% icon param-file %} *"Input histogram file"*: `meryldb histogram` +> 2. *Ploidy for model to use*: `2` +> 3. *"k-mer length used to calculate k-mer spectra"*: `31` +> 4. In "*Output options*": mark `Summary of the analysis` +> 5. In "*Advanced options*": +> 6. *"Create testing.tsv file with model parameters"*: `Yes` > {: .hands_on} Genomescope will generate six outputs: -- Plots +- **Plots**: - Linear plot: *k*-mer spectra and fitted models: frequency (y-axis) versus coverage. - Log plot: logarithmic transformation of the previous plot. - Transformed linear plot: *k*-mer spectra and fitted models: frequency times coverage (y-axis) versus coverage (x-axis). This transformation increases the heights of higher-order peaks, overcoming the effect of high heterozygosity. - Transformed log plot: logarithmic transformation of the previous plot. -- Model: this file includes a detailed report about the model fitting. -- Summary: it includes the properties inferred from the model, such as genome haploid length and the percentage of heterozygosity. +- **Model**: this file includes a detailed report about the model fitting. +- **Summary**: it includes the properties inferred from the model, such as genome haploid length and the percentage of heterozygosity. -Now, let's analyze the *k*-mer profiles, fitted models and estimated parameters (fig. 4). +Now, let's analyze the *k*-mer profiles, fitted models and estimated parameters shown below: -![Genomescope plot](../../images/vgp_assembly/genomescope_plot.png "GenomeScope2 31-mer profile. The first peak located at coverage 25x corresponds to the heterozygous peak. The second peak at coverage 50x, corresponds to the homozygous peak. Estimate of the heterozygous portion is 0.576%. The plot also includes information about the inferred total genome length (len), genome unique length percent ('uniq'), overall heterozygosity rate ('ab'), mean k-mer coverage for heterozygous bases ('kcov'), read error rate ('err'), and average rate of read duplications ('dup'). It also reports the user-given parameters of k-mer size ('k') and ploidy ('p')."){:width="65%"} +![Genomescope plot](../../images/vgp_assembly/genomescope_plot.png "GenomeScope2 31-mer profile. The first peak located at coverage 25✕ corresponds to the heterozygous peak. The second peak at coverage 50✕, corresponds to the homozygous peak. Estimate of the heterozygous portion is 0.576%. The plot also includes information about the inferred total genome length (len), genome unique length percent ('uniq'), overall heterozygosity rate ('ab'), mean *k*-mer coverage for heterozygous bases ('kcov'), read error rate ('err'), and average rate of read duplications ('dup'). It also reports the user-given parameters of *k*-mer size ('k') and ploidy ('p')."){:width="65%"} This distribution is the result of the Poisson process underlying the generation of sequencing reads. As we can see, the *k*-mer profile follows a bimodal distribution, indicative of a diploid genome. The distribution is consistent with the theoretical diploid model (model fit > 93%). Low frequency *k*-mers are the result of sequencing errors. GenomeScope2 estimated a haploid genome size is around 11.7 Mb, a value reasonably close to *Saccharomyces* genome size. Additionally, it revealed that the variation across the genomic sequences is 0.576%. > Are you expecting to purge your assembly? > This tutorial covers purging using the program **purge_dups**. purge_dups has some default options and can try to detect coverage-based cutoffs automatically, but the VGP pipeline prefers to define these cutoffs using parameters derived from the GenomeScope2 output. > -> _If you expect you need to purge your genome, please see the **solo** contigging section of the tutorial for details on parsing the GenomeScope2 output for purging cutoffs._ +> _If you expect you need to purge your genome, please see the [**solo** contiging section](#solo_hic_switch) of the tutorial for details on parsing the GenomeScope2 output for purging cutoffs._ {: .comment} # Assembly with **hifiasm** @@ -364,73 +407,94 @@ Once we have finished the genome profiling stage, we can start the genome assemb > {: .details} -The output of hifiasm will be {GFA} files. These differ from FASTA files in that they are a representation of the assembly graph instead of just linear sequences, so the GFA contains information about sequences, nodes, and edges (*i.e.*, overlaps). This output preserves the most information about how the reads assemble in graph space, and is useful to visualize in tools such as Bandage; however, our QV tools will expect FASTA files, so we will cover the GFA to FASTA conversion step later. +The output of hifiasm will be [GFA](https://github.com/GFA-spec/GFA-spec) files. These differ from FASTA files in that they are a representation of the assembly graph instead of just linear sequences, so the GFA contains information about sequences, nodes, and edges (*i.e.*, overlaps). This output preserves the most information about how the reads assemble in graph space, and is useful to visualize in tools such as Bandage; however, our QV tools will expect FASTA files, so we will cover the GFA to FASTA conversion step later. + +## `hifiasm` assembly modes -Hifiasm can be run in multiple modes depending on data availability: +Hifiasm can be run in multiple modes depending on data availability -**Solo**: generates a pseudohaplotype assembly, resulting in a primary & an alternate assembly (fig. 5). -- _Input: only HiFi reads_ +### **Solo** mode + +**Solo**: generates a pseudohaplotype assembly, resulting in a primary & an alternate assembly. +- _Input: PacBio HiFi reads_ - _Output: scaffolded primary assembly, and alternate contigs_ -![Diagram for hifiasm solo mode.](../../images/vgp_assembly/hifiasm_solo_schematic.png "The solo pipeline creates primary and alternate contigs, which then typically undergo purging with purge_dups to reconcile the haplotypes. During the purging process, haplotigs are removed from the primary assembly and added to the alternate assembly, which is then purged to generate the final alternate set of contigs. The purged primary contigs are then carried through scaffolding with Bionano and/or Hi-C data, resulting in one final draft primary assembly to be sent to manual curation.") +![Diagram for hifiasm solo mode.](../../images/vgp_assembly/hifiasm_solo_schematic.png "The solo pipeline creates primary and alternate contigs, which then typically undergo purging with purge_dups to reconcile the haplotypes. During the purging process, haplotigs are removed from the primary assembly and added to the alternate assembly, which is then purged to generate the final alternate set of contigs. The purged primary contigs are then carried through scaffolding with Bionano and/or Hi-C data, resulting in one final draft primary assembly to be sent to manual curation.") + +### **Hi-C** phased mode -**Hi-C-phased**: generates a hap1 assembly and a hap2 assembly, which are phased using the {Hi-C} reads from the same individual (fig. 6). -- _Input: HiFi & HiC reads_ +**Hi-C-phased**: generates a hap1 assembly and a hap2 assembly, which are phased using the {Hi-C} reads from the same individual. +- _Input: PacBio HiFi & Illumina HiC reads_ - _Output: scaffolded hap1 assembly, and scaffolded hap2 assembly (assuming you run the scaffolding on **both** haplotypes)_ -![Diagram for hifiasm hic mode.](../../images/vgp_assembly/hifiasm_hic_schematic.png "The Hi-C-phased mode produces hap1 and hap2 contigs, which have been phased using the HiC information as described in {% cite Cheng2021 %}. Typically, these assemblies do not need to undergo purging, but you should always look at your assemblies' QC to make sure. These contigs are then scaffolded separately using Bionano and/or Hi-C workflows, resulting in two scaffolded assemblies.") +![Diagram for hifiasm hic mode.](../../images/vgp_assembly/hifiasm_hic_schematic.png "The Hi-C-phased mode produces hap1 and hap2 contigs, which have been phased using the HiC information as described in {% cite Cheng2021 %}. Typically, these assemblies do not need to undergo purging, but you should always look at your assemblies' QC to make sure. These contigs are then scaffolded separately using Bionano and/or Hi-C workflows, resulting in two scaffolded assemblies.") + +### **Trio** mode -**Trio**: generates a maternal assembly and a paternal assembly, which are phased using reads from the parents (fig. 7). -- _Input: HiFi reads from child, Illumina reads from both parents._ +**Trio**: generates a maternal assembly and a paternal assembly, which are phased using reads from the parents. +- _Input: PacBio HiFi reads from child, Illumina reads from both parents._ - _Output: scaffolded maternal assembly, and scaffolded paternal assembly (assuming you run the scaffolding on **both** haplotypes)_ -![Diagram for hifiasm trio mode.](../../images/vgp_assembly/hifiasm_trio_schematic.png "The trio mode produces maternal and paternal contigs, which have been phased using paternal short read data. Typically, these assemblies do not need to undergo purging, but you should always look at your assemblies' QC to make sure. These contigs are then scaffolded separately using Bionano and/or Hi-C workflows, resulting in two scaffolded assemblies.") +![Diagram for hifiasm trio mode.](../../images/vgp_assembly/hifiasm_trio_schematic.png "The trio mode produces maternal and paternal contigs, which have been phased using paternal short read data. Typically, these assemblies do not need to undergo purging, but you should always look at your assemblies' QC to make sure. These contigs are then scaffolded separately using Bionano and/or Hi-C workflows, resulting in two scaffolded assemblies.") No matter which way you run hifiasm, you will have to evaluate the assemblies' {QC} to ensure your genome is in good shape. The VGP pipeline features several reference-free ways of evaluating assembly quality, all of which are automatically generated with our workflows; however, we will run them manually in this tutorial so we can familiarize ourselves with how each QC metric captures a different aspect of assembly quality. ## Assembly evaluation + +There are several tools for assessing various aspects of assembly quality: + - **gfastats**: manipulation & evaluation of assembly graphs and FASTA files, particularly used for summary statistics (*e.g.*, contig count, N50, NG50, etc.) ({% cite Formenti2022 %}). ![Schematic of N50 calculation.](../../images/vgp_assembly/n50schematic.jpg "N50 is a commonly reported statistic used to represent genome contiguity. N50 is calculated by sorting contigs according to their lengths, and then taking the halfway point of the total genome length. The size of the contig at that halfway point is the N50 value. In the pictured example, the total genome length is 400 bp, so the N50 value is 60 because the contig at the halfway point is 60 bp long. N50 can be interpreted as the value where >50% of an assembly's contigs are at that value or higher. Image adapted from Elin Videvall at The Molecular Ecologist.") - **{BUSCO}**: assesses completeness of a genome from an evolutionarily informed functional point of view. BUSCO genes are genes that are expected to be present at single-copy in one haplotype for a certain clade, so their presence, absence, or duplication can inform scientists about if an assembly is likely missing important regions, or if it has multiple copies of them, which can indicate a need for purging ({% cite Simo2015 %}). - **Merqury**: reference-free assessment of assembly completeness and phasing based on *k*-mers. Merqury compares *k*-mers in the reads to the *k*-mers found in the assemblies, as well as the {CN} of each *k*-mer in the assemblies ({% cite Rhie_merqury %}). +
+ +
{% include _includes/cyoa-choices.html option1="hic" option2="solo" default="hic" text="Use the following buttons to switch between contigging approaches. If you are assembling with only HiFi reads for an individual, then click solo. If you have HiC reads for the same indiviudal, then click hic. NOTE: If you want to learn more about purging, then please check out the solo tutorial for details on purging false duplications." %} -
+ + + + + + ## HiC-phased assembly with **hifiasm** If you have the {Hi-C} data for the individual you are assembling with {HiFi} reads, then you can use that information to phase the {contigs}. > Hi-C-phased assembly with hifiasm -> 1. {% tool [Hifiasm](toolshed.g2.bx.psu.edu/repos/bgruening/hifiasm/hifiasm/0.18.8+galaxy1) %} with the following parameters: -> - *"Assembly mode"*: `Standard` -> - {% icon param-file %} *"Input reads"*: `HiFi_collection (trim)` (output of **Cutadapt** {% icon tool %}) -> - *"Hi-C R1 reads"*: `Hi-C_dataset_F` -> - *"Hi-C R2 reads"*: `Hi-C_dataset_R` +>**Step 1**: Run {% tool [Hifiasm](toolshed.g2.bx.psu.edu/repos/bgruening/hifiasm/hifiasm/0.19.8+galaxy0) %} with the following parameters: +> 1. *"Assembly mode"*: `Standard` +> 2. {% icon param-file %} *"Input reads"*: `HiFi_collection (trim)` (output of **Cutadapt** {% icon tool %}) +> 3. In *"Options for Hi-C-partition*" select `Specify` +> - *"Hi-C R1 reads"*: `Hi-C_dataset_F` +> - *"Hi-C R2 reads"*: `Hi-C_dataset_R` > -> 2. After the tool has finished running, rename its outputs as follows: -> - Rename the `Hi-C hap1 balanced contig graph` as `Hap1 contigs graph` and add a `#hap1` tag -> - Rename the `Hi-C hap2 balanced contig graph` as `Hap2 contigs graph` and add a `#hap2` tag +>**Step 2**:. After the tool has finished running, rename its outputs as follows: +> 1. Rename the `Hi-C hap1 balanced contig graph` as `Hap1 contigs graph` and add a `#hap1` tag +> 2. Rename the `Hi-C hap2 balanced contig graph` as `Hap2 contigs graph` and add a `#hap2` tag > {: .hands_on} -We have obtained the fully phased contig graphs (as {GFA} files) of hap1 and hap2, but these must be converted to FASTA format for subsequent steps. We will use a tool developed from the VGP: **gfastats**. gfastats is a tool suite that allows for manipulation and evaluation of FASTA and GFA files, but in this instance we will use it to convert our GFAs to FASTA files. Later on we will use it to generate standard summary statistics for our assemblies. +We have obtained the fully phased contig graphs (as {GFA} files) of hap1 and hap2, but these must be converted to FASTA format for subsequent steps. We will use a tool developed from the VGP: [`gfastats`](https://github.com/vgl-hub/gfastats). `gfastats` is a tool suite that allows for manipulation and evaluation of FASTA and GFA files, but in this instance we will use it to convert our GFAs to FASTA files. Later on, we will use it to generate standard summary statistics for our assemblies. -> Convert GFA to FASTA +> GFA to FASTA conversion for hifiasm Hi-C assembly > -> 1. {% tool [gfastats](toolshed.g2.bx.psu.edu/repos/bgruening/gfastats/gfastats/1.3.6+galaxy0) %} with the following parameters: -> - {% icon param-files %} *"Input GFA file"*: select `Hap1 contigs graph` and the `Hap2 contigs graph` datasets +>**Step 1**: Run {% tool [gfastats](toolshed.g2.bx.psu.edu/repos/bgruening/gfastats/gfastats/1.3.6+galaxy0) %} with the following parameters: +> 1. {% icon param-files %} *"Input GFA file"*: select `Hap1 contigs graph` and the `Hap2 contigs graph` datasets > -> > Select multiple datasets -> > 1. Click on {% icon param-files %} **Multiple datasets** -> > 2. Select several files by keeping the Ctrl (or COMMAND) key pressed and clicking on the files of interest -> {: .tip} +> > Select multiple datasets +> > 1. Click on {% icon param-files %} **Multiple datasets** +> > 2. Select several files by keeping the Ctrl (or COMMAND) key pressed and clicking on the files of interest +> {: .tip} > -> - *"Tool mode"*: `Genome assembly manipulation` -> - *"Output format"*: `FASTA` -> - *"Generates the initial set of paths*": toggle to `yes` -> 2. Rename the outputs as `Hap1 contigs FASTA` and `Hap2 contigs FASTA` +> 2. *"Tool mode"*: `Genome assembly manipulation` +> 3. *"Output format"*: `FASTA` +> 4. *"Generates the initial set of paths*": toggle to `yes` +> +>**Step 2**: Rename the outputs as `Hap1 contigs FASTA` and `Hap2 contigs FASTA` > {: .hands_on} @@ -438,12 +502,12 @@ We have obtained the fully phased contig graphs (as {GFA} files) of hap1 and hap > > gfastats will provide us with the following statistics: > -> - No. of contigs: The total number of contigs in the assembly. -> - Largest contig: The length of the largest contig in the assembly. -> - Total length: The total number of bases in the assembly. -> - Nx: The largest contig length, *L*, such that contigs of length >= *L* account for at least *x*% of the bases of the assembly. -> - NGx: The contig length such that using equal or longer length contigs produces *x*% of the length of the reference genome, rather than *x*% of the assembly length. -> - GC content: the percentage of nitrogenous bases which are either guanine or cytosine. +> - **No. of contigs**: The total number of contigs in the assembly. +> - **Largest contig**: The length of the largest contig in the assembly. +> - **Total length**: The total number of bases in the assembly. +> - **Nx**: The largest contig length, *L*, such that contigs of length > *L* account for at least *x*% of the bases of the assembly. +> - **NGx**: The contig length such that using equal or longer length contigs produces *x*% of the length of the reference genome, rather than *x*% of the assembly length. +> - **GC content**: the percentage of nitrogenous bases which are either guanine or cytosine. > {: .comment} @@ -451,24 +515,73 @@ Let's use gfastats to get a basic idea of what our assembly looks like. We'll ru > Assembly evaluation with gfastats > -> 1. {% tool [gfastats](toolshed.g2.bx.psu.edu/repos/bgruening/gfastats/gfastats/1.3.6+galaxy0) %} with the following parameters: -> - {% icon param-files %} *"Input file"*: select `Hap1 contigs graph` and the `Hap2 contigs graph` datasets -> - *"Expected genome size"*: `11747160` (remember we calculated this value earlier, so it should be in your history!) -> - *"Generates the initial set of paths*": toggle to `yes` -> 2. Rename the outputs as `Hap1 stats` and `Hap2 stats` -> 3. {% tool [Column join](toolshed.g2.bx.psu.edu/repos/iuc/collection_column_join/collection_column_join/0.0.3) %} with the following parameters: -> - {% icon param-files %} *"Input file"*: select `Hap1 stats` and the `Hap2 stats` datasets -> 4. Rename the output as `gfastats on hap1 and hap2 (full)` -> 5. {% tool [Search in textfiles](toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_grep_tool/1.1.1) %} with the following parameters: -> - {% icon param-files %} *"Input file"*: select `gfastats on hap1 and hap2 (full)` -> - *"that"*: `Don't Match` -> - *"Type of regex"*: `Basic` -> - *"Regular Expression"*: `[Ss]caffold` -> 6. Rename the output as `gfastats on hap1 and hap2 contigs` +> **Step 1**: Run assembly statistics generation with {% tool [gfastats](toolshed.g2.bx.psu.edu/repos/bgruening/gfastats/gfastats/1.3.6+galaxy0) %} using the following parameters: +> +> 1. {% icon param-files %} *"Input file"*: select `Hap1 contigs graph` and the `Hap2 contigs graph` datasets +> 2. *"Tool mode": `Summary statistics generation` +> 3. *"Expected genome size"*: `11747160` (remember we calculated this value [earlier using `GenomeScope2`](#genome-profiling-with-genomescope2). It is contained within `GenomeScope2` **Summary** output that should be in your history!) +> 4. *"Thousands separator in output"*: Set to "No" +> +>
+> +> **Step 2**: Rename outputs of `gfastats` step to as `Hap1 stats` and `Hap2 stats` +> +> This would generate summary files that look like this (only the first six rows are shown): +> +> ``` +> Expected genome size 11747160 +> # scaffolds 0 +> Total scaffold length 0 +> Average scaffold length nan +> Scaffold N50 0 +> Scaffold auN 0.00 +> ``` +> +> Because we ran `gfastats` on hap1 and hap2 outputs of `hifiasm` we need to join the two outputs together for easier interpretation: +> +>
+> +> **Step 3**: Run {% tool [Column join](toolshed.g2.bx.psu.edu/repos/iuc/collection_column_join/collection_column_join/0.0.3) %} with the following parameters: +> +> {% icon param-files %} *"Input file"*: select `Hap1 stats` and the `Hap2 stats` datasets. Keep all other settings as they are. +> +>
+> +> **Step 4**: Rename the output as `gfastats on hap1 and hap2 (full)` +> +> This would generate a joined summary file that looks like this (only the first five rows are shown): +> +> ``` +> # gaps 0 0 +> # gaps in scaffolds 0 0 +> # paths 0 0 +> # segments 17 16 +> ``` +> +> Now let's extract only relevant information by excluding all lines containing the word `scaffold` since there are no scaffolds at this stage of the assembly process (only contigs): +> +>
+> +> **Step 5**: Run {% tool [Search in textfiles](toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_grep_tool/1.1.1) %} with the following parameters: +> 1. {% icon param-files %} *"Input file"*: select `gfastats on hap1 and hap2 (full)` +> 2. *"that"*: `Don't Match` +> 3. *"Type of regex"*: `Basic` +> 4. *"Regular Expression"*: enter the word `scaffold` +> 5. *"Match type*": leave as `case insensitive` +> +>
+> +> **Step 6**: Rename the output as `gfastats on hap1 and hap2 contigs` > {: .hands_on} -Take a look at the _gfastats on hap1 and hap2 contigs_ output — it should have three columns: 1) name of statistic, 2) hap1 value, and 3) hap2 value. According to the report, both assemblies are quite similar; the hap1 assembly includes 16 contigs, totalling ~11.3Mbp of sequence (the `Total contig length` statistic), while the hap2 assembly includes 17 contigs, whose total length is ~12.2Mbp. (**NB**: Your values may differ slightly, or be reversed between the two haplotypes!) +Take a look at the `gfastats on hap1 and hap2 contigs` output — it has three columns: + + 1. Name of statistic + 2. Value for haplotype 1 (hap1) + 2. Value for haplotype 2 (hap2) + +According to the report, both assemblies are quite similar; the hap1 assembly includes 16 contigs, totalling ~11.3Mbp of sequence (the `Total contig length` statistic), while the hap2 assembly includes 17 contigs, whose total length is ~12.2Mbp. (**NB**: Your values may differ slightly, or be reversed between the two haplotypes!) > > @@ -487,26 +600,30 @@ Take a look at the _gfastats on hap1 and hap2 contigs_ output — it should have Next, we will use {BUSCO}, which will provide quantitative assessment of the completeness of a genome assembly in terms of expected gene content. It relies on the analysis of genes that should be present only once in a complete assembly or gene set, while allowing for rare gene duplications or losses ({% cite Simo2015 %}). > Assessing assembly completeness with BUSCO -> -> 1. {% tool [Busco](toolshed.g2.bx.psu.edu/repos/iuc/busco/busco/5.0.0+galaxy0) %} with the following parameters: -> - {% icon param-files %} *"Sequences to analyze"*: `Hap1 contigs FASTA` and `Hap2 contigs FASTA` -> - *"Mode"*: `Genome assemblies (DNA)` -> - *"Use Augustus instead of Metaeuk"*: `Use Metaeuk` -> - *"Auto-detect or select lineage?"*: `Select lineage` -> - *"Lineage"*: `Saccharomycetes` -> - *"Which outputs should be generated"*: `short summary text` and `summary image` -> -> > -> > -> > Remember to modify the lineage option if you are working with vertebrate genomes. -> {: .comment} -> -> 2. Rename the outputs as `BUSCO hap1` and `BUSCO hap2`. +> +> **Step 1**: Run {% tool [Busco](toolshed.g2.bx.psu.edu/repos/iuc/busco/busco/5.5.0+galaxy0) %} with the following parameters: +> 1. {% icon param-files %} *"Sequences to analyze"*: `Hap1 contigs FASTA` and `Hap2 contigs FASTA` +> 2. *"Lineage data source"*: `Use cached lineage data` +> 3. *"Cached database with lineage"*: `Busco v5 Lineage Datasets` +> 4. *"Mode"*: `Genome assemblies (DNA)` +> 5. *"Use Augustus instead of Metaeuk"*: `Use Metaeuk` +> 6. *"Auto-detect or select lineage?"*: `Select lineage` +> 7. *"Lineage"*: `Saccharomycetes` +> 8. *"Which outputs should be generated"*: `short summary text` and `summary image` +> +>> +>> +>> Remember to modify the *"Lineage"* option if you are working with vertebrate genomes. +> {: .comment} +> +>
+> +> **Step 2**: Rename the outputs as `BUSCO hap1` and `BUSCO hap2`. > {: .hands_on} We have asked {BUSCO} to generate two particular outputs: the short summary, and a summary image. -![BUSCO for hap1 & hap2.](../../images/vgp_assembly/busco_hap1hap2.png "BUSCO results for hap1 (top) and hap2 (bottom). Each haplotype is showing the summary image output as well as the short summary output. The summary image gives a good overall idea of the status of BUSCO genes within the assembly, while the short summary lists these as percentages as well. In our case, neither assembly seems to have duplicated BUSCO genes (there is a very low amount of dark blue in the summary images), though hap1 seems to be marginally less complete (there is more red in the summary imaged compared to hap2).") +![BUSCO for hap1 & hap2.](../../images/vgp_assembly/busco_hap1hap2.svg "BUSCO results for hap1 and hap2. Each haplotype is showing the summary image output as well as the short summary output. The summary image gives a good overall idea of the status of BUSCO genes within the assembly, while the short summary lists these as percentages as well. In our case, neither assembly seems to have duplicated BUSCO genes (there is a very low amount of dark blue in the summary images).") > > @@ -515,82 +632,98 @@ We have asked {BUSCO} to generate two particular outputs: the short summary, and > > > > > -> > 1. Hap1 has 1864 complete and single-copy BUSCO genes, which is 87.2% of the gene set. -> > 2. 181 BUSCO genes are missing. +> > 1. Hap1 has 2,047 complete and single-copy BUSCO genes, which is 95.8% of the gene set. +> > 2. 29 BUSCO genes are missing. > > > {: .solution} > {: .question} -Despite BUSCO being robust for species that have been widely studied, it can be inaccurate when the newly assembled genome belongs to a taxonomic group that is not well represented in [OrthoDB](https://www.orthodb.org/). Merqury provides a complementary approach for assessing genome assembly quality metrics in a reference-free manner via *k*-mer copy number analysis. +Despite BUSCO being robust for species that have been widely studied, it can be inaccurate when the newly assembled genome belongs to a taxonomic group that is not well represented in [OrthoDB](https://www.orthodb.org/). Merqury provides a complementary approach for assessing genome assembly quality metrics in a reference-free manner via *k*-mer copy number analysis. Let's run Merqury evaluation as shown below. > k-mer based evaluation with Merqury > -> 1. {% tool [Merqury](toolshed.g2.bx.psu.edu/repos/iuc/merqury/merqury/1.3) %} with the following parameters: -> - *"Evaluation mode"*: `Default mode` -> - {% icon param-file %} *"k-mer counts database"*: `Merged meryldb` -> - *"Number of assemblies"*: `Two assemblies -> - {% icon param-file %} *"First genome assembly"*: `Hap1 contigs FASTA` -> - {% icon param-file %} *"Second genome assembly"*: `Hap2 contigs FASTA` +> Run {% tool [Merqury](toolshed.g2.bx.psu.edu/repos/iuc/merqury/merqury/1.3+galaxy3) %} with the following parameters: +> +> 1. *"Evaluation mode"*: `Default mode` +> 2. {% icon param-file %} *"k-mer counts database"*: `Merged meryldb` +> 3. *"Number of assemblies"*: `Two assemblies` +> 4. {% icon param-file %} *"First genome assembly"*: `Hap1 contigs FASTA` +> 5. {% icon param-file %} *"Second genome assembly"*: `Hap2 contigs FASTA` > {: .hands_on} -By default, Merqury generates three collections as output: stats, plots and {QV} stats. The "stats" collection contains the completeness statistics, while the "QV stats" collection contains the quality value statistics. Let's have a look at the assembly {CN} spectrum plot, known as the *spectra-cn* plot (fig. 7). +By default, Merqury generates three collections as output: stats, plots and {QV} stats. The "stats" collection contains the completeness statistics, while the "QV stats" collection contains the quality value statistics. Let's have a look at the assembly {CN} spectrum plot, known as the *spectra-cn.fl* plot: -![Merqury spectra-cn plot for the hap1/hap2 assemblies.](../../images/vgp_assembly/merqury_cn_plot.png "Merqury CN plot. This plot tracks the multiplicity of each k-mer found in the Hi-Fi read set and colors it by the number of times it is found in a given assembly. Merqury connects the midpoint of each histogram bin with a line, giving the illusion of a smooth curve."){:width="65%"} +![Merqury spectra-cn plot for the hap1/hap2 assemblies.](../../images/vgp_assembly/merqury_cn_plot.png "Merqury CN plot. This plot tracks the multiplicity of each k-mer found in the HiFi read set and colors it by the number of times it is found in a given assembly. Merqury connects the midpoint of each histogram bin with a line, giving the illusion of a smooth curve."){:width="65%"} -The black region in the left side corresponds to k-mers found only in the read set; it is usually indicative of sequencing error in the read set, although it can also be indicative of missing sequences in the assembly. The red area represents one-copy k-mers in the genome, while the blue area represents two-copy k-mers originating from homozygous sequence or haplotype-specific duplications. From this figure we can state that the diploid sequencing coverage is around 50x, which we also know from the GenomeScope2 plot we looked at earlier. +The grey region in the left side corresponds to *k*-mers found only in the read set; it is usually indicative of sequencing error in the read set, although it can also be a result of missing sequences in the assembly. The red area represents one-copy *k*-mers in the genome, while the blue area represents two-copy *k*-mers originating from homozygous sequences or haplotype-specific duplications. From this figure, we can state that the diploid sequencing coverage is around 50✕, which we also know from the GenomeScope2 plot we looked at [earlier](#figure-5). -To get an idea of how the *k*-mers have been distributed between our hap1 and hap2 assemblies, we should look at the *spectra-asm* output of Merqury. +To get an idea of how the *k*-mers have been distributed between our hap1 and hap2 assemblies, we should look at the *spectra-asm.fl* output of Merqury. -![Merqury spectra-asm plot for the hap1/hap2 assemblies.](../../images/vgp_assembly/merqury_hap1hap2_asm.png "Merqury ASM plot. This plot tracks the multiplicity of each k-mer found in the Hi-Fi read set and colors it according to which assemblies contain those k-mers. This can tell you which k-mers are found in only one assembly or shared between them."){:width="65%"} +![Merqury spectra-asm plot for the hap1/hap2 assemblies.](../../images/vgp_assembly/merqury_hap1hap2_asm.png "Merqury ASM plot. This plot tracks the multiplicity of each k-mer found in the HiFi read set and colors it according to which assemblies contain those k-mers. This can tell you which k-mers are found in only one assembly or shared between them."){:width="65%"} -The large green peak is centered at 50x coverage (remember that's our diploid coverage!), indicating that *k*-mers suggested by the reads to be from diploid regions are in fact shared between the two assemblies, as they should be if they are from homozygous regions. The haploid coverage *k*-mers (around ~25x coverage) are split between hap1 and hap2 assemblies, somewhat unevenly but still not as bad as it would be in an assembly without phasing data. +The large green peak is centered at 50✕ coverage (remember that's our diploid coverage!), indicating that *k*-mers suggested by the reads to be from diploid regions are in fact shared between the two assemblies, as they should be if they are from homozygous regions. The haploid coverage *k*-mers (around ~25✕ coverage) are split between hap1 and hap2 assemblies, somewhat unevenly but still not as bad as it would be in an assembly without phasing data.
+ + + + +
## Pseudohaplotype assembly with **hifiasm** -When hifiasm is run without any additional phasing data, it will do its best to generate a pseudohaplotype primary/alternate set of assemblies. These assemblies will typically contain more contigs that switch between parental blocks. Because of this, the primary assembly generated with this method can have a higher N50 value than an assembly generated with haplotype-phasing, but the contigs will contain more switch errors. +When hifiasm is run without any additional phasing data, it will do its best to generate a pseudohaplotype primary/alternate set of assemblies. These assemblies will typically contain more contigs that switch between parental blocks. Because of this, the primary assembly generated with this method can have a higher $$ N50 $$ value than an assembly generated with haplotype-phasing, but the contigs will contain more switch errors. > Pseudohaplotype assembly with hifiasm -> 1. {% tool [Hifiasm](toolshed.g2.bx.psu.edu/repos/bgruening/hifiasm/hifiasm/0.18.8+galaxy1) %} with the following parameters: -> - *"Assembly mode"*: `Standard` -> - {% icon param-file %} *"Input reads"*: `HiFi_collection (trim)` (output of **Cutadapt** {% icon tool %}) -> - *"Options for purging duplicates"*: `Specify` -> - *"Purge level"*: `Light (1)` > +> **Step 1**: Run {% tool [Hifiasm](toolshed.g2.bx.psu.edu/repos/bgruening/hifiasm/hifiasm/0.19.8+galaxy0) %} with the following parameters: > -> > A note on hifiasm purging level -> > hifiasm has an internal purging function, which we have set to `Light` here. The VGP pipeline currently disables the hifiasm internal purging, in favor of using the purge_dups suite after the fact in order to have more control over the parameters used for purging. -> {: .comment} +> 1. *"Assembly mode"*: `Standard` +> 2. {% icon param-file %} *"Input reads"*: `HiFi_collection (trim)` (output of **Cutadapt** {% icon tool %}) +> 3. *"Options for purging duplicates"*: `Specify` +> 4. *"Purge level"*: `Light (1)` +> +> +>> A note on hifiasm purging level +>> hifiasm has an internal purging function, which we have set to `Light` here. The VGP pipeline currently disables the hifiasm internal purging, in favor of using the purge_dups suite after the fact in order to have more control over the parameters used for purging. +>{: .comment} > +>
> -> 2. After the tool has finished running, rename its outputs as follows: -> - Rename the `primary assembly contig graph for pseudohaplotype assembly` as `Primary contigs graph` and add a `#pri` tag -> - Rename the `alternate assembly contig graph for pseudohaplotype assemblyh` as `Alternate contigs graph` and add a `#alt` tag +> **Step 2**: After the tool has finished running, rename its outputs as follows: +> 1. Rename the `primary assembly contig graph for pseudohaplotype assembly` as `Primary contigs graph` and add a `#pri` tag +> 2. Rename the `alternate assembly contig graph for pseudohaplotype assembly` as `Alternate contigs graph` and add a `#alt` tag > {: .hands_on} We have obtained the primary and alternate contig graphs (as {GFA} files), but these must be converted to FASTA format for subsequent steps. We will use a tool developed from the VGP: **gfastats**. gfastats is a tool suite that allows for manipulation and evaluation of FASTA and GFA files, but in this instance we will use it to convert our GFAs to FASTA files. Later on we will use it to generate standard summary statistics for our assemblies. +
+ +
+ > convert GFA to FASTA > -> 1. {% tool [gfastats](toolshed.g2.bx.psu.edu/repos/bgruening/gfastats/gfastats/1.3.6+galaxy0) %} with the following parameters: -> - {% icon param-files %} *"Input GFA file"*: select `Primary contigs graph` and the `Alternate contigs graph` datasets +>> Selecting multiple datasets +>> Below we start two `gfastats` jobs by selecting two input datasets. To do this: +>> 1. Click on {% icon param-files %} **Multiple datasets** +>> 2. Select several files by keeping the Ctrl (or COMMAND) key pressed and clicking on the files of interest +>{: .tip} > -> > Select multiple datasets -> > 1. Click on {% icon param-files %} **Multiple datasets** -> > 2. Select several files by keeping the Ctrl (or COMMAND) key pressed and clicking on the files of interest -> {: .tip} +>
> -> - *"Tool mode"*: `Genome assembly manipulation` -> - *"Output format"*: `FASTA` -> - *"Generates the initial set of paths*": toggle to `yes` -> 2. Rename the outputs as `Primary contigs FASTA` and `Alternate contigs FASTA` +> **Step 1**: Run {% tool [gfastats](toolshed.g2.bx.psu.edu/repos/bgruening/gfastats/gfastats/1.3.6+galaxy0) %} with the following parameters: +> 1. {% icon param-files %} *"Input GFA file"*: select `Primary contigs graph` and the `Alternate contigs graph` datasets +> 2. *"Tool mode"*: `Genome assembly manipulation` +> 3. *"Output format"*: `FASTA` +> 4. *"Generates the initial set of paths*": toggle to `yes` +> +> **Step 2**: Rename the outputs as `Primary contigs FASTA` and `Alternate contigs FASTA` > {: .hands_on} @@ -609,26 +742,78 @@ We have obtained the primary and alternate contig graphs (as {GFA} files), but t Let's use gfastats to get a basic idea of what our assembly looks like. We'll run gfastats on the {GFA} files because gfastats can report graph-specific statistics as well. After generating the stats, we'll be doing some text manipulation to get the stats side-by-side in a pretty fashion. + > Assembly evaluation with gfastats > -> 1. {% tool [gfastats](toolshed.g2.bx.psu.edu/repos/bgruening/gfastats/gfastats/1.3.6+galaxy0) %} with the following parameters: -> - {% icon param-files %} *"Input file"*: select `Primary contigs graph` and the `Alternate contigs graph` datasets -> - *"Expected genome size"*: `11747160` (remember we calculated this value earlier, so it should be in your history!) -> - *"Generates the initial set of paths*": toggle to `yes` -> 2. Rename the outputs as `Primary stats` and `Alternate stats` -> 3. {% tool [Column join](toolshed.g2.bx.psu.edu/repos/iuc/collection_column_join/collection_column_join/0.0.3) %} with the following parameters: -> - {% icon param-files %} *"Input file"*: select `Primary stats` and the `Alternate stats` datasets -> 4. Rename the output as `gfastats on pri and alt (full)` -> 5. {% tool [Search in textfiles](toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_grep_tool/1.1.1) %} with the following parameters: -> - {% icon param-files %} *"Input file"*: select `gfastats on pri and alt (full)` -> - *"that"*: `Don't Match` -> - *"Type of regex"*: `Basic` -> - *"Regular Expression"*: `[Ss]caffold` -> 6. Rename the output as `gfastats on pri and alt contigs` +> **Step 1**: Run assembly statistics generation with {% tool [gfastats](toolshed.g2.bx.psu.edu/repos/bgruening/gfastats/gfastats/1.3.6+galaxy0) %} using the following parameters: +> +> 1. {% icon param-files %} *"Input file"*: select `Primary contigs graph` and the `Alternate contigs graph` datasets +> 2. *"Tool mode": `Summary statistics generation` +> 3. *"Expected genome size"*: `11747160` (remember we calculated this value earlier using [`GenomeScope2`](#genome-profiling-with-genomescope2). It is contained within `GenomeScope2` **Summary** output that should be in your history!) +> 4. *"Thousands separator in output"*: Set to "No" +> 5. *"Generates the initial set of paths*": toggle to `yes` +> +>
+> +> **Step 2**: Rename outputs of `gfastats` step to as `Primary stats` and `Alternate stats` +> +> This would generate summary files that look like this (only the first six rows are shown): +> +> ``` +> Expected genome size 11747160 +> # scaffolds 25 +> Total scaffold length 18519764 +> Average scaffold length 740790.56 +> Scaffold N50 813311 +> Scaffold auN 913050.77 +> ``` +> +> Because we ran `gfastats` on Primary and Alternate outputs of `hifiasm` we need to join the two outputs together for easier interpretation: +> +>
+> +> **Step 3**: Run {% tool [Column join](toolshed.g2.bx.psu.edu/repos/iuc/collection_column_join/collection_column_join/0.0.3) %} with the following parameters: +> +> {% icon param-files %} *"Input file"*: select `Primary stats` and the `Alternate stats` datasets (these are from **Step 2** above). Keep all other setting as they are. +> +>
+> +> **Step 4**: Rename the output as `gfastats on Pri and Alt (full)` +> +> This would generate a joined summary file that looks like this (only five rows are shown): +> +> ``` +> # contigs 25 10 +> # dead ends . 16 +> # disconnected components . 7 +> # edges . 6 +> # gaps 0 0 +> ``` +> +> Now let's extract only relevant information by excluding all lines containing the word `scaffold` since there are no scaffolds at this stage of the assembly process (only contigs): +> +>
+> +> **Step 5**: Run {% tool [Search in textfiles](toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_grep_tool/1.1.1) %} with the following parameters: +> 1. {% icon param-files %} *"Input file"*: select `gfastats on Pri and Alt (full)` +> 2. *"that"*: `Don't Match` +> 3. *"Type of regex"*: `Basic` +> 4. *"Regular Expression"*: enter the word `scaffold` +> 5. *"Match type*": leave as `case insensitive` +> +>
+> +> **Step 6**: Rename the output as `gfastats on Pri and Alt contigs` > {: .hands_on} -Take a look at the _gfastats on pri and alt contigs_ output — it should have three columns: 1) name of statistic, 2) primary assembly value, and 3) alternate assembly value. The report makes it clear that the two assemblies are markedly uneven: the primary assembly has 25 contigs totalling ~18.5 Mbp, while the alternate assembly has 8 contigs totalling only about 4.95 Mbp. If you'll remember that our estimated genome size is ~11.7 Mbp, then you'll see that the primary assembly has almost 2/3 more sequence than expected for a haploid representation of the genome! This is because a lot of heterozygous regions have had *both* copies of those loci placed into the primary assembly, as a result of incomplete purging. The presence of false duplications can be confirmed by looking at {BUSCO} and Merqury results. +Take a look at the `gfastats on Pri and Alt contigs` output — it has three columns: + + 1. Name of statistic + 2. Value for haplotype 1 (Pri) + 3. Value for haplotype 2 (Alt) + +The report makes it clear that the two assemblies are markedly uneven: the primary assembly has 25 contigs totalling ~18.5 Mbp, while the alternate assembly has 8 contigs totalling only about 4.95 Mbp. If you'll remember that our estimated genome size is ~11.7 Mbp, then you'll see that the primary assembly has almost 2/3 more sequence than expected for a haploid representation of the genome! This is because a lot of heterozygous regions have had *both* copies of those loci placed into the primary assembly, as a result of incomplete purging. The presence of false duplications can be confirmed by looking at {BUSCO} and Merqury results. > > @@ -647,26 +832,31 @@ Take a look at the _gfastats on pri and alt contigs_ output — it should have t Next, we will use {BUSCO}, which will provide quantitative assessment of the completeness of a genome assembly in terms of expected gene content. It relies on the analysis of genes that should be present only once in a complete assembly or gene set, while allowing for rare gene duplications or losses ({% cite Simo2015 %}). > Assessing assembly completeness with BUSCO -> -> 1. {% tool [Busco](toolshed.g2.bx.psu.edu/repos/iuc/busco/busco/5.0.0+galaxy0) %} with the following parameters: -> - {% icon param-files %} *"Sequences to analyze"*: `Primary contigs FASTA` -> - *"Mode"*: `Genome assemblies (DNA)` -> - *"Use Augustus instead of Metaeuk"*: `Use Metaeuk` -> - *"Auto-detect or select lineage?"*: `Select lineage` -> - *"Lineage"*: `Saccharomycetes` -> - *"Which outputs should be generated"*: `short summary text` and `summary image` -> -> > -> > -> > Remember to modify the lineage option if you are working with vertebrate genomes. -> {: .comment} -> -> 2. Rename the outputs as `BUSCO primary contigs`. +> +> **Step 1**: Run {% tool [Busco](toolshed.g2.bx.psu.edu/repos/iuc/busco/busco/5.5.0+galaxy0) %} with the following parameters: +> 1. {% icon param-files %} *"Sequences to analyze"*: `Primary contigs FASTA` and `Alternate contigs FASTA` +> 2. *"Lineage data source"*: `Use cached lineage data` +> 3. *"Cached database with lineage"*: `Busco v5 Lineage Datasets` +> 4. *"Mode"*: `Genome assemblies (DNA)` +> 5. *"Use Augustus instead of Metaeuk"*: `Use Metaeuk` +> 6. *"Auto-detect or select lineage?"*: `Select lineage` +> 7. *"Lineage"*: `Saccharomycetes` +> 8. *"Which outputs should be generated"*: `short summary text` and `summary image` +> +>> +>> +>> Remember to modify the *"Lineage"* option if you are working with vertebrate genomes. +> {: .comment} +> +>
+> +> **Step 2**: Rename the outputs as `BUSCO Pri` and `BUSCO Alt`. > {: .hands_on} We have asked {BUSCO} to generate two particular outputs: the short summary, and a summary image. -![BUSCO for primary contigs.](../../images/vgp_assembly/busco_pri_unpurged.png "BUSCO results for the primary contigs. The summary image (left) gives a good overall idea of the status of BUSCO genes within the assembly, while the short summary (right) lists these as percentages as well. In this case, this primary assembly seems to have a large amount of duplicated BUSCO genes, but is otherwise complete (i.e., not much missing content).") + +![BUSCO for primary contigs.](../../images/vgp_assembly/busco_pri_alt_solo.svg "BUSCO results for primary and alternate contigs. The summary image (left) gives a good overall idea of the status of BUSCO genes within the assembly, while the short summary (right) lists these as percentages as well. In this case, this primary assembly seems to have a large amount of duplicated BUSCO genes, but is otherwise complete (i.e., not much missing content).") The BUSCO results support our hypothesis that the primary assembly is so much larger than expected due to improper purging, resulting in false duplications. @@ -690,40 +880,42 @@ Despite BUSCO being robust for species that have been widely studied, it can be > k-mer based evaluation with Merqury > -> 1. {% tool [Merqury](toolshed.g2.bx.psu.edu/repos/iuc/merqury/merqury/1.3) %} with the following parameters: -> - *"Evaluation mode"*: `Default mode` -> - {% icon param-file %} *"k-mer counts database"*: `Merged meryldb` -> - *"Number of assemblies"*: `Two assemblies -> - {% icon param-file %} *"First genome assembly"*: `Primary contigs FASTA` -> - {% icon param-file %} *"Second genome assembly"*: `Alternate contigs FASTA` +> Run {% tool [Merqury](toolshed.g2.bx.psu.edu/repos/iuc/merqury/merqury/1.3+galaxy3) %} with the following parameters: +> +> 1. *"Evaluation mode"*: `Default mode` +> 2. {% icon param-file %} *"k-mer counts database"*: `Merged meryldb` +> 3. *"Number of assemblies"*: `Two assemblies` +> 4. {% icon param-file %} *"First genome assembly"*: `Primary contigs FASTA` +> 5. {% icon param-file %} *"Second genome assembly"*: `Alternate contigs FASTA` +>
+>(REMINDER: `Primary contigs FASTA` and `Alternate contigs FASTA` were generated [earlier](#gfa2fasta_solo)) > {: .hands_on} -By default, Merqury generates three collections as output: stats, plots and {QV} stats. The "stats" collection contains the completeness statistics, while the "QV stats" collection contains the quality value statistics. Let's have a look at the assembly {CN} spectrum plot, known as the *spectra-cn* plot (fig. 7). +By default, Merqury generates three collections as output: stats, plots and {QV} stats. The "stats" collection contains the completeness statistics, while the "QV stats" collection contains the quality value statistics. Let's have a look at the assembly {CN} spectrum plot, known as the *spectra-cn* plot: -![Merqury spectra-cn plot for the pri/alt assemblies.](../../images/vgp_assembly/merqury_cn_plot.png "Merqury CN plot. This plot tracks the multiplicity of each k-mer found in the Hi-Fi read set and colors it by the number of times it is found in a given assembly. Merqury connects the midpoint of each histogram bin with a line, giving the illusion of a smooth curve."){:width="65%"} +![Merqury spectra-cn plot for the pri/alt assemblies.](../../images/vgp_assembly/merqury_cn_plot.png "Merqury CN plot. This plot tracks the multiplicity of each k-mer found in the Hi-Fi read set and colors it by the number of times it is found in a given assembly. Merqury connects the midpoint of each histogram bin with a line, giving the illusion of a smooth curve."){:width="65%"} -The black region in the left side corresponds to k-mers found only in the read set; it is usually indicative of sequencing error in the read set, although it can also be indicative of missing sequences in the assembly. The red area represents one-copy k-mers in the genome, while the blue area represents two-copy k-mers originating from homozygous sequence or haplotype-specific duplications. From this figure we can state that the diploid sequencing coverage is around 50x, which we also know from the GenomeScope2 plot we looked at earlier. +The black region in the left side corresponds to *k*-mers found only in the read set; it is usually indicative of sequencing error in the read set, although it can also be indicative of missing sequences in the assembly. The red area represents one-copy *k*-mers in the genome, while the blue area represents two-copy *k*-mers originating from homozygous sequences or haplotype-specific duplications. From this figure, we can state that the diploid sequencing coverage is around 50✕, which we also know from the GenomeScope2 plot we looked at earlier. -To get an idea of how the *k*-mers have been distributed between our hap1 and hap2 assemblies, we should look at the *spectra-asm* output of Merqury. +To get an idea of how the *k*-mers have been distributed between our Primary and Alternate assemblies, we should look at the *spectra-asm* output of Merqury. -![Merqury spectra-asm plot for the hap1/hap2 assemblies.](../../images/vgp_assembly/merqury_prialt_asm_prepurge.png "Merqury ASM plot. This plot tracks the multiplicity of each k-mer found in the Hi-Fi read set and colors it according to which assemblies contain those k-mers. This can tell you which k-mers are found in only one assembly or shared between them."){:width="65%"} +![Merqury spectra-asm plot for the hap1/hap2 assemblies.](../../images/vgp_assembly/merqury_prialt_asm_prepurge.png "Merqury ASM plot. This plot tracks the multiplicity of each k-mer found in the HiFi read set and colors it according to which assemblies contain those k-mers. This can tell you which k-mers are found in only one assembly or shared between them."){:width="65%"} -For an idea of what a properly phased spectra-asm plot would look like, **please click over to the Hi-C phasing version of this tutorial**. A properly phased spectra-asm plot should have a large green peak centered around the point of diploid coverage (here ~50X), and the two assembly-specific peaks should be centered around the point of haploid coverage (here ~25X) and resembling each other in size. +For an idea of what a properly phased spectra-asm plot would look like, **please [go over](#solo_hic_switch) to the Hi-C phasing version of this tutorial**. A properly phased spectra-asm plot should have a large green peak centered around the point of diploid coverage (here ~50✕), and the two assembly-specific peaks should be centered around the point of haploid coverage (here ~25✕) and resembling each other in size. The spectra-asm plot we have for our primary & alternate assemblies here does not resemble one that is properly phased. There is a peak of green (shared) *k*-mers around diploid coverage, indicating that some homozygous regions have been properly split between the primary and alternate assemblies; however, there is still a large red peak of primary-assembly-only *k*-mers at that coverage value, too, which means that some homozygous regions are being represented twice in the primary assembly, instead of once in the primary and once in the alternate. Additionally, for the haploid peaks, the primary-only peak (in red) is much larger than the alternate-only peak (in blue), indicating that a lot of heterozygous regions might have both their alternate alleles represented in the primary assembly, which is false duplication. -For further confirmation, we can also look at the individual, assembly-specific {CN} plots. In the Merqury outputs, the `output_merqury.assembly_01.spectra-cn.fl` is a {CN} spectra with *k*-mers colored according to their copy number in the primary assembly. +For further confirmation, we can also look at the individual, assembly-specific {CN} plots. In the Merqury outputs, the `output_merqury.assembly_01.spectra-cn.fl` is a {CN} spectra with *k*-mers colored according to their copy number in the primary assembly: -![Merqury spectra-cn plot for the pri assembly only.](../../images/vgp_assembly/merqury_prialt_priCN_prepurge.png "Merqury CN plot for the primary assembly only. This plot colors k-mers according to their copy number in the primary assembly. K-mers that are present in the reads but not the primary assembly are labelled 'read-only'."){:width="65%"} +![Merqury spectra-cn plot for the pri assembly only.](../../images/vgp_assembly/merqury_prialt_priCN_prepurge.png "Merqury CN plot for the primary assembly only. This plot colors k-mers according to their copy number in the primary assembly. k-mers that are present in the reads but not the primary assembly are labelled 'read-only'."){:width="65%"} -In the primary-only {CN} plot, we observe a large 2-copy (colored blue) peak at diploid coverage. Ideally, this would not be here, beacause these diploid regions would be *1-copy in both assemblies*. Purging this assembly should reconcile this by removing one copy of false duplicates, making these 2-copy *k*-mers 1-copy. You might notice the 'read-only' peak at haploid coverage — this is actually expected, because 'read-only' here just means that the *k*-mer in question is not seen in this specific assembly while it was in the original readset. **Often, these 'read-only' _k_-mers are actually present as alternate loci in the other assembly.** +In the primary-only {CN} plot, we observe a large 2-copy (colored blue) peak at diploid coverage. Ideally, this would not be here, because these diploid regions would be *1-copy in both assemblies*. Purging this assembly should reconcile this by removing one copy of false duplicates, making these 2-copy *k*-mers 1-copy. You might notice the 'read-only' peak at haploid coverage — this is actually expected, because 'read-only' here just means that the *k*-mer in question is not seen in this specific assembly while it was in the original readset. **Often, these 'read-only' _k_-mers are actually present as alternate loci in the other assembly.** Now that we have looked at our primary assembly with multiple {QC} metrics, we know that it should undergo purging. The VGP pipeline uses **purge_dups** to remove false duplications from the primary assembly and put them in the alternate assembly to reconcile the haplotypes. Additionally, purge_dups can also find collapsed repeats and regions of suspiciously low coverage. ## Purging the primary and alternate assemblies - Before proceeding to purging, we need to carry out some text manipulation operations on the output generated by GenomeScope2 to make it compatible with downstream tools. The goal is to extract some parameters which at a later stage will be used by **purge_dups**. ### Parsing **purge_dups** cutoffs from **GenomeScope2** output @@ -732,34 +924,47 @@ The first relevant parameter is the `estimated genome size`. > Get estimated genome size > -> 1. {% tool [Replace parts of text](toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_find_and_replace/1.1.4) %} with the following parameters: -> - {% icon param-file %} *"File to process"*: `summary` (output of **GenomeScope** {% icon tool %}) -> - *"Find pattern"*: `bp` -> - *"Replace with*": leave this field empty (as it is) -> - *"Replace all occurrences of the pattern"*: `Yes` -> - *"Find and Replace text in"*: `entire line` -> -> 2. {% tool [Replace parts of text](toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_find_and_replace/1.1.4) %} with the following parameters: -> - {% icon param-file %} *"File to process"*: output file of **Replace** {% icon tool %} -> - *"Find pattern"*: `,` -> - *"Replace with*": leave this field empty (as it is) -> - *"Replace all occurrences of the pattern"*: `Yes` -> - *"Find and Replace text in"*: `entire line` -> -> 3. {% tool [Search in textfiles (grep)](toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_grep_tool/1.1.1) %} with the following parameters: -> - {% icon param-file %} *"Select lines from"*: output file of the previous step. -> - *"Type of regex"*: `Basic` -> - *"Regular Expression"*: `Haploid` -> -> 4. {% tool [Convert delimiters to TAB](Convert characters1) %} with the following parameters: -> - {% icon param-file %} *"in Dataset"*: output of **Search in textfiles** {% icon tool %} -> -> 5. {% tool [Advanced Cut](toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_cut_tool/1.1.0) %} with the following parameters: -> - {% icon param-file %} *"File to cut"*: output of **Convert delimiters to TAB** {% icon tool %} -> - *"Cut by"*: `fields` -> - *"List of Fields"*: `Column: 5` -> -> 6. Rename the output as `Estimated genome size`. +>**Step 1**: Open {% tool [Replace parts of text](toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_find_and_replace/1.1.4) %} +>
+>**Step 2**: Scroll down to find *"+ Insert Find and Replace"* button and click it. +>
+>**Step 3**: Scroll down again to find *"+ Insert Find and Replace"* button and click it again. After this you should have *"Find and Replace"* panel repeated three times: *"1: Find and Replace"*, *"2: Find and Replace"*, and *"3: Find and Replace"*. +>
+>**Step 4**: In {% icon param-file %} *"File to process"*: Select `GenomeScope summary` output (generated during *k*-mer profiling [step](#genome-profiling-with-genomescope2)). The input file should have content that looks like this (it may not be exactly like this): +> ``` +> GenomeScope version 2.0 +> input file = .... +> output directory = . +> p = 2 +> k = 31 +> TESTING set to TRUE +> +> property min max +> Homozygous (aa) 99.4165% 99.4241% +> Heterozygous (ab) 0.575891% 0.583546% +> Genome Haploid Length 11,739,321 bp 11,747,160 bp +> Genome Repeat Length 722,921 bp 723,404 bp +> Genome Unique Length 11,016,399 bp 11,023,755 bp +> Model Fit 92.5159% 96.5191% +> Read Error Rate 0.000943206% 0.000943206% +>``` +> +>**Step 5**: In the first Find and Replace panel *"1: Find and Replace"* set the following parameters: +> 1. *"Find pattern"*: `^(?!Genome Haploid Length).*\n` +> 2. *"Find-Pattern is a regular expression"*: Toggle to `Yes` +> +>
+>**Step 6**: In the second Find and Replace panel *"2: Find and Replace"* set the following parameters: +> 1. *"Find pattern"*: `Genome Haploid Length\s+(\d{1,3}(?:,\d{3})*\s+bp)\s+(\d{1,3}(?:,\d{3})*)\s+bp` +> 2. *"Replace with"*: `$2` +> 3. *"Find-Pattern is a regular expression"*: Toggle to `Yes` +> +>
+>**Step 7**: In the third Find and Replace panel *"3: Find and Replace"* set the following parameters: +>*"Find pattern"*: `,` (Yes, just a comma) +> +>
+>**Step 8**: Rename the output as `Estimated genome size`. > > > > > @@ -767,7 +972,7 @@ The first relevant parameter is the `estimated genome size`. > > > > > > > > -> > > The estimated genome size is 11,747,076 bp. +> > > The estimated genome size is 11,747,160 bp. > > > > > {: .solution} > > @@ -775,25 +980,28 @@ The first relevant parameter is the `estimated genome size`. > {: .hands_on} -Now let's parse the `transition between haploid & diploid` and `upper bound for the read depth estimation` parameters. The transition between haploid & diploid represents the coverage value halfway between haploid and diploid coverage, and helps purger_dups identify *haplotigs*. The upper bound parameter will be used by purge_dups as high read depth cutoff to identify *collapsed repeats*. When repeats are collapsed in an assembly, they are not as long as they actually are in the genome. This results in a pileup of reads at the collapsed region when mapping the reads back to the assembly. +Now let's parse the `transition between haploid & diploid` and `upper bound for the read depth estimation` parameters. The transition between haploid & diploid represents the coverage value halfway between haploid and diploid coverage, and helps purge_dups identify *haplotigs*. The upper bound parameter will be used by purge_dups as high read depth cutoff to identify *collapsed repeats*. When repeats are collapsed in an assembly, they are not as long as they actually are in the genome. This results in a pileup of reads at the collapsed region when mapping the reads back to the assembly. > Get maximum read depth > -> 1. {% tool [Compute on rows](toolshed.g2.bx.psu.edu/repos/devteam/column_maker/Add_a_column1/2.0) %} with the following parameters: -> - {% icon param-file %} *"Input file"*: `model_params` (output of **GenomeScope** {% icon tool %}) -> - For 1: Expressions: +>**Step 1**: Run {% tool [Compute on rows](toolshed.g2.bx.psu.edu/repos/devteam/column_maker/Add_a_column1/2.0) %} with the following parameters: +> 1. {% icon param-file %} *"Input file"*: `model_params` (output of **GenomeScope** {% icon tool %}) +> 2. For "*1: Expressions*": > - *"Add expression"*: `round(1.5*c3)` > - *"Mode of the operation"*: `Append` -> - Click {% icon galaxy-wf-new %} Insert Expressions -> - For 2: Expressions: +> 3. Click {% icon galaxy-wf-new %} Insert Expressions +> 4. For "*2: Expressions*": > - *"Add expression"*: `3*c7` > - *"Mode of the operation"*: `Append` -> 2. Rename it as `Parsing purge parameters` -> 3. {% tool [Advanced Cut](toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_cut_tool/1.1.0) %} with the following parameters: -> - {% icon param-file %} *"File to cut"*: `Parsing purge parameters` -> - *"Cut by"*: `fields` -> - *"List of Fields"*: `Column: 8` -> 4. Rename the output as `Maximum depth` +> +>**Step 2**: Rename it as `Parsing purge parameters` +> +>**Step 3**: Run {% tool [Advanced Cut](toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_cut_tool/1.1.0) %} with the following parameters: +> 1. {% icon param-file %} *"File to cut"*: `Parsing purge parameters` +> 2. *"Cut by"*: `fields` +> 3. *"List of Fields"*: `Column: 8` +> +>**Step 4**: Rename the output as `Maximum depth` > > > > > @@ -811,11 +1019,12 @@ Now let's parse the `transition between haploid & diploid` and `upper bound for > > Now let's get the transition parameter. > -> 1. {% tool [Advanced Cut](toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_cut_tool/1.1.0) %} with the following parameters: -> - {% icon param-file %} *"File to cut"*: `Parsing purge parameters` -> - *"Cut by"*: `fields` -> - *"List of Fields"*: `Column: 7` -> 2. Rename the output as `Transition parameter` +>**Step 5**: Run {% tool [Advanced Cut](toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_cut_tool/1.1.0) %} with the following parameters: +> 1. {% icon param-file %} *"File to cut"*: `Parsing purge parameters` +> 2. *"Cut by"*: `fields` +> 3. *"List of Fields"*: `Column: 7` +> +>**Step 6**: Rename the output as `Transition parameter` > > > > > @@ -846,39 +1055,39 @@ Initially, we need to collapse our HiFi trimmed reads collection into a single d > Collapse the collection > -> 1. {% tool [Collapse Collection](toolshed.g2.bx.psu.edu/repos/nml/collapse_collections/collapse_dataset/4.2) %} with the following parameters: -> - {% icon param-collection %} *"Collection of files to collapse into single dataset"*:`HiFi_collection (trim)` -> 2. Rename the output as `HiFi reads collapsed` +>**Step 1**: Run {% tool [Collapse Collection](toolshed.g2.bx.psu.edu/repos/nml/collapse_collections/collapse_dataset/4.2) %} with the following parameters: +> - {% icon param-collection %} *"Collection of files to collapse into single dataset"*:`HiFi_collection (trim)` +> +>**Step 2**: Rename the output as `HiFi reads collapsed` {: .hands_on} Now, we will map the reads against the primary assembly by using Minimap2 ({% cite Li2018 %}), an alignment program designed to map long sequences. > Map the reads to contigs with Minimap2 > -> 1. {% tool [Map with minimap2](toolshed.g2.bx.psu.edu/repos/iuc/minimap2/minimap2/2.17+galaxy4) %} with the following parameters: -> - *"Will you select a reference genome from your history or use a built-in index?"*: `Use a genome from history and build index` -> - {% icon param-file %} *"Use the following dataset as the reference sequence"*: `Primary contigs FASTA` -> - *"Single or Paired-end reads"*: `Single` -> - {% icon param-collection %} *"Select fastq dataset"*: `HiFi reads collapsed` -> - *"Select a profile of preset options"*: `Long assembly to reference mapping (-k19 -w19 -A1 -B19 -O39,81 -E3,1 -s200 -z200 --min-occ-floor=100). Typically, the alignment will not extend to regions with 5% or higher sequence divergence. Only use this preset if the average divergence is far below 5%. (asm5)` -> - In *"Set advanced output options"*: -> - *"Select an output format"*: `paf` +>**Step 1*: Run {% tool [Map with minimap2](toolshed.g2.bx.psu.edu/repos/iuc/minimap2/minimap2/2.17+galaxy4) %} with the following parameters: +> 1. *"Will you select a reference genome from your history or use a built-in index?"*: `Use a genome from history and build index` +> 2. {% icon param-file %} *"Use the following dataset as the reference sequence"*: `Primary contigs FASTA` +> 3. *"Single or Paired-end reads"*: `Single` +> 4. {% icon param-collection %} *"Select fastq dataset"*: `HiFi reads collapsed` +> 5. *"Select a profile of preset options"*: `Long assembly to reference mapping (-k19 -w19 -A1 -B19 -O39,81 -E3,1 -s200 -z200 --min-occ-floor=100). Typically, the alignment will not extend to regions with 5% or higher sequence divergence. Only use this preset if the average divergence is far below 5%. (asm5)` +> 6. In *"Set advanced output options"* set *"Select an output format"*: `PAF` > -> 2. Rename the output as `Reads mapped to contigs` +>**Step 2**: Rename the output as `Reads mapped to contigs` {: .hands_on} Finally, we will use the `Reads mapped to contigs` pairwise mapping format (PAF) file for calculating some statistics required in a later stage. In this step, purge_dups (listed as **Purge overlaps** in Galaxy tool panel) initially produces a read-depth histogram from base-level coverages. This information is used for estimating the coverage cutoffs, taking into account that collapsed haplotype contigs will lead to reads from both alleles mapping to those contigs, whereas if the alleles have assembled as separate contigs, then the reads will be split over the two contigs, resulting in half the read-depth ({% cite Roach2018 %}). > Read-depth analisys -> 1. {% tool [Purge overlaps](toolshed.g2.bx.psu.edu/repos/iuc/purge_dups/purge_dups/1.2.5+galaxy3) %} with the following parameters: -> - *"Function mode"*: `Calculate coverage cutoff, base-level read depth and create read depth histogram for PacBio data (calcuts+pbcstats)` -> - {% icon param-file %} *"PAF input file"*: `Reads mapped to contigs` -> - In *"Calcuts options"*: -> - *"Transition between haploid and diploid"*: 38 -> - *"Upper bound for read depth"*: `114` (the previously estimated maximum depth) -> - *"Ploidity"*: `Diploid` -> -> 2. Rename the outputs as `PBCSTAT base coverage primary`, `Histogram plot primary` and `Calcuts cutoff primary`. +>**Step 1**: Run {% tool [Purge overlaps](toolshed.g2.bx.psu.edu/repos/iuc/purge_dups/purge_dups/1.2.6+galaxy0) %} with the following parameters: +> 1. *"Function mode"*: `Calculate coverage cutoff, base-level read depth and create read depth histogram for PacBio data (calcuts+pbcstats)` +> 2. {% icon param-file %} *"PAF input file"*: `Reads mapped to contigs` +> 3. In *"Calcuts options"*: +> - *"Transition between haploid and diploid"*: 38 +> - *"Upper bound for read depth"*: `114` (the previously estimated maximum depth) +> - *"Ploidy"*: `Diploid` +> +>**Step 2**: Rename the outputs as `PBCSTAT base coverage primary`, `Histogram plot primary` and `Calcuts cutoff primary`. {: .hands_on} Purge overlaps (purge_dups) generates three outputs: @@ -887,28 +1096,26 @@ Purge overlaps (purge_dups) generates three outputs: - Calcuts-cutoff: it includes the thresholds calculated by purge_dups. - Histogram plot. - ### Generation of all versus all self-alignment Now, we will segment the draft assembly into contigs by cutting at blocks of *N*s, and use minimap2 to generate an all by all self-alignment. > purge_dups pipeline -> 1. {% tool [Purge overlaps](toolshed.g2.bx.psu.edu/repos/iuc/purge_dups/purge_dups/1.2.5+galaxy2) %} with the following parameters: -> - *"Function mode"*: `split assembly FASTA file by 'N's (split_fa)` -> - {% icon param-file %} *"Assembly FASTA file"*: `Primary contigs FASTA` -> -> 2. Rename the output as `Split FASTA` -> -> 3. {% tool [Map with minimap2](toolshed.g2.bx.psu.edu/repos/iuc/minimap2/minimap2/2.17+galaxy4) %} with the following parameters: -> - *"Will you select a reference genome from your history or use a built-in index?"*: `Use a genome from history and build index` -> - {% icon param-file %} *"Use the following dataset as the reference sequence"*: `Split FASTA` -> - *"Single or Paired-end reads"*: `Single` -> - {% icon param-file %} *"Select fastq dataset"*: `Split FASTA` -> - *"Select a profile of preset options"*: `Construct a self-homology map - use the same genome as query and reference (-DP -k19 -w 19 -m200) (self-homology)` -> - In *"Set advanced output options"*: -> - *"Select an output format"*: `PAF` -> -> 4. Rename the output as `Self-homology map primary` +>**Step 1**: Run {% tool [Purge overlaps](toolshed.g2.bx.psu.edu/repos/iuc/purge_dups/purge_dups/1.2.6+galaxy0) %} with the following parameters: +> 1. *"Function mode"*: `split assembly FASTA file by 'N's (split_fa)` +> 2. {% icon param-file %} *"Assembly FASTA file"*: `Primary contigs FASTA` +> +>**Step 2**: Rename the output as `Split FASTA` +> +>**Step 3**: Run {% tool [Map with minimap2](toolshed.g2.bx.psu.edu/repos/iuc/minimap2/minimap2/2.17+galaxy4) %} with the following parameters: +> 1. *"Will you select a reference genome from your history or use a built-in index?"*: `Use a genome from history and build index` +> 2. {% icon param-file %} *"Use the following dataset as the reference sequence"*: `Split FASTA` +> 3. *"Single or Paired-end reads"*: `Single` +> 4. {% icon param-file %} *"Select fastq dataset"*: `Split FASTA` +> 5. *"Select a profile of preset options"*: `Construct a self-homology map - use the same genome as query and reference (-DP -k19 -w 19 -m200) (self-homology)` +> 6. In *"Set advanced output options"*: set *"Select an output format"* to `PAF` +> +>**Step 4**: Rename the output as `Self-homology map primary` {: .hands_on} @@ -918,7 +1125,7 @@ During the final step of the purge_dups pipeline, it will use the self alignment > Purge overlaps (purge_dups) algorithm details > -> In order to identify the haplotypic duplications, purge_dups uses the base-level coverage information to flag the contigs according the following criteria: +> In order to identify the haplotypic duplications, purge_dups uses the base-level coverage information to flag the contigs according to the following criteria: > - If more than 80% bases of a contig are above the high read depth cutoff or below the noise cutoff, it is discarded. > - If more than 80% bases are in the diploid depth interval, it is labeled as a primary contig, otherwise it is considered further as a possible haplotig. > @@ -935,20 +1142,20 @@ During the final step of the purge_dups pipeline, it will use the self alignment > Resolution of haplotigs and overlaps > -> 1. {% tool [Purge overlaps](toolshed.g2.bx.psu.edu/repos/iuc/purge_dups/purge_dups/1.2.5+galaxy5) %} with the following parameters: -> - *"Select the purge_dups function"*: `Purge haplotigs and overlaps for an assembly (purge_dups)` -> - {% icon param-file %} *"PAF input file"*: `Self-homology map primary` -> - {% icon param-file %} *"Base-level coverage file"*: `PBCSTAT base coverage primary` -> - {% icon param-file %} *"Cutoffs file"*: `calcuts cutoff primary` +>**Step 1**: {% tool [Purge overlaps](toolshed.g2.bx.psu.edu/repos/iuc/purge_dups/purge_dups/1.2.6+galaxy0) %} with the following parameters: +> 1. *"Select the purge_dups function"*: `Purge haplotigs and overlaps for an assembly (purge_dups)` +> 2. {% icon param-file %} *"PAF input file"*: `Self-homology map primary` +> 3. {% icon param-file %} *"Base-level coverage file"*: `PBCSTAT base coverage primary` +> 4. {% icon param-file %} *"Cutoffs file"*: `calcuts cutoff primary` > -> 2. Rename the output as `purge_dups BED` +>**Step 2**: Rename the output as `purge_dups BED` > -> 3. {% tool [Purge overlaps](toolshed.g2.bx.psu.edu/repos/iuc/purge_dups/purge_dups/1.2.5+galaxy2) %} with the following parameters: -> - *"Select the purge_dups function"*: `Obtain sequences after purging (get_seqs)` -> - {% icon param-file %} *"Assembly FASTA file"*: `Primary contigs FASTA` -> - {% icon param-file %} *"BED input file"*: `purge_dups BED` (output of the previous step) +>**Step 3**: {% tool [Purge overlaps](toolshed.g2.bx.psu.edu/repos/iuc/purge_dups/purge_dups/1.2.6+galaxy0) %} with the following parameters: +> 1. *"Select the purge_dups function"*: `Obtain sequences after purging (get_seqs)` +> 2. {% icon param-file %} *"Assembly FASTA file"*: `Primary contigs FASTA` +> 3. {% icon param-file %} *"BED input file"*: `purge_dups BED` (output of the previous step) > -> 4. Rename the output `get_seq purged sequences` as `Primary contigs purged` and the `get_seq haplotype` file as `Alternate haplotype contigs` +>**Step 4**: Rename the output `get_seq purged sequences` as `Primary contigs purged` and the `get_seq haplotype` file as `Alternate haplotype contigs` > {: .hands_on} @@ -959,18 +1166,18 @@ Now we should repeat the same procedure with the alternate contigs generated by > Merge the purged sequences and the Alternate contigs > -> 1. {% tool [Concatenate datasets](cat1) %} with the following parameters: -> - {% icon param-file %} *"Concatenate Dataset"*: `Alternate contigs FASTA` -> - In *"Dataset"*: -> - {% icon param-repeat %} *"Insert Dataset"* -> - {% icon param-file %} *"Select"*: `Alternate haplotype contigs` +>**Step 1**: {% tool [Concatenate datasets](cat1) %} with the following parameters: +> 1. {% icon param-file %} *"Concatenate Dataset"*: `Alternate contigs FASTA` +> 2. In *"Dataset"*: +> 3. {% icon param-repeat %} *"Insert Dataset"* +> 4. {% icon param-file %} *"Select"*: `Alternate haplotype contigs` > > > > > > > Remember that the `Alternate haplotype contigs` file contains those contigs that were considered to be haplotypic duplications of the primary contigs. > {: .comment} > -> 2. Rename the output as `Alternate contigs full` +>**Step 2**: Rename the output as `Alternate contigs full` > {: .hands_on} @@ -978,57 +1185,55 @@ Once we have merged the files, we should run the purge_dups pipeline again, but > Process the alternate assembly with purge_dups > -> 1. {% tool [Map with minimap2](toolshed.g2.bx.psu.edu/repos/iuc/minimap2/minimap2/2.17+galaxy4) %} with the following parameters: -> - *"Will you select a reference genome from your history or use a built-in index?"*: `Use a genome from history and build index` -> - {% icon param-file %} *"Use the following dataset as the reference sequence"*: `Alternate contigs full` -> - *"Single or Paired-end reads"*: `Single` -> - {% icon param-collection %} *"Select fastq dataset"*: `HiFi reads collapsed` -> - *"Select a profile of preset options"*: `Long assembly to reference mapping (-k19 -w19 -A1 -B19 -O39,81 -E3,1 -s200 -z200 --min-occ-floor=100). Typically, the alignment will not extend to regions with 5% or higher sequence divergence. Only use this preset if the average divergence is far below 5%. (asm5)` (**Note** `asm5` at the end!) -> - In *"Set advanced output options"*: -> - *"Select an output format"*: `paf` +>**Step 1**: Run {% tool [Map with minimap2](toolshed.g2.bx.psu.edu/repos/iuc/minimap2/minimap2/2.17+galaxy4) %} with the following parameters: +> 1. *"Will you select a reference genome from your history or use a built-in index?"*: `Use a genome from history and build index` +> 2. {% icon param-file %} *"Use the following dataset as the reference sequence"*: `Alternate contigs full` +> 3. *"Single or Paired-end reads"*: `Single` +> 4. {% icon param-collection %} *"Select fastq dataset"*: `HiFi reads collapsed` +> 5. *"Select a profile of preset options"*: `Long assembly to reference mapping (-k19 -w19 -A1 -B19 -O39,81 -E3,1 -s200 -z200 --min-occ-floor=100). Typically, the alignment will not extend to regions with 5% or higher sequence divergence. Only use this preset if the average divergence is far below 5%. (asm5)` (**Note** `asm5` at the end!) +> 6. In *"Set advanced output options"* set *"Select an output format"* to `PAF` > -> 2. Rename the output as `Reads mapped to contigs alternate` +>**Step 2**: Rename the output as `Reads mapped to contigs alternate` > -> 3. {% tool [Purge overlaps](toolshed.g2.bx.psu.edu/repos/iuc/purge_dups/purge_dups/1.2.5+galaxy3) %} with the following parameters: -> - *"Function mode"*: `Calculate coverage cutoff, base-level read depth and create read depth histogram for PacBio data (calcuts+pbcstats)` -> - {% icon param-file %} *"PAF input file"*: `Reads mapped to contigs alternate` -> - In *"Calcuts options"*: -> - *"Upper bound for read depth"*: `114` -> - *"Ploidity"*: `Diploid` +>**Step 3**: {% tool [Purge overlaps](toolshed.g2.bx.psu.edu/repos/iuc/purge_dups/purge_dups/1.2.6+galaxy0) %} with the following parameters: +> 1. *"Function mode"*: `Calculate coverage cutoff, base-level read depth and create read depth histogram for PacBio data (calcuts+pbcstats)` +> 2. {% icon param-file %} *"PAF input file"*: `Reads mapped to contigs alternate` +> 3. In *"Calcuts options"*: +> 4. *"Upper bound for read depth"*: `114` +> 5. *"Ploidy"*: `Diploid` > -> 3. Rename the outputs as `PBCSTAT base coverage alternate`, `Histogram plot alternate` and `Calcuts cutoff alternate`. +>**Step 4**: Rename the outputs as `PBCSTAT base coverage alternate`, `Histogram plot alternate` and `Calcuts cutoff alternate`. > -> 4. {% tool [Purge overlaps](toolshed.g2.bx.psu.edu/repos/iuc/purge_dups/purge_dups/1.2.5+galaxy2) %} with the following parameters: -> - *"Function mode"*: `split assembly FASTA file by 'N's (split_fa)` -> - {% icon param-file %} *"Assembly FASTA file"*: `Alternate contigs full` +>**Step 5**: Run {% tool [Purge overlaps](toolshed.g2.bx.psu.edu/repos/iuc/purge_dups/purge_dups/1.2.6+galaxy0) %} with the following parameters: +> 1. *"Function mode"*: `split assembly FASTA file by 'N's (split_fa)` +> 2. {% icon param-file %} *"Assembly FASTA file"*: `Alternate contigs full` > -> 5. Rename the output as `Split FASTA alternate` +>**Step 5**: Rename the output as `Split FASTA alternate` > -> 6. {% tool [Map with minimap2](toolshed.g2.bx.psu.edu/repos/iuc/minimap2/minimap2/2.17+galaxy4) %} with the following parameters: -> - *"Will you select a reference genome from your history or use a built-in index?"*: `Use a genome from history and build index` -> - {% icon param-file %} *"Use the following dataset as the reference sequence"*: `Split FASTA alternate` -> - *"Single or Paired-end reads"*: `Single` -> - {% icon param-file %} *"Select fastq dataset"*: `Split FASTA alternate` -> - *"Select a profile of preset options"*: `Construct a self-homology map - use the same genome as query and reference (-DP -k19 -w 19 -m200) (self-homology)` -> - In *"Set advanced output options"*: -> - *"Select an output format"*: `PAF` +>**Step 6**: Run {% tool [Map with minimap2](toolshed.g2.bx.psu.edu/repos/iuc/minimap2/minimap2/2.17+galaxy4) %} with the following parameters: +> 1. *"Will you select a reference genome from your history or use a built-in index?"*: `Use a genome from history and build index` +> 2. {% icon param-file %} *"Use the following dataset as the reference sequence"*: `Split FASTA alternate` +> 3. *"Single or Paired-end reads"*: `Single` +> 4. {% icon param-file %} *"Select fastq dataset"*: `Split FASTA alternate` +> 5. *"Select a profile of preset options"*: `Construct a self-homology map - use the same genome as query and reference (-DP -k19 -w 19 -m200) (self-homology)` +> 6. In *"Set advanced output options"* set *"Select an output format"* to `PAF` > -> 7. Rename the output as `Self-homology map alternate` +>**Step 7**: Rename the output as `Self-homology map alternate` > -> 8. {% tool [Purge overlaps](toolshed.g2.bx.psu.edu/repos/iuc/purge_dups/purge_dups/1.2.5+galaxy5) %} with the following parameters: -> - *"Select the purge_dups function"*: `Purge haplotigs and overlaps for an assembly (purge_dups)` -> - {% icon param-file %} *"PAF input file"*: `Self-homology map alternate` -> - {% icon param-file %} *"Base-level coverage file"*: `PBCSTAT base coverage alternate` -> - {% icon param-file %} *"Cutoffs file"*: `calcuts cutoff alternate` +>**Step 8**: Run {% tool [Purge overlaps](toolshed.g2.bx.psu.edu/repos/iuc/purge_dups/purge_dups/1.2.6+galaxy0) %} with the following parameters: +> 1. *"Select the purge_dups function"*: `Purge haplotigs and overlaps for an assembly (purge_dups)` +> 2. {% icon param-file %} *"PAF input file"*: `Self-homology map alternate` +> 3. {% icon param-file %} *"Base-level coverage file"*: `PBCSTAT base coverage alternate` +> 4. {% icon param-file %} *"Cutoffs file"*: `calcuts cutoff alternate` > -> 9. Rename the output as `purge_dups BED alternate` +>**Step 9**: Rename the output as `purge_dups BED alternate` > -> 10. {% tool [Purge overlaps](toolshed.g2.bx.psu.edu/repos/iuc/purge_dups/purge_dups/1.2.5+galaxy2) %} with the following parameters: -> - *"Select the purge_dups function"*: `Obtain sequences after purging (get_seqs)` -> - {% icon param-file %} *"Assembly FASTA file"*: `Alternate contigs full` -> - {% icon param-file %} *"BED input file"*: `purge_dups BED alternate` +>**Step 10**: Run {% tool [Purge overlaps](toolshed.g2.bx.psu.edu/repos/iuc/purge_dups/purge_dups/1.2.6+galaxy0) %} with the following parameters: +> 1. *"Select the purge_dups function"*: `Obtain sequences after purging (get_seqs)` +> 2. {% icon param-file %} *"Assembly FASTA file"*: `Alternate contigs full` +> 3. {% icon param-file %} *"BED input file"*: `purge_dups BED alternate` > -> 11. Rename the outputs as `Alternate contigs purged` and `Alternate haplotype contigs`. +>**Step 11**: Rename the outputs as `Alternate contigs purged` and `Alternate haplotype contigs`. > {: .hands_on} @@ -1038,34 +1243,39 @@ Recall that, prior to purging, our primary assembly showed it needed to be purge > Evaluating the purged assemblies > -> 1. {% tool [gfastats](toolshed.g2.bx.psu.edu/repos/bgruening/gfastats/gfastats/1.3.6+galaxy0) %} with the following parameters: -> - {% icon param-files %} *"Input file"*: select `Primary contigs purged` and the `Alternate contigs purged` datasets -> - *"Expected genome size"*: `11747160` (remember we calculated this value earlier, so it should be in your history!) -> 2. Rename the outputs as `Primary purged stats` and `Alternate purged stats` -> 3. {% tool [Column join](toolshed.g2.bx.psu.edu/repos/iuc/collection_column_join/collection_column_join/0.0.3) %} with the following parameters: -> - {% icon param-files %} *"Input file"*: select `Primary purged stats` and the `Alternate purged stats` datasets -> 4. Rename the output as `gfastats on purged pri and alt (full)` -> 5. {% tool [Search in textfiles](toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_grep_tool/1.1.1) %} with the following parameters: -> - {% icon param-files %} *"Input file"*: select `gfastats on purged pri and alt (full)` -> - *"that"*: `Don't Match` -> - *"Type of regex"*: `Basic` -> - *"Regular Expression"*: `[Ss]caffold` -> 6. Rename the output as `gfastats on purged pri and alt contigs` -> -> 7. {% tool [Busco](toolshed.g2.bx.psu.edu/repos/iuc/busco/busco/5.0.0+galaxy0) %} with the following parameters: -> - {% icon param-files %} *"Sequences to analyze"*: `Primary contigs purged` -> - *"Mode"*: `Genome assemblies (DNA)` -> - *"Use Augustus instead of Metaeuk"*: `Use Metaeuk` -> - *"Auto-detect or select lineage?"*: `Select lineage` -> - *"Lineage"*: `Saccharomycetes` -> - *"Which outputs should be generated"*: `short summary text` and `summary image` -> -> 8. {% tool [Merqury](toolshed.g2.bx.psu.edu/repos/iuc/merqury/merqury/1.3) %} with the following parameters: -> - *"Evaluation mode"*: `Default mode` -> - {% icon param-file %} *"k-mer counts database"*: `Merged meryldb` -> - *"Number of assemblies"*: `Two assemblies -> - {% icon param-file %} *"First genome assembly"*: `Primary contigs purged` -> - {% icon param-file %} *"Second genome assembly"*: `Alternate contigs purged` +>**Step 1**: Run {% tool [gfastats](toolshed.g2.bx.psu.edu/repos/bgruening/gfastats/gfastats/1.3.6+galaxy0) %} with the following parameters: +> 1. {% icon param-files %} *"Input file"*: select `Primary contigs purged` and the `Alternate contigs purged` datasets +> 2. *"Expected genome size"*: `11747160` (remember we calculated this value earlier, so it should be in your history!) +> +>**Step 2**:. Rename the outputs as `Primary purged stats` and `Alternate purged stats` +> +>**Step 3**: Run {% tool [Column join](toolshed.g2.bx.psu.edu/repos/iuc/collection_column_join/collection_column_join/0.0.3) %} with the following parameters: +> - {% icon param-files %} *"Input file"*: select `Primary purged stats` and the `Alternate purged stats` datasets +> +>**Step 4**: Rename the output as `gfastats on purged pri and alt (full)` +> +>**Step 5**: {% tool [Search in textfiles](toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_grep_tool/1.1.1) %} with the following parameters: +> 1. {% icon param-files %} *"Input file"*: select `gfastats on purged pri and alt (full)` +> 2. *"that"*: `Don't Match` +> 3. *"Type of regex"*: `Basic` +> 4. *"Regular Expression"*: `[Ss]caffold` +> +>**Step 6**: Rename the output as `gfastats on purged pri and alt contigs` +> +>**Step 7**: {% tool [Busco](toolshed.g2.bx.psu.edu/repos/iuc/busco/busco/5.5.0+galaxy0) %} with the following parameters: +> 1. {% icon param-files %} *"Sequences to analyze"*: `Primary contigs purged` +> 2. *"Mode"*: `Genome assemblies (DNA)` +> 3. *"Use Augustus instead of Metaeuk"*: `Use Metaeuk` +> 4. *"Auto-detect or select lineage?"*: `Select lineage` +> 5. *"Lineage"*: `Saccharomycetes` +> 6. *"Which outputs should be generated"*: `short summary text` and `summary image` +> +>**Step 8**: {% tool [Merqury](toolshed.g2.bx.psu.edu/repos/iuc/merqury/merqury/1.3+galaxy3) %} with the following parameters: +> 1. *"Evaluation mode"*: `Default mode` +> 2. {% icon param-file %} *"k-mer counts database"*: `Merged meryldb` +> 3. *"Number of assemblies"*: `Two assemblies +> 4. {% icon param-file %} *"First genome assembly"*: `Primary contigs purged` +> 5. {% icon param-file %} *"Second genome assembly"*: `Alternate contigs purged` > {: .hands_on} @@ -1075,11 +1285,11 @@ The summary statistics indicate that both assemblies are now of a similar size t The {BUSCO} results for the purged primary assembly look much better, since we no longer have the large amount of duplicate BUSCOs that we previously had. Additionally, there is no large increase in missing BUSCOs, indicating that we have *not* over-purged the primary assembly. -The previous metrics tell us that the primary is likely fixed after purging, but what about the previously incomplete alternate assembly? Let's see if the Merqury spectra plots show any change in how *k*-mers are split up between the two assemblies. +The previous metrics tell us that the primary is likely fixed after purging, but what about the previously incomplete alternate assembly? Let's see if the Merqury spectra plots show any change in how *k*-mers are split up between the two assemblies: ![Merqury spectra-asm plot after purging.](../../images/vgp_assembly/merqury_prialt_asm_postpurge.png "Merqury ASM plot after purging."){:width="65%"} -This looks a lot better! The diploid regions are all shared between the two assemblies (the large green peak centered at 50x, the diploid coverage value), and the haplotypic variation is shared between the primary and alternate assemblies (the red and blue peaks centered around 25x, the haploid coverage value). +This looks a lot better! The diploid regions are all shared between the two assemblies (the large green peak centered at 50x, the diploid coverage value), and the haplotypic variation is shared between the primary and alternate assemblies (the red and blue peaks centered around 25✕, the haploid coverage value). ![Merqury spectra-cn plot for primary assembly after purging.](../../images/vgp_assembly/merqury_prialt_priCN_postpurge.png "Merqury CN plot for the primary assembly only after purging."){:width="65%"} @@ -1093,13 +1303,13 @@ At this point, we have a set of contigs, which may or may not be fully phased, d > What assembly am I scaffolding?? > -> For the purposes of this tutorial, the scaffolding hands-on exercises will be referring to a primary assembly. If you have hap1 contigs or hap2 contigs, then you can also follow along just using hap1 contigs or hap2 contigs. Wherever the tutorial refers to primary contigs, just replace with whichever haplotype you are scaffolding. +> For the purposes of this tutorial, the scaffolding hands-on exercises will be referring to a Hap1 assembly produced with Hi-C mode of hifiasm. If you have hap1 contigs or hap2 contigs, then you can also follow along just using Primary purged contigs or Alternate purged contigs. Wherever the tutorial refers to primary contigs, just replace it with whichever haplotype you are scaffolding. > {: .comment} ![Schematic of scaffolding contigs to generate a draft assembly, and then curating the draft scaffolds to generate a curated assembly.](../../images/vgp_assembly/scaffoldingandcuration.png "Scaffolding uses additional information to join together contigs. Optical maps, such as those produced using the Bionano Saphyr, as well as Hi-C data such as those generated by Arima Hi-C or Dovetail OmniC kits, are used to re-order and orient contigs according to long-range information. This generates the draft scaffold-level genome. This draft genome then undergoes manual curation, which uses information such as coverage tracks or repeat annotations in order to further orient scaffolds and correct misassemblies such as mis-joins and missed joins. Image adapted from {% cite Rhie2022 %}."){:width="70%"} -# Hybrid scaffolding with Bionano optical maps +## Hybrid scaffolding with Bionano optical maps In this step, the linkage information provided by optical maps is integrated with primary assembly sequences, and the overlaps are used to orient and order the contigs, resolve chimeric joins, and estimate the length of gaps between adjacent contigs. One of the advantages of optical maps is that they can easily span genomic regions that are difficult to resolve using DNA sequencing technologies ({% cite Savara2021 %}, {% cite Yuan2020 %}). @@ -1121,35 +1331,52 @@ The *Bionano Hybrid Scaffold* tool automates the scaffolding process, which incl 4. Align sequence maps to the hybrid scaffolds 5. Generate AGP and FASTA files for the scaffolds. +Before we begin, we need to upload BioNano data: + +> Uploading BioNano datasets from Zenodo +> +>**Step 1**: Copy the following URLs into clipboard. You can do this by clicking on {% icon copy %} button in the right upper corner of the box below. It will appear if you mouse over the box. +> +> ``` +> https://zenodo.org/records/5887339/files/bionano.cmap +> ``` +> +>**Step 2**: Upload datasets into Galaxy +> - set the datatype to `cmap` +> +>The box below explains how to upload data if you forgot. Just make sure you set dataset type to `cmap`. +> +> {% snippet faqs/galaxy/datasets_import_via_link.md format="fasta" %} +> +{: .hands_on} > Bionano hybrid scaffolding > -> 1. {% tool [Bionano Hybrid Scaffold](toolshed.g2.bx.psu.edu/repos/bgruening/bionano_scaffold/bionano_scaffold/3.6.1+galaxy2) %} with the following parameters: -> - {% icon param-file %} *"NGS FASTA"*: `Primary contigs purged` -> - {% icon param-file %} *"BioNano CMAP"*: `Bionano_dataset` -> - *"Configuration mode"*: `VGP mode` -> - *"Genome maps conflict filter"*: `Cut contig at conflict` -> - *"Sequences conflict filter"*: `Cut contig at conflict` +>**Step1**: Run {% tool [Bionano Hybrid Scaffold](toolshed.g2.bx.psu.edu/repos/bgruening/bionano_scaffold/bionano_scaffold/3.7.0+galaxy3) %} with the following parameters: +> 1. {% icon param-file %} *"NGS FASTA"*: `Hap1 contigs FASTA` generated during [hifiasm contigging](#hic-phased-assembly-with-hifiasm). +> 2. {% icon param-file %} *"BioNano CMAP"*: `Bionano_dataset` we just uploaded +> 3. *"Configuration mode"*: `VGP mode` +> 4. *"Genome maps conflict filter"*: `Cut contig at conflict` +> 5. *"Sequences conflict filter"*: `Cut contig at conflict` > -> > -> > -> > If your data are not associated with VGP, make sure that the configuration mode fits with your samples. -> {: .comment} +>> +>> +>> If your data are not associated with VGP, make sure that the configuration mode fits with your samples. +>{: .comment} > -> 2. {% tool [Concatenate datasets](cat1) %} with the following parameters: -> - {% icon param-file %} *"Concatenate Dataset"*: `NGScontigs scaffold NCBI trimmed` (output of **Bionano Hybrid Scaffold** {% icon tool %}) -> - In *"Dataset"*: -> - {% icon param-repeat %} *"Insert Dataset"* -> - {% icon param-file %} *"Select"*: `NGScontigs not scaffolded trimmed` (output of **Bionano Hybrid Scaffold** {% icon tool %}) +>**Step 2**: Run {% tool [Concatenate datasets](cat1) %} with the following parameters: +> 1. {% icon param-file %} *"Concatenate Dataset"*: `NGScontigs scaffold NCBI trimmed` (output of **Bionano Hybrid Scaffold** {% icon tool %}) +> 2. {% icon param-repeat %} *"Insert Dataset"* +> 3. {% icon param-file %} *"Select"*: `NGScontigs not scaffolded trimmed` (output of **Bionano Hybrid Scaffold** {% icon tool %}) > -> 3. Rename the output as `Primary assembly bionano` +>**Step 3**: Rename the output as `Hap1 assembly bionano` {: .hands_on} ## Evaluating Bionano scaffolds Let's evaluate our scaffolds to see the impact of scaffolding on some key assembly statistics. -> Bionano assembly evaluation with QUAST and BUSCO +> Bionano assembly evaluation gfastats > > 1. {% tool [gfastats](toolshed.g2.bx.psu.edu/repos/bgruening/gfastats/gfastats/1.3.6+galaxy0) %} with the following parameters: > - {% icon param-files %} *"Input file"*: select `Primary assembly bionano` @@ -1158,8 +1385,6 @@ Let's evaluate our scaffolds to see the impact of scaffolding on some key assemb > {: .hands_on} - - > > > 1. How many scaffolds are in the primary assembly after the hybrid scaffolding? @@ -1168,7 +1393,7 @@ Let's evaluate our scaffolds to see the impact of scaffolding on some key assemb > > > > > > 1. The number of contigs is 17. -> > 2. The largest contig is 1.531.728 bp long. This value hasn't changed. +> > 2. The largest contig is 1,531,728 bp long. This value hasn't changed. This is likely due to the fact that this is a downsampled training dataset. > > > {: .solution} > @@ -1197,196 +1422,170 @@ Despite Hi-C generating paired-end reads, we need to map each read separately. T > Mapping Hi-C reads > -> 1. {% tool [BWA-MEM2](toolshed.g2.bx.psu.edu/repos/iuc/bwa_mem2/bwa_mem2/2.2.1+galaxy0) %} with the following parameters: -> - *"Will you select a reference genome from your history or use a built-in index?"*: `Use a genome from history and build index` -> - {% icon param-file %} *"Use the following dataset as the reference sequence"*: `Primary assembly bionano` -> - *"Single or Paired-end reads"*: `Single` -> - {% icon param-file %} *"Select fastq dataset"*: `Hi-C_dataset_F` -> - *"Set read groups information?"*: `Do not set` -> - *"Select analysis mode"*: `1.Simple Illumina mode` -> - *"BAM sorting mode"*: `Sort by read names (i.e., the QNAME field) ` -> -> 2. Rename the output as `BAM forward` -> -> 3. {% tool [BWA-MEM2](toolshed.g2.bx.psu.edu/repos/iuc/bwa_mem2/bwa_mem2/2.2.1+galaxy0) %} with the following parameters: -> - *"Will you select a reference genome from your history or use a built-in index?"*: `Use a genome from history and build index` -> - {% icon param-file %} *"Use the following dataset as the reference sequence"*: `Primary assembly bionano` -> - *"Single or Paired-end reads"*: `Single` -> - {% icon param-file %} *"Select fastq dataset"*: `Hi-C_dataset_R` -> - *"Set read groups information?"*: `Do not set` -> - *"Select analysis mode"*: `1.Simple Illumina mode` -> - *"BAM sorting mode"*: `Sort by read names (i.e., the QNAME field) ` -> -> 4. Rename the output as `BAM reverse` -> -> 5. {% tool [Filter and merge](toolshed.g2.bx.psu.edu/repos/iuc/bellerophon/bellerophon/1.0+galaxy0) %} chimeric reads from Arima Genomics with the following parameters: +>**Step 1**: Run {% tool [BWA-MEM2](ttoolshed.g2.bx.psu.edu/repos/iuc/bwa_mem2/bwa_mem2/2.2.1+galaxy1) %} with the following parameters: +> 1. *"Will you select a reference genome from your history or use a built-in index?"*: `Use a genome from history and build index` +> 2. {% icon param-file %} *"Use the following dataset as the reference sequence"*: `Hap1 assembly bionano` +> 3. *"Single or Paired-end reads"*: `Single` +> 4. {% icon param-file %} *"Select fastq dataset"*: `Hi-C_dataset_F` +> 5. *"Set read groups information?"*: `Do not set` +> 6. *"Select analysis mode"*: `1.Simple Illumina mode` +> 7. *"BAM sorting mode"*: `Sort by read names (i.e., the QNAME field) ` +> +>**Step 2**: Rename the output as `BAM forward` +> +> Now let's do the same for reverse Hi-C reads: +> +>**Step 3**: Run {% tool [BWA-MEM2](toolshed.g2.bx.psu.edu/repos/iuc/bwa_mem2/bwa_mem2/2.2.1+galaxy1) %} with the following parameters: +> 1. *"Will you select a reference genome from your history or use a built-in index?"*: `Use a genome from history and build index` +> 2. {% icon param-file %} *"Use the following dataset as the reference sequence"*: `Hap1 assembly bionano` +> 3. *"Single or Paired-end reads"*: `Single` +> 4. {% icon param-file %} *"Select fastq dataset"*: `Hi-C_dataset_R` +> 5. *"Set read groups information?"*: `Do not set` +> 6. *"Select analysis mode"*: `1.Simple Illumina mode` +> 7. *"BAM sorting mode"*: `Sort by read names (i.e., the QNAME field) ` +> +>**Step 4**: Rename the output as `BAM reverse` +> +> 5. {% tool [Filter and merge](toolshed.g2.bx.psu.edu/repos/iuc/bellerophon/bellerophon/1.0+galaxy1) %} chimeric reads from Arima Genomics with the following parameters: > - {% icon param-file %} *"First set of reads"*: `BAM forward` > - {% icon param-file %} *"Second set of reads"*: `BAM reverse` > > 6. Rename it as `BAM Hi-C reads` {: .hands_on} -Finally, we need to convert the BAM file to BED format and sort it. - - ## Generate initial Hi-C contact map After mapping the Hi-C reads, the next step is to generate an initial Hi-C contact map, which will allow us to compare the Hi-C contact maps before and after using the Hi-C for scaffolding. > Biological basis of Hi-C contact maps > -> Hi-C contact maps reflect the interaction frequency between genomic loci. In order to understand the Hi-C contact maps, it is necessary to take into account two factors: the higher interaction frequency between loci that reside in the same chromosome (_i.e._, in cis), and the distance-dependent decay of interaction frequency ({% cite Lajoie2015 %}). +> Hi-C contact maps reflect the interaction frequency between genomic loci. In order to understand the Hi-C contact maps, it is necessary to take into account two factors: the higher interaction frequency between loci that reside in the same chromosome (_i.e._, in *cis*), and the distance-dependent decay of interaction frequency ({% cite Lajoie2015 %}). > -> The higher interaction between cis regions can be explained, at least in part, by the territorial organization of chromosomes in interphase (chromosome territories), and in a genome-wide contact map, this pattern appears as blocks of high interaction centered along the diagonal and matching individual chromosomes (fig. 12) ({% cite Cremer2010 %}, {% cite Lajoie2015 %}). +> The higher interaction between *cis* regions can be explained, at least in part, by the territorial organization of chromosomes in interphase (chromosome territories), and in a genome-wide contact map, this pattern appears as blocks of high interaction centered along the diagonal and matching individual chromosomes ({% cite Cremer2010 %}, {% cite Lajoie2015 %}): > -> ![Hi-C map](../../images/vgp_assembly/hic_map.png "An example of a Hi-C map. Genomic regions are arranged along the x and y axes, and contacts are colored on the matrix like a heat map; here darker color indicates greater interaction frequency.") {:width="10%"} +> ![Hi-C map](../../images/vgp_assembly/hic_map.png "An example of a Hi-C map. Genomic regions are arranged along the X and Y axes, and contacts are colored on the matrix like a heat map; here darker color indicates greater interaction frequency.") {:width="10%"} > > On the other hand, the distance-dependent decay may be due to random movement of the chromosomes, and in the contact map appears as a gradual decrease of the interaction frequency the farther away from the diagonal it moves ({% cite Lajoie2015 %}). > -> {: .comment} - > Generate a contact map with PretextMap and Pretext Snapshot > -> 1. {% tool [PretextMap](toolshed.g2.bx.psu.edu/repos/iuc/pretext_map/pretext_map/0.1.6+galaxy0) %} with the following parameters: -> - {% icon param-file %} *"Input dataset in SAM or BAM format"*: `BAM Hi-C reads` -> - *"Sort by"*: `Don't sort` +>**Step 1**: Run {% tool [PretextMap](toolshed.g2.bx.psu.edu/repos/iuc/pretext_map/pretext_map/0.1.9+galaxy0) %} with the following parameters: +> 1. {% icon param-file %} *"Input dataset in SAM or BAM format"*: `BAM Hi-C reads` +> 2. *"Sort by"*: `Don't sort` > -> 3. Rename the output as `PretextMap output` +>**Step 2**: Rename the output as `PretextMap output` > -> 2. {% tool [Pretext Snapshot](toolshed.g2.bx.psu.edu/repos/iuc/pretext_snapshot/pretext_snapshot/0.0.3+galaxy0) %} with the following parameters: -> - {% icon param-file %} *"Input Pretext map file"*: `PretextMap output` -> - *"Output image format"*: `png` -> - *"Show grid?"*: `Yes` +>**Step 3**: Run {% tool [Pretext Snapshot](toolshed.g2.bx.psu.edu/repos/iuc/pretext_snapshot/pretext_snapshot/0.0.3+galaxy1) %} with the following parameters: +> 1. {% icon param-file %} *"Input Pretext map file"*: `PretextMap output` +> 2. *"Output image format"*: `png` +> 3. *"Show grid?"*: `Yes` {: .hands_on} Let's have a look at the Hi-C contact maps generated by Pretext Snapshot. ![Pretext optical map](../../images/vgp_assembly/hic_map_pretext.png "Hi-C map generated by Pretext. Primary assembly full contact map generated in this training (a) Hi-C map representative of a typical missasembly (b).") -In the contact generated from the Bionano-scaffolded assembly can be identified 17 scaffolds, representing each of the haploid chromosomes of our genome (fig. 13.a). The fact that all the contact signals are found around the diagonal suggest that the contigs were scaffolded in the right order. However, during the assembly of complex genomes, it is common to find in the contact maps indicators of errors during the scaffolding process, as shown in the figure 13b. In that case, a contig belonging to the second chromosome has been misplaced as part of the fourth chromosome. We can also note that the final portion of the second chromosome should be placed at the beginning, as the off-diagonal contact signal suggests. +In the contact generated from the Bionano-scaffolded assembly can be identified 17 scaffolds, representing each of the haploid chromosomes of our genome (panel **a** above). The fact that all the contact signals are found around the diagonal suggest that the contigs were scaffolded in the right order. However, during the assembly of complex genomes, it is common to find in the contact maps indicators of errors during the scaffolding process, as shown in panel **b**. In that case, a contig belonging to the second chromosome has been misplaced as part of the fourth chromosome. We can also note that the final portion of the second chromosome should be placed at the beginning, as the off-diagonal contact signal suggests. -Once we have evaluated the quality of the scaffolded genome assembly, the next step consists in integrating the information contained in the HiC reads into our assembly, so that any errors identified can be resolved. For this purpose we will use SALSA2 ({% cite Ghurye2019 %}). +Once we have evaluated the quality of the scaffolded genome assembly, the next step consists in integrating the information contained in the HiC reads into our assembly, so that any errors identified can be resolved. For this purpose we will use YaHS ({% cite Zhou2022 %}). -## SALSA2 scaffolding +## **YaHS** scaffolding -SALSA2 is an open source software that makes use of Hi-C to linearly orient and order assembled contigs along entire chromosomes ({% cite Ghurye2019 %}). One of the advantages of SALSA2 with respect to most existing Hi-C scaffolding tools is that it doesn't require the estimated number of chromosomes. +YaHS is an open source software that makes use of Hi-C to linearly orient and order assembled contigs along entire chromosomes ({% cite Zhou2022 %}). One of the advantages of Yahs with respect to most existing Hi-C scaffolding tools is that it doesn't require the estimated number of chromosomes. -> SALSA2 algorithm overview +> YAHS algorithm overview > -> Initially SALSA2 uses the physical coverage of Hi-C pairs to identify suspicious regions and break the sequence at the likely point of mis-assembly. Then, a hybrid scaffold graph is constructed using edges from the Hi-C reads, scoring the edges according to a *best buddy* scheme (fig. 14a). +>YaHS takes the alignment between Hi-C reads and contigs, breaking contigs at positions lacking Hi-C coverage to address potential assembly errors. > -> ![Figure 14: SALSA2 algorithm](../../images/vgp_assembly/salsa2_algorithm.png "Overview of the SALSA2 algorithm. Solid edges indicate the linkages between different contigs and dotted edges indicate the links between the ends of the same contig. B and E denote the start and end of contigs, respectively. Adapted from Ghurye et al. 2019.") +>Scaffolding occurs in multiple rounds, with YaHS creating a contact matrix for each contig by splitting it into chunks. Hi-C contact signals are assigned to cells within contigs (intra-cells) and between contigs (inter-cells; see figure below). Joining scores for contig pairs are calculated based on normalized contact frequencies, aiming for similar inter-cell frequencies between neighboring contigs. > -> From this graph scaffolds are iteratively constructed using a greedy weighted maximum matching. After each iteration, a mis-join detection step is performed to check if any of the joins made during this round are incorrect. Incorrect joins are broken and the edges blacklisted during subsequent iterations. This process continues until the majority of joins made in the prior iteration are incorrect. This provides a natural stopping condition, when accurate Hi-C links have been exhausted ({% cite Ghurye2019 %}). +>Optionally considering Hi-C library restriction enzymes, YaHS normalizes contact frequencies by the corresponding number of cutting sites. A scaffolding graph is constructed with contigs as nodes and contig joins as weighted edges. The graph undergoes simplification steps, including edge filtering, tip and blunt end trimming, repeat resolution, transitive edge removal, bubble and ambiguous orientation resolution, weak edge trimming, and ambiguous edge removal. > -{: .comment} - -Before launching SALSA2, we need to carry out some modifications on our datasets. - -> BAM to BED conversion -> -> 1. {% tool [bedtools BAM to BED](toolshed.g2.bx.psu.edu/repos/iuc/bedtools/bedtools_bamtobed/2.30.0+galaxy1) %} with the following parameters: -> - {% icon param-file %} *"Convert the following BAM file to BED"*: `BAM Hi-C reads` -> - *"What type of BED output would you like"*: `Create a full, 12-column "blocked" BED file` -> -> 2. Rename the output as `BED unsorted` -> -> 3. {% tool [Sort](sort1) %} with the following parameters: -> - {% icon param-file %} *"Sort Dataset"*: `BED unsorted` -> - *"on column"*: `Column: 4` -> - *"with flavor"*: `Alphabetical sort` -> - *"everything in"*: `Ascending order` +>Finally, the graph is traversed to assemble scaffolds along continuous paths. Optionally, a second step of assembly error correction breaks scaffolds at positions lacking sufficient Hi-C coverage. YaHS employs a hierarchical joining process with multiple rounds of scaffolding at decreasing resolutions (increasing chunk sizes), using previous round scaffolds as input. +>![ An overview of YaHS](../../images/vgp_assembly/yahs.png "Schematic of how Hi-C information is used to scaffold and orient contigs along chromosomes using long-range information, as well as link separated haplotype blocks into chromosome-scale haplotypes (from {% cite Zhou2022 %}") > -> 4. Rename the output as `BED sorted` -> -> 5. {% tool [Replace](toolshed.g2.bx.psu.edu/repos/bgruening/text_processing/tp_find_and_replace/1.1.3) %} with the following parameters: -> - {% icon param-file %} *"File to process"*: `Primary assembly bionano` -> - *"Find pattern"*: `:` -> - *"Replace all occurrences of the pattern"*: `Yes` -> - *"Find and Replace text in"*: `entire line` -> -> 6. Rename the output as `Primary assembly bionano edited` -{: .hands_on} +>See {% cite Zhou2022 %} for exact details. +{: .comment} -Now we can launch SALSA2 in order to generate the hybrid scaffolding based on the Hi-C data. +Now we can launch YaHS in order to generate the hybrid scaffolding based on the Hi-C data. -> Salsa scaffolding -> +> YaHS scaffolding > -> 1. {% tool [SALSA](toolshed.g2.bx.psu.edu/repos/iuc/salsa/salsa/2.3+galaxy0) %} with the following parameters: -> - {% icon param-file %} *"Initial assembly file"*: `Primary assembly bionano edited` -> - {% icon param-file %} *"Bed alignment"*: `BED sorted` -> - *"Restriction enzyme sequence(s)"*: `CTTAAG` +>**Step 1**: Run {% tool [YaHS](toolshed.g2.bx.psu.edu/repos/iuc/yahs/yahs/1.2a.2+galaxy1) %} with the following parameters: +> 1. {% icon param-file %} *"Input contig sequences"*: `Hap1 assembly bionano` +> 2. {% icon param-file %} *"Alignment file of Hi-C reads to contigs*"*: `BAM Hi-C reads` +> 3. *"Restriction enzyme used in Hi-C experiment"*: set to `Enter a specific sequence` +> 4. *"Restriction enzyme sequence(s)"*: Enter `CTTAAG` > -> 2. Rename the output as `SALSA2 scaffold FASTA` and `SALSA2 scaffold AGP` +>**Step 2**: Rename `YAHS on data NNN and data NNN: Final scaffolds fasta output` as `YaHS Scaffolds FASTA` > {: .hands_on} ## Evaluate the final genome assembly with Pretext -Finally, we should repeat the procedure described previously for generating the contact maps, but in that case, we will use the scaffold generated by SALSA2. +Finally, we should repeat the procedure described previously for generating the contact maps, but in that case, we will use the scaffold generated by YaHS. > Mapping reads against the scaffold > -> 1. {% tool [BWA-MEM2](toolshed.g2.bx.psu.edu/repos/iuc/bwa_mem2/bwa_mem2/2.2.1+galaxy0) %} with the following parameters: -> - *"Will you select a reference genome from your history or use a built-in index?"*: `Use a genome from history and build index` -> - {% icon param-file %} *"Use the following dataset as the reference sequence"*: `SALSA2 scaffold FASTA` -> - *"Single or Paired-end reads"*: `Single` -> - {% icon param-file %} *"Select fastq dataset"*: `Hi-C_dataset_F` -> - *"Set read groups information?"*: `Do not set` -> - *"Select analysis mode"*: `1.Simple Illumina mode` -> - *"BAM sorting mode"*: `Sort by read names (i.e., the QNAME field) ` +>**Step 1**: Run {% tool [BWA-MEM2](toolshed.g2.bx.psu.edu/repos/iuc/bwa_mem2/bwa_mem2/2.2.1+galaxy1) %} with the following parameters: +> 1. *"Will you select a reference genome from your history or use a built-in index?"*: `Use a genome from history and build index` +> 2. {% icon param-file %} *"Use the following dataset as the reference sequence"*: `YaHS Scaffolds FASTA` +> 3. *"Single or Paired-end reads"*: `Single` +> 4. {% icon param-file %} *"Select fastq dataset"*: `Hi-C_dataset_F` +> 5. *"Set read groups information?"*: `Do not set` +> 6. *"Select analysis mode"*: `1.Simple Illumina mode` +> 7. *"BAM sorting mode"*: `Sort by read names (i.e., the QNAME field) ` > -> 2. Rename the output as `BAM forward SALSA2` +>**Step 2**: Rename the output as `BAM forward YaHS` > -> 3. {% tool [BWA-MEM2](toolshed.g2.bx.psu.edu/repos/iuc/bwa_mem2/bwa_mem2/2.2.1+galaxy0) %} with the following parameters: -> - *"Will you select a reference genome from your history or use a built-in index?"*: `Use a genome from history and build index` -> - {% icon param-file %} *"Use the following dataset as the reference sequence"*: `SALSA2 scaffold FASTA` -> - *"Single or Paired-end reads"*: `Single` -> - {% icon param-file %} *"Select fastq dataset"*: `Hi-C_dataset_R` -> - *"Set read groups information?"*: `Do not set` -> - *"Select analysis mode"*: `1.Simple Illumina mode` -> - *"BAM sorting mode"*: `Sort by read names (i.e., the QNAME field) ` +>**Step 3**: Run {% tool [BWA-MEM2](toolshed.g2.bx.psu.edu/repos/iuc/bwa_mem2/bwa_mem2/2.2.1+galaxy1) %} with the following parameters: +> 1. *"Will you select a reference genome from your history or use a built-in index?"*: `Use a genome from history and build index` +> 2. {% icon param-file %} *"Use the following dataset as the reference sequence"*: `YaHS Scaffolds FASTA` +> 3. *"Single or Paired-end reads"*: `Single` +> 4. {% icon param-file %} *"Select fastq dataset"*: `Hi-C_dataset_R` +> 5. *"Set read groups information?"*: `Do not set` +> 6. *"Select analysis mode"*: `1.Simple Illumina mode` +> 7. *"BAM sorting mode"*: `Sort by read names (i.e., the QNAME field) ` > -> 4. Rename the output as `BAM reverse SALSA2` +>**Step 4**: Rename the output as `BAM reverse YaHS` > -> 5. {% tool [Filter and merge](toolshed.g2.bx.psu.edu/repos/iuc/bellerophon/bellerophon/1.0+galaxy0) %} chimeric reads from Arima Genomics with the following parameters: -> - {% icon param-file %} *"First set of reads"*: `BAM forward SALSA2` -> - {% icon param-file %} *"Second set of reads"*: `BAM reverse SALSA2` +>**Step 5**: Run {% tool [Filter and merge](toolshed.g2.bx.psu.edu/repos/iuc/bellerophon/bellerophon/1.0+galaxy1) %} chimeric reads from Arima Genomics with the following parameters: +> 1. {% icon param-file %} *"First set of reads"*: `BAM forward YaHS` +> 2. {% icon param-file %} *"Second set of reads"*: `BAM reverse YaHS` > -> 6. Rename the output as `BAM Hi-C reads SALSA2` +>**Step 6**: Rename the output as `BAM Hi-C reads SALSA2` > -> 7. {% tool [PretextMap](toolshed.g2.bx.psu.edu/repos/iuc/pretext_map/pretext_map/0.1.6+galaxy0) %} with the following parameters: -> - {% icon param-file %} *"Input dataset in SAM or BAM format"*: `BAM Hi-C reads SALSA2` -> - *"Sort by"*: `Don't sort` +>**Step 7**: Run {% tool [PretextMap](toolshed.g2.bx.psu.edu/repos/iuc/pretext_map/pretext_map/0.1.9+galaxy0) %} with the following parameters: +> 1. {% icon param-file %} *"Input dataset in SAM or BAM format"*: `BAM Hi-C reads YaHS` +> 2. *"Sort by"*: `Don't sort` > -> 8. Rename the output as `PretextMap output SALSA2` +>**Step 8**: Rename the output as `PretextMap output YaHS` > -> 9. {% tool [Pretext Snapshot](toolshed.g2.bx.psu.edu/repos/iuc/pretext_snapshot/pretext_snapshot/0.0.3+galaxy0) %} with the following parameters: -> - {% icon param-file %} *"Input Pretext map file"*: `PretextMap output SALSA2` -> - *"Output image format"*: `png` -> - *"Show grid?"*: `Yes` +>**Step 9**: Run {% tool [Pretext Snapshot](toolshed.g2.bx.psu.edu/repos/iuc/pretext_snapshot/pretext_snapshot/0.0.3+galaxy1) %} with the following parameters: +> 1. {% icon param-file %} *"Input Pretext map file"*: `PretextMap output YaHS` +> 2. *"Output image format"*: `png` +> 3. *"Show grid?"*: `Yes` > {: .hands_on} -In order to evaluate the Hi-C hybrid scaffolding, we are going to compare the contact maps before and after running SALSA2 (fig. 15). +In order to evaluate the Hi-C hybrid scaffolding, we are going to compare the contact maps before and after running YaHS: -![Figure 15: Pretext final contact map](../../images/vgp_assembly/hi-c_pretext_final.png "Hi-C map generated by Pretext after the hybrid scaffolding based on Hi-C data. The red circles indicate the differences between the contact map generated after (a) and before (b) Hi-C hybrid scaffolding.") +![Pretext final contact map](../../images/vgp_assembly/hi-c_pretext_final.png "Hi-C map generated by Pretext after the hybrid scaffolding based on Hi-C data. The red circles indicate the differences between the contact map generated after (a) and before (b) Hi-C hybrid scaffolding.") Among the most notable differences that can be identified between the contact maps, it can be highlighted the regions marked with red circles, where inversion can be identified. # Conclusion -To sum up, it is worthwhile to compare the final assembly with the [_S. cerevisiae_ S288C reference genome](https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/146/045/GCF_000146045.2_R64/GCF_000146045.2_R64_assembly_stats.txt). +To sum up, it is worthwhile to compare the final assembly with the [S. cerevisiae_ S288C reference genome](https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/146/045/GCF_000146045.2_R64/GCF_000146045.2_R64_assembly_stats.txt). -![Table 1: Final stats](../../images/vgp_assembly/stats_conclusion.png "Comparison between the final assembly generating in this training and the reference genome. Contiguity plot using the reference genome size (a). Assemby statistics (b).") +![Table 1: Final stats](../../images/vgp_assembly/stats_conclusion.png "Comparison between the final assembly generated in this training and the reference genome. Contiguity plot using the reference genome size (a). Assembly statistics (b).") -With respect to the total sequence length, we can conclude that the size of our genome assembly is almost identical to the reference genome (fig.16a,b). Regarding the number of scaffolds, the obtained value is similar to the reference assembly, which consist in 16 chromosomes plus the mitochondrial DNA, which consists of 85,779 bp. The remaining statistics exhibit very similar values (fig. 16b). +With respect to the total sequence length, we can conclude that the size of our genome assembly is almost identical to the reference genome (figure above). Regarding the number of scaffolds, the obtained value is similar to the reference assembly, which consists of 16 chromosomes plus the mitochondrial DNA, which consists of 85,779 bp. The remaining statistics exhibit very similar values (panel **b** above). ![Comparison reference genome](../../images/vgp_assembly/hi-c_pretext_conclusion.png "Comparison between contact maps generated by using the final assembly (a) and the reference genome (b).") -If we compare the contact map of our assembled genome (fig. 17a) with the reference assembly (fig. 17b), we can see that the two are essentially identical. This means that we have achieved an almost perfect assembly at the chromosome level. +If we compare the contact map of our assembled genome (panel **a** above) with the reference assembly (panel **b** above), we can see that the two are essentially identical. This means that we have achieved an almost perfect assembly at the chromosome level.