Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Long vgp tutorial edits #4542

Merged
merged 29 commits into from
Jan 12, 2024
Merged
Changes from 1 commit
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
86b1460
Long VGP tutorial update
nekrut Nov 21, 2023
7befaf0
fix heading levels
shiltemann Nov 27, 2023
cb5c15f
add border around images per @nekrut's request
shiltemann Nov 27, 2023
b8e2e58
tweak figure layout
shiltemann Nov 27, 2023
81b5488
tweak FAQ layout
shiltemann Nov 27, 2023
0c156d1
move FAQ to tutorial-level FAQ folder
shiltemann Nov 27, 2023
9dd5f85
restore written instructions in FAQ
shiltemann Nov 27, 2023
4d10d79
format FAQ
shiltemann Nov 27, 2023
780fbb7
unify duplicate FAQs
shiltemann Nov 27, 2023
d346a67
rework data upload boxes and FAQs
shiltemann Nov 27, 2023
a5ca087
fix box
shiltemann Nov 27, 2023
f001c0c
fix box
shiltemann Nov 27, 2023
2c9a489
Merge branch 'main' into long_vgp_tutorial_edits
shiltemann Nov 27, 2023
241e33b
Merge branch 'galaxyproject:main' into long_vgp_tutorial_edits
nekrut Nov 28, 2023
91e0b00
further changes
nekrut Nov 28, 2023
e828f33
more changes to hic section
nekrut Nov 29, 2023
c835e09
even more changes to hic section
nekrut Nov 29, 2023
a7bed74
updated busco version and hands-on section
nekrut Nov 30, 2023
f613229
switching to solo mode
nekrut Dec 1, 2023
02c103e
gfastats update for solo
nekrut Dec 1, 2023
473255b
almos done with solo
nekrut Dec 19, 2023
d373e41
replaced salsa with yahs
nekrut Jan 10, 2024
70662e5
additional tweaks
nekrut Jan 10, 2024
5a4d0ae
fixed all tools and figures. reeady for merge
nekrut Jan 11, 2024
d0389bf
removed here links
nekrut Jan 11, 2024
543f15d
one more tweak
nekrut Jan 11, 2024
a708957
fixed snippets spacing
nekrut Jan 11, 2024
e97a09d
fixed anchors
nekrut Jan 11, 2024
32e9c5a
small final changes
bgruening Jan 12, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
133 changes: 80 additions & 53 deletions topics/assembly/tutorials/vgp_genome_assembly/tutorial.md
Original file line number Diff line number Diff line change
Expand Up @@ -145,54 +145,78 @@
>
{: .details}

The first step is to get the datasets from Zenodo. The VGP assembly pipeline uses data generated by a variety of technologies, including PacBio HiFi reads, Bionano optical maps, and Hi-C chromatin interaction maps.

> <hands-on-title>Data upload</hands-on-title>
>
> 1. Create a new history for this tutorial
> 2. Import the files from [Zenodo]({{ page.zenodo_link }})
>
> - Open the file {% icon galaxy-upload %} __upload__ menu
> - Click on **Rule-based** tab
> - *"Upload data as"*: `Datasets`
> - Copy the tabular data, paste it into the textbox and press <kbd>Build</kbd>
>
> ```
> Hi-C_dataset_F https://zenodo.org/record/5550653/files/SRR7126301_1.fastq.gz?download=1 fastqsanger.gz Hi-C
> Hi-C_dataset_R https://zenodo.org/record/5550653/files/SRR7126301_2.fastq.gz?download=1 fastqsanger.gz Hi-C
> Bionano_dataset https://zenodo.org/record/5550653/files/bionano.cmap?download=1 cmap Bionano
> ```
>
> - From **Rules** menu select `Add / Modify Column Definitions`
> - Click `Add Definition` button and select `Name`: column `A`
> - Click `Add Definition` button and select `URL`: column `B`
> - Click `Add Definition` button and select `Type`: column `C`
> - Click `Add Definition` button and select `Name Tag`: column `D`
> - Click `Apply` and press <kbd>Upload</kbd>
>
> 3. Import the remaining datasets from [Zenodo]({{ page.zenodo_link }})
>
> - Open the file {% icon galaxy-upload %} __upload__ menu
> - Click on **Rule-based** tab
> - *"Upload data as"*: `Collections`
> - Copy the tabular data, paste it into the textbox and press <kbd>Build</kbd>
>
> ```
> dataset_01 https://zenodo.org/record/6098306/files/HiFi_synthetic_50x_01.fasta?download=1 fasta HiFi HiFi_collection
> dataset_02 https://zenodo.org/record/6098306/files/HiFi_synthetic_50x_02.fasta?download=1 fasta HiFi HiFi_collection
> dataset_03 https://zenodo.org/record/6098306/files/HiFi_synthetic_50x_03.fasta?download=1 fasta HiFi HiFi_collection
> ```
>
> - From **Rules** menu select `Add / Modify Column Definitions`
> - Click `Add Definition` button and select `List Identifier(s)`: column `A`
> - Click `Add Definition` button and select `URL`: column `B`
> - Click `Add Definition` button and select `Type`: column `C`
> - Click `Add Definition` button and select `Group Tag`: column `D`
> - Click `Add Definition` button and select `Collection Name`: column `E`
> - Click `Apply` and press <kbd>Upload</kbd>
The first step is to get the datasets from Zenodo. Specifically, we will be uploading two datasets:

1. A set of PacBio {HiFi} reads in `fasta` format
2. A set of Illumina {Hi-C} reads in `fastqsanger.gz` format

## Uploading `fasta` datasets from Zenodo

The following two steps demonstrate how to upload three PacBio {HiFi} datasets into you Galaxy history.

> <hands-on-title><b>Uploading <tt>FASTA</tt> datasets from Zenodo</b></hands-on-title>
>
> **Step 1**: Copy the following URLs into clipboard.
>
>(you can do this by clicking on {% icon copy %} button in the right upper corner of the box below. It will appear if you mouse over the box.)
>
> ```
> https://zenodo.org/record/6098306/files/HiFi_synthetic_50x_01.fasta
> https://zenodo.org/record/6098306/files/HiFi_synthetic_50x_02.fasta
> https://zenodo.org/record/6098306/files/HiFi_synthetic_50x_03.fasta
> ```
>
> **Step 2**: Upload datasets into Galaxy.
>
> These datasets are in `fasta` format. Upload by following the steps shown in the figure below.
>
>![Uploading fasta files in Galaxy]({% link faqs/galaxy/images/upload_fastqsanger_via_url.png %} "Here we upload three fasta files. Compressed (.gz or .bz2) datasets are uploaded in exactly the same fashion by selecting an appropriate datatype (fasta.gz or fasta.bz2)")
{: .hands_on}

## Uploading `fastqsanger.gz` datasets from Zenodo

Illumina {Hi-C} data is uploaded in essentially the same way as shown in the following two steps.

> <warning-title>DANGER: Make sure you choose correct format!</warning-title>
> When selecting datatype in "**Type (set all)**" drop-down, make sure you select `fastaqsanger` or `fastqsanger.gz` BUT NOT `fastqcssanger` or anything else!
{: .warning}

> <hands-on-title><b>Uploading <tt>fastqsanger.gz</tt> datasets from Zenodo</b></hands-on-title>
>
> **Step 1**: Copy the following URLs into clipboard.
>
>(you can do this by clicking on {% icon copy %} button in the right upper corner of the box below. It will appear if you mouse over the box.)
>
> ```
> https://zenodo.org/record/5550653/files/SRR7126301_1.fastq.gz
> https://zenodo.org/record/5550653/files/SRR7126301_2.fastq.gz
> ```
>
> **Step 2**: Upload datasets into Galaxy.
>
> These datasets are in `fastqsanger.gz` format. Upload by following the steps shown in the figure below.
>
> ![Uploading Fasta files in Galaxy]({% link /faqs/galaxy/images/upload_fastqsanger_via_url.png %} "Here we upload two fastqsanger.gz files. Uncompressed or bz2 compressed (.bz2) detests are uploaded in exactly the same fashion by selecting an appropriation datatype (fastqsanger or fastasanger.bz2)")
{: .hands_on}

> <warning-title>These datasets are large!</warning-title>
> Hi-C datasets are large. It will take some time (~15 min) for them to be fully uploaded. Please, be patient.
{: .warning}

## Organizing the data

If everything goes smoothly you history will look like shown in Fig. 4 below. The three {HiFi} fasta files are better represented as a collection: {collection}. Also, importantly, the workflow we will be using for the analysis of our data takes collection as an input (it does not access individual datasets). So let's create a collection using steps outlines in the Tip {% icon tip %} "Creating a dataset collection" that you can find below Fig. 4.

![AfterUpload](../../images/vgp_assembly/making_list.svg "History after uploading HiFi and HiC data (left). Creation of a list (collection) combines all HiFi datasets into a single history item called 'HiFi data' (right). See below for instruction on how to make this collection.")

{% snippet faqs/galaxy/collections_build_list.md %}

> <details-title>Other ways to upload the data</details-title>
> You can obviously upload your own datasets via URLs as illustrated above or from your own computer. In addition, you can upload data from a major repository called [GenomeArk](https://genomeark.org). GenomeArk is integrated directly into Galaxy Upload. To use GenomeArk following the steps in the Tip {% icon tip %} below:
>
> {% snippet faqs/galaxy/dataset_upload_from_genomeark.md %}
{: .details}

### HiFi reads preprocessing with **cutadapt**

Adapter trimming usually means trimming the adapter sequence off the ends of reads, which is where the adapter sequence is usually located in {NGS} reads. However, due to the nature of {SMRT} sequencing technology, adapters do not have a specific, predictable location in {HiFi} reads. Additionally, the reads containing adapter sequence could be of generally lower quality compared to the rest of the reads. Thus, we will use **cutadapt** not to trim, but to remove the entire read if a read is found to have an adapter inside of it.
Expand All @@ -209,7 +233,7 @@

> <hands-on-title>Primer removal with Cutadapt</hands-on-title>
>
> 1. {% tool [Cutadapt](toolshed.g2.bx.psu.edu/repos/lparsons/cutadapt/cutadapt/3.4) %} with the following parameters:
> 1. {% tool [Cutadapt](toolshed.g2.bx.psu.edu/repos/lparsons/cutadapt/cutadapt/4.4+galaxy0) %} with the following parameters:
> - *"Single-end or Paired-end reads?"*: `Single-end`
> - {% icon param-collection %} *"FASTQ/A file"*: `HiFi_collection`
> - In *"Read 1 Options"*:
Expand All @@ -236,7 +260,7 @@
> >
> {: .tip}
>
> 2. Rename the output file as `HiFi_collection (trim)`.
> 2. Rename the output file as `HiFi_collection (trimmed)`.
>
> {% snippet faqs/galaxy/datasets_rename.md %}
>
Expand Down Expand Up @@ -291,7 +315,7 @@
> > We used 31 as *k*-mer size, as this length has demonstrated to be sufficiently long that most *k*-mers are not repetitive and is short enough to be more robust to sequencing errors. For very large (haploid size > 10 Gb) and/or very repetitive genomes, larger *k*-mer length is recommended to increase the number of unique *k*-mers.
> {: .comment}
>
> 2. Rename it `Collection meryldb`
> 2. Rename it `meryldb`
>
> 3. Run {% tool [Meryl](toolshed.g2.bx.psu.edu/repos/iuc/meryl/meryl/1.3+galaxy6) %} again with the following parameters:
> - *"Operation type selector"*: `Operations on sets of k-mers`
Expand All @@ -304,7 +328,7 @@
> - *"Operation type selector"*: `Generate histogram dataset`
> - {% icon param-file %} *"Input meryldb"*: `Merged meryldb`
>
> 6. Finally, rename it as `Meryldb histogram`.
> 6. Finally, rename it as `meryldb histogram`.
>
{: .hands_on}

Expand All @@ -316,7 +340,7 @@
> <hands-on-title>Estimate genome properties</hands-on-title>
>
> 1. {% tool [GenomeScope](toolshed.g2.bx.psu.edu/repos/iuc/genomescope/genomescope/2.0+galaxy2) %} with the following parameters:
> - {% icon param-file %} *"Input histogram file"*: `Meryldb histogram`
> - {% icon param-file %} *"Input histogram file"*: `meryldb histogram`
> - *Ploidy for model to use*: `2`
> - *"k-mer length used to calculate k-mer spectra"*: `31`
>
Expand All @@ -342,6 +366,8 @@

![Genomescope plot](../../images/vgp_assembly/genomescope_plot.png "GenomeScope2 31-mer profile. The first peak located at coverage 25x corresponds to the heterozygous peak. The second peak at coverage 50x, corresponds to the homozygous peak. Estimate of the heterozygous portion is 0.576%. The plot also includes information about the inferred total genome length (len), genome unique length percent ('uniq'), overall heterozygosity rate ('ab'), mean k-mer coverage for heterozygous bases ('kcov'), read error rate ('err'), and average rate of read duplications ('dup'). It also reports the user-given parameters of k-mer size ('k') and ploidy ('p')."){:width="65%"}

<br>

This distribution is the result of the Poisson process underlying the generation of sequencing reads. As we can see, the *k*-mer profile follows a bimodal distribution, indicative of a diploid genome. The distribution is consistent with the theoretical diploid model (model fit > 93%). Low frequency *k*-mers are the result of sequencing errors. GenomeScope2 estimated a haploid genome size is around 11.7 Mb, a value reasonably close to *Saccharomyces* genome size. Additionally, it revealed that the variation across the genomic sequences is 0.576%.

> <comment-title>Are you expecting to purge your assembly?</comment-title>
Expand Down Expand Up @@ -403,11 +429,12 @@
If you have the {Hi-C} data for the individual you are assembling with {HiFi} reads, then you can use that information to phase the {contigs}.

> <hands-on-title>Hi-C-phased assembly with <b>hifiasm</b></hands-on-title>
> 1. {% tool [Hifiasm](toolshed.g2.bx.psu.edu/repos/bgruening/hifiasm/hifiasm/0.18.8+galaxy1) %} with the following parameters:
> 1. {% tool [Hifiasm](toolshed.g2.bx.psu.edu/repos/bgruening/hifiasm/hifiasm/0.19.8+galaxy0) %} with the following parameters:
> - *"Assembly mode"*: `Standard`
> - {% icon param-file %} *"Input reads"*: `HiFi_collection (trim)` (output of **Cutadapt** {% icon tool %})
> - *"Hi-C R1 reads"*: `Hi-C_dataset_F`
> - *"Hi-C R2 reads"*: `Hi-C_dataset_R`
> - In *"Options for Hi-C-partition*" select `Specify`
> - *"Hi-C R1 reads"*: `Hi-C_dataset_F`
> - *"Hi-C R2 reads"*: `Hi-C_dataset_R`
>
> 2. After the tool has finished running, rename its outputs as follows:
> - Rename the `Hi-C hap1 balanced contig graph` as `Hap1 contigs graph` and add a `#hap1` tag
Expand Down Expand Up @@ -1191,7 +1218,7 @@
{: .details}


### Pre-processing Hi-C data

Check failure on line 1221 in topics/assembly/tutorials/vgp_genome_assembly/tutorial.md

View workflow job for this annotation

GitHub Actions / lint

[rdjsonl] reported by reviewdog 🐶 You have skipped a heading level, please correct this. <details><summary>Listing of Heading Levels</summary> ``` # Important terms to know # VGP assembly pipeline overview # Get data ## Uploading `fasta` datasets from Zenodo ## Uploading `fastqsanger.gz` datasets from Zenodo ## Organizing the data ### HiFi reads preprocessing with **cutadapt** # Genome profile analysis ## Generation of _k_-mer spectra with **Meryl** ## Genome profiling with **GenomeScope2** # Assembly with **hifiasm** ## Assembly evaluation ## HiC-phased assembly with **hifiasm** ## Pseudohaplotype assembly with **hifiasm** ## Purging the primary and alternate assemblies ### Parsing **purge_dups** cutoffs from **GenomeScope2** output ## Purging with **purge_dups** ### Read-depth analysis ### Generation of all versus all self-alignment ### Resolution of haplotigs and overlaps ### Process the alternate assembly ### Post-purge quality control # Scaffolding # Hybrid scaffolding with Bionano optical maps ## Evaluating Bionano scaffolds # Hi-C scaffolding ### Pre-processing Hi-C data ### Generate initial Hi-C contact map ### SALSA2 scaffolding ### Evaluate the final genome assembly with Pretext # Conclusion ``` </details> Raw Output: {"message":"You have skipped a heading level, please correct this.\n<details><summary>Listing of Heading Levels</summary>\n\n```\n# Important terms to know\n# VGP assembly pipeline overview\n# Get data\n## Uploading `fasta` datasets from Zenodo\n## Uploading `fastqsanger.gz` datasets from Zenodo\n## Organizing the data \n### HiFi reads preprocessing with **cutadapt**\n# Genome profile analysis\n## Generation of _k_-mer spectra with **Meryl**\n## Genome profiling with **GenomeScope2**\n# Assembly with **hifiasm**\n## Assembly evaluation\n## HiC-phased assembly with **hifiasm**\n## Pseudohaplotype assembly with **hifiasm**\n## Purging the primary and alternate assemblies\n### Parsing **purge_dups** cutoffs from **GenomeScope2** output\n## Purging with **purge_dups**\n### Read-depth analysis\n### Generation of all versus all self-alignment\n### Resolution of haplotigs and overlaps\n### Process the alternate assembly\n### Post-purge quality control\n# Scaffolding\n# Hybrid scaffolding with Bionano optical maps\n## Evaluating Bionano scaffolds\n# Hi-C scaffolding\n### Pre-processing Hi-C data\n### Generate initial Hi-C contact map\n### SALSA2 scaffolding\n### Evaluate the final genome assembly with Pretext\n# Conclusion\n```\n</details>","location":{"path":"./topics/assembly/tutorials/vgp_genome_assembly/tutorial.md","range":{"start":{"line":1221,"column":1},"end":{"line":1221,"column":4}}},"severity":"ERROR","code":{"value":"GTN:028","url":"https://github.com/galaxyproject/training-material/wiki/Error-Codes#gtn028"},"suggestions":[{"text":"##","range":{"start":{"line":1221,"column":1},"end":{"line":1221,"column":4}}}]}

Despite Hi-C generating paired-end reads, we need to map each read separately. This is because most aligners assume that the distance between paired-end reads fit a known distribution, but in Hi-C data the insert size of the ligation product can vary between one base pair to hundreds of megabases ({% cite Lajoie2015 %}).

Expand Down
Loading