galaxyproject · bgruening · Jan 12, 2024 · Nov 21, 2023 · Nov 27, 2023 · Nov 27, 2023
@@ -145,54 +145,78 @@
 >
 {: .details}
 
-The first step is to get the datasets from Zenodo. The VGP assembly pipeline uses data generated by a variety of technologies, including PacBio HiFi reads, Bionano optical maps, and Hi-C chromatin interaction maps.
-
-> <hands-on-title>Data upload</hands-on-title>
->
-> 1. Create a new history for this tutorial
-> 2. Import the files from [Zenodo]({{ page.zenodo_link }})
->
->    - Open the file {% icon galaxy-upload %} __upload__ menu
->    - Click on **Rule-based** tab
->    - *"Upload data as"*: `Datasets`
->    - Copy the tabular data, paste it into the textbox and press <kbd>Build</kbd>
->
->       ```
->   Hi-C_dataset_F   https://zenodo.org/record/5550653/files/SRR7126301_1.fastq.gz?download=1   fastqsanger.gz    Hi-C
->   Hi-C_dataset_R   https://zenodo.org/record/5550653/files/SRR7126301_2.fastq.gz?download=1   fastqsanger.gz    Hi-C
->   Bionano_dataset    https://zenodo.org/record/5550653/files/bionano.cmap?download=1   cmap    Bionano
->       ```
->
->    - From **Rules** menu select `Add / Modify Column Definitions`
->       - Click `Add Definition` button and select `Name`: column `A`
->       - Click `Add Definition` button and select `URL`: column `B`
->       - Click `Add Definition` button and select `Type`: column `C`
->       - Click `Add Definition` button and select `Name Tag`: column `D`
->    - Click `Apply` and press <kbd>Upload</kbd>
->   
-> 3. Import the remaining datasets from [Zenodo]({{ page.zenodo_link }})
->
->    - Open the file {% icon galaxy-upload %} __upload__ menu
->    - Click on **Rule-based** tab
->    - *"Upload data as"*: `Collections`
->    - Copy the tabular data, paste it into the textbox and press <kbd>Build</kbd>
->
->       ```
->   dataset_01    https://zenodo.org/record/6098306/files/HiFi_synthetic_50x_01.fasta?download=1  fasta    HiFi  HiFi_collection
->   dataset_02    https://zenodo.org/record/6098306/files/HiFi_synthetic_50x_02.fasta?download=1  fasta    HiFi  HiFi_collection
->   dataset_03    https://zenodo.org/record/6098306/files/HiFi_synthetic_50x_03.fasta?download=1  fasta    HiFi  HiFi_collection
->       ```
->
->    - From **Rules** menu select `Add / Modify Column Definitions`
->       - Click `Add Definition` button and select `List Identifier(s)`: column `A`
->       - Click `Add Definition` button and select `URL`: column `B`
->       - Click `Add Definition` button and select `Type`: column `C`
->       - Click `Add Definition` button and select `Group Tag`: column `D`
->       - Click `Add Definition` button and select `Collection Name`: column `E`
->    - Click `Apply` and press <kbd>Upload</kbd>
+The first step is to get the datasets from Zenodo. Specifically, we will be uploading two datasets:
+
+1. A set of PacBio {HiFi} reads in `fasta` format
+2. A set of Illumina {Hi-C} reads in `fastqsanger.gz` format
+
+## Uploading `fasta` datasets from Zenodo
+
+The following two steps demonstrate how to upload three PacBio {HiFi} datasets into you Galaxy history.
+
+> <hands-on-title><b>Uploading <tt>FASTA</tt> datasets from Zenodo</b></hands-on-title>
+>
+> **Step 1**: Copy the following URLs into clipboard.
+>
+>(you can do this by clicking on {% icon copy %} button in the right upper corner of the box below. It will appear if you mouse over the box.)
+>
+>   ```
+>   https://zenodo.org/record/6098306/files/HiFi_synthetic_50x_01.fasta
+>   https://zenodo.org/record/6098306/files/HiFi_synthetic_50x_02.fasta
+>   https://zenodo.org/record/6098306/files/HiFi_synthetic_50x_03.fasta
+>   ```
+>
+> **Step 2**: Upload datasets into Galaxy.
 >
+> These datasets are in `fasta` format. Upload by following the steps shown in the figure below.
+>
+>![Uploading fasta files in Galaxy]({% link faqs/galaxy/images/upload_fastqsanger_via_url.png %} "Here we upload three fasta files. Compressed (.gz or .bz2) datasets are uploaded in exactly the same fashion by selecting an appropriate datatype (fasta.gz or fasta.bz2)")
 {: .hands_on}
 
+## Uploading `fastqsanger.gz` datasets from Zenodo
+
+Illumina {Hi-C} data is uploaded in essentially the same way as shown in the following two steps.
+
+> <warning-title>DANGER: Make sure you choose correct format!</warning-title>
+> When selecting datatype in "**Type (set all)**" drop-down, make sure you select `fastaqsanger` or `fastqsanger.gz` BUT NOT `fastqcssanger` or anything else!
+{: .warning}
+
+> <hands-on-title><b>Uploading <tt>fastqsanger.gz</tt> datasets from Zenodo</b></hands-on-title>
+>
+> **Step 1**: Copy the following URLs into clipboard.
+>
+>(you can do this by clicking on {% icon copy %} button in the right upper corner of the box below. It will appear if you mouse over the box.)
+>
+>  ```
+>  https://zenodo.org/record/5550653/files/SRR7126301_1.fastq.gz
+>  https://zenodo.org/record/5550653/files/SRR7126301_2.fastq.gz
+>  ```
+>
+> **Step 2**: Upload datasets into Galaxy.
+>
+> These datasets are in `fastqsanger.gz` format. Upload by following the steps shown in the figure below.
+>
+> ![Uploading Fasta files in Galaxy]({% link /faqs/galaxy/images/upload_fastqsanger_via_url.png %} "Here we upload two fastqsanger.gz files. Uncompressed or bz2 compressed (.bz2) detests are uploaded in exactly the same fashion by selecting an appropriation datatype (fastqsanger or fastasanger.bz2)")
+{: .hands_on}
+
+> <warning-title>These datasets are large!</warning-title>
+> Hi-C datasets are large. It will take some time (~15 min) for them to be fully uploaded. Please, be patient.
+{: .warning}
+
+## Organizing the data 
+
+If everything goes smoothly you history will look like shown in Fig. 4 below. The three {HiFi} fasta files are better represented as a collection: {collection}. Also, importantly, the workflow we will be using for the analysis of our data takes collection as an input (it does not access individual datasets). So let's create a collection using steps outlines in the Tip {% icon tip %} "Creating a dataset collection" that you can find below Fig. 4.
+
+![AfterUpload](../../images/vgp_assembly/making_list.svg "History after uploading HiFi and HiC data (left). Creation of a list (collection) combines all HiFi datasets into a single history item called 'HiFi data' (right). See below for instruction on how to make this collection.")
+
+{% snippet faqs/galaxy/collections_build_list.md %}
+
+> <details-title>Other ways to upload the data</details-title>
+> You can obviously upload your own datasets via URLs as illustrated above or from your own computer. In addition, you can upload data from a major repository called [GenomeArk](https://genomeark.org). GenomeArk is integrated directly into Galaxy Upload. To use GenomeArk following the steps in the Tip {% icon tip %} below:
+>
+> {% snippet faqs/galaxy/dataset_upload_from_genomeark.md %}
+{: .details}
+
 ### HiFi reads preprocessing with **cutadapt**
 
 Adapter trimming usually means trimming the adapter sequence off the ends of reads, which is where the adapter sequence is usually located in {NGS} reads. However, due to the nature of {SMRT} sequencing technology, adapters do not have a specific, predictable location in  {HiFi} reads. Additionally, the reads containing adapter sequence could be of generally lower quality compared to the rest of the reads. Thus, we will use **cutadapt** not to trim, but to remove the entire read if a read is found to have an adapter inside of it.
@@ -209,7 +233,7 @@
 
 > <hands-on-title>Primer removal with Cutadapt</hands-on-title>
 >
-> 1. {% tool [Cutadapt](toolshed.g2.bx.psu.edu/repos/lparsons/cutadapt/cutadapt/3.4) %} with the following parameters:
+> 1. {% tool [Cutadapt](toolshed.g2.bx.psu.edu/repos/lparsons/cutadapt/cutadapt/4.4+galaxy0) %} with the following parameters:
 >    - *"Single-end or Paired-end reads?"*: `Single-end`
 >        - {% icon param-collection %} *"FASTQ/A file"*: `HiFi_collection`
 >        - In *"Read 1 Options"*:
@@ -236,7 +260,7 @@
 >    >
 >    {: .tip}
 >
-> 2. Rename the output file as `HiFi_collection (trim)`.
+> 2. Rename the output file as `HiFi_collection (trimmed)`.
 >
 > {% snippet faqs/galaxy/datasets_rename.md %}
 >
@@ -291,7 +315,7 @@
 >    > We used 31 as *k*-mer size, as this length has demonstrated to be sufficiently long that most *k*-mers are not repetitive and is short enough to be more robust to sequencing errors. For very large (haploid size > 10 Gb) and/or very repetitive genomes, larger *k*-mer length is recommended to increase the number of unique *k*-mers.
 >    {: .comment}
 >
-> 2. Rename it `Collection meryldb`
+> 2. Rename it `meryldb`
 >
 > 3. Run {% tool [Meryl](toolshed.g2.bx.psu.edu/repos/iuc/meryl/meryl/1.3+galaxy6) %} again with the following parameters:
 >    - *"Operation type selector"*: `Operations on sets of k-mers`
@@ -304,7 +328,7 @@
 >    - *"Operation type selector"*: `Generate histogram dataset`
 >        - {% icon param-file %} *"Input meryldb"*: `Merged meryldb`
 >
-> 6. Finally, rename it as `Meryldb histogram`.
+> 6. Finally, rename it as `meryldb histogram`.
 >
 {: .hands_on}
 
@@ -316,7 +340,7 @@
 > <hands-on-title>Estimate genome properties</hands-on-title>
 >
 > 1. {% tool [GenomeScope](toolshed.g2.bx.psu.edu/repos/iuc/genomescope/genomescope/2.0+galaxy2) %} with the following parameters:
->    - {% icon param-file %} *"Input histogram file"*: `Meryldb histogram`
+>    - {% icon param-file %} *"Input histogram file"*: `meryldb histogram`
 >    - *Ploidy for model to use*: `2`
 >    - *"k-mer length used to calculate k-mer spectra"*: `31`
 >
@@ -342,6 +366,8 @@
 
 ![Genomescope plot](../../images/vgp_assembly/genomescope_plot.png "GenomeScope2 31-mer profile. The first peak located at coverage 25x corresponds to the heterozygous peak. The second peak at coverage 50x, corresponds to the homozygous peak. Estimate of the heterozygous portion is 0.576%. The plot also includes information about the inferred total genome length (len), genome unique length percent ('uniq'), overall heterozygosity rate ('ab'), mean k-mer coverage for heterozygous bases ('kcov'), read error rate ('err'), and average rate of read duplications ('dup'). It also reports the user-given parameters of k-mer size ('k') and ploidy ('p')."){:width="65%"}
 
+<br>
+
 This distribution is the result of the Poisson process underlying the generation of sequencing reads. As we can see, the *k*-mer profile follows a bimodal distribution, indicative of a diploid genome. The distribution is consistent with the theoretical diploid model (model fit > 93%). Low frequency *k*-mers are the result of sequencing errors. GenomeScope2 estimated a haploid genome size is around 11.7 Mb, a value reasonably close to *Saccharomyces* genome size. Additionally, it revealed that the variation across the genomic sequences is 0.576%.
 
 > <comment-title>Are you expecting to purge your assembly?</comment-title>
@@ -403,11 +429,12 @@
 If you have the {Hi-C} data for the individual you are assembling with {HiFi} reads, then you can use that information to phase the {contigs}.
 
 > <hands-on-title>Hi-C-phased assembly with <b>hifiasm</b></hands-on-title>
-> 1. {% tool [Hifiasm](toolshed.g2.bx.psu.edu/repos/bgruening/hifiasm/hifiasm/0.18.8+galaxy1) %} with the following parameters:
+> 1. {% tool [Hifiasm](toolshed.g2.bx.psu.edu/repos/bgruening/hifiasm/hifiasm/0.19.8+galaxy0) %} with the following parameters:
 >    - *"Assembly mode"*: `Standard`
 >        - {% icon param-file %} *"Input reads"*: `HiFi_collection (trim)` (output of **Cutadapt** {% icon tool %})
->       - *"Hi-C R1 reads"*: `Hi-C_dataset_F`
->       - *"Hi-C R2 reads"*: `Hi-C_dataset_R`
+>       - In *"Options for Hi-C-partition*" select `Specify`
+>         - *"Hi-C R1 reads"*: `Hi-C_dataset_F`
+>         - *"Hi-C R2 reads"*: `Hi-C_dataset_R`
 >
 > 2. After the tool has finished running, rename its outputs as follows:
 >   - Rename the `Hi-C hap1 balanced contig graph` as `Hap1 contigs graph` and add a `#hap1` tag
@@ -1191,7 +1218,7 @@
 {: .details}


 ### Pre-processing Hi-C data

 Despite Hi-C generating paired-end reads, we need to map each read separately. This is because most aligners assume that the distance between paired-end reads fit a known distribution, but in Hi-C data the insert size of the ligation product can vary between one base pair to hundreds of megabases ({% cite Lajoie2015 %}).