Formalise bin refinement lesson

GenomicsAotearoa · Aug 27, 2024 · 61097d0 · 61097d0
1 parent 6dca3a5
commit 61097d0
Showing 1 changed file with 166 additions and 15 deletions.
diff --git a/docs/day2/ex9_refining_bins.md b/docs/day2/ex9_refining_bins.md
@@ -25,12 +25,12 @@
 
     In the interests of time today, the input files have been generated and are provided in the `6.bin_refinement/` folder: 
 
-    * `all_bins.fna` is a concatenation of the bins of *fragmented* sub-contigs (fragmented to 20k)
-    * `all_bins.sample1.vizbin.ann` is the annotation file containing per-subcontig coverage, label (bin ID), and length values.
+    * `all_fragments.fna` is a concatenation of the bins of *fragmented* sub-contigs (fragmented to 20k)
+    * `all_fragments.sample1.vizbin.ann` is the annotation file containing per-subcontig coverage, label (bin ID), and length values.
 
 !!! note "Contig fragments as input for `VizBin`"
 
-    When running `VizBin`, it is often preferable to split long contigs into smaller pieces in order to increase the density of clustering in the **t-SNE**. The data we are working with today are based on our bins output by `DAS_Tool` in the last binning exercise, but have been further processed using the `cut_up_fasta.py` script that comes with the binning tool `CONCOCT` to cut long contigs into 20k fragments. When reviewing our `VizBin` plots and outputs, it is important to remember that here we are looking at the **fragmented sub-contigs**, rather than the full complete contigs (the importance of this will be clear when we are reviewing our `vb_count_table.txt` later in this exercise).
+    When running `VizBin`, it is often preferable to split long contigs into smaller pieces in order to increase the density of clustering in the **t-SNE**. The data we are working with today are based on our bins output by `DAS_Tool` in the last binning exercise, but have been further processed using the `cut_up_fasta.py` script that comes with the binning tool `CONCOCT` to cut long contigs into 20k fragments. When reviewing our `VizBin` plots and outputs, it is important to remember that here we are looking at the **fragmented sub-contigs**, rather than the full complete contigs.
 
 For future reference, and for working with your own data, a step-by-step process for generating these files from the dereplicated bins generated by `DAS_Tool` has been provided as an [Appendix](../resources/2_APPENDIX_ex9_Generating_input_files_for_VizBin.md).
 
@@ -46,20 +46,20 @@ For this section, we will be working within `6.bin_refinement/`. Let's first hav
 !!! terminal-2 "Inspect `all_bins.sample1.vizbin.ann`"
 
     ```bash
-    head -n 5 all_bins.sample1.vizbin.ann
+    head -n 5 all_fragments.sample1.vizbin.ann
     ```
 
 !!! circle-check "Terminal output"
 
     ```
     coverage,label,length
-    16.5255,bin_0.chopped,20000
-    17.8777,bin_0.chopped,20000
-    17.6983,bin_0.chopped,20000
-    16.7296,bin_0.chopped,20000
+    17.6626,bin_0.chopped,20000
+    15.9561,bin_0.chopped,20000
+    17.294,bin_0.chopped,20000
+    15.8157,bin_0.chopped,20000
     ```
 
-This file is a comma-delimited table (csv file) that presents the information in the way that `VizBin` expects it. The order of rows in this file corresponds to the order of contigs in the concatenated FASTA file of our fragmented bins, `all_bins.fna`.
+This file is a comma-delimited table (csv file) that presents the information in the way that `VizBin` expects it. The order of rows in this file corresponds to the order of contigs in the concatenated FASTA file of our fragmented bins, `all_fragments.fna`.
 
 Create a few variations of the *.ann* file with various columns removed, in order to examine the different outputs they can generate.
 
@@ -68,15 +68,28 @@ Create a few variations of the *.ann* file with various columns removed, in orde
     === "Bin ID only"
 
         ```bash
-        cut -f2 -d ',' all_bins.sample1.vizbin.ann > all_bins.sample1.vizbin.bin_only.ann
+        cut -f2 -d ',' all_fragments.sample1.vizbin.ann > all_fragments.sample1.vizbin.bin_only.ann
         ```
 
     === "Bin ID and coverage without length"
 
         ```bash
-        cut -f1,2 -d ',' all_bins.sample1.vizbin.ann > all_bins.sample1.vizbin.no_length.ann
+        cut -f1,2 -d ',' all_fragments.sample1.vizbin.ann > all_fragments.sample1.vizbin.no_length.ann
         ```
 
+!!! hint "Symbolic links for easy access"
+
+    By default, `VizBin` redirects to your home directory. To make it easy to come back to our working directory, we can make a symbolic link (aka shortcut) that points to here from home.
+
+    ```bash
+    ln -sr $(pwd) ~/
+    ```
+
+    The flags mean:
+
+    *  `-s` designates it as a symbolic, instead of hard link
+    *  `-r` indicates that paths are relative
+
 ---
 
 ## Projecting a *t-SNE* and examining bin clusters
@@ -165,7 +178,10 @@ If this fails to open on your PC, or if it runs prohibitively slowly, team up wi
     !!! circle-check "Terminal output"
 
         ```
-        all_bins.fna  all_bins.sample1.vizbin.ann  vizbin_count_table_2022.sh  VizBin-dist.jar  vizbin_example_exports
+        all_fragments.fna                 example_data_unchopped  sample3.vb_tbl.csv
+        all_fragments.sample1.vizbin.ann  mock_bins               VizBin-dist.jar
+        example_data_20k                  mock_bins_checkm_out
+        example_data_20k_cov.txt          mock_bins.checkm.txt
         ```
 
 4. Type the following into your Virtual Desktop terminal to initiate VizBin.
@@ -184,7 +200,7 @@ If this fails to open on your PC, or if it runs prohibitively slowly, team up wi
 
 ### Load input files
 
-Once `VizBin` is open, to get started, click the 'Choose...' button then navigate to the FASTA file `all_bins.fna`.
+Once `VizBin` is open, to get started, click the 'Choose...' button then navigate to the FASTA file `all_fragments.fna`.
 
 !!! tip "`VizBin` directory"
 
@@ -202,7 +218,6 @@ For now leave all other parameters as default. Click the 'Start' button to begin
 
 ### Contigs coloured by bin
 
-
 <center>![image](../figures/ex10_bin_only_2022.png){width="600"}</center>
 
 ??? note "Additional annotations by length and coverage"
@@ -215,9 +230,26 @@ For now leave all other parameters as default. Click the 'Start' button to begin
 
 Similar to other projection techniques, we interpret the closeness of points as a proxy for how similar they are, and because of our *.ann* file we can see which contigs belong to the same bin.
 
+!!! question "What do scaffolds look like?"
+
+    In the example above, we used fragmented scaffolds as input files for VizBin. Take a look at what scaffolds look like with VizBin. You run follow the following code to generate input files for VizBin to visualise scaffolds.
+
+    !!! terminal "code"
+
+    ```bash linenums="1"
+    echo "label" > all_scaffolds.vizbin.ann
+    for fasta in example_data_unchopped/*.fna; do
+        bin=$(basename ${fasta} .fna)
+        cat ${fasta} >> all_scaffolds.fna
+        grep '>' ${fasta} | sed "s/.*/${bin}/g" >> all_scaffolds.vizbin.ann
+    done
+    ```
+
+    Import the newly generated `all_scaffolds.fna` and `all_scaffolds.vizbin.ann` into VizBin to visualise it.
+
 ---
 
-## Picking refined bins
+## Picking sequences
 
 We can use the interactive GUI to pick the boundaries of new bins, or to identify contigs which we do not believe should be retained in the data. Have a play around with the interface, testing out the following commands:
 
@@ -229,7 +261,9 @@ We can use the interactive GUI to pick the boundaries of new bins, or to identif
 
 How you proceed in this stage is up to you. You can either select bins based on their boundary, and call these the refined bins. Alternatively, you could select outlier contigs and examine these in more detail to determine whether or not they were correctly placed into the bin. Which way you proceed really depends on how well the ordination resolves your bins, and it might be that both approaches are needed.
 
+<!--
 Today, we will run through an example of selecting potentially problematic (sub)contigs, and then deciding whether or not we want to filter these contigs out of our refined bins. We can use a combination of `VizBin` and `seqmagick` to remove contigs from bins where we do not trust the placement of the contig. We are aiming to reduce each bin to a trusted set of contigs.
+-->
 
 ## 1. Export `VizBin` clusters
 
@@ -259,6 +293,117 @@ Right-click, 'Selection', 'Export'. Save the output as `cluster_1.fna`.
 ![image](../figures/ex10_export_2022.png){width="600"}
 </center>
 
+## 2. Refining bins
+
+VizBin is a general purpose tool for contig/scaffold/fragment visualisation. For this workshop, we're going to attempt to refine a few bins. Here, we will:
+
+* Diagnose and visualise contigs
+* Export clusters of contigs
+* Check if genome metrics improve after manual refinement
+
+For this exercise, we will be using data from the `mock_bins/` sub-directory and additional genome metrics in `mock_bins.checkm.txt`. These files were generated using a different assembly (reads assembled using MEGAHIT) based on a modified sample 3 library, and then binned using MetaBAT2 and MaxBin as per previous lessons. The assembly is quite a bit more fragmented which saves us from fragmenting the input ourselves!
+
+??? question "Do you remember how to check an assembly?"
+
+    Use BBMap's `stats.sh` (see [here](../day1/ex5_evaluating_assemblies.md#evaluating-the-assemblies-using-bbmap))
+
+### Inspect the genome metrics
+
+Open up the file named `mock_bins.checkm.txt`. Take note of the metrics of each bin and consider what we might want to improve on.
+
+### Prepare files for VizBin
+
+List the content of `mock_bins/`, there are 4 FASTA files. We will need to generate a concatenated set of sequences and a bin-contig map for colours.
+
+!!! terminal "code"
+
+    ```bash
+    # VizBin requires a header for annotation files
+    echo "label" > all_mock_bins.label.txt
+
+    # Generate concatenated set of sequences and labels at the same time!
+    for bin in mock_bins/*.fna; do
+        binID=$(basename ${bin} .fna)
+        grep '>' ${bin} | sed "s/.*/${binID}/g" >> all_mock_bins.label.ann
+        cat ${i} >> all_mock_bins.fna
+    done
+    ```
+
+### Prepare output directory
+
+We also need to prepare an output directory for the clusters that we export.
+
+!!! terminal "code"
+
+    ```bash linenums="1"
+    mkdir -p vb_export
+    ```
+
+### Load the files into VizBin
+
+Return to the VizBin set-up dialogue in the Virtual Desktop. Select the following input files we just made:
+
+**File to visualize:** `all_mock_bins.fna`
+
+**Annotation file:** `all_mock_bins.label.ann`
+
+### Export contig clusters
+
+Make your selection around clusters that you think should form bins. Use the CheckM output earlier to help inform your decisions.
+
+Once you've made your selection, export the sequences into the `vb_export/` directory we made earlier. Name your new clusters in a way that you can easily recognise the original bins they came from. For example, if most of the contigs were from `mock_bin_3`, perhaps name the new cluster `mock_bin_3.cluster`. If you're splitting a bin into several clusters, we recommend you name them something like `mock_bin_3.cluster_1` and `mock_bin_3.cluster_2`.
+
+!!! tip "Remember to add the `.fna` suffix to your new cluster filenames!"
+
+### Check exported clusters
+
+Moment of truth! How did your decisions impact the genome metrics of each bin? Run your selections through CheckM and see how you did!
+
+!!! terminal "code"
+
+    ```sh linenums="1"
+    #!/bin/bash -e
+    #SBATCH --account       nesi02659
+    #SBATCH --job-name      CheckM_vb_exports
+    #SBATCH --partition     milan
+    #SBATCH --time          00:20:00
+    #SBATCH --mem           50GB
+    #SBATCH --cpus-per-task 10
+    #SBATCH --error         %x_%j.err
+    #SBATCH --output        %x_%j.out
+
+    # Load modules
+    module purge
+    module load CheckM/1.2.3-foss-2023a-Python-3.11.6
+
+    # Working directory
+    cd /nesi/nobackup/nesi02659/MGSS_U/<YOUR FOLDER>/6.bin_refinement/
+
+    # Run CheckM
+    checkm lineage_wf -t $SLURM_CPUS_PER_TASK \
+                      --pplacer_threads $SLURM_CPUS_PER_TASK \
+                      -x fa --tab_table -f vb_exports.checkm.txt \
+                      vb_exports/ vb_exports.checkm_out/
+    ```
+
+### Human refinement vs automated binning
+
+Did you do better than the automated binning software? Consider the following:
+
+* Were the new clusters more complete?
+* How much contamination was removed?
+* Was there an substantial trade-off between completeness and contamination?
+* How would you pre-process the sequences differently prior to manual refinement?
+* Are there additional information that would have helped you in the decision-making process?
+
+When your CheckM run finishes, check and compare how you did!
+
+<!--
+### Food for thought: bin diagnosis and refinement
+
+Did you do better or worse? Which metrics improved and which deteriorated? What were some trade-offs 
+
+
 ## 2. Export potentially problematic contigs
 
 ### Select problematic contigs to examine
@@ -276,7 +421,12 @@ Try this for one or two problematic contigs (or subsets of contigs). In practice
     For the subsequent step using `vizbin_count_table_2022.sh`, all exported cluster files must share a common prefix (e.g. `cluster...fna`), and all files of problematic contigs must also share a common prefix (e.g. `contigs...fna`).*
 
 ---
+-->
+
+
 
+
+<!--
 ## *(Optional)* Refining and filtering problematic contigs from bins
 
 ### Create a count table of counts of our problematic contigs across each bin
@@ -497,3 +647,4 @@ A suite of tools for creating input files for `ESOMana` can be found on github [
 The tool `ESOMana` can be downloaded from [SourceForge](http://databionic-esom.sourceforge.net/).
 
 ---
+-->