Skip to content

Commit

Permalink
Formalise bin refinement lesson
Browse files Browse the repository at this point in the history
  • Loading branch information
JSBoey committed Aug 27, 2024
1 parent 6dca3a5 commit 61097d0
Showing 1 changed file with 166 additions and 15 deletions.
181 changes: 166 additions & 15 deletions docs/day2/ex9_refining_bins.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,12 +25,12 @@

In the interests of time today, the input files have been generated and are provided in the `6.bin_refinement/` folder:

* `all_bins.fna` is a concatenation of the bins of *fragmented* sub-contigs (fragmented to 20k)
* `all_bins.sample1.vizbin.ann` is the annotation file containing per-subcontig coverage, label (bin ID), and length values.
* `all_fragments.fna` is a concatenation of the bins of *fragmented* sub-contigs (fragmented to 20k)
* `all_fragments.sample1.vizbin.ann` is the annotation file containing per-subcontig coverage, label (bin ID), and length values.

!!! note "Contig fragments as input for `VizBin`"

When running `VizBin`, it is often preferable to split long contigs into smaller pieces in order to increase the density of clustering in the **t-SNE**. The data we are working with today are based on our bins output by `DAS_Tool` in the last binning exercise, but have been further processed using the `cut_up_fasta.py` script that comes with the binning tool `CONCOCT` to cut long contigs into 20k fragments. When reviewing our `VizBin` plots and outputs, it is important to remember that here we are looking at the **fragmented sub-contigs**, rather than the full complete contigs (the importance of this will be clear when we are reviewing our `vb_count_table.txt` later in this exercise).
When running `VizBin`, it is often preferable to split long contigs into smaller pieces in order to increase the density of clustering in the **t-SNE**. The data we are working with today are based on our bins output by `DAS_Tool` in the last binning exercise, but have been further processed using the `cut_up_fasta.py` script that comes with the binning tool `CONCOCT` to cut long contigs into 20k fragments. When reviewing our `VizBin` plots and outputs, it is important to remember that here we are looking at the **fragmented sub-contigs**, rather than the full complete contigs.

For future reference, and for working with your own data, a step-by-step process for generating these files from the dereplicated bins generated by `DAS_Tool` has been provided as an [Appendix](../resources/2_APPENDIX_ex9_Generating_input_files_for_VizBin.md).

Expand All @@ -46,20 +46,20 @@ For this section, we will be working within `6.bin_refinement/`. Let's first hav
!!! terminal-2 "Inspect `all_bins.sample1.vizbin.ann`"

```bash
head -n 5 all_bins.sample1.vizbin.ann
head -n 5 all_fragments.sample1.vizbin.ann
```

!!! circle-check "Terminal output"

```
coverage,label,length
16.5255,bin_0.chopped,20000
17.8777,bin_0.chopped,20000
17.6983,bin_0.chopped,20000
16.7296,bin_0.chopped,20000
17.6626,bin_0.chopped,20000
15.9561,bin_0.chopped,20000
17.294,bin_0.chopped,20000
15.8157,bin_0.chopped,20000
```

This file is a comma-delimited table (csv file) that presents the information in the way that `VizBin` expects it. The order of rows in this file corresponds to the order of contigs in the concatenated FASTA file of our fragmented bins, `all_bins.fna`.
This file is a comma-delimited table (csv file) that presents the information in the way that `VizBin` expects it. The order of rows in this file corresponds to the order of contigs in the concatenated FASTA file of our fragmented bins, `all_fragments.fna`.

Create a few variations of the *.ann* file with various columns removed, in order to examine the different outputs they can generate.

Expand All @@ -68,15 +68,28 @@ Create a few variations of the *.ann* file with various columns removed, in orde
=== "Bin ID only"

```bash
cut -f2 -d ',' all_bins.sample1.vizbin.ann > all_bins.sample1.vizbin.bin_only.ann
cut -f2 -d ',' all_fragments.sample1.vizbin.ann > all_fragments.sample1.vizbin.bin_only.ann
```

=== "Bin ID and coverage without length"

```bash
cut -f1,2 -d ',' all_bins.sample1.vizbin.ann > all_bins.sample1.vizbin.no_length.ann
cut -f1,2 -d ',' all_fragments.sample1.vizbin.ann > all_fragments.sample1.vizbin.no_length.ann
```

!!! hint "Symbolic links for easy access"

By default, `VizBin` redirects to your home directory. To make it easy to come back to our working directory, we can make a symbolic link (aka shortcut) that points to here from home.

```bash
ln -sr $(pwd) ~/
```

The flags mean:

* `-s` designates it as a symbolic, instead of hard link
* `-r` indicates that paths are relative

---

## Projecting a *t-SNE* and examining bin clusters
Expand Down Expand Up @@ -165,7 +178,10 @@ If this fails to open on your PC, or if it runs prohibitively slowly, team up wi
!!! circle-check "Terminal output"

```
all_bins.fna all_bins.sample1.vizbin.ann vizbin_count_table_2022.sh VizBin-dist.jar vizbin_example_exports
all_fragments.fna example_data_unchopped sample3.vb_tbl.csv
all_fragments.sample1.vizbin.ann mock_bins VizBin-dist.jar
example_data_20k mock_bins_checkm_out
example_data_20k_cov.txt mock_bins.checkm.txt
```

4. Type the following into your Virtual Desktop terminal to initiate VizBin.
Expand All @@ -184,7 +200,7 @@ If this fails to open on your PC, or if it runs prohibitively slowly, team up wi

### Load input files

Once `VizBin` is open, to get started, click the 'Choose...' button then navigate to the FASTA file `all_bins.fna`.
Once `VizBin` is open, to get started, click the 'Choose...' button then navigate to the FASTA file `all_fragments.fna`.

!!! tip "`VizBin` directory"

Expand All @@ -202,7 +218,6 @@ For now leave all other parameters as default. Click the 'Start' button to begin

### Contigs coloured by bin


<center>![image](../figures/ex10_bin_only_2022.png){width="600"}</center>

??? note "Additional annotations by length and coverage"
Expand All @@ -215,9 +230,26 @@ For now leave all other parameters as default. Click the 'Start' button to begin

Similar to other projection techniques, we interpret the closeness of points as a proxy for how similar they are, and because of our *.ann* file we can see which contigs belong to the same bin.

!!! question "What do scaffolds look like?"

In the example above, we used fragmented scaffolds as input files for VizBin. Take a look at what scaffolds look like with VizBin. You run follow the following code to generate input files for VizBin to visualise scaffolds.

!!! terminal "code"

```bash linenums="1"
echo "label" > all_scaffolds.vizbin.ann
for fasta in example_data_unchopped/*.fna; do
bin=$(basename ${fasta} .fna)
cat ${fasta} >> all_scaffolds.fna
grep '>' ${fasta} | sed "s/.*/${bin}/g" >> all_scaffolds.vizbin.ann
done
```

Import the newly generated `all_scaffolds.fna` and `all_scaffolds.vizbin.ann` into VizBin to visualise it.

---

## Picking refined bins
## Picking sequences

We can use the interactive GUI to pick the boundaries of new bins, or to identify contigs which we do not believe should be retained in the data. Have a play around with the interface, testing out the following commands:

Expand All @@ -229,7 +261,9 @@ We can use the interactive GUI to pick the boundaries of new bins, or to identif

How you proceed in this stage is up to you. You can either select bins based on their boundary, and call these the refined bins. Alternatively, you could select outlier contigs and examine these in more detail to determine whether or not they were correctly placed into the bin. Which way you proceed really depends on how well the ordination resolves your bins, and it might be that both approaches are needed.

<!--
Today, we will run through an example of selecting potentially problematic (sub)contigs, and then deciding whether or not we want to filter these contigs out of our refined bins. We can use a combination of `VizBin` and `seqmagick` to remove contigs from bins where we do not trust the placement of the contig. We are aiming to reduce each bin to a trusted set of contigs.
-->

## 1. Export `VizBin` clusters

Expand Down Expand Up @@ -259,6 +293,117 @@ Right-click, 'Selection', 'Export'. Save the output as `cluster_1.fna`.
![image](../figures/ex10_export_2022.png){width="600"}
</center>

## 2. Refining bins

VizBin is a general purpose tool for contig/scaffold/fragment visualisation. For this workshop, we're going to attempt to refine a few bins. Here, we will:

* Diagnose and visualise contigs
* Export clusters of contigs
* Check if genome metrics improve after manual refinement

For this exercise, we will be using data from the `mock_bins/` sub-directory and additional genome metrics in `mock_bins.checkm.txt`. These files were generated using a different assembly (reads assembled using MEGAHIT) based on a modified sample 3 library, and then binned using MetaBAT2 and MaxBin as per previous lessons. The assembly is quite a bit more fragmented which saves us from fragmenting the input ourselves!

??? question "Do you remember how to check an assembly?"

Use BBMap's `stats.sh` (see [here](../day1/ex5_evaluating_assemblies.md#evaluating-the-assemblies-using-bbmap))

### Inspect the genome metrics

Open up the file named `mock_bins.checkm.txt`. Take note of the metrics of each bin and consider what we might want to improve on.

### Prepare files for VizBin

List the content of `mock_bins/`, there are 4 FASTA files. We will need to generate a concatenated set of sequences and a bin-contig map for colours.

!!! terminal "code"

```bash
# VizBin requires a header for annotation files
echo "label" > all_mock_bins.label.txt

# Generate concatenated set of sequences and labels at the same time!
for bin in mock_bins/*.fna; do
binID=$(basename ${bin} .fna)
grep '>' ${bin} | sed "s/.*/${binID}/g" >> all_mock_bins.label.ann
cat ${i} >> all_mock_bins.fna
done
```

### Prepare output directory

We also need to prepare an output directory for the clusters that we export.

!!! terminal "code"

```bash linenums="1"
mkdir -p vb_export
```

### Load the files into VizBin

Return to the VizBin set-up dialogue in the Virtual Desktop. Select the following input files we just made:

**File to visualize:** `all_mock_bins.fna`

**Annotation file:** `all_mock_bins.label.ann`

### Export contig clusters

Make your selection around clusters that you think should form bins. Use the CheckM output earlier to help inform your decisions.

Once you've made your selection, export the sequences into the `vb_export/` directory we made earlier. Name your new clusters in a way that you can easily recognise the original bins they came from. For example, if most of the contigs were from `mock_bin_3`, perhaps name the new cluster `mock_bin_3.cluster`. If you're splitting a bin into several clusters, we recommend you name them something like `mock_bin_3.cluster_1` and `mock_bin_3.cluster_2`.

!!! tip "Remember to add the `.fna` suffix to your new cluster filenames!"

### Check exported clusters

Moment of truth! How did your decisions impact the genome metrics of each bin? Run your selections through CheckM and see how you did!

!!! terminal "code"

```sh linenums="1"
#!/bin/bash -e
#SBATCH --account nesi02659
#SBATCH --job-name CheckM_vb_exports
#SBATCH --partition milan
#SBATCH --time 00:20:00
#SBATCH --mem 50GB
#SBATCH --cpus-per-task 10
#SBATCH --error %x_%j.err
#SBATCH --output %x_%j.out

# Load modules
module purge
module load CheckM/1.2.3-foss-2023a-Python-3.11.6

# Working directory
cd /nesi/nobackup/nesi02659/MGSS_U/<YOUR FOLDER>/6.bin_refinement/

# Run CheckM
checkm lineage_wf -t $SLURM_CPUS_PER_TASK \
--pplacer_threads $SLURM_CPUS_PER_TASK \
-x fa --tab_table -f vb_exports.checkm.txt \
vb_exports/ vb_exports.checkm_out/
```

### Human refinement vs automated binning

Did you do better than the automated binning software? Consider the following:

* Were the new clusters more complete?
* How much contamination was removed?
* Was there an substantial trade-off between completeness and contamination?
* How would you pre-process the sequences differently prior to manual refinement?
* Are there additional information that would have helped you in the decision-making process?

When your CheckM run finishes, check and compare how you did!

<!--
### Food for thought: bin diagnosis and refinement
Did you do better or worse? Which metrics improved and which deteriorated? What were some trade-offs
## 2. Export potentially problematic contigs
### Select problematic contigs to examine
Expand All @@ -276,7 +421,12 @@ Try this for one or two problematic contigs (or subsets of contigs). In practice
For the subsequent step using `vizbin_count_table_2022.sh`, all exported cluster files must share a common prefix (e.g. `cluster...fna`), and all files of problematic contigs must also share a common prefix (e.g. `contigs...fna`).*
---
-->




<!--
## *(Optional)* Refining and filtering problematic contigs from bins
### Create a count table of counts of our problematic contigs across each bin
Expand Down Expand Up @@ -497,3 +647,4 @@ A suite of tools for creating input files for `ESOMana` can be found on github [
The tool `ESOMana` can be downloaded from [SourceForge](http://databionic-esom.sourceforge.net/).
---
-->

0 comments on commit 61097d0

Please sign in to comment.