Skip to content

Commit

Permalink
Update examples + check if window was already processed. (#36)
Browse files Browse the repository at this point in the history
* update readmes

* update version number

* remove unused packages

* change default viloca mode

* b2w and inference: check if window was processed already before adding process to process list

* reference to manuscript

* add viloca conda package
  • Loading branch information
LaraFuhrmann authored Jun 11, 2024
1 parent b3498b3 commit bb554a7
Show file tree
Hide file tree
Showing 10 changed files with 73 additions and 255 deletions.
25 changes: 19 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,12 +7,20 @@ are written in different programming languages and provide error correction,
haplotype reconstruction and estimation of the frequency of the different
genetic variants present in a mixed sample.

The corresponding manuscript can be found here: https://www.biorxiv.org/content/10.1101/2024.06.06.597712v1

---

### Installation
For installation miniconda is recommended: https://docs.conda.io/en/latest/miniconda.html.
We recommend to install VILOCA in a clean conda environment:
```
conda create --name env_viloca --channel conda-forge --channel bioconda viloca
conda activate env_viloca
```

If you want to install the `master` branch use:
```
conda create --name env_viloca --channel conda-forge --channel bioconda libshorah
conda activate env_viloca
pip install git+https://github.com/cbg-ethz/VILOCA@master
Expand All @@ -21,18 +29,19 @@ pip install git+https://github.com/cbg-ethz/VILOCA@master
### Example
To test your installation run VILOCA `tests/data_1`:
```
viloca run -b test_aln.cram -f test_ref.fasta -z scheme.insert.bed --mode use_quality_scores
viloca run -b test_aln.cram -f test_ref.fasta --mode use_quality_scores
```


Another example can be found in `tests/data_6`:
If the sequencing amplicon strategy is known, we recommend using the amplicon-mode of the program, which takes as input the `<smth>.insert.bed` - file:
`viloca run -b test_aln.cram -f test_ref.fasta -z scheme.insert.bed --mode use_quality_scores`
`viloca run -f reference.fasta -b reads.shotgun.bam -w 90 --mode use_quality_scores -z scheme.insert.bed`

If there is no information on the sequencing amplicon strategy available, run:
`viloca run -f reference.fasta -b reads.shotgun.bam -w 90 --mode use_quality_scores`

If the sequencing quality scores are not trustable, the sequencing error parameters can also be learned:
`viloca run -b test_aln.cram -f test_ref.fasta -z scheme.insert.bed --mode learn_error_params`.
`viloca run -f reference.fasta -b reads.shotgun.bam -w 90 --mode learn_error_params`.

If there is no information on the sequencing amplicon strategy available, run:
`viloca run -b test_aln.cram -f test_ref.fasta --mode use_quality_scores`

### Parameters
There are several parameters available:
Expand Down Expand Up @@ -70,3 +79,7 @@ This is the same setup as used in the CI at [`.github/workflows/test.yaml`](.git
```bash
poetry run python3 -m cProfile -m shorah shotgun ...
```

### Applications

You can find several applications of VILOCA at https://github.com/cbg-ethz/viloca_applications.
2 changes: 1 addition & 1 deletion environment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,4 +6,4 @@ dependencies:
- htslib >=1.9
- boost-cpp >=1.56
- pip:
- https://github.com/spaceben/shorah/releases/download/canary-eeec049/ShoRAH-0.1.0.tar.gz
- https://github.com/cbg-ethz/VILOCA/archive/refs/tags/viloca-v1.0.0.tar.gz
8 changes: 3 additions & 5 deletions pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
[tool.poetry]
name = "VILOCA"
version = "0.1.0"
description = "SHOrt Reads Assembly into Haplotypes"
version = "1.0.0"
description = "VIral LOcal haplotype reconstruction and mutation CAlling for short and long read data"
license = "GPL-3.0-only"
authors = ["Benjamin Langer <[email protected]>, Lara Fuhrmann <[email protected]>"]
authors = ["Ivan Topolsky", "Benjamin Langer <[email protected]>, Lara Fuhrmann <[email protected]>"]
build = "build.py"
packages = [
{ include = "viloca" }
Expand All @@ -18,9 +18,7 @@ biopython = "^1.79"
numpy = "^1.21.4"
pysam = "^0.18.0"
pybind11 = "^2.9.0"
PyYAML = "^6.0"
scipy = "^1.7.3"
bio = "^1.3.3"
pandas = "^1.3.5"

[tool.poetry.dev-dependencies]
Expand Down
12 changes: 6 additions & 6 deletions tests/data_1/README.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
### Sample files to test `shorah shotgun`
### Sample files to test `VILOCA`

Use files in this directory to test shorah in shotgun mode.
The reads data comes from the
Use files in this directory to test VILOCA.
The reads data comes from the
[test-data](https://github.com/cbg-ethz/V-pipe/tree/master/testdata/2VM-sim/20170904/raw_data)
of [V-pipe](https://cbg-ethz.github.io/V-pipe/)
and has been processed with the pipeline using the `bwa`
and has been processed with the pipeline using the `bwa`
[option](https://github.com/cbg-ethz/V-pipe/wiki/options#aligner):

```ini
Expand All @@ -16,8 +16,8 @@ The sorted bam file has been further compressed with samtools for space saving:

[user@host shotgun_test]$ samtools view -T test_ref.fasta -C -O cram,embed_ref,use_bzip2,use_lzma,level=9,seqs_per_slice=1000000 -o test_aln.cram V-pipe/work/samples/2VM-sim/20170904/alignments/REF_aln.bam

You can then run `shorah shotgun` as follows
You can then run `viloca` as follows

[user@host shotgun_test]$ shorah shotgun -b test_aln.cram -f test_ref.fasta
[user@host shotgun_test]$ viloca run -b test_aln.cram -f test_ref.fasta

The output files will be `snv/SNVs_0.010000_final.vcf` and `snv/SNVs_0.010000_final.csv`.
2 changes: 1 addition & 1 deletion tests/data_1/shotgun_test.sh
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
#!/bin/bash

viloca run -a 0.1 -w 201 -x 100000 -p 0.9 -c 0 \
viloca run -a 0.1 -w 201 --mode shorah -x 100000 -p 0.9 -c 0 \
-r HXB2:2469-3713 -R 42 -f test_ref.fasta -b test_aln.cram --out_format csv "$@"
4 changes: 2 additions & 2 deletions tests/data_5/README.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
### Test to check fil.cpp implementation accounting for long deletions

Test files `SNV.txt` and `SNVs_0.010000.txt` are obtained by running `shorah shutgun`, e.g:
Test files `SNV.txt` and `SNVs_0.010000.txt` are obtained by running `viloca run`, e.g:

```
shorah shotgun -a 0.1 -w 42 -x 100000 -p 0.9 -c 0 -r REF:42-272 -R 42 -b test_aln.cram -f ref.fasta
viloca run -a 0.1 -w 42 -x 100000 -p 0.9 -c 0 -r REF:42-272 -R 42 -b test_aln.cram -f ref.fasta
```

The test script `test_long_deletions.py` uses `pysam` and `NumPy`, which can be installed using pip or conda.
23 changes: 7 additions & 16 deletions tests/data_6/README.md
Original file line number Diff line number Diff line change
@@ -1,27 +1,18 @@
### Sample files to test `shorah shotgun`
### Sample files to test `VILOCA`

Use files in this directory to test shorah in shotgun mode. The reads data have been generated with V-pipe's benchmarking framework (simulated with parameters: ```seq_tech~illumina__seq_mode~shotgun__seq_mode_param~nan__read_length~90__genome_size~90__coverage~100__haplos~5@5@10@5@10@[email protected]```)

The reads are from one single amplicon of length 90, meaning the reference is of length 90 and each read is of length 90bps.

To run ShoRAH's original Gibbs sampler use the following command:
```
poetry run shorah shotgun -f reference.fasta -b reads.shotgun.bam -w 90 --sampler shorah
```
or
```
poetry run shorah shotgun -f reference.fasta -b reads.shotgun.bam -z scheme.insert.bed --sampler shorah
```

To use the new inference method using the sequencing quality scores use:
```
poetry run shorah shotgun -f reference.fasta -b reads.shotgun.bam -w 90 --sampler use_quality_scores --alpha 0.0001 --n_max_haplotypes 100 --n_mfa_starts 1 --conv_thres 0.0001
viloca run -f reference.fasta -b reads.shotgun.bam -w 90 --mode use_quality_scores
```
To use the new inference method learning the sequencing error parameter:
To use the model that is learning the sequencing error parameter:
```
poetry run shorah shotgun -f reference.fasta -b reads.shotgun.bam -w 90 --sampler -learn_error_params --alpha 0.0001 --n_max_haplotypes 100 --n_mfa_starts 1 --conv_thres 0.0001
viloca run -f reference.fasta -b reads.shotgun.bam -w 90 --mode -learn_error_params
```

In the new inference method reads are filtered (and weighted respectively) such that only a set of unique reads are processed. This mode can be switch off by setting the parameter `--non-unique_modus`, e.g.:
To run VILOCA with the insert file run:
```
viloca run -f reference.fasta -b reads.shotgun.bam -w 90 --mode use_quality_scores -z scheme.insert.bed
```
poetry run shorah shotgun -f reference.fasta -b reads.shotgun.bam -w 90 --sampler -learn_error_params --alpha 0.0001 --n_max_haplotypes 100 --n_mfa_starts 1 --conv_thres 0.0001 --non-unique_modus
Loading

0 comments on commit bb554a7

Please sign in to comment.