Update examples + check if window was already processed. (#36)

* update readmes * update version number * remove unused packages * change default viloca mode * b2w and inference: check if window was processed already before adding process to process list * reference to manuscript * add viloca conda package
cbg-ethz · Jun 11, 2024 · bb554a7 · bb554a7
1 parent b3498b3
commit bb554a7
Show file tree

Hide file tree

Showing 10 changed files with 73 additions and 255 deletions.
diff --git a/README.md b/README.md
@@ -7,12 +7,20 @@ are written in different programming languages and provide error correction,
 haplotype reconstruction and estimation of the frequency of the different
 genetic variants present in a mixed sample.
 
+The corresponding manuscript can be found here: https://www.biorxiv.org/content/10.1101/2024.06.06.597712v1
+
 ---
 
 ### Installation
 For installation miniconda is recommended: https://docs.conda.io/en/latest/miniconda.html.
 We recommend to install VILOCA in a clean conda environment:
 ```
+conda create --name env_viloca --channel conda-forge --channel bioconda viloca
+conda activate env_viloca
+```
+
+If you want to install the `master` branch use:
+```
 conda create --name env_viloca --channel conda-forge --channel bioconda libshorah
 conda activate env_viloca
 pip install git+https://github.com/cbg-ethz/VILOCA@master
@@ -21,18 +29,19 @@ pip install git+https://github.com/cbg-ethz/VILOCA@master
 ### Example
 To test your installation run VILOCA `tests/data_1`:
 ```
-viloca run -b test_aln.cram -f test_ref.fasta -z scheme.insert.bed --mode use_quality_scores
+viloca run -b test_aln.cram -f test_ref.fasta --mode use_quality_scores
 ```
 
-
+Another example can be found in  `tests/data_6`:
 If the sequencing amplicon strategy is known, we recommend using the amplicon-mode of the program, which takes as input the `<smth>.insert.bed` - file:
-`viloca run -b test_aln.cram -f test_ref.fasta -z scheme.insert.bed --mode use_quality_scores`
+`viloca run -f reference.fasta -b reads.shotgun.bam -w 90 --mode use_quality_scores -z scheme.insert.bed`
+
+If there is no information on the sequencing amplicon strategy available, run:
+`viloca run -f reference.fasta -b reads.shotgun.bam -w 90 --mode use_quality_scores`
 
 If the sequencing quality scores are not trustable, the sequencing error parameters can also be learned:
-`viloca run -b test_aln.cram -f test_ref.fasta -z scheme.insert.bed --mode learn_error_params`.
+`viloca run -f reference.fasta -b reads.shotgun.bam -w 90 --mode learn_error_params`.
 
-If there is no information on the sequencing amplicon strategy available, run:
-`viloca run -b test_aln.cram -f test_ref.fasta --mode use_quality_scores`
 
 ### Parameters
 There are several parameters available:  
@@ -70,3 +79,7 @@ This is the same setup as used in the CI at [`.github/workflows/test.yaml`](.git
 ```bash
 poetry run python3 -m cProfile -m shorah shotgun ...
 ```
+
+### Applications
+
+You can find several applications of VILOCA at https://github.com/cbg-ethz/viloca_applications.
diff --git a/environment.yml b/environment.yml
@@ -6,4 +6,4 @@ dependencies:
   - htslib >=1.9
   - boost-cpp >=1.56
   - pip:
-    - https://github.com/spaceben/shorah/releases/download/canary-eeec049/ShoRAH-0.1.0.tar.gz 
+    - https://github.com/cbg-ethz/VILOCA/archive/refs/tags/viloca-v1.0.0.tar.gz
diff --git a/pyproject.toml b/pyproject.toml
@@ -1,9 +1,9 @@
 [tool.poetry]
 name = "VILOCA"
-version = "0.1.0"
-description = "SHOrt Reads Assembly into Haplotypes"
+version = "1.0.0"
+description = "VIral LOcal haplotype reconstruction and mutation CAlling for short and long read data"
 license = "GPL-3.0-only"
-authors = ["Benjamin Langer <[email protected]>, Lara Fuhrmann <[email protected]>"]
+authors = ["Ivan Topolsky", "Benjamin Langer <[email protected]>, Lara Fuhrmann <[email protected]>"]
 build = "build.py"
 packages = [
     { include = "viloca" }
@@ -18,9 +18,7 @@ biopython = "^1.79"
 numpy = "^1.21.4"
 pysam = "^0.18.0"
 pybind11 = "^2.9.0"
-PyYAML = "^6.0"
 scipy = "^1.7.3"
-bio = "^1.3.3"
 pandas = "^1.3.5"
 
 [tool.poetry.dev-dependencies]

diff --git a/tests/data_1/README.md b/tests/data_1/README.md
@@ -1,10 +1,10 @@
-### Sample files to test `shorah shotgun`
+### Sample files to test `VILOCA`
 
-Use files in this directory to test shorah in shotgun mode.
-The reads data comes from the 
+Use files in this directory to test VILOCA.
+The reads data comes from the
 [test-data](https://github.com/cbg-ethz/V-pipe/tree/master/testdata/2VM-sim/20170904/raw_data)
 of [V-pipe](https://cbg-ethz.github.io/V-pipe/)
-and has been processed with the pipeline using the `bwa` 
+and has been processed with the pipeline using the `bwa`
 [option](https://github.com/cbg-ethz/V-pipe/wiki/options#aligner):
 
 ```ini
@@ -16,8 +16,8 @@ The sorted bam file has been further compressed with samtools for space saving:
 
     [user@host shotgun_test]$ samtools view -T test_ref.fasta -C -O cram,embed_ref,use_bzip2,use_lzma,level=9,seqs_per_slice=1000000 -o test_aln.cram V-pipe/work/samples/2VM-sim/20170904/alignments/REF_aln.bam
 
-You can then run `shorah shotgun` as follows
+You can then run `viloca` as follows
 
-    [user@host shotgun_test]$ shorah shotgun -b test_aln.cram -f test_ref.fasta
+    [user@host shotgun_test]$ viloca run -b test_aln.cram -f test_ref.fasta
 
 The output files will be `snv/SNVs_0.010000_final.vcf` and `snv/SNVs_0.010000_final.csv`.
diff --git a/tests/data_1/shotgun_test.sh b/tests/data_1/shotgun_test.sh
@@ -1,4 +1,4 @@
 #!/bin/bash
 
-viloca run -a 0.1 -w 201 -x 100000 -p 0.9 -c 0 \
+viloca run -a 0.1 -w 201 --mode shorah -x 100000 -p 0.9 -c 0 \
 -r HXB2:2469-3713 -R 42 -f test_ref.fasta -b test_aln.cram --out_format csv "$@"
diff --git a/tests/data_5/README.md b/tests/data_5/README.md
@@ -1,9 +1,9 @@
 ### Test to check fil.cpp implementation accounting for long deletions
 
-Test files `SNV.txt` and `SNVs_0.010000.txt` are obtained by running `shorah shutgun`, e.g:
+Test files `SNV.txt` and `SNVs_0.010000.txt` are obtained by running `viloca run`, e.g:
 
 ```
-shorah shotgun -a 0.1 -w 42 -x 100000 -p 0.9 -c 0 -r REF:42-272 -R 42 -b test_aln.cram -f ref.fasta
+viloca run -a 0.1 -w 42 -x 100000 -p 0.9 -c 0 -r REF:42-272 -R 42 -b test_aln.cram -f ref.fasta
 ```
 
 The test script `test_long_deletions.py` uses `pysam` and `NumPy`, which can be installed using pip or conda.  
diff --git a/tests/data_6/README.md b/tests/data_6/README.md
@@ -1,27 +1,18 @@
-### Sample files to test `shorah shotgun`
+### Sample files to test `VILOCA`
 
 Use files in this directory to test shorah in shotgun mode. The reads data have been generated with V-pipe's benchmarking framework (simulated with parameters: ```seq_tech~illumina__seq_mode~shotgun__seq_mode_param~nan__read_length~90__genome_size~90__coverage~100__haplos~5@5@10@5@10@[email protected]```)
 
 The reads are from one single amplicon of length 90, meaning the reference is of length 90 and each read is of length 90bps.
 
-To run ShoRAH's original Gibbs sampler use the following command:
-```
-poetry run shorah shotgun -f reference.fasta -b reads.shotgun.bam -w 90 --sampler shorah
-```
-or
-```
-poetry run shorah shotgun -f reference.fasta -b reads.shotgun.bam -z scheme.insert.bed --sampler shorah
-```
-
 To use the new inference method using the sequencing quality scores use:
 ```
-poetry run shorah shotgun -f reference.fasta -b reads.shotgun.bam -w 90 --sampler use_quality_scores --alpha 0.0001 --n_max_haplotypes 100 --n_mfa_starts 1 --conv_thres 0.0001
+viloca run -f reference.fasta -b reads.shotgun.bam -w 90 --mode use_quality_scores
 ```
-To use the new inference method learning the sequencing error parameter:
+To use the model that is learning the sequencing error parameter:
 ```
-poetry run shorah shotgun -f reference.fasta -b reads.shotgun.bam -w 90 --sampler -learn_error_params --alpha 0.0001 --n_max_haplotypes 100 --n_mfa_starts 1 --conv_thres 0.0001
+viloca run -f reference.fasta -b reads.shotgun.bam -w 90 --mode -learn_error_params
 ```
-
-In the new inference method reads are filtered (and weighted respectively) such that only a set of unique reads are processed. This mode can be switch off by setting the parameter `--non-unique_modus`,  e.g.:
+To run VILOCA with the insert file run:
+```
+viloca run -f reference.fasta -b reads.shotgun.bam -w 90 --mode use_quality_scores -z scheme.insert.bed
 ```
-poetry run shorah shotgun -f reference.fasta -b reads.shotgun.bam -w 90 --sampler -learn_error_params --alpha 0.0001 --n_max_haplotypes 100 --n_mfa_starts 1 --conv_thres 0.0001 --non-unique_modus