QuantTB is a SNP based method to identify and quantify individual strains present in Tuberculosis whole genome sequencing datasets.
These instructions will guide you through the process of using QuantTB, so that you can deploy it on your own local system. Tested to work for Mac OSX and Ubuntu.
Python 2.x with development and setup packages needs to be installed on your system to run QuantTB. (https://www.python.org/downloads/) Alternatively python can be installed using a package manager such as miniconda (https://conda.io/miniconda.html).
sudo apt install python
sudo apt install python-setuptools
sudo apt install python-dev
A recent version of Java is needed
sudo add-apt-repository ppa:linuxuprising/java
sudo apt update
sudo apt install oracle-java11-installer
Some functionalities of QuantTB require additional software to be installed on your system.
- If wanting to make a reference SNP database based on assembly genomes (.fna/.fa files) MUMmer v3 is required: http://mummer.sourceforge.net/
- If wanting to generate a SNP profile for a fastq readset (.fq/.fastq files), samtools (v 1.7 or higher) and bwa (v. 0.7.17 or higher) need to be installed in your system and in your path.
- samtools download: https://sourceforge.net/projects/samtools/files/samtools/1.7/
sudo apt install samtools
sudo apt install bwa
Download the latest release of QuantTB from https://github.com/AbeelLab/quanttb/releases and install it on your computing environment.
tar -zxvf quanttb-1.01.tar.gz
cd quanttb-1.01
sudo python setup.py install
QuantTB should now be installed. If things do not appear to be working, there is a log file present in the temp directory of the output folder which may help you diagnose problems.
QuantTB can be used to classify strains, make a SNP databases, and obtain snp profiles from fastq readsets.
To classify strains in a sample using a reference genome, the command 'quant' is used. Quanttb accepts a list of fastq files (-f argument), and vcf files (pilon), or .samp files (-v argument) as input. In addition a reference snp database (.db) needs to be used (-db flag). QuantTB comes prepackaged with a database of 2166 TB genomes that differ by at least 100 snps. This is used as a default if no reference SNP database is supplied. QuantTB classifies strains using an iterative approach. The max number of iterations by default is set to 8, but this can be changed with the '-i' flag.
# Classify a sample from the example data with the default database and save results to results.txt
quanttb quant -f exdata/readset1.fq exdata/readset2.fq -o output1/myresults.txt
A result file containing the references observed in the sample is output to the specified location (default is output/results.txt). The output looks like the table below for a sample containing two strains. Every row in the output denotes the presence of a specific reference snp profile for the corresponding sample. The relative abundances are noted in the column 'relabundance' column.
sample | refname | totscore | relabundance | depth |
---|---|---|---|---|
readset1 | UT0106 | 0.577 | 0.699 | 5.472 |
readset1 | I0003367-5 | 0.437 | 0.301 | 2.358 |
QuantTB can optionally find antibiotic resistant variants in the sample using a list of predetermined resistant mutations. Mutations in positions coding for antibiotic resistance can be output with the flag 'abres'
# Classify a sample with manually made database, and output antibiotic resistance results
quanttb quant -f exdata/readset1.fq exdata/readset2.fq -o output2/myresults.txt -abres
Antibiotic resistance results for all samples are output in a separate file, 'antibioticresistances.txt'.
QuantTB can also work directly from pre-computed VCF files, one example is included
gunzip exdata/sample1scnps.vcf
quanttb quant -v sample1snps.vcf -o output3/myresults.txt
QuantTB can also work directly from pre-computed VCF files and use user-defined databases (see below).
quanttb quant -v sample1snps.vcf sample2snps.vcf sample3snps.vcf -db newdb.db -o someoutput/myresults.txt
A reference SNP database is required to classify samples. QuantTB comes prepackaged with it's own default database, however the user also has the ability to make their own. A SNP database can be made from different sources. A list of vcf files (.vcf), a list of assembly genomes (.fna/.fa/.fasta), a list of snp files output from MUMmer (.snps), or a list of snpprofiles (.samp) generated by quanttb. The command to make a snp database is 'makesnpdb'. The path to the reference files can be supplied by using the '-g' argument.
# From VCF files
quanttb makesnpdb -g genome1.vcf genome2.vcf genome3.vcf ....
# From fna files. MUMmer must be installed on system
quanttb makesnpdb -g genome1.fna genome2.fna genome3.fna ....
Optionally, during construction of the snp database, quanttb can filter the snps and filter the genomes present to increase classification accuracy. SNPs can be filtered to remove those that are within a certain range of each other per genome using the 'reddist' argument. The default is 25.
# Makes database and removes snps from each genome that are within 50 snps from each other
quanttb makesnpdb -reddist 50 -g genome1.fna genome2.fna genome3.fna ....
A snpdb is output with the '.db' suffix which can be used in the 'quant' command to classify strains within a sample.
To use this functionality correct versions of bwa and samtools need to be installed and on the file path.
wget https://github.com/samtools/samtools/releases/download/1.7/samtools-1.7.tar.bz2 -O - | tar xj ; ( cd samtools-1.7 ; make )
export PATH=/full/path/to/samtools-1.7:${PATH}
wget https://github.com/lh3/bwa/releases/download/v0.7.17/bwa-0.7.17.tar.bz2 -O - | tar xj ; ( cd bwa-0.7.17 ; make )
export PATH=/full/path/to/bwa-0.7.17:${PATH}
Fastq readsets need to be converted to a VCF file in order to be classified or be used as a reference genome in the database. This can optionally be done with the quanttb command: variants. The variants command accepts paired or single end fastq files as input. Reads are variant called against the H37rv genome (Genbank: CP003248.2). For multiple samples, the '-f' argument can be used repeatedly
# For one paired end readset
quanttb variants -f sample_1.fq sample_2.fq
# For two paired end readsets
quanttb variants -f sample_1.fq sample_2.fq -f sample2_1.fq sample2_2.fq
This outputs a VCF file which can be used as input in snpdb or the quant command.
Usage: quantTB <command> [options]
Command: makesnpdb Make a reference SNP database
quant Quantify sample with a ref SNP db
variants Generate a vcf from sequencing readsets
Usage: quanttb quant [options] <-db> <-s>
Optional arguments:
-v [VCFSAMPLES ...]
VCF(s) or snp profiles that you want tested against
the refdb, can either be .vcf(.gz) or .samp
-f [FASTQ ...] Fastq Readset(s) that you want tested against refdb,
can specify multiple times for multiple pairs, (.fq,
.fastq)
-db DB Location of reference SNP DB file. (.db) If not
supplied a default TB db will be used
-o OUTPUT Directory/File where you want results written to
-resout Should stats from each run be output?
-i Number of iterations for classificaiton
-abres Should resistances from each sample be output?
-k Keep temp files?
Usage: quanttb makesnpdb [options] <-g>
Required arguments:
-g [DBFILES ...]
Files you want to use to make the reference database,
Can be either .fa, .fna, .fasta, .snps .vcf(.gz), or
.samp
Optional arguments:
-reducedist When making database, what is the minimum distance
between SNPs in a genome
-o OUTPUT Directory/File where you want the snpdb to be written
to
-k Keep temp files?
Usage: quanttb variants [options] <-f>
Required arguments:
-f FASTQ [FASTQ ...] Fastq Readset(s) that you want converted to a vcf
(using samtools,bwa and pilon), can specify multiple
times for multiple pairs, (.fq, .fastq)
Optional arguments:
-o OUTPUT Directory you want vcf written to
-k Keep temp files?
This project is licensed under the GNU GENERAL PUBLIC LICENSE- see the LICENSE file for details
- Thomas Abeel : [email protected]