Structural Variant Machine (SV-M) to accurately predict InDels from NGS paired-end short reads as described in:
D. Grimm J. Hagmann, D. Koenig, D. Weigel and K. Borgwardt (2013) Accurate indel prediction using paired-end short reads BMC Genomics 14:132 link
- Installation
- Usage
- Input format
- Output files and format
- Training Data
- Author and license informations
To install the tool you have to compile the source code. Type into you Linux/Mac terminal:
make all
The source code get compiled, generating two directory (build, bin). The bin directory contains the complied tool sv-m.
To re-compile:
make clean
make all
To predict if an indel is a true or false candidate use the -predict command:
./sv-m -predict <model_file> <normalization_parameter_file> <data_file> <output_filename>
where:
<model_file>
: trained SVM model file<normalization_parameter_file>
: the corresponding normalization parameter file for the trained SVM model<data_file>
: input data file with all features<output_filename>
: filename for the output file
To train a new SVM model on a set of features use the -train command:
./sv-m -train <data_filename> <output_directory>
where:
<data_filename>
: input data file<output_directory>
: name of an existing emtpy output directory
Optional arguments:
-n
k-fold (default = 10)-experiments
number of experiments/repeats (default=1)
(In general several experiments are performed)
The <model_file>
and <normalization_parameter_file>
can be found in the Model folder in the root directory. For a new or different set of features these files have to be generated by performing a new training.
<data_file>
format (tab seperated):
<chromosome> <start position> <end position> <feature 1> <feature 2> ... <feature n>
<data_file>
format (tab seperated):
<class label: 1 for positive, -1 for negative> <chromosome> <start position> <end position> <feature 1> <feature 2> ... <feature n>
<output_file>
format (tab seperated):
<class label, 1 positive, -1 negative class> <probability for positive class (negative class: 1-probability of positive class)> <chromsome> <start position> <end position> <feature 1> <feature 2> ... <feature n>
The output directory contains the following output files:
model.svm: The trained model file
model_normalization.param: The corresponding normalization parameters for that model
results.txt: A summary of the performance of the model and the corresponding weights
experiments.tab: A tab seperated file containg the C-Value, AUC and BEP value for each experiment
<C-Value> <AUC> <BEP>
The folder trainingdata contains the Sanger validated training data. For more detailed informations and the file format see the README file within the trainingdata folder.
Version: 0.1 Author: Dominik Gerhard Grimm Mail: [email protected] Date: 07th of Dezember 2011
Group: Machine Learning and Computational Biology Group (http://webdav.tuebingen.mpg.de/u/karsten/group/) Institutes: Max Planck Institute for Developmental Biology and Max Planck Institute for Intelligent Systems (Tübingen, Germany)
This tool make use of libSVM 3.0 (www.csie.ntu.edu.tw/~cjlin/libsvm/)