GenotypeTensors

This project provides tools to load the vectorized genotype information files (.vec/.vecp) produced with goby3 and variationanalysis. It also demonstrates how to train deep-learning models using information in these files with pytorch.

Installation

GenotypeTensors has been upgraded to pytorch 0.4.0.

on windows:

conda create --name pytorch4
conda install pytorch -c pytorch
miniconda/Scripts/activate.bat pytorch4

Use the pip.exe in miniconda for the following.

on mac:

conda install pytorch torchvision -c pytorch

Common to all platforms:

pip install -r requirements.txt

Example Training

Assuming you have downloaded a training dataset called dataset-2018-01-16 (with files dataset-2018-01-16-train.vec*, dataset-2018-01-16-validation.vec*), you can run the following to train an auto-encoder:

bin/train-autoencoder.sh --mode autoencoder \
        --problem genotyping:dataset-2018-01-16 \
        --lr 0.001  \
        --L2 1E-6   \
        --mini-batch-size 128 \
        --checkpoint-key GENOTYPE_AUTOENCODER_1 \
        --max-epochs 20

The model will be trained for 20 epochs. Best models are saved as checkpoints under the checkpoint directory, using the provided --checkpoint-key.

You can monitor the performance metrics during training with these files:

all-perfs-GENOTYPE_AUTOENCODER_1.tsv
best-perfs-GENOTYPE_AUTOENCODER_1.tsv (restricted to performance of best models, up to latest training epoch.)
args-GENOTYPE_AUTOENCODER_1 (contains exact command line used to train the model, useful for reproducing previous runs, includes random seed)

If you do not provide --checkpoint-key argument, a random one is generated and saved in args-*. This is convenient to perform hyperparameter searches.

Training somatic models

Instead of training an auto-encoder, the code base also supports training a model to call somatic mutations. The vec files must have been created with a somatic feature mapper and in this case, you can do:

bin/train-autoencoder.sh --mode supervised_somatic \
        --problem somatic:dataset2-2018-01-17 \
        --lr 0.001  \
        --L2 1E-6   \
        --mini-batch-size 128 \
        --checkpoint-key GENOTYPE_AUTOENCODER_1 \
        --max-epochs 20

Note that we changed both the mode (now supervised_somatic) and the the dataset, now somatic:dataset2. Training a somatic supervised model requires specific outputs in the .vec files, which are produced by somatic feature mappers in the variationanalysis project (and by the DNANexus Convert Somatic .sbi to Tensors app).

Training genotyping models with semi-supervised training:

bin/train-autoencoder.sh --mode semisupervised_genotypes \
                --problem genotyping:/data/gen/CNG-NA12878-realigned-2018-01-30 \
                --lr 0.01 --L2 1E-6 --mini-batch-size 100 \
                --checkpoint-key GENOTYPE_SEMISUP_1 \
                --max-epochs 200 -n 500 -x 10000

Name		Name	Last commit message	Last commit date
Latest commit History 548 Commits
.idea		.idea
bin		bin
checkpoint		checkpoint
config		config
src		src
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
all-perfs-GENOTYPE_AUTOENCODER_1.tsv		all-perfs-GENOTYPE_AUTOENCODER_1.tsv
all-perfs-TJGGF.tsv		all-perfs-TJGGF.tsv
args-GENOTYPE_AUTOENCODER_1		args-GENOTYPE_AUTOENCODER_1
args-GENOTYPE_SEMISUP_1		args-GENOTYPE_SEMISUP_1
args-STRUCT_1		args-STRUCT_1
best-perfs-GENOTYPE_AUTOENCODER_1.tsv		best-perfs-GENOTYPE_AUTOENCODER_1.tsv
best-perfs-TJGGF.tsv		best-perfs-TJGGF.tsv
commands-30.txt		commands-30.txt
commands-small.txt		commands-small.txt
requirements.txt		requirements.txt
trace.html		trace.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GenotypeTensors

Installation

Example Training

Training somatic models

Training genotyping models with semi-supervised training:

About

Releases

Packages

Contributors 2

Languages

License

CampagneLaboratory/GenotypeTensors

Folders and files

Latest commit

History

Repository files navigation

GenotypeTensors

Installation

Example Training

Training somatic models

Training genotyping models with semi-supervised training:

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages