We provide two different prediction models.
- Word2Vec Embeddings with a feed-forward neural network in PyTorch
- Google Seq2Seq Prediction
To run the project, you need to use Python 3.6+. The required dependencies are listed below:
torch
torchvision
numpy
matplotlib
biovec
sklearn
To install the dependencies run
pip install -r requirements.txt
To run the protein-protein binding predictions
python predict.py -i input_file.fasta -o output_file.fasta
Example input and output files can be found under: data/test_input.fasta
and data/test_output.fasta
. The script uses a pre-trained model that can be found under trained_models/ffnn_model.ckpt
The data needs to be split into a training and test set for later cross-valdiation and bootstrapping.
python scripts/preprocessing/split_data.py \
--ppi_protvecs=scripts/preprocessing/ppi_as_vec.npy \
--train_set=ppi_vec_train.npy \
--test_set=ppi_vec_test.npy
python scripts/ffnn/train_ffnn_w2v.py \
--training_set=scripts/preprocessing/ppi_vec_train.npy \
--model=trained_models/ffnn_model.ckpt \
--num_epochs=100 \
--batch_size=100
python scripts/ffnn/validation/cross_validation.py \
--ppi_protvecs=scripts/preprocessing/ppi_vec_train.npy \
--num_epochs=100 \
--num_split=5
python scripts/ffnn/validation/bootstrapping.py \
--test_set=scripts/preprocessing/ppi_vec_test.npy \
--model=trained_models/ffnn_model.ckpt \
--num_boot=1000
The results of cross-validation and bootstrapping are summarised in scripts/postprocessing/final_model.html