pytorch, pytorch-transformers, nvidia-apex
We adapt the Transformer neural network model to predict GO labels for protein sequences. We trained our method on DeepGO datasets which was used as a baseline in our paper. You can download our trained models here.
During training, we saved the model at each checkpoint. Once we finished, we kept only checkpoint that works best with the dev datasets. You will see these saved files in the format checkpoint-number.
The config.json shows how the Transformer model was trained. Please see this demo script that shows to use a trained model to evaluate a test set, and how to explore some of the model properties.
You can train your own model. Your input must match the input here. The high-level format is
protein_name \t sequence \t label \t protein_vector_from_external_source \t domain_motif_in_sequence
We support 4 training options:
- Base Transformer
- Domain data (like motifs, compositional bias, etc.)
- External protein data (like 3D structure, protein-protein interaction network)
- Any combination of the above.
You can download the most updated manually annotated data at Uniprot.org. The site also provides all known motifs and domains for a given sequence. You may have to do a custom download from Uniprot for these extra information.
We do not have the pre-trained encoder in DeepGO that provides embeddings for any proteins in a protein-protein interaction network.
We do have the pre-trained encoder that provides embeddings representing 3D structures of proteins.