Skip to content

datduong/GOAnnotationTransformer

Repository files navigation

GOAT: GO Annotation with the Transformer model

Libraries needed

pytorch, pytorch-transformers, nvidia-apex

Where are pre-trained models?

We adapt the Transformer neural network model to predict GO labels for protein sequences. We trained our method on DeepGO datasets which was used as a baseline in our paper. You can download our trained models here.

During training, we saved the model at each checkpoint. Once we finished, we kept only checkpoint that works best with the dev datasets. You will see these saved files in the format checkpoint-number.

The config.json shows how the Transformer model was trained. Please see this demo script that shows to use a trained model to evaluate a test set, and how to explore some of the model properties.

How to train your model?

You can train your own model. Your input must match the input here. The high-level format is

protein_name \t sequence \t label \t protein_vector_from_external_source \t domain_motif_in_sequence

We support 4 training options:

  1. Base Transformer
  2. Domain data (like motifs, compositional bias, etc.)
  3. External protein data (like 3D structure, protein-protein interaction network)
  4. Any combination of the above.

You can download the most updated manually annotated data at Uniprot.org. The site also provides all known motifs and domains for a given sequence. You may have to do a custom download from Uniprot for these extra information.

We do not have the pre-trained encoder in DeepGO that provides embeddings for any proteins in a protein-protein interaction network.

We do have the pre-trained encoder that provides embeddings representing 3D structures of proteins.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published