-
Defintion encoder
- BiLSTM
- ELMo
- Transformer based on BERT strategy.
-
Position encoder
- GCN
- Onto2vec
Consider the example below. We would expect child-parent terms to have similar vector embeddings; whereas, two unrelated terms should have different embeddings. Moreover, child-parent terms are in the same neighborhood, so that the position embeddings should also be the same.
pytorch, pytorch-pretrained-bert, pytorch-geometric
We embed the definition or position of a term. The key idea is that child-parent terms often have simlar defintions or positions in the GO tree, so that we can embed them into comparable vectors.
All models are already trained, and ready to be used. You can download the embeddings here. There are different types of embeddings, you can try any of these embeddings. For example, download these files if you want to use the BiLSTM embedding for Task 1 and 2 discussed in our paper.
You can also use our trained model to produce vectors for any GO definitions, see example script here. You will have to prepare the go.obo definition input in this format here.
Alternatively, you can also train your own embedding by following the same example script. You only need to prepare your train/dev/test datasets into the same format here.
Almost every protein is annotated by a set of GO terms, for example see the Uniprot database. Once you can express each GO term as a vector, then for any 2 proteins, you can compare the sets of terms annotating them. We used the Best-Match Average metric to compare 2 sets; however, there other options to explore. Our example to compare 2 proteins is here.
We can use Uniprot database to train a model that predicts GO labels for an unknown protein sequence. In our paper, we demonstrate that GO embeddings can be used to predict GO labels not included in the training data (zeroshot learning). There are two advantages. First, many machine learning methods exclude rare labels because these methods often have problem when training data contains very rare labels. GO embeddings allow us to adopt the zeroshot learning philosophy, where we train models on labels in training data, but test models on new unseen labels. Second, as the GO database is constantly updating with new terms, we do not need to train a brand new model with each update.