EC number prediction models created using the TPOT tool. The TPOT tool can be accessed here. This tool uses Genetic Programming in order to arrive at optimized machine learning pipelines, which were validated and used to created the following models. These models were created hierarchically and the pipelines chosen are adapted for each EC number digit (with enough data to study). These models were done for the Master Dissertation "A Study of Machine Learning for Artificial Intelligence-Based Enzyme Classification.", of the Computational Biology and Bioinformatics Master from Lisbon's Nova University, at NOVA ITQB. For using the models, you need to have Python and Anaconda and follow the next steps if you are on a terminal:
- Clone the repository
git clone https://github.com/Ananas-bio/Tpot_ec_prediction.git
- Create and activate a conda environment using the YAML file
conda env create -f environment.yml
conda activate tpot_ec
To run the models in the terminal window here is an example:
python ec_predict -i uniprot_test.fasta -l 3 -m c40
The -l and -m are optional, with the default of -l being 3 (as in prediction up to level 3) and -m the c40 model (can be c40 or swiss).