Reproducing results

Deepro Banerjee, Michael A. Jindra, Alec J. Linot, Brian F. Pfleger, Costas D. Maranas, EnZymClass: Substrate specificity prediction tool of plant acyl-ACP thioesterases based on ensemble learning, Current Research in Biotechnology, Volume 4, 2022, Pages 1-9, ISSN 2590-2628, https://doi.org/10.1016/j.crbiot.2021.12.002. (https://www.sciencedirect.com/science/article/pii/S259026282100037X)

Reproducing results

Environment creation

Setting up a conda virtual environment

Install conda preferably through anaconda. For installation instructions, please visit link
Set the conda channel priority list by editing .condarc file to the following:

channel_priority: flexible
channels:
- conda-forge
- bioconda
- defaults

This step is optional but is recommended as it can circumvent some installation errors. The .condarc file can be found in the home directory. In macOS/linux it can be edited from the terminal using the following command:

$nano ~/.condarc
Create a conda environment with the following command:

$conda create -n te_env python=3.9 scikit-learn pandas jupyter blast bioconductor-kebabs=1.24.0
Install ifeatpro link, ngrampro link and pssmpro link using pip after activating the conda environment. Run the following commands in MacOS/Linux. If you have Windows, use WSL.

$ conda activate te_env

$ pip install ifeatpro

$ pip install ngrampro

$ pip install pssmpro

Reproducing TE substrate specificity results

The jupyter notebook, TE_SubstrateSpecificityAnalysis.ipynb present in the notebooks/ directory provides step by step instructions on how we obtained the current results. Open a jupyter session and rerun the notebook. Some steps take several hours to run; please use multiple cores (I used 24) to attain results within a reasonable amount of time.

Applying EnZymClass on other protein sequence classification applications

Using EnZymClass module

Please refer to: https://github.com/deeprob/EnZymClass

Using Jupyter Notebooks

Create train and test dataset of the same format as the csv files given in data/raw/ directory. The format should be as follows:

protein_name, protein_sequence, protein_label (for training dataset)

protein_name, protein_sequence (for test dataset)
Duplicate and rename the TE_SubstrateSpecificityAnalysis.ipynb present in the notebooks/ directory according to your application area. Rename train_raw and test_raw variables in the notebook
Run the notebook step by step. Please note that the pssm based features require creation of pssm profiles of protein sequences which in turn require the psiblast program path and a blast database. Blast database creation is described in the pssmpro tutorial link.

Name		Name	Last commit message	Last commit date
Latest commit History 106 Commits
data		data
model		model
notebooks		notebooks
similarity		similarity
utils		utils
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Reproducing results

Environment creation

Setting up a conda virtual environment

Reproducing TE substrate specificity results

Applying EnZymClass on other protein sequence classification applications

Using EnZymClass module

Using Jupyter Notebooks

About

Releases

Packages

Languages

cabbi-bio/ThioesteraseEnzymeSpecificity

Folders and files

Latest commit

History

Repository files navigation

Reproducing results

Environment creation

Setting up a conda virtual environment

Reproducing TE substrate specificity results

Applying EnZymClass on other protein sequence classification applications

Using EnZymClass module

Using Jupyter Notebooks

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages