About

ClusterFast is a scalable, rapid and (CPU and memory) efficient command-line interface for the clustering of orthologous proteins encoded in multiple genomic samples. ClusterFast was developed by the MiDEP group at the Malawi Liverpool Wellcome Trust Clinical Research Programme (members of H3ABionet), after they became frustrated by the long run time and large memory requirements of avaliable clustering programs.

ClusterFast works by first grouping the most similar protein sequences encoded between a random pair of genomes (files) within the input dataset, then choosing the longest sequence for further comparison to protein sequences selected from another pair of genomes until only one file remains. This is then used to identify less similar distantaly related sequences. The tool is suitable for use with both prokaryotic and eukaryotic genomes. By employing this novel approach, ClusterFast substantially reduces the memory and processing time required for the clustering of orthologous proteins compared to other avaliable programs.

Using a test dataset of 140 pneumococcal genomes (each 2.5Mbp in size and encoding ~1500 genes), ClusterFast successfully executed in <20 minutes on a single core. Using a test dataset of 62 bacterial genomes composed of multiple different species, ClusterFast excecuted in ~1 hour on 20 cores.

ClusterFast is written in Python and uses PBLAT (multicore BLAT), BLAST and the ProteinOrtho4.0 algorithm.

External tools

Expected to be in system path or provided in the options

PBLAT
NCBI BLAST suit

Python and module dependencies

There are a number of dependencies required for ClusterFast, with instructions specific to the type of system you have:

Python 3+ (Python2+ not tested)
NumPy
SciPy
Pandas
Click
BioPython
NetworkX

Modules should be installed by ClusterFast installation.

Installation

Note: user might need to be the root

Suggested method

pip install git+https://github.com/codemeleon/ClusterFast.git

Alternative method

git clone https://github.com/codemeleon/ClusterFast.git
cd ClusterFast
python setup.py install

If the installation fails, please contact your system administrator. If you discover any bugs, please let us know by emailing [email protected]

Input Files

The input format for ClusterFast is protein sequence files (extension .faa) of translated amino acid sequences of predicted open reading frames for each genome (sample) in the input dataset. The file and/or sequence names must not contain ___(three underscores). These files can be created using Prokka.

Usage

clusterfast -faaf < protein_seq_folder > -identity < sequence_similarity > -ncor < #_of_cores_to_use > -outfile < outputfile > -pblat < pblat_absolute_path > -blastp < blast_absolute_path > -makeblastdb < makeblastdb_path > -sim_algo < blat|anm > -minlen < minmum_sequence_size_for_clustering > -mindiff < Sequence_difference_in_pair_sequenc > -minmap < Minimum_map_length_relative_to_longer_sequence_in_pair > -seed < random_number_for_file_pairing >

--help/-h : Help
-faaf: Folder containing protein fasta files with file extension .faa
-identity: Similarity between sequences. Defaults: 0.8 for closly related samples and 0.25 for distantly related samples
-ncor: Number of processors to use
-outfile: Output file path
-pblat: Path for pblat executable. Default: pblat
-makeblastdb: Path for makeblastdb executable. Default: makeblastdb
-blastp: Path for blastp executable. Default: blastp
-evalue: BLAST evalue. Default: 1e-10
-distant: Are samples distantly related? Default: False
-seed: For random file pairing. Default: 1234
-minlen: Minimun length of sequences used in clustering. Default: 50
-mindiff: Length of smaller sequences relative longer, to consider a blast hit. Default: 0.5
-minmap: Minimum mapping length relative to the longer sequence in the pair. Default: 0.5
-conn_threshold: Connection threshold used in ProteinOrtho4.0. Default: 0.1
-adaptive: Adapative search value as in ProteinOrtho4. Default: 0.95
-algo: For different Identity calculation method. Default: anm
- blast: 2*matches/(sum of length of sequences)
- anm: matches/tolal alignment length as following.
- "*" represents matches in the alignment. Total alignment length includes overhanging sequences, gaps in two sequences, mismatches and matches
- ADGTHADT--FGGHJJ---DFGDTJHKJLKSDFHKJLJ
- ---*****--******---***-**--******-----
- ---THADTFGFGGHJJSDFDFGFTJKHJLKSDF-----

License

GPLv3

Benchmarking

The benchmarking was performed comapare to ProteinOrtho4.0 on two different datasets

Kulohoma BW et.al

140 S. pneumonia samples
~1600 protein sequences per samples
20 Cores
ProteinOrtho4.0 : ~8 hours
ClusterFast : 5 Minutes (~100 times faster)
more than 90 % similarity

62 Different bacterial genome

20 Cores
2100 to 4500 protein sequences per genome
Protein Ortho ~4 hours
Clusterfast: ~1 hours (4 times faster)
~70% identical clusters
~20% cluster between ProteinOrtho4. and ClusterFast had
~70% indetical clusters among two tools

Figure 1: Shows a direct comparison of the number of orthologous amino acid sequences present in the clusters produced by ClusterFast compared to those produced by ProteinOrtho. A linear relationship can be observed between the two. The size of the data points indicates the degree of overlap in the protein IDs assigned to the amino acid sequences in each ClusterFast versus ProteinOrtho cluster; the larger the data point the higher the degree of overlap.

Figure 2: Shows the frequency (dark grey) and cumulative frequency (light grey) of the non-overlapping clusters of a given content percentage similarity produced by ClusterFast compared to ProteinOrtho. All of the clusters showed >50% content similarity.

Request

All improvement suggestions and critics are welcome.

ToDo

Py27 compatible
More optimisations

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
Images		Images
__pycache__		__pycache__
bin		bin
build/scripts-3.4		build/scripts-3.4
dist		dist
62BactrialSamples.md		62BactrialSamples.md
README.md		README.md
clusterfast.py		clusterfast.py
setup.py		setup.py
setup.py~		setup.py~

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

External tools

Python and module dependencies

Installation

Suggested method

Alternative method

Input Files

Usage

License

Benchmarking

Kulohoma BW et.al

62 Different bacterial genome

Request

ToDo

About

Releases

Packages

Languages

codemeleon/ClusterFast

Folders and files

Latest commit

History

Repository files navigation

About

External tools

Python and module dependencies

Installation

Suggested method

Alternative method

Input Files

Usage

License

Benchmarking

Kulohoma BW et.al

62 Different bacterial genome

Request

ToDo

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages