Skip to content

Latest commit

 

History

History
104 lines (78 loc) · 3.32 KB

README.md

File metadata and controls

104 lines (78 loc) · 3.32 KB

CAESAR logo

CAndidate Enzyme SeARch

CAESAR is a pipeline to (1) search candidates sequence, based on reference sequences or hmm pattern, (2) cluster them, (3) select candidates in each cluster and (4) build a phylogenetic tree.

Installation

Download the latest release to obtain the code

Then you can install all the Python dependencies and all the external bioinformatics tools required with the following command*:

conda env create -n caesar -f caesar_env.yml

*mamba can be used instead of conda

This will install:

  • Python package
    • biopython
    • pyyaml
    • requests
    • psutil
  • Bioinformatic tools
    • diamond>=2.1.0
    • seqkit>=2.8.0
    • hmmer
    • mafft
    • fasttree

Configuration file

CAESAR will always ask you to provide a file in yaml format. This file must contains some information like the databases paths, e.g:

trembl_db:
    - dmnd: "/home/user/trembl.dmnd"
    - faa: "/home/user/trembl.fasta"
nr_db:
    - dmnd: "/home/user/nr.dmnd"
    - faa: "/home/user/nr.fasta"
other_db:
    - dmnd: "/home/user/other.dmnd"
    - faa: "/home/user/other.faa"  # amino acid sequences in fasta format
    - fna: "/home/user/other.fna"  # nucleic acid sequences in fasta format
strain_library: "/home/user/strain_library.tsv"
date:
    - trembl: "2024_07_28"
candidate_selection:
    - strain_library
    - order  # strain available in the ATCC or DSM collection
    - other
slurm: 1  # 0: false 1: true
parallel: 1  # 0: false 1: true
module:  # if the system use module to load software...
    - diamond/2.1.2
    - hmmer/3.4

All the name_db key correspond to a database, the dmnd format is required to use the blastp of diamond, the faa format is required to use hmmsearch. For the uniprot or nr database, only one of them is necessary. For other database, theses two format is required to use blastp, but only the faa if you used hmmsearch. For CAESAR to be able to provide the nucleic sequences of the candidates, the sequences (fna) must also be provided (for uniprot and nr they are retrieved via database queries).

The strain_library key just give the path of the strain_library.

The date key is used to indicate a publication or update date for a database. This key is optional; without it, the file creation date is written to the summary.out file.

The candidate_selection key list the priority order for the selection. In the above case, the sequences in organism/strain find in the strain_library have priority on organism only in external collection. The other can be used if you have access on the sequences inside the other_db.

The slurm, parallel and module key are optionnal.

Usage

Start with blastp

python ./CAESAR/set_caesar.py blastp -q references_sequences.fasta -c config.yml

This command create a bash script named run_caesar.sh use to launch the pipeline.

bash ./run_caesar.sh

Start with hmmsearch

python ./CAESAR/set_caesar.py hmmsearch -q reference_profile.hmm -c config.yml

Then

bash ./run_caesar.sh

NB: for more details, see the Usages page on the Wiki