This pipeline is designed by team2-group1, to predict genes of the samples from team1 using a number of Gene prediction tools. This pipeline is used to generate a merged result from several tools.
python3
Latest Perl
bedtools
samtools
Latest Prodigal
Latest GeneMarkS-2
Latest Aragorn
Latest Barrnap
Latest Biopython (If running Bedtools)
bedtools and samtools are required for the union of GeneMarkS-2 and Prodigal results
All required tools need to be installed properly and added to $PATH
-f
:Path to file input directory (Required)
-p
:Run Prodigal prokaryotic mRNA gene prediction tool
-g
:Run GeneMarkS-2 prokaryotic mRNA gene prediction tool
-nc
:Run Aragorn and Barrnap to predict tRNA/tmRNA and rRNA (respectively) (optional)
-ncs
:Separate Aragorn and Barrnap results into two distinct sets of nucleotide fasta files
Default behavior will still require -f
and will run both Prodigal and GeneMarkS-2 with Bedtools
bedtools will run if both Prodigal and GeneMarkS-2 are run, and includes a union folder of both tools
Example usage: ./geneprediction_pipeline_t1.py -f <input_dir>
Prodigal and GeneMarkS-2 run individually will be found in their respective folders, ./prodigalresults
or ./gms2results
Output files are split into three folders. One for GFF format, fna and faa.
If Prodigal and GeneMarkS-2 are run in tandem, then the combined output will also be in ./prodigal-genemark
Aragorn and Barrnap results are joined by default into single .fna
files by assembly, located in ./arabarr
Nucleotide and Amino acid fasta formats may be used with BLAST homology validation as described below.
Version_5 database (required)
taxonomic_id list (required)
EDirect command-line utility (required)
Latest Perl (required)
python3 (required)
blast+ (required)
The validation part assume that all the requirements are installed and the tools should be added to $PATH.
For downloading database:
Use ./update_blastdb.pl --blastdb_version 5 --showall
to see the option.
Use ./update_blastdb.pl --blastdb_version 5 [Database] --decompress
to download.
For getting the taxonomy_idlist:
Use get_species_taxids.sh -n [organism]
For blastp (amino acid):
./blastp.py -d [queried_fold] -t [taxonomy_idlist] -o [outputfolder]
or blastx (DNA seqs):
./blastx.py -d [queried_fold] -t [taxonomy_idlist] -o [outputfolder]
For blastp.py or blastx.py:
-d
:the folder that contains only fasta files you want to validate.
-t
:the taxonomy_idlist for specific organism.
-o
:the output folder for your outputs.
For validationP.py or validationX.py:
-s
:the folder that contains only fasta files you want to validate.
-b
:the folder that contains only blast results for your fasta files.
-o
:the output folder for your outputs.
There will be two folders in your output folder:
knownprotein/
: The fasta files in this folder have got rid of the sequences that do not have hit in blast.
novelgene/
: The fasta files in this folder do not have hit in blast.