Input and Output Files

Data structure

Beside the given input file, for each taxon that is included in your study HaMStR-oneSeq need these 3 directories to be functional:

genome_dir: Contains sub-directories for proteome fasta files for each taxon. All taxa in this folder will be used for ortholog search.
blast_dir: Contains sub-directories for BLAST databases made with makeblastdb out of your proteomes. It is not necessary that all taxa within the genome_dir have to have a BLAST database. Only taxa that should be included in the core ortholog group compilation must be present in this folder.
weight_dir: Contains feature annotation files for all taxa present in genome_dir and blast_dir. These annotation files are not a must. However, to utilise all the features of HaMStR-oneSeq including the FAS scores calculations, we recommend that you should have these data available.

HaMStR-oneSeq comes together with a pre-calculated data for 78 QFO species (data set 2019). If you want to work with other taxa, you can add them into HaMStR-oneSeq following this instruction.

NOTE: you can rename genome_dir, blast_dir and weight_dir to anything as well as place them anywhere you want.

NOTE 2: we recommend you should check your own data for their validity before running HaMStR.

Input file

Input (or seed sequence) for HaMStR-oneSeq is a single FASTA file. For example:

>HUMAN@9606@3|P83876
MSYMLPHLHNGWQVDQAILSEEDRVVVIRFGHDWDPTCMKMDEVLYSIAEKVKNFAVIYL
VDITEVPDFNKMYELYDPCTVMFFFRNKHIMIDLGTGNNNKINWAMEDKQEMVDIIETVY
RGARKGRGLVVSPKDYSTKYRY

The taxon of this seed sequence, which is called reference taxon and specified by the option -refspec, must be present in the blast database directory (blast_dir) of HaMStR-oneSeq.

Output files

For one seed sequence, HaMStR-oneSeq output consist of these text files (note: test is your defined job name using the -seqName parameter)

test.extended.fa: a multiple FASTA file containing the seed and its ortholog sequences
test.phyloprofile: an input file for analysing the phylogenetic profile of the query gene using PhyloProfile tool
test_forward.domains and optionally, test_reverse.domains: protein domain annotation files for all the sequences present in the orthologous group. The _forward or _reverse suffix indicates the direction of the feature architecture comparison (FAS), in which _forward means that the query gene is used as seed and it orthologs as target for the comparison, while _reverse is vice versa. These files can be submitted into PhyloProfile for visualising

Phylogenetic profile analysis using PhyloProfile

For a rich visualisation of the provided information from the HaMStR-oneSeq outputs, you can plug them into the Phyloprofile tool.

The main input file for PhyloProfile is test.phyloprofile, which contains list of all orthologous gene names and the taxonomy IDs of their taxa together with the FAS scores (if available). For analysing more information such as the FASTA sequences or the domain annotations, you can optionally input test.extended.fa and test_forward.domains (or test_reverse.domains) to PhyloProfile.

You can combine multiple HaMStR runs into a single phylogenetic profile input for data visualisation and data exploration:

python bin/visuals/mergePhyloprofileData.py /path/to/hamstr/output/directory /path/output/outName

in which /path/to/hamstr/output/directory is a directory where all single *.phyloprofile, *.domains, *.extended.fa file can be found.

The resulting file /path/output/outName.phyloprofile, /path/output/outName.extended.fa, /path/output/outName_forward.domains and /path/output/outName_backward.domains can be then uploaded into the Phyloprofile tool for further investigation.