-
Notifications
You must be signed in to change notification settings - Fork 2
Input and Output Files
Beside the given input file, for each taxon that is included in your study HaMStR-oneSeq need these 3 directories to be functional:
- genome_dir: Contains sub-directories for proteome fasta files for each taxon. All taxa in this folder will be used for ortholog search.
-
blast_dir: Contains sub-directories for BLAST databases made with
makeblastdb
out of your proteomes. It is not necessary that all taxa within the genome_dir have to have a BLAST database. Only taxa that should be included in the core ortholog group compilation must be present in this folder. - weight_dir: Contains feature annotation files for all taxa present in genome_dir and blast_dir. These annotation files are not a must. However, to utilise all the features of HaMStR-oneSeq including the FAS scores calculations, we recommend that you should have these data available.
HaMStR-oneSeq comes together with a pre-calculated data for 78 QFO species (data set 2019). If you want to work with other taxa, you can add them into HaMStR-oneSeq following this instruction.
NOTE: you can rename genome_dir, blast_dir and weight_dir to anything as well as place them anywhere you want.
NOTE 2: we recommend you should check your own data for their validity before running HaMStR.
Input (or seed sequence) for HaMStR-oneSeq is a single FASTA file. For example:
>HUMAN@9606@3|P83876
MSYMLPHLHNGWQVDQAILSEEDRVVVIRFGHDWDPTCMKMDEVLYSIAEKVKNFAVIYL
VDITEVPDFNKMYELYDPCTVMFFFRNKHIMIDLGTGNNNKINWAMEDKQEMVDIIETVY
RGARKGRGLVVSPKDYSTKYRY
The taxon of this seed sequence, which is called reference taxon and specified by the option -refspec
, must be present in the blast database directory (blast_dir) of HaMStR-oneSeq.
For one seed sequence, HaMStR-oneSeq output consist of these text files (note: test
is your defined job name using the -seqName
parameter)
-
test.extended.fa
: a multiple FASTA file containing the seed and its ortholog sequences -
test.phyloprofile
: an input file for analysing the phylogenetic profile of the query gene using PhyloProfile tool -
test_forward.domains
and optionally,test_reverse.domains
: protein domain annotation files for all the sequences present in the orthologous group. The_forward
or_reverse
suffix indicates the direction of the feature architecture comparison (FAS), in which_forward
means that the query gene is used as seed and it orthologs as target for the comparison, while_reverse
is vice versa. These files can be submitted into PhyloProfile for visualising
For a rich visualisation of the provided information from the HaMStR-oneSeq outputs, you can plug them into the Phyloprofile tool.
The main input file for PhyloProfile is test.phyloprofile
, which contains list of all orthologous gene names and the taxonomy IDs of their taxa together with the FAS scores (if available). For analysing more information such as the FASTA sequences or the domain annotations, you can optionally input test.extended.fa
and test_forward.domains
(or test_reverse.domains
) to PhyloProfile.
You can combine multiple HaMStR runs into a single phylogenetic profile input for data visualisation and data exploration:
python bin/visuals/mergePhyloprofileData.py /path/to/hamstr/output/directory /path/output/outName
in which /path/to/hamstr/output/directory
is a directory where all single *.phyloprofile
, *.domains
, *.extended.fa
file can be found.
The resulting file /path/output/outName.phyloprofile
, /path/output/outName.extended.fa
, /path/output/outName_forward.domains
and /path/output/outName_backward.domains
can be then uploaded into the Phyloprofile tool for further investigation.