diff --git a/docs/about/index.rst b/docs/about/index.rst index 002f45c..bb6fa7d 100644 --- a/docs/about/index.rst +++ b/docs/about/index.rst @@ -3,7 +3,7 @@ About ^^^^^ -**Ortho**\logy inference using **H**\idden **M**\narkov **M**\rodels (**OrthoHMM**) was developed as +**Ortho**\logy inference using **H**\idden **M**\arkov **M**\odels (**OrthoHMM**) was developed as part of `Jacob L. Steenwyk `_'s post-doctoral work. Inferring orthology (that is, genes that have shared ancestry) is a major challenge in bioinformatics. diff --git a/docs/advanced/index.rst b/docs/advanced/index.rst index 671b130..7b5141f 100644 --- a/docs/advanced/index.rst +++ b/docs/advanced/index.rst @@ -9,22 +9,20 @@ All OrthoHMM outputs have the prefix *orthohmm* so that they are easy to find. - orthohmm_gene_count.txt - A gene count matrix per taxa for each orthogroup. -| - orthohmm_orthogroups.txt - Genes present in each orthogroup. -| - orthohmm_single_copy_orthogroups.txt - A single-column list of single-copy orthologs. -| - orthohmm_orthogroups - A directory of FASTA files wherein each file is an orthogroup. -| - orthohmm_single_copy_orthogroups - A directory of FASTA files wherein each file is a single-copy ortholog. - Headers are modified to have taxon names come before the gene identifier. - Taxon names are the file name excluding the extension. - Taxon name and gene identifier are separated by a pipe symbol "|". - This aims to help streamline phylogenomic workflows wherein sequences will be concatenated downstream based on taxon names. +- orthohmm_working_res + - A directory of intermediate files, such as all-by-all search results, network edges, and clusters inferred from network edges. ^^^^^ @@ -32,10 +30,13 @@ This remaining sections describe the various features and options of OrthoHMM. - `Output directory`_ - Phmmer_ +- `Substitution matrix`_ - CPU_ - `Single-copy Threshold`_ - MCL_ - `Inflation Value`_ +- `Stop`_ +- `Start`_ - `All options`_ | @@ -44,7 +45,6 @@ This remaining sections describe the various features and options of OrthoHMM. Output directory ---------------- - Output directory name to store OrthoHMM results. This directory should already exist. By default, results files will be written to the same directory as the input directory of FASTA files. (-o, --output_directory) @@ -60,7 +60,6 @@ directory of FASTA files. (-o, --output_directory) Phmmer ------ - Path to phmmer executable from HMMER suite. By default, phmmer is assumed to be in the PATH variable; in other words, phmmer can be evoked by typing `phmmer`. @@ -72,6 +71,22 @@ can be evoked by typing `phmmer`. | +.. _`Substitution matrix`: + +Substitution matrix +------------------- +Residue alignment probabilities will be determined from the +specified substitution matrix. Supported substitution matrices +include: BLOSUM45, BLOSUM50, BLOSUM62, BLOSUM80, BLOSUM90, +PAM30, PAM70, PAM120, and PAM240. The default is BLOSUM62. + +.. code-block:: shell + + # specify using the BLOSUM80 substitution matrix + orthohmm -x BLOSUM80 + +| + .. _CPU: CPU @@ -91,7 +106,6 @@ By default, the number of CPUs available will be auto-detected. Single-copy Threshold --------------------- - Taxon occupancy threshold when identifying single-copy orthologs. By default, the threshold is 50% taxon occupancy, which is specified as a fraction - that is, 0.5. @@ -107,7 +121,6 @@ as a fraction - that is, 0.5. MCL --- - Path to mcl executable from MCL software. By default, mcl is assumed to be in the PATH variable; in other words, mcl can be evoked by typing `mcl`. @@ -124,7 +137,6 @@ mcl can be evoked by typing `mcl`. Inflation Value --------------- - MCL inflation parameter for clustering genes into orthologous groups. Lower values are more permissive resulting in larger orthogroups. Higher values are stricter resulting in smaller orthogroups. @@ -137,6 +149,45 @@ The default value is 1.5. | + +.. _Stop: + +Stop +---- +Similar to other ortholog calling algorithms, different steps in the +OrthoHMM workflow can be cpu or memory intensive. Thus, users may +want to stop OrthoHMM at certain steps, to faciltiate more +practical resource allocation. There are three choices for when to +stop the analysis: prepare, infer, and write. + +* prepare: Stop after preparing input files for the all-by-all search +* infer: Stop after inferring the orthogroups +* write: Stop after writing sequence files for the orthogroups. Currently, this is synonymous with not specifying a step to stop the analysis at. + +.. code-block:: shell + + # stop orthohmm after preparing the all-by-all search commands + orthohmm --stop prepare + +| + +.. _Start: + +Start +----- +Start analysis from a specific intermediate step. Currently, this +can only be applied to the results from the all-by-all search. + +* search_res: Start analysis from all-by-all search results. + +.. code-block:: shell + + # start orthohmm from after the all-by-all searches are complete + orthohmm --start search_res + +| + + .. _`All options`: All options @@ -154,7 +205,9 @@ All options +------------------------------+--------------------------------------------------------------------------------+ | -p/\-\-phhmer | Path to phmmer from HMMER suite. Default: phmmer | +------------------------------+--------------------------------------------------------------------------------+ -| -c\-\-cpu | Number of parallel CPU workers to use for multithreading. Default: auto detect | +| -x/\-\-substitution_matrix | Specify substitution matrix to use for generating the HMMs. Default: BLOSUM62 | ++------------------------------+--------------------------------------------------------------------------------+ +| -c/\-\-cpu | Number of parallel CPU workers to use for multithreading. Default: auto detect | +------------------------------+--------------------------------------------------------------------------------+ | -s/\-\-single_copy_threshold | Taxon occupancy threshold for single-copy orthologs. Default 0.5 | +------------------------------+--------------------------------------------------------------------------------+ @@ -162,4 +215,7 @@ All options +------------------------------+--------------------------------------------------------------------------------+ | -i/\-\-inflation_value | MCL inflation parameter. Default: 1.5 | +------------------------------+--------------------------------------------------------------------------------+ - +| \-\-stop | Stop OrthoHMM run at a specific step. Default: None | ++------------------------------+--------------------------------------------------------------------------------+ +| \-\-start | Start OrthoHMM run at a specific step. Default: None | ++------------------------------+--------------------------------------------------------------------------------+ diff --git a/orthohmm/parser.py b/orthohmm/parser.py index 59ea198..18ed13b 100644 --- a/orthohmm/parser.py +++ b/orthohmm/parser.py @@ -115,7 +115,7 @@ def create_parser() -> ArgumentParser: Residue alignment probabilities will be determined from the specified substitution matrix. Supported substitution matrices include: BLOSUM45, BLOSUM50, BLOSUM62, BLOSUM80, BLOSUM90, - PAM30, PAM70, PAM120, and PAM240. + PAM30, PAM70, PAM120, and PAM240. The default is BLOSUM62. CPU (-c, --cpu) Number of CPU workers for multithreading during sequence search.