diff --git a/docs/about/index.rst b/docs/about/index.rst
index 002f45c..bb6fa7d 100644
--- a/docs/about/index.rst
+++ b/docs/about/index.rst
@@ -3,7 +3,7 @@ About
^^^^^
-**Ortho**\logy inference using **H**\idden **M**\narkov **M**\rodels (**OrthoHMM**) was developed as
+**Ortho**\logy inference using **H**\idden **M**\arkov **M**\odels (**OrthoHMM**) was developed as
part of `Jacob L. Steenwyk `_'s post-doctoral work.
Inferring orthology (that is, genes that have shared ancestry) is a major challenge in bioinformatics.
diff --git a/docs/advanced/index.rst b/docs/advanced/index.rst
index 671b130..7b5141f 100644
--- a/docs/advanced/index.rst
+++ b/docs/advanced/index.rst
@@ -9,22 +9,20 @@ All OrthoHMM outputs have the prefix *orthohmm* so that they are easy to find.
- orthohmm_gene_count.txt
- A gene count matrix per taxa for each orthogroup.
-|
- orthohmm_orthogroups.txt
- Genes present in each orthogroup.
-|
- orthohmm_single_copy_orthogroups.txt
- A single-column list of single-copy orthologs.
-|
- orthohmm_orthogroups
- A directory of FASTA files wherein each file is an orthogroup.
-|
- orthohmm_single_copy_orthogroups
- A directory of FASTA files wherein each file is a single-copy ortholog.
- Headers are modified to have taxon names come before the gene identifier.
- Taxon names are the file name excluding the extension.
- Taxon name and gene identifier are separated by a pipe symbol "|".
- This aims to help streamline phylogenomic workflows wherein sequences will be concatenated downstream based on taxon names.
+- orthohmm_working_res
+ - A directory of intermediate files, such as all-by-all search results, network edges, and clusters inferred from network edges.
^^^^^
@@ -32,10 +30,13 @@ This remaining sections describe the various features and options of OrthoHMM.
- `Output directory`_
- Phmmer_
+- `Substitution matrix`_
- CPU_
- `Single-copy Threshold`_
- MCL_
- `Inflation Value`_
+- `Stop`_
+- `Start`_
- `All options`_
|
@@ -44,7 +45,6 @@ This remaining sections describe the various features and options of OrthoHMM.
Output directory
----------------
-
Output directory name to store OrthoHMM results. This directory should already exist.
By default, results files will be written to the same directory as the input
directory of FASTA files. (-o, --output_directory)
@@ -60,7 +60,6 @@ directory of FASTA files. (-o, --output_directory)
Phmmer
------
-
Path to phmmer executable from HMMER suite. By default, phmmer
is assumed to be in the PATH variable; in other words, phmmer
can be evoked by typing `phmmer`.
@@ -72,6 +71,22 @@ can be evoked by typing `phmmer`.
|
+.. _`Substitution matrix`:
+
+Substitution matrix
+-------------------
+Residue alignment probabilities will be determined from the
+specified substitution matrix. Supported substitution matrices
+include: BLOSUM45, BLOSUM50, BLOSUM62, BLOSUM80, BLOSUM90,
+PAM30, PAM70, PAM120, and PAM240. The default is BLOSUM62.
+
+.. code-block:: shell
+
+ # specify using the BLOSUM80 substitution matrix
+ orthohmm -x BLOSUM80
+
+|
+
.. _CPU:
CPU
@@ -91,7 +106,6 @@ By default, the number of CPUs available will be auto-detected.
Single-copy Threshold
---------------------
-
Taxon occupancy threshold when identifying single-copy orthologs.
By default, the threshold is 50% taxon occupancy, which is specified
as a fraction - that is, 0.5.
@@ -107,7 +121,6 @@ as a fraction - that is, 0.5.
MCL
---
-
Path to mcl executable from MCL software. By default, mcl
is assumed to be in the PATH variable; in other words,
mcl can be evoked by typing `mcl`.
@@ -124,7 +137,6 @@ mcl can be evoked by typing `mcl`.
Inflation Value
---------------
-
MCL inflation parameter for clustering genes into orthologous groups.
Lower values are more permissive resulting in larger orthogroups.
Higher values are stricter resulting in smaller orthogroups.
@@ -137,6 +149,45 @@ The default value is 1.5.
|
+
+.. _Stop:
+
+Stop
+----
+Similar to other ortholog calling algorithms, different steps in the
+OrthoHMM workflow can be cpu or memory intensive. Thus, users may
+want to stop OrthoHMM at certain steps, to faciltiate more
+practical resource allocation. There are three choices for when to
+stop the analysis: prepare, infer, and write.
+
+* prepare: Stop after preparing input files for the all-by-all search
+* infer: Stop after inferring the orthogroups
+* write: Stop after writing sequence files for the orthogroups. Currently, this is synonymous with not specifying a step to stop the analysis at.
+
+.. code-block:: shell
+
+ # stop orthohmm after preparing the all-by-all search commands
+ orthohmm --stop prepare
+
+|
+
+.. _Start:
+
+Start
+-----
+Start analysis from a specific intermediate step. Currently, this
+can only be applied to the results from the all-by-all search.
+
+* search_res: Start analysis from all-by-all search results.
+
+.. code-block:: shell
+
+ # start orthohmm from after the all-by-all searches are complete
+ orthohmm --start search_res
+
+|
+
+
.. _`All options`:
All options
@@ -154,7 +205,9 @@ All options
+------------------------------+--------------------------------------------------------------------------------+
| -p/\-\-phhmer | Path to phmmer from HMMER suite. Default: phmmer |
+------------------------------+--------------------------------------------------------------------------------+
-| -c\-\-cpu | Number of parallel CPU workers to use for multithreading. Default: auto detect |
+| -x/\-\-substitution_matrix | Specify substitution matrix to use for generating the HMMs. Default: BLOSUM62 |
++------------------------------+--------------------------------------------------------------------------------+
+| -c/\-\-cpu | Number of parallel CPU workers to use for multithreading. Default: auto detect |
+------------------------------+--------------------------------------------------------------------------------+
| -s/\-\-single_copy_threshold | Taxon occupancy threshold for single-copy orthologs. Default 0.5 |
+------------------------------+--------------------------------------------------------------------------------+
@@ -162,4 +215,7 @@ All options
+------------------------------+--------------------------------------------------------------------------------+
| -i/\-\-inflation_value | MCL inflation parameter. Default: 1.5 |
+------------------------------+--------------------------------------------------------------------------------+
-
+| \-\-stop | Stop OrthoHMM run at a specific step. Default: None |
++------------------------------+--------------------------------------------------------------------------------+
+| \-\-start | Start OrthoHMM run at a specific step. Default: None |
++------------------------------+--------------------------------------------------------------------------------+
diff --git a/orthohmm/parser.py b/orthohmm/parser.py
index 59ea198..18ed13b 100644
--- a/orthohmm/parser.py
+++ b/orthohmm/parser.py
@@ -115,7 +115,7 @@ def create_parser() -> ArgumentParser:
Residue alignment probabilities will be determined from the
specified substitution matrix. Supported substitution matrices
include: BLOSUM45, BLOSUM50, BLOSUM62, BLOSUM80, BLOSUM90,
- PAM30, PAM70, PAM120, and PAM240.
+ PAM30, PAM70, PAM120, and PAM240. The default is BLOSUM62.
CPU (-c, --cpu)
Number of CPU workers for multithreading during sequence search.