[MRG] Build a LineageDB interface for taxonomy databases/information (

#1651) * start making a LineageDB * further abstract taxonomy loading * replace rest of taxonomy parsing * get some basic sqlite stuff going * refactoring, simplification, optimization * fix a few things, alert on bad arg combination * add combine_tax CLI command under tax * rename 'combine_tax' to 'prepare' * allow output database type for tax prepare * format switching now works for sql and csv * adjust loading to try/fail rather than suffix * refactor taxonomy load * fix tests * basic tests for in/out formats * add tests for trying to load bad files * raise appropriate error message * fix with pytest.raises foo * some more tests * more tests, fix bug * make available_ranks work with sqlite db * re-add test for available_ranks * produce more useful errors for dups, and restore test code * catch exception for database already exists * test split idents and keep versions * align keyword args with CLI args * add more end-to-end tests * alias --taxonomy-csv to --taxonomy * 'prepare' docs * clean up
sourmash-bio · Jul 8, 2021 · eeb1874 · eeb1874
1 parent 512d961
commit eeb1874
Show file tree

Hide file tree

Showing 14 changed files with 1,003 additions and 197 deletions.
diff --git a/doc/command-line.md b/doc/command-line.md
@@ -425,11 +425,12 @@ signatures, rather than all the signatures in the database.
 
 The sourmash `tax` or `taxonomy` commands integrate taxonomic
  information into the results of `sourmash gather`. All `tax` commands
- require one or more properly formatted `taxonomy` csv files that correspond to
- the database(s) used for `gather`. Note that if using multiple databases,
- the `gather` needs to have been conducted against all desired databases
- within the same `gather` command (we cannot combine separate `gather` runs
- for the same query). For supported databases (e.g. GTDB, NCBI), we provide
+ require one or more properly formatted `taxonomy` files where the
+ identifiers correspond to those in the database(s) used for
+ `gather`. Note that if using multiple databases, the `gather` needs
+ to have been conducted against all desired databases within the same
+ `gather` command (we cannot combine separate `gather` runs for the
+ same query). For supported databases (e.g. GTDB, NCBI), we provide
  taxonomy csv files, but they can also be generated for user-generated
  databases. For more information, see [databases](databases.md).
 
@@ -454,8 +455,8 @@ As with all reference-based analysis, results can be affected by the
  results from `gather` minimizes issues associated with increasing size
  and redundancy of reference databases.
 
-For more on how `gather` works and can be used to classify signatures, see
- [classifying-signatures](classifying-signatures.md).
+For more details on how `gather` works and can be used to classify
+ signatures, see [classifying-signatures](classifying-signatures.md).
 
 
 ### `sourmash tax metagenome` - summarize metagenome content from `gather` results
@@ -469,7 +470,7 @@ example command to summarize a single `gather csv`, where the query was gathered
 ```
 sourmash tax metagenome
     --gather-csv HSMA33MX_gather_x_gtdbrs202_k31.csv \
-    --taxonomy-csv gtdb-rs202.taxonomy.v2.csv
+    --taxonomy gtdb-rs202.taxonomy.v2.csv
 ```
 
 There are three possible output formats, `csv_summary`, `lineage_summary`, and
@@ -517,7 +518,7 @@ To generate `krona`, we add `--output-format krona` to the command above, and
 ```
 sourmash tax metagenome
     --gather-csv HSMA33MX_gather_x_gtdbrs202_k31.csv \
-    --taxonomy-csv gtdb-rs202.taxonomy.v2.csv \
+    --taxonomy gtdb-rs202.taxonomy.v2.csv \
     --output-format krona --rank species
 ```
 
@@ -545,7 +546,7 @@ To generate `lineage_summary`, we add `--output-format lineage_summary` to the s
 sourmash tax metagenome
     --gather-csv HSMA33MX_gather_x_gtdbrs202_k31.csv \
     --gather-csv PSM6XBW3_gather_x_gtdbrs202_k31.csv \
-    --taxonomy-csv gtdb-rs202.taxonomy.v2.csv \
+    --taxonomy gtdb-rs202.taxonomy.v2.csv \
     --output-format krona --rank species
 ```
 
@@ -602,7 +603,7 @@ We can use `tax genome` on this gather csv to classify our "Sb47+63" mixed-strai
 ```
 sourmash tax genome
     --gather-csv 47+63_x_gtdb-rs202.gather.csv \
-    --taxonomy-csv gtdb-rs202.taxonomy.v2.csv
+    --taxonomy gtdb-rs202.taxonomy.v2.csv
 ```
 > This command uses the default classification strategy, which uses a
 containment threshold of 0.1 (10%).
@@ -649,7 +650,7 @@ To generate `krona`, we must classify by `--rank` instead of using the
 ```
 sourmash tax genome
     --gather-csv Sb47+63_gather_x_gtdbrs202_k31.csv \
-    --taxonomy-csv gtdb-rs202.taxonomy.v2.csv \
+    --taxonomy gtdb-rs202.taxonomy.v2.csv \
     --output-format krona --rank species
 ```
 > Note that specifying `--rank` forces classification by rank rather than
@@ -684,10 +685,35 @@ By default, `annotate` uses the name of each input gather csv to write an update
 ```
 sourmash tax annotate
     --gather-csv Sb47+63_gather_x_gtdbrs202_k31.csv \
-    --taxonomy-csv gtdb-rs202.taxonomy.v2.csv
+    --taxonomy gtdb-rs202.taxonomy.v2.csv
 ```
 > This will produce an annotated gather CSV, `Sb47+63_gather_x_gtdbrs202_k31.with-lineages.csv`
 
+### `sourmash tax prepare` - prepare and/or combine taxonomy files
+
+All `sourmash tax` commands must be given one or more taxonomy files as
+parameters to the `--taxonomy` argument. These files can be either CSV
+files or (as of sourmash 4.2.1) sqlite3 databases. sqlite3 databases
+are much faster for large taxonomies, while CSV files are easier to view
+and modify using spreadsheet software.
+
+`sourmash tax prepare` is a utility function that can ingest and validate
+multiple CSV files or sqlite3 databases, and output a CSV file or a sqlite3
+database. It can be used to combine multiple taxonomies into a single file,
+as well as change formats between CSV and sqlite3.
+
+The following command will take in two taxonomy files and combine them into
+a single taxonomy sqlite database.
+
+```
+sourmash tax prepare --taxonomy file1.csv file2.csv -o tax.db
+```
+
+Input databases formats can be mixed and matched, and the output format
+can be set to CSV like so:
+```
+sourmash tax prepare --taxonomy file1.csv file2.db -o tax.csv -F csv
+```
 
 ## `sourmash lca` subcommands for in-memory taxonomy integration
 
@@ -1216,9 +1242,10 @@ Briefly,
   signature files using `--query-from-file` (see below).
 
 * `index` and `lca index` take a few fixed parameters (database name,
-  taxonomy spreadsheet) and then an arbitrary number of other files
-  that contain signatures, including files, directories, and indexed
-  databases. These commands will also take `--from-file` (see below).
+  and for `lca index`, a taxonomy file) and then an arbitrary number of
+  other files that contain signatures, including files, directories,
+  and indexed databases. These commands will also take `--from-file`
+  (see below).
 
 None of these commands currently support searching, comparing, or indexing
 signatures with multiple ksizes or moltypes at the same time; you need

diff --git a/src/sourmash/cli/tax/__init__.py b/src/sourmash/cli/tax/__init__.py
@@ -7,6 +7,7 @@
 from . import metagenome
 from . import genome
 from . import annotate
+from . import prepare
 from ..utils import command_list
 from argparse import SUPPRESS, RawDescriptionHelpFormatter
 import os

diff --git a/src/sourmash/cli/tax/annotate.py b/src/sourmash/cli/tax/annotate.py
@@ -35,7 +35,7 @@ def subparser(subparsers):
         help='suppress non-error output'
     )
     subparser.add_argument(
-        '-t', '--taxonomy-csv', metavar='FILE',
+        '-t', '--taxonomy-csv', '--taxonomy', metavar='FILE',
         nargs="+", required=True,
         help='database lineages CSV'
     )

diff --git a/src/sourmash/cli/tax/genome.py b/src/sourmash/cli/tax/genome.py
@@ -50,7 +50,7 @@ def subparser(subparsers):
         help='suppress non-error output'
     )
     subparser.add_argument(
-        '-t', '--taxonomy-csv',  metavar='FILE',
+        '-t', '--taxonomy-csv', '--taxonomy', metavar='FILE',
         nargs='+', required=True,
         help='database lineages CSV'
     )

diff --git a/src/sourmash/cli/tax/metagenome.py b/src/sourmash/cli/tax/metagenome.py
@@ -48,7 +48,7 @@ def subparser(subparsers):
         help='directory for output files'
     )
     subparser.add_argument(
-        '-t', '--taxonomy-csv', metavar='FILE',
+        '-t', '--taxonomy-csv', '--taxonomy', metavar='FILE',
         nargs='+', required=True,
         help='database lineages CSV'
     )

diff --git a/src/sourmash/cli/tax/prepare.py b/src/sourmash/cli/tax/prepare.py
@@ -0,0 +1,60 @@
+"""combine multiple taxonomy databases into one."""
+
+usage="""
+
+    sourmash tax prepare --taxonomy-csv <taxonomy_file> [ ... ] -o <output>
+
+The 'tax prepare' command reads in one or more taxonomy databases
+and saves them into a new database. It can be used to combine databases
+in the desired order, as well as output different database formats.
+
+Please see the 'tax prepare' documentation for more details:
+  https://sourmash.readthedocs.io/en/latest/command-line.html#sourmash-tax-prepare-prepare-and-or-combine-taxonomy-files
+"""
+
+import sourmash
+from sourmash.logging import notify, print_results, error
+
+
+def subparser(subparsers):
+    subparser = subparsers.add_parser('prepare',
+                                      usage=usage)
+    subparser.add_argument(
+        '-q', '--quiet', action='store_true',
+        help='suppress non-error output'
+    )
+    subparser.add_argument(
+        '-t', '--taxonomy-csv', '--taxonomy', metavar='FILE',
+        nargs="+", required=True,
+        help='database lineages'
+    )
+    subparser.add_argument(
+        '-o', '--output', required=True,
+        help='output file',
+    )
+    subparser.add_argument(
+        '-F', '--database-format',
+        help="format of output file; default is 'sql')",
+        default='sql',
+        choices=['csv', 'sql'],
+    )
+    subparser.add_argument(
+        '--keep-full-identifiers', action='store_true',
+        help='do not split identifiers on whitespace'
+    )
+    subparser.add_argument(
+        '--keep-identifier-versions', action='store_true',
+        help='after splitting identifiers, do not remove accession versions'
+    )
+    subparser.add_argument(
+        '--fail-on-missing-taxonomy', action='store_true',
+        help='fail quickly if taxonomy is not available for an identifier',
+    )
+    subparser.add_argument(
+        '-f', '--force', action = 'store_true',
+        help='continue past errors in file and taxonomy loading',
+    )
+
+def main(args):
+    import sourmash
+    return sourmash.tax.__main__.prepare(args)
diff --git a/src/sourmash/tax/__main__.py b/src/sourmash/tax/__main__.py
@@ -12,7 +12,7 @@
 from sourmash.lca.lca_utils import display_lineage
 
 from . import tax_utils
-from .tax_utils import ClassificationResult
+from .tax_utils import ClassificationResult, MultiLineageDB
 
 usage='''
 sourmash taxonomy <command> [<args>] - manipulate/work with taxonomy information.
@@ -61,22 +61,16 @@ def metagenome(args):
     set_quiet(args.quiet)
 
     # first, load taxonomic_assignments
-    tax_assign = {}
-    available_ranks = set()
-    for tax_csv in args.taxonomy_csv:
-
-        try:
-            this_tax_assign, _, avail_ranks = tax_utils.load_taxonomy_csv(tax_csv, split_identifiers=not args.keep_full_identifiers,
-                                              keep_identifier_versions = args.keep_identifier_versions,
-                                              force=args.force)
-            # maybe check for overlapping tax assignments? currently, later ones will override earlier ones
-            tax_assign.update(this_tax_assign)
-            available_ranks.update(set(avail_ranks))
-
-        except ValueError as exc:
-            error(f"ERROR: {str(exc)}")
-            sys.exit(-1)
-
+    try:
+        tax_assign = MultiLineageDB.load(args.taxonomy_csv,
+                       keep_full_identifiers=args.keep_full_identifiers,
+                       keep_identifier_versions=args.keep_identifier_versions,
+                       force=args.force)
+        available_ranks = tax_assign.available_ranks
+    except ValueError as exc:
+        error(f"ERROR: {str(exc)}")
+        sys.exit(-1)
+
     if not tax_assign:
         error(f'ERROR: No taxonomic assignments loaded from {",".join(args.taxonomy_csv)}. Exiting.')
         sys.exit(-1)
@@ -103,7 +97,7 @@ def metagenome(args):
     seen_perfect = set()
     for rank in sourmash.lca.taxlist(include_strain=False):
         summarized_gather[rank], seen_perfect = tax_utils.summarize_gather_at(rank, tax_assign, gather_results, skip_idents=idents_missed,
-                                                                split_identifiers=not args.keep_full_identifiers,
+                                                                keep_full_identifiers=args.keep_full_identifiers,
                                                                 keep_identifier_versions = args.keep_identifier_versions,
                                                                 seen_perfect = seen_perfect)
 
@@ -139,20 +133,16 @@ def genome(args):
     set_quiet(args.quiet)
 
     # first, load taxonomic_assignments
-    tax_assign = {}
-    available_ranks = set()
-    for tax_csv in args.taxonomy_csv:
-
-        try:
-            this_tax_assign, _, avail_ranks = tax_utils.load_taxonomy_csv(tax_csv, split_identifiers=not args.keep_full_identifiers,
-                                              keep_identifier_versions = args.keep_identifier_versions,
-                                              force=args.force)
-            # maybe check for overlapping tax assignments? currently later ones will override earlier ones
-            tax_assign.update(this_tax_assign)
-            available_ranks.update(set(avail_ranks))
-        except ValueError as exc:
-            error(f"ERROR: {str(exc)}")
-
+    try:
+        tax_assign = MultiLineageDB.load(args.taxonomy_csv,
+                       keep_full_identifiers=args.keep_full_identifiers,
+                       keep_identifier_versions=args.keep_identifier_versions,
+                       force=args.force)
+        available_ranks = tax_assign.available_ranks
+    except ValueError as exc:
+        error(f"ERROR: {str(exc)}")
+        sys.exit(-1)
+
     if not tax_assign:
         error(f'ERROR: No taxonomic assignments loaded from {",".join(args.taxonomy_csv)}. Exiting.')
         sys.exit(-1)
@@ -184,7 +174,7 @@ def genome(args):
     # if --rank is specified, classify to that rank
     if args.rank:
         best_at_rank, seen_perfect = tax_utils.summarize_gather_at(args.rank, tax_assign, gather_results, skip_idents=idents_missed,
-                                                     split_identifiers=not args.keep_full_identifiers,
+                                                     keep_full_identifiers=args.keep_full_identifiers,
                                                      keep_identifier_versions = args.keep_identifier_versions,
                                                      best_only=True, seen_perfect=seen_perfect)
 
@@ -210,7 +200,7 @@ def genome(args):
         for rank in tax_utils.ascending_taxlist(include_strain=False):
             # gets best_at_rank for all queries in this gather_csv
             best_at_rank, seen_perfect = tax_utils.summarize_gather_at(rank, tax_assign, gather_results, skip_idents=idents_missed,
-                                                         split_identifiers=not args.keep_full_identifiers,
+                                                         keep_full_identifiers=args.keep_full_identifiers,
                                                          keep_identifier_versions = args.keep_identifier_versions,
                                                          best_only=True, seen_perfect=seen_perfect)
 
@@ -257,22 +247,16 @@ def annotate(args):
     set_quiet(args.quiet)
 
     # first, load taxonomic_assignments
-    tax_assign = {}
-    this_tax_assign = None
-    for tax_csv in args.taxonomy_csv:
-
-        try:
-            this_tax_assign, _, avail_ranks = tax_utils.load_taxonomy_csv(tax_csv, split_identifiers=not args.keep_full_identifiers,
-                                              keep_identifier_versions = args.keep_identifier_versions,
-                                              force=args.force)
-        except ValueError as exc:
-            error(f"ERROR: {str(exc)}")
-            sys.exit(-1)
-
-        # maybe check for overlapping tax assignments? currently later ones will override earlier ones
-        if this_tax_assign:
-            tax_assign.update(this_tax_assign)
-
+    try:
+        tax_assign = MultiLineageDB.load(args.taxonomy_csv,
+                       keep_full_identifiers=args.keep_full_identifiers,
+                       keep_identifier_versions=args.keep_identifier_versions,
+                       force=args.force)
+        available_ranks = tax_assign.available_ranks
+    except ValueError as exc:
+        error(f"ERROR: {str(exc)}")
+        sys.exit(-1)
+
     if not tax_assign:
         error(f'ERROR: No taxonomic assignments loaded from {",".join(args.taxonomy_csv)}. Exiting.')
         sys.exit(-1)
@@ -300,12 +284,37 @@ def annotate(args):
             for row in gather_results:
                 match_ident = row['name']
                 lineage = tax_utils.find_match_lineage(match_ident, tax_assign, skip_idents=idents_missed,
-                                             split_identifiers=not args.keep_full_identifiers,
+                                             keep_full_identifiers=args.keep_full_identifiers,
                                              keep_identifier_versions=args.keep_identifier_versions)
                 row['lineage'] = display_lineage(lineage)
                 w.writerow(row)
 
 
+def prepare(args):
+    "Combine multiple taxonomy databases into one and/or translate formats."
+    notify("loading taxonomies...")
+    try:
+        tax_assign = MultiLineageDB.load(args.taxonomy_csv,
+                       keep_full_identifiers=args.keep_full_identifiers,
+                       keep_identifier_versions=args.keep_identifier_versions)
+    except ValueError as exc:
+        error("ERROR while loading taxonomies!")
+        error(str(exc))
+        sys.exit(-1)
+
+    notify(f"...loaded {len(tax_assign)} entries.")
+
+    notify(f"saving to '{args.output}', format {args.database_format}...")
+    try:
+        tax_assign.save(args.output, args.database_format)
+    except ValueError as exc:
+        error("ERROR while saving!")
+        error(str(exc))
+        sys.exit(-1)
+
+    notify("done!")
+
+
 def main(arglist=None):
     args = sourmash.cli.get_parser().parse_args(arglist)
     submod = getattr(sourmash.cli.sig, args.subcmd)