Skip to content

Commit

Permalink
[MRG] Build a LineageDB interface for taxonomy databases/information (
Browse files Browse the repository at this point in the history
#1651)

* start making a LineageDB

* further abstract taxonomy loading

* replace rest of taxonomy parsing

* get some basic sqlite stuff going

* refactoring, simplification, optimization

* fix a few things, alert on bad arg combination

* add combine_tax CLI command under tax

* rename 'combine_tax' to 'prepare'

* allow output database type for tax prepare

* format switching now works for sql and csv

* adjust loading to try/fail rather than suffix

* refactor taxonomy load

* fix tests

* basic tests for in/out formats

* add tests for trying to load bad files

* raise appropriate error message

* fix with pytest.raises foo

* some more tests

* more tests, fix bug

* make available_ranks work with sqlite db

* re-add test for available_ranks

* produce more useful errors for dups, and restore test code

* catch exception for database already exists

* test split idents and keep versions

* align keyword args with CLI args

* add more end-to-end tests

* alias --taxonomy-csv to --taxonomy

* 'prepare' docs

* clean up
  • Loading branch information
ctb authored Jul 8, 2021
1 parent 512d961 commit eeb1874
Show file tree
Hide file tree
Showing 14 changed files with 1,003 additions and 197 deletions.
59 changes: 43 additions & 16 deletions doc/command-line.md
Original file line number Diff line number Diff line change
Expand Up @@ -425,11 +425,12 @@ signatures, rather than all the signatures in the database.

The sourmash `tax` or `taxonomy` commands integrate taxonomic
information into the results of `sourmash gather`. All `tax` commands
require one or more properly formatted `taxonomy` csv files that correspond to
the database(s) used for `gather`. Note that if using multiple databases,
the `gather` needs to have been conducted against all desired databases
within the same `gather` command (we cannot combine separate `gather` runs
for the same query). For supported databases (e.g. GTDB, NCBI), we provide
require one or more properly formatted `taxonomy` files where the
identifiers correspond to those in the database(s) used for
`gather`. Note that if using multiple databases, the `gather` needs
to have been conducted against all desired databases within the same
`gather` command (we cannot combine separate `gather` runs for the
same query). For supported databases (e.g. GTDB, NCBI), we provide
taxonomy csv files, but they can also be generated for user-generated
databases. For more information, see [databases](databases.md).

Expand All @@ -454,8 +455,8 @@ As with all reference-based analysis, results can be affected by the
results from `gather` minimizes issues associated with increasing size
and redundancy of reference databases.

For more on how `gather` works and can be used to classify signatures, see
[classifying-signatures](classifying-signatures.md).
For more details on how `gather` works and can be used to classify
signatures, see [classifying-signatures](classifying-signatures.md).


### `sourmash tax metagenome` - summarize metagenome content from `gather` results
Expand All @@ -469,7 +470,7 @@ example command to summarize a single `gather csv`, where the query was gathered
```
sourmash tax metagenome
--gather-csv HSMA33MX_gather_x_gtdbrs202_k31.csv \
--taxonomy-csv gtdb-rs202.taxonomy.v2.csv
--taxonomy gtdb-rs202.taxonomy.v2.csv
```

There are three possible output formats, `csv_summary`, `lineage_summary`, and
Expand Down Expand Up @@ -517,7 +518,7 @@ To generate `krona`, we add `--output-format krona` to the command above, and
```
sourmash tax metagenome
--gather-csv HSMA33MX_gather_x_gtdbrs202_k31.csv \
--taxonomy-csv gtdb-rs202.taxonomy.v2.csv \
--taxonomy gtdb-rs202.taxonomy.v2.csv \
--output-format krona --rank species
```

Expand Down Expand Up @@ -545,7 +546,7 @@ To generate `lineage_summary`, we add `--output-format lineage_summary` to the s
sourmash tax metagenome
--gather-csv HSMA33MX_gather_x_gtdbrs202_k31.csv \
--gather-csv PSM6XBW3_gather_x_gtdbrs202_k31.csv \
--taxonomy-csv gtdb-rs202.taxonomy.v2.csv \
--taxonomy gtdb-rs202.taxonomy.v2.csv \
--output-format krona --rank species
```

Expand Down Expand Up @@ -602,7 +603,7 @@ We can use `tax genome` on this gather csv to classify our "Sb47+63" mixed-strai
```
sourmash tax genome
--gather-csv 47+63_x_gtdb-rs202.gather.csv \
--taxonomy-csv gtdb-rs202.taxonomy.v2.csv
--taxonomy gtdb-rs202.taxonomy.v2.csv
```
> This command uses the default classification strategy, which uses a
containment threshold of 0.1 (10%).
Expand Down Expand Up @@ -649,7 +650,7 @@ To generate `krona`, we must classify by `--rank` instead of using the
```
sourmash tax genome
--gather-csv Sb47+63_gather_x_gtdbrs202_k31.csv \
--taxonomy-csv gtdb-rs202.taxonomy.v2.csv \
--taxonomy gtdb-rs202.taxonomy.v2.csv \
--output-format krona --rank species
```
> Note that specifying `--rank` forces classification by rank rather than
Expand Down Expand Up @@ -684,10 +685,35 @@ By default, `annotate` uses the name of each input gather csv to write an update
```
sourmash tax annotate
--gather-csv Sb47+63_gather_x_gtdbrs202_k31.csv \
--taxonomy-csv gtdb-rs202.taxonomy.v2.csv
--taxonomy gtdb-rs202.taxonomy.v2.csv
```
> This will produce an annotated gather CSV, `Sb47+63_gather_x_gtdbrs202_k31.with-lineages.csv`
### `sourmash tax prepare` - prepare and/or combine taxonomy files

All `sourmash tax` commands must be given one or more taxonomy files as
parameters to the `--taxonomy` argument. These files can be either CSV
files or (as of sourmash 4.2.1) sqlite3 databases. sqlite3 databases
are much faster for large taxonomies, while CSV files are easier to view
and modify using spreadsheet software.

`sourmash tax prepare` is a utility function that can ingest and validate
multiple CSV files or sqlite3 databases, and output a CSV file or a sqlite3
database. It can be used to combine multiple taxonomies into a single file,
as well as change formats between CSV and sqlite3.

The following command will take in two taxonomy files and combine them into
a single taxonomy sqlite database.

```
sourmash tax prepare --taxonomy file1.csv file2.csv -o tax.db
```

Input databases formats can be mixed and matched, and the output format
can be set to CSV like so:
```
sourmash tax prepare --taxonomy file1.csv file2.db -o tax.csv -F csv
```

## `sourmash lca` subcommands for in-memory taxonomy integration

Expand Down Expand Up @@ -1216,9 +1242,10 @@ Briefly,
signature files using `--query-from-file` (see below).

* `index` and `lca index` take a few fixed parameters (database name,
taxonomy spreadsheet) and then an arbitrary number of other files
that contain signatures, including files, directories, and indexed
databases. These commands will also take `--from-file` (see below).
and for `lca index`, a taxonomy file) and then an arbitrary number of
other files that contain signatures, including files, directories,
and indexed databases. These commands will also take `--from-file`
(see below).

None of these commands currently support searching, comparing, or indexing
signatures with multiple ksizes or moltypes at the same time; you need
Expand Down
1 change: 1 addition & 0 deletions src/sourmash/cli/tax/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
from . import metagenome
from . import genome
from . import annotate
from . import prepare
from ..utils import command_list
from argparse import SUPPRESS, RawDescriptionHelpFormatter
import os
Expand Down
2 changes: 1 addition & 1 deletion src/sourmash/cli/tax/annotate.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ def subparser(subparsers):
help='suppress non-error output'
)
subparser.add_argument(
'-t', '--taxonomy-csv', metavar='FILE',
'-t', '--taxonomy-csv', '--taxonomy', metavar='FILE',
nargs="+", required=True,
help='database lineages CSV'
)
Expand Down
2 changes: 1 addition & 1 deletion src/sourmash/cli/tax/genome.py
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@ def subparser(subparsers):
help='suppress non-error output'
)
subparser.add_argument(
'-t', '--taxonomy-csv', metavar='FILE',
'-t', '--taxonomy-csv', '--taxonomy', metavar='FILE',
nargs='+', required=True,
help='database lineages CSV'
)
Expand Down
2 changes: 1 addition & 1 deletion src/sourmash/cli/tax/metagenome.py
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ def subparser(subparsers):
help='directory for output files'
)
subparser.add_argument(
'-t', '--taxonomy-csv', metavar='FILE',
'-t', '--taxonomy-csv', '--taxonomy', metavar='FILE',
nargs='+', required=True,
help='database lineages CSV'
)
Expand Down
60 changes: 60 additions & 0 deletions src/sourmash/cli/tax/prepare.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
"""combine multiple taxonomy databases into one."""

usage="""
sourmash tax prepare --taxonomy-csv <taxonomy_file> [ ... ] -o <output>
The 'tax prepare' command reads in one or more taxonomy databases
and saves them into a new database. It can be used to combine databases
in the desired order, as well as output different database formats.
Please see the 'tax prepare' documentation for more details:
https://sourmash.readthedocs.io/en/latest/command-line.html#sourmash-tax-prepare-prepare-and-or-combine-taxonomy-files
"""

import sourmash
from sourmash.logging import notify, print_results, error


def subparser(subparsers):
subparser = subparsers.add_parser('prepare',
usage=usage)
subparser.add_argument(
'-q', '--quiet', action='store_true',
help='suppress non-error output'
)
subparser.add_argument(
'-t', '--taxonomy-csv', '--taxonomy', metavar='FILE',
nargs="+", required=True,
help='database lineages'
)
subparser.add_argument(
'-o', '--output', required=True,
help='output file',
)
subparser.add_argument(
'-F', '--database-format',
help="format of output file; default is 'sql')",
default='sql',
choices=['csv', 'sql'],
)
subparser.add_argument(
'--keep-full-identifiers', action='store_true',
help='do not split identifiers on whitespace'
)
subparser.add_argument(
'--keep-identifier-versions', action='store_true',
help='after splitting identifiers, do not remove accession versions'
)
subparser.add_argument(
'--fail-on-missing-taxonomy', action='store_true',
help='fail quickly if taxonomy is not available for an identifier',
)
subparser.add_argument(
'-f', '--force', action = 'store_true',
help='continue past errors in file and taxonomy loading',
)

def main(args):
import sourmash
return sourmash.tax.__main__.prepare(args)
111 changes: 60 additions & 51 deletions src/sourmash/tax/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
from sourmash.lca.lca_utils import display_lineage

from . import tax_utils
from .tax_utils import ClassificationResult
from .tax_utils import ClassificationResult, MultiLineageDB

usage='''
sourmash taxonomy <command> [<args>] - manipulate/work with taxonomy information.
Expand Down Expand Up @@ -61,22 +61,16 @@ def metagenome(args):
set_quiet(args.quiet)

# first, load taxonomic_assignments
tax_assign = {}
available_ranks = set()
for tax_csv in args.taxonomy_csv:

try:
this_tax_assign, _, avail_ranks = tax_utils.load_taxonomy_csv(tax_csv, split_identifiers=not args.keep_full_identifiers,
keep_identifier_versions = args.keep_identifier_versions,
force=args.force)
# maybe check for overlapping tax assignments? currently, later ones will override earlier ones
tax_assign.update(this_tax_assign)
available_ranks.update(set(avail_ranks))

except ValueError as exc:
error(f"ERROR: {str(exc)}")
sys.exit(-1)

try:
tax_assign = MultiLineageDB.load(args.taxonomy_csv,
keep_full_identifiers=args.keep_full_identifiers,
keep_identifier_versions=args.keep_identifier_versions,
force=args.force)
available_ranks = tax_assign.available_ranks
except ValueError as exc:
error(f"ERROR: {str(exc)}")
sys.exit(-1)

if not tax_assign:
error(f'ERROR: No taxonomic assignments loaded from {",".join(args.taxonomy_csv)}. Exiting.')
sys.exit(-1)
Expand All @@ -103,7 +97,7 @@ def metagenome(args):
seen_perfect = set()
for rank in sourmash.lca.taxlist(include_strain=False):
summarized_gather[rank], seen_perfect = tax_utils.summarize_gather_at(rank, tax_assign, gather_results, skip_idents=idents_missed,
split_identifiers=not args.keep_full_identifiers,
keep_full_identifiers=args.keep_full_identifiers,
keep_identifier_versions = args.keep_identifier_versions,
seen_perfect = seen_perfect)

Expand Down Expand Up @@ -139,20 +133,16 @@ def genome(args):
set_quiet(args.quiet)

# first, load taxonomic_assignments
tax_assign = {}
available_ranks = set()
for tax_csv in args.taxonomy_csv:

try:
this_tax_assign, _, avail_ranks = tax_utils.load_taxonomy_csv(tax_csv, split_identifiers=not args.keep_full_identifiers,
keep_identifier_versions = args.keep_identifier_versions,
force=args.force)
# maybe check for overlapping tax assignments? currently later ones will override earlier ones
tax_assign.update(this_tax_assign)
available_ranks.update(set(avail_ranks))
except ValueError as exc:
error(f"ERROR: {str(exc)}")

try:
tax_assign = MultiLineageDB.load(args.taxonomy_csv,
keep_full_identifiers=args.keep_full_identifiers,
keep_identifier_versions=args.keep_identifier_versions,
force=args.force)
available_ranks = tax_assign.available_ranks
except ValueError as exc:
error(f"ERROR: {str(exc)}")
sys.exit(-1)

if not tax_assign:
error(f'ERROR: No taxonomic assignments loaded from {",".join(args.taxonomy_csv)}. Exiting.')
sys.exit(-1)
Expand Down Expand Up @@ -184,7 +174,7 @@ def genome(args):
# if --rank is specified, classify to that rank
if args.rank:
best_at_rank, seen_perfect = tax_utils.summarize_gather_at(args.rank, tax_assign, gather_results, skip_idents=idents_missed,
split_identifiers=not args.keep_full_identifiers,
keep_full_identifiers=args.keep_full_identifiers,
keep_identifier_versions = args.keep_identifier_versions,
best_only=True, seen_perfect=seen_perfect)

Expand All @@ -210,7 +200,7 @@ def genome(args):
for rank in tax_utils.ascending_taxlist(include_strain=False):
# gets best_at_rank for all queries in this gather_csv
best_at_rank, seen_perfect = tax_utils.summarize_gather_at(rank, tax_assign, gather_results, skip_idents=idents_missed,
split_identifiers=not args.keep_full_identifiers,
keep_full_identifiers=args.keep_full_identifiers,
keep_identifier_versions = args.keep_identifier_versions,
best_only=True, seen_perfect=seen_perfect)

Expand Down Expand Up @@ -257,22 +247,16 @@ def annotate(args):
set_quiet(args.quiet)

# first, load taxonomic_assignments
tax_assign = {}
this_tax_assign = None
for tax_csv in args.taxonomy_csv:

try:
this_tax_assign, _, avail_ranks = tax_utils.load_taxonomy_csv(tax_csv, split_identifiers=not args.keep_full_identifiers,
keep_identifier_versions = args.keep_identifier_versions,
force=args.force)
except ValueError as exc:
error(f"ERROR: {str(exc)}")
sys.exit(-1)

# maybe check for overlapping tax assignments? currently later ones will override earlier ones
if this_tax_assign:
tax_assign.update(this_tax_assign)

try:
tax_assign = MultiLineageDB.load(args.taxonomy_csv,
keep_full_identifiers=args.keep_full_identifiers,
keep_identifier_versions=args.keep_identifier_versions,
force=args.force)
available_ranks = tax_assign.available_ranks
except ValueError as exc:
error(f"ERROR: {str(exc)}")
sys.exit(-1)

if not tax_assign:
error(f'ERROR: No taxonomic assignments loaded from {",".join(args.taxonomy_csv)}. Exiting.')
sys.exit(-1)
Expand Down Expand Up @@ -300,12 +284,37 @@ def annotate(args):
for row in gather_results:
match_ident = row['name']
lineage = tax_utils.find_match_lineage(match_ident, tax_assign, skip_idents=idents_missed,
split_identifiers=not args.keep_full_identifiers,
keep_full_identifiers=args.keep_full_identifiers,
keep_identifier_versions=args.keep_identifier_versions)
row['lineage'] = display_lineage(lineage)
w.writerow(row)


def prepare(args):
"Combine multiple taxonomy databases into one and/or translate formats."
notify("loading taxonomies...")
try:
tax_assign = MultiLineageDB.load(args.taxonomy_csv,
keep_full_identifiers=args.keep_full_identifiers,
keep_identifier_versions=args.keep_identifier_versions)
except ValueError as exc:
error("ERROR while loading taxonomies!")
error(str(exc))
sys.exit(-1)

notify(f"...loaded {len(tax_assign)} entries.")

notify(f"saving to '{args.output}', format {args.database_format}...")
try:
tax_assign.save(args.output, args.database_format)
except ValueError as exc:
error("ERROR while saving!")
error(str(exc))
sys.exit(-1)

notify("done!")


def main(arglist=None):
args = sourmash.cli.get_parser().parse_args(arglist)
submod = getattr(sourmash.cli.sig, args.subcmd)
Expand Down
Loading

0 comments on commit eeb1874

Please sign in to comment.