Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MRG: support proper manifest creation with --relpath for sig check and sig collect #3054

Merged
merged 31 commits into from
Mar 8, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
5f6ef82
load manifest paths relative to cwd
ctb Mar 1, 2024
2405e9d
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 1, 2024
df87087
more better tests
ctb Mar 1, 2024
d7da0aa
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 1, 2024
0caee7f
implement --abspath, --relpath for sig check
ctb Mar 2, 2024
1fba4ee
clean up relpath a bit
ctb Mar 2, 2024
f3079b9
add --relpath to sig collect
ctb Mar 2, 2024
b17cd4f
implement --relpath for sig collect too
ctb Mar 3, 2024
72a9062
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 3, 2024
5cf4749
straighten out tests
ctb Mar 3, 2024
d6dfe35
add abspath/relpath to sig collect tests
ctb Mar 3, 2024
da3f165
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 3, 2024
b185be9
add abspath/relpath tests for sig collect
ctb Mar 3, 2024
9d2018c
Merge branch 'manifest_relpath' of https://github.com/sourmash-bio/so…
ctb Mar 3, 2024
28e0a5b
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 3, 2024
e9b9a34
fix/update comments
ctb Mar 3, 2024
f88c662
update docs for sig check and sig collect
ctb Mar 3, 2024
4e10802
add in some documentation about relpath in sourmash internals
ctb Mar 3, 2024
6dd3e3a
path rewriting tests for sig check
ctb Mar 3, 2024
e3bb509
update docs
ctb Mar 3, 2024
703585c
add --relpath tests for sig collect
ctb Mar 3, 2024
346ed82
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 3, 2024
637b1fc
a few more tests
ctb Mar 3, 2024
8b1bc2c
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 3, 2024
3e41cdb
Merge branch 'latest' of https://github.com/sourmash-bio/sourmash int…
ctb Mar 5, 2024
1ade436
upd with warning when abspath given with --relpath
ctb Mar 5, 2024
6ef3d88
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 5, 2024
afde775
Merge branch 'latest' into manifest_relpath
ctb Mar 6, 2024
0c1aa27
Merge branch 'latest' of https://github.com/sourmash-bio/sourmash int…
ctb Mar 7, 2024
27e8c32
Merge branch 'manifest_relpath' of https://github.com/sourmash-bio/so…
ctb Mar 7, 2024
1e3688a
update docstring for StandaloneManifestIndex
ctb Mar 7, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 20 additions & 1 deletion doc/command-line.md
Original file line number Diff line number Diff line change
Expand Up @@ -551,7 +551,7 @@ sourmash multigather --query <queries ...> --db <collections>
```

Note that multigather is single threaded, so it offers no substantial
efficiency gains over just running gather multiple times! Nontheless, it
efficiency gains over just running gather multiple times! Nonetheless, it
is useful for situations where you have many sketches organized in a
combined file, e.g. sketches built with `sourmash sketch
... --singleton`).
Expand Down Expand Up @@ -1959,6 +1959,16 @@ sourmash database.
`sourmash sig check` is particularly useful when working with large
collections of signatures and identifiers.

With `-m/--save-manifest-matching`, `sig check` creates a standalone
manifest. In these manifests, sourmash v4 will by default write paths
to the matched elements that are relative to the current working
directory. In some cases - when the output manifest is in different
directory - this will create manifests that do not work properly
with sourmash. The `--relpath` argument will rewrite the paths to be
relative to the manifest, while the `--abspath` argument will rewrite
paths to be absolute. The `--relpath` behavior will be the default in
sourmash v5.

### `sourmash signature collect` - collect manifests across databases

Collect manifests from across (many) files and merge into a single
Expand All @@ -1977,6 +1987,15 @@ This manifest file can be loaded directly from the command line by sourmash.
particularly useful when working with large collections of signatures and
identifiers, and has command line options for merging and updating manifests.

As with `sig check`, the standalone manifests created by `sig collect`
in sourmash v4 will by default write paths to the matched elements
relative to the current working directory. When the output manifest
is in a different directory, this will create manifests that do not work
properly with sourmash. The `--relpath` argument will rewrite the
paths to be relative to the manifest, while the `--abspath` argument
will rewrite paths to be absolute. The `--relpath` behavior will be
the default in sourmash v5.

## Advanced command-line usage

### Loading signatures and databases
Expand Down
6 changes: 5 additions & 1 deletion doc/sourmash-internals.md
Original file line number Diff line number Diff line change
Expand Up @@ -368,7 +368,11 @@ Thus, while standalone manifests can point at any kind of container,
including JSON files or LCA databases, they are most efficient when
`internal_location` points at a file with either a single sketch in
it, or a manifest that supports direct loading of sketches. Therefore,
we suggest using standalone manifest indices.
we suggest using standalone manifest indices. Note that sourmash
interprets paths to locations in standalone manifests relative to the
manifest filename; see the `--relpath` behavior in `sig check` and
`sig collect` to output manifests that deal with relative filenames
properly.

Note that searching a standalone manifest is currently done through a
linear iteration, and does not use any features of indexed containers
Expand Down
24 changes: 24 additions & 0 deletions src/sourmash/cli/sig/check.py
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,30 @@ def subparser(subparsers):
default="csv",
choices=["csv", "sql"],
)
subparser.add_argument(
"--abspath",
"--use-absolute-paths",
help="convert all locations to absolute paths",
action="store_true",
)
subparser.add_argument(
"--no-abspath",
help="do not convert all locations to absolute paths",
action="store_false",
dest="abspath",
)
subparser.add_argument(
"--relpath",
"--use-relative-paths",
help="convert all locations to paths relative to the output manifest",
action="store_true",
)
subparser.add_argument(
"--no-relpath",
help="do not convert all locations to paths relative to the output manifest",
action="store_false",
dest="relpath",
)

add_ksize_arg(subparser)
add_moltype_args(subparser)
Expand Down
23 changes: 22 additions & 1 deletion src/sourmash/cli/sig/collect.py
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,28 @@ def subparser(subparsers):
help="merge new manifests into existing",
)
subparser.add_argument(
"--abspath", help="convert all locations to absolute paths", action="store_true"
"--abspath",
"--use-absolute-paths",
help="convert all locations to absolute paths",
action="store_true",
)
subparser.add_argument(
"--no-abspath",
help="do not convert all locations to absolute paths",
action="store_false",
dest="abspath",
)
subparser.add_argument(
"--relpath",
"--use-relative-paths",
help="convert all locations to paths relative to the output manifest",
action="store_true",
)
subparser.add_argument(
"--no-relpath",
help="do not convert all locations to paths relative to the output manifest",
action="store_false",
dest="relpath",
)

add_ksize_arg(subparser)
Expand Down
18 changes: 11 additions & 7 deletions src/sourmash/index/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -1113,23 +1113,26 @@ class StandaloneManifestIndex(Index):
with many signature collections underneath it, and you don't want to load
every collection each time you run sourmash.

Instead, you can run 'sourmash sig manifest <directory> -o mf.csv' to
output a manifest and then use this class to load 'mf.csv' directly.
Instead, you can run 'sourmash sig collect <directory> -o <manifest>' to
output a manifest and then use this class to load <manifest> directly.
Sketch type selection, picklists, and pattern matching will all work
directly on the manifest and will load signatures only upon demand.

One feature of this class is that absolute paths to sketches in
the 'internal_location' field of the manifests will be loaded properly.
This permits manifests to be constructed for various collections of
signatures that reside elsewhere, and not just below a single directory
prefix.
One feature of this class is that external paths to sketches in
the 'internal_location' field of the manifests will be loaded
properly. This permits manifests to be constructed for various
collections of signatures that reside elsewhere, and not just
below a single directory prefix. By default paths are interpreted
relative to the location of the manifest, unless an absolute path
is provided in the 'internal_location' field.

StandaloneManifestIndex does _not_ store signatures in memory.

This class also overlaps in concept with MultiIndex when
MultiIndex.load_from_pathlist is used to load other Index
objects. However, this class does not store any signatures in
memory, unlike MultiIndex.

"""

is_database = True
Expand All @@ -1155,6 +1158,7 @@ def load(cls, location, *, prefix=None):
m = CollectionManifest.load_from_filename(location)

if prefix is None:
# by default, calculate paths relative to manifest location.
prefix = os.path.dirname(location)

return cls(m, location, prefix=prefix)
Expand Down
63 changes: 57 additions & 6 deletions src/sourmash/sig/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -1435,11 +1435,42 @@ def check(args):
else:
debug("sig check: manifest required")

# abspath/relpath checks
if args.abspath and args.relpath:
error("** Cannot specify both --abspath and --relpath; pick one!")
sys.exit(-1)

if args.relpath or args.abspath and not args.save_manifest_matching:
notify(
"** WARNING: --abspath and --relpath only have effects when saving a manifest"
)

relpath = "."
if args.relpath and args.save_manifest_matching:
output_manifest_dir = os.path.dirname(args.save_manifest_matching)
relpath = os.path.relpath(os.curdir, output_manifest_dir)

total_manifest_rows = CollectionManifest([])

# start loading!
total_rows_examined = 0
for filename in args.signatures:
# if saving a manifest, think about how to rewrite locations.
if args.abspath:
# convert to abspath
new_iloc = os.path.abspath(filename)
elif args.relpath:
# interpret paths relative to manifest directory.
if filename.startswith("/"):
notify(
f"** WARNING: cannot convert abspath {filename} into relative path."
)
new_iloc = os.path.join(relpath, filename)
else:
# default: paths are relative to cwd. This breaks when sketches
# are in subdirectories; will be deprecated for v5.
new_iloc = filename

idx = sourmash_args.load_file_as_index(filename, yield_all_files=args.force)

idx = idx.select(ksize=args.ksize, moltype=moltype)
Expand All @@ -1458,7 +1489,7 @@ def check(args):
# rewrite locations so that each signature can be found by filename
# of its container; this follows `sig collect` logic.
for row in sub_manifest.rows:
row["internal_location"] = filename
row["internal_location"] = new_iloc
total_manifest_rows.add_row(row)

# the len(sub_manifest) here should only be run when needed :)
Expand Down Expand Up @@ -1535,6 +1566,11 @@ def collect(args):
f"WARNING: --merge-previous specified, but output file '{args.output}' does not already exist?"
)

# abspath/relpath checks
if args.abspath and args.relpath:
error("** Cannot specify both --abspath and --relpath; pick one!")
sys.exit(-1)

# load previous manifest for --merge-previous. This gets tricky with
# mismatched manifest types, which we forbid.
try:
Expand Down Expand Up @@ -1579,15 +1615,16 @@ def collect(args):
# load from_file
_extend_signatures_with_from_file(args, target_attr="locations")

# convert to abspath
if args.abspath:
args.locations = [os.path.abspath(iloc) for iloc in args.locations]
relpath = None
if args.relpath:
output_manifest_dir = os.path.dirname(args.output)
relpath = os.path.relpath(os.curdir, output_manifest_dir)

# iterate through, loading all the manifests from all the locations.
for n_files, loc in enumerate(args.locations):
notify(f"Loading signature information from {loc}.")

if n_files % 100 == 0:
if n_files and n_files % 100 == 0:
notify(f"... loaded {len(collected_mf)} sigs from {n_files} files")
idx = sourmash.load_file_as_index(loc)
if idx.manifest is None and require_manifest:
Expand All @@ -1600,8 +1637,22 @@ def collect(args):

mf = sourmash_args.get_manifest(idx)

# decide how to rewrite locations to container:
if args.abspath:
# convert to abspath
new_iloc = os.path.abspath(loc)
elif args.relpath:
# interpret paths relative to manifest directory
if loc.startswith("/"):
notify(f"** WARNING: cannot convert abspath {loc} into relative path.")
new_iloc = os.path.join(relpath, loc)
else:
# default: paths are relative to cwd. This breaks when sketches
# are in subdirectories; will be deprecated for v5.
new_iloc = loc

for row in mf.rows:
row["internal_location"] = loc
row["internal_location"] = new_iloc
collected_mf.add_row(row)

if args.manifest_format == "csv":
Expand Down
15 changes: 15 additions & 0 deletions tests/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -89,6 +89,21 @@ def sig_save_extension_abund(request):
return request.param


# these should both always succeed for 'sig check' and 'sig collect' output
# manifests.
@pytest.fixture(params=["--abspath", "--relpath"])
def abspath_or_relpath(request):
return request.param


# this will fail if subdirs used; see #3008. but this ensures v4 behavior of
# sig collect/sig check works, where manifest paths are interpreted relative
# to cwd.
@pytest.fixture(params=["--no-abspath", "--abspath", "--relpath"])
def abspath_relpath_v4(request):
return request.param


# --- BEGIN - Only run tests using a particular fixture --- #
# Cribbed from: http://pythontesting.net/framework/pytest/pytest-run-tests-using-particular-fixture/
def pytest_collection_modifyitems(items, config):
Expand Down
Loading
Loading