Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MRG: fix multigather output by adding md5sum along with -U/--output-add-query-md5sum #2722

Merged
merged 28 commits into from
Feb 29, 2024

Conversation

ctb
Copy link
Contributor

@ctb ctb commented Aug 19, 2023

This PR:

  • adds documentation for multigather to sourmash docs!
  • builds on MRG: Uniquify csv output from multigather #2065 / MRG: update #2065 (uniquify CSV output from multigather) #2721 so that tests pass.
  • adds an option -U/--output-add-query-md5sum to sourmash multigather
  • adds an option --force-allow-overwrite-output to sourmash multigather
  • CHANGES BEHAVIOR of multigather by treating query.filename == '-' as if query.filename is empty, thus replacing it with md5sum
  • CHANGES BEHAVIOR of multigather by failing loudly and clearly if output files are going to be overwritten
  • adds -E/--extension to allow output to files other than .sig

See discussion over in #2328: multigather CSV output uses signature filename as basename.

To add:

  • tests for -U;
  • implement and test -E/--extension
  • implement and test --force-allow-overwrite-output
  • fix for query_filename being None/empty in -U branch
  • documentation update for changed output behavior for multigather: '-' => using md5sum
  • documentation update for changed output behavior for multigather: fails if overwrite happens
  • fix multigather link in docs

olgabot and others added 5 commits May 28, 2022 13:23
Hello!
Hope you and yours are doing well. I am using multigather to query protein sequences against each other, and primarily use the csv file for downstream processing. As is, if the signatures in `--query query.sig` were created from one fasta file with e.g. `--singleton`, then all results iteratively overwrite each other into the same csv file. This proposed change adds the md5sum to the query file to ensure uniqueness. Other suggestions are welcome!

(not tested yet)

Warmest,
Olga
@codecov
Copy link

codecov bot commented Aug 19, 2023

Codecov Report

Attention: Patch coverage is 88.57143% with 4 lines in your changes are missing coverage. Please review.

Project coverage is 86.78%. Comparing base (0d2359d) to head (1265d82).

Files Patch % Lines
src/sourmash/commands.py 87.50% 2 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           latest    #2722      +/-   ##
==========================================
+ Coverage   86.76%   86.78%   +0.01%     
==========================================
  Files         136      136              
  Lines       15509    15523      +14     
  Branches     2624     2626       +2     
==========================================
+ Hits        13457    13471      +14     
  Misses       1751     1751              
  Partials      301      301              
Flag Coverage Δ
hypothesis-py 25.53% <0.00%> (-0.03%) ⬇️
python 92.86% <88.57%> (+<0.01%) ⬆️
rust 61.20% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ctb ctb changed the title WIP: fix multigather output by adding md5sum WIP: fix multigather output by adding md5sum along with -U/--output-add-query-md5sum Aug 19, 2023
Note: PR into #2065.

This PR updates #2065 with all the changes from `latest`. This includes
the fix & update to multigather in
#2322, which:
- fixes the output of multigather to include more than one line 😆 
- prints out the output filename
- supports `--output-dir`

No functional changes are made beyond the merge, so some tests are still
failing; will discuss fixes in yet a new PR :).

---------

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: Keya Barve <[email protected]>
Co-authored-by: ccbaumler <[email protected]>
Co-authored-by: Tessa Pierce Ward <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Taylor Reiter <[email protected]>
Co-authored-by: Erik Young <[email protected]>
Co-authored-by: David Koslicki <[email protected]>
Co-authored-by: Luiz Irber <[email protected]>
Co-authored-by: Colton Baumler <[email protected]>
Co-authored-by: Luiz Irber <[email protected]>
Co-authored-by: N. Tessa Pierce-Ward <[email protected]>
Co-authored-by: Peter Cock <[email protected]>
Co-authored-by: Francesco Beghini <[email protected]>
Co-authored-by: Jason Stajich <[email protected]>
Co-authored-by: Katrin Leinweber <[email protected]>
Base automatically changed from ctb_update_2065 to olgabot-patch-2 August 21, 2023 00:56
@ctb ctb changed the base branch from olgabot-patch-2 to latest February 28, 2024 14:33
@ctb ctb changed the title WIP: fix multigather output by adding md5sum along with -U/--output-add-query-md5sum MRG: fix multigather output by adding md5sum along with -U/--output-add-query-md5sum Feb 29, 2024
@ctb
Copy link
Contributor Author

ctb commented Feb 29, 2024

OK, I'm proposing this for merge based on a few things -

and finally, @bluegenes is doing different experiments in re output files with fastmultigather over in sourmash-bio/sourmash_plugin_branchwater#197, and so I think we have routes to explore better output over there.

It might be that a long-term future for multigather is to deprecate it in sourmash and direct people to fastmultigather, which is actually much faster than multigather.

remaining_mh = remaining_query.minhash.to_mutable()
remaining_mh += noident_mh.downsample(scaled=remaining_mh.scaled)
remaining_query.minhash = remaining_mh
output_unassigned = output_base + f".unassigned{args.extension}"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing the open fixes a bug I introduced in

with open(output_unassigned, 'wt') as fp:
, where I mangle tessa's original code by using SaveSignaturesToLocation on an already opened filename. The switch to using args.extension revealed the bug.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the other changes here are just indentation based.


with SaveSignaturesToLocation(output_unassigned) as save_sig:
# CTB: note, multigather does not save abundances
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

multigather does now save abundances, so the comment was false ;)

@@ -4886,7 +4886,10 @@ def test_multigather_metagenome_output(runtmp):
cmd = cmd.split(" ")
c.run_sourmash(*cmd)

output_csv = runtmp.output("-.csv")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixa da bug

@@ -4916,15 +4919,17 @@ def test_multigather_metagenome_output_outdir(runtmp):
cmd = cmd.split(" ")
c.run_sourmash(*cmd)

output_csv = runtmp.output("savehere/-.csv")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixa da bug

testdata_glob = utils.get_test_data("gather/GCF*.sig")
testdata_sigs = glob.glob(testdata_glob)

utils.get_test_data("gather/combined.sig")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed no-op

@@ -5050,11 +5062,7 @@ def test_multigather_metagenome_query_with_sbt_addl_query(c):

assert os.path.exists(c.output("gcf_all.sbt.zip"))

another_query = utils.get_test_data(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed redundancy


assert (
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed behavior

c.run_sourmash(*cmd)

out = c.last_result.out
print(out)
err = c.last_result.err
print(err)

assert "conducted gather searches on 13 signatures" in err
assert (
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed behavior

@ctb
Copy link
Contributor Author

ctb commented Feb 29, 2024

@sourmash-bio/devs ready for review & merge! I've tried to clearly annotate the changed behavior and so on in the tests... good luck!

Comment on lines +1234 to +1238
output_base = query.md5sum()
elif args.output_add_query_md5sum:
# Uniquify the output file if all signatures were made from the same file (e.g. with --singleton)
assert query_filename and query_filename != "-" # first branch
output_base = os.path.basename(query_filename) + "." + query.md5sum()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens in the case of duplicated md5sums? I assume this will be much more likely with short sequences that are sketched with --singleton

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then the output_base will be flagged as a duplicate - which I think is appropriate, if not ideal ;). People can choose to force allow override, or... do something else.

Copy link
Contributor

@bluegenes bluegenes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than potential issues from duplicated md5sums, I think this looks ok.

duplicated md5sums would imply that the hashes are the same anyway, so really the issue would present as a non 1:1 query:outputfile issue -- all results would be available, just in slightly fewer files than 1 per query.

@ctb
Copy link
Contributor Author

ctb commented Feb 29, 2024

I'll merge then! Thanks!

@ctb ctb merged commit c7a1265 into latest Feb 29, 2024
41 checks passed
@ctb ctb deleted the fix_multigather_output branch February 29, 2024 21:18
@ctb
Copy link
Contributor Author

ctb commented Feb 29, 2024

😮‍💨

ctb added a commit that referenced this pull request Mar 21, 2024
# Release notes for sourmash v4.8.7

Note: This release changes the way `sourmash multigather` names output
files in some situations. Please see
#2722 for details.

Minor new features:

* support proper manifest creation with `--relpath` for `sig check` and
`sig collect` (#3054)
* fix `multigather` output by adding md5sum along with
`-U/--output-add-query-md5sum` (#2722)
* enable loading lineages from annotated gather with match_name instead
of name (#3078)

Bug fixes:

* fix output for `sketch ... --singleton` (#3066)
* fix `calculate_gather_stats` `threshold=0` bug (#3052)

Cleanup and documentation updates:

* adjust protein ksize for record/manifest (#3019)
* Resolve `sourmash gather --help` issue (#3032)
* rework the manifest documentation; do misc cleanup (#3027)
* add branchwater web to docs (#3018)

Developer updates:

* make core Manifest booleans python compatible (core) (#3007)
* safer ksize selection while still accommodating k=k*3 (#3028)
* fix clippy beta issues (#3088)
* tell dependabot to ignore upgrades to `byteorder`, `chrono`,
`once_cell`, and `wasm-bindgen` (#3065)
* update rust changelog for r0.13.0 in preparation for release (#3033)
* Allow changing storage location for a collection in RevIndex (#3015)
* Fix tox and nix configs so all tox tests execute correctly (#2992)
* Calculate all gather stats in rust; use for rocksdb gather (#2943)
* bump screed req to 1.1.3 (#3067)
* bump to v4.8.7-dev (#2989)

Dependabot updates:

* Bump DeterminateSystems/magic-nix-cache-action from 1 to 3 (#2994)
* Bump DeterminateSystems/magic-nix-cache-action from 3 to 4 (#3085)
* Bump DeterminateSystems/nix-installer-action from 4 to 9 (#2995)
* Bump DeterminateSystems/nix-installer-action from 9 to 10 (#3083)
* Bump chrono from 0.4.33 to 0.4.34 (#3000)
* Bump conda-incubator/setup-miniconda from 3.0.1 to 3.0.2 (#3046)
* Bump conda-incubator/setup-miniconda from 3.0.2 to 3.0.3 (#3057)
* Bump histogram from 0.9.0 to 0.9.1 (#3002)
* Bump itertools from 0.12.0 to 0.12.1 (#3043)
* Bump log from 0.4.20 to 0.4.21 (#3062)
* Bump num-iter from 0.1.43 to 0.1.44 (#2997)
* Bump pypa/cibuildwheel from 2.16.5 to 2.17.0 (#3084)
* Bump rayon from 1.8.1 to 1.9.0 (#3058)
* Bump roaring from 0.10.2 to 0.10.3 (#3014)
* Bump serde from 1.0.196 to 1.0.197 (#3045)
* Bump serde_json from 1.0.113 to 1.0.114 (#3044)
* Bump tempfile from 3.10.0 to 3.10.1 (#3059)
* Bump thiserror from 1.0.56 to 1.0.57 (#2999)
* Bump thiserror from 1.0.57 to 1.0.58 (#3082)
* Bump wasm-bindgen from 0.2.91 to 0.2.92 (#3060)
* Bump wasm-bindgen-test from 0.3.40 to 0.3.41 (#2996)
* Bump wasm-bindgen-test from 0.3.41 to 0.3.42 (#3063)
* Bump web-sys from 0.3.67 to 0.3.68 (#2998)
* Bump web-sys from 0.3.68 to 0.3.69 (#3061)
* Revert "Bump wasm-bindgen from 0.2.91 to 0.2.92 (#3060)" (#3064)
* Update asv to 0.6.2 (#3025)
* Update pytest requirement from <8.1.0,>=6.2.4 to >=6.2.4,<8.2.0
(#3075)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants