Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sourmash gather raises AssertionError on minimal test dataset #2825

Open
vmikk opened this issue Oct 27, 2023 · 4 comments
Open

sourmash gather raises AssertionError on minimal test dataset #2825

vmikk opened this issue Oct 27, 2023 · 4 comments

Comments

@vmikk
Copy link

vmikk commented Oct 27, 2023

Dear sourmash team,

I've recently attempted to create a minimal test dataset and encountered an error when using sourmash gather.
Specifically, it raises an AssertionError.
I think that the error might be due to the presence of very few sketches in common between the sample and the database.
The error appears even with --threshold-bp 0.

Expected Behavior:
While this may be a marginal case, I believe the expected behavior in this scenario is for sourmash gather to return an empty CSV file with no results, and subsequently an exit code of 0.

Actual Behavior:
I've got an AssertionError when running the command.
Sourmash fails after Prefetch and "Doing gather to generate minimum metagenome cover".

Additional Information:

  • When I use the same sample with a larger database (e.g., gtdb-rs214-reps.k31.zip), it works as expected without any issues.
  • sourmash v.4.8.4 (from bioconda), Python 3.11.5, Linux.

Here is the full output log:

== This is sourmash version 4.8.4. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

select query k=31 automatically.
loaded query: ERR5766174_se... (k=31, DNA)
loading from 'sourmash-db.zip'...
--
loaded 6 total signatures from 1 locations.
after selecting signatures compatible with search, 6 remain.
Starting prefetch sweep across databases.
Prefetch found 6 signatures with overlap >= 0 bp.
Doing gather to generate minimum metagenome cover.

overlap     p_query p_match
---------   ------- -------
25.5 kbp       0.1%    0.7%    CYXV01000001.1
Traceback (most recent call last):
  File "/usr/local/bin/sourmash", line 11, in <module>
    sys.exit(main())
             ^^^^^^
  File "/usr/local/lib/python3.11/site-packages/sourmash/__main__.py", line 19, in main
    retval = mainmethod(args)
             ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/sourmash/cli/gather.py", line 162, in main
    return sourmash.commands.gather(args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/sourmash/commands.py", line 830, in gather
    for result in gather_iter:
  File "/usr/local/lib/python3.11/site-packages/sourmash/search.py", line 754, in __next__
    best_result, intersect_mh = _find_best(counters, query, threshold_bp)
                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/sourmash/search.py", line 640, in _find_best
    result = counter.peek(query.minhash, threshold_bp=threshold_bp)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/sourmash/index/__init__.py", line 818, in peek
    assert cont
AssertionError

The command I executed was:

 sourmash gather \
   --threshold-bp 0 \
   --output ERR5766174_se.csv.gz \
   ERR5766174_se.sig \
   sourmash-db.zip

I've uploaded the test data here:

curl -J -O "https://owncloud.ut.ee/owncloud/s/xHE76xLr9jrDeqX/download?path=%2F&files=ERR5766174_se.sig&downloadStartSecret=wxqvxr5285r"
curl -J -O "https://owncloud.ut.ee/owncloud/s/xHE76xLr9jrDeqX/download?path=%2F&files=sourmash-db.zip&downloadStartSecret=pfgmp5y8i4a"

Thank you for providing such a valuable tool!
With kind regards,
Vladimir

PS. As this was a small test dataset, I used scaled=10 to generate sample signatures,
and the database signatures were also created with a different scale setting.
Given that it works correctly with a larger database, the issue likely isn't related to the scaling difference.

@ctb
Copy link
Contributor

ctb commented Oct 28, 2023

Thanks! I see the same error with commit 720962c, the latest latest - the detailed test case really helps 😅 . More soon!

@ctb
Copy link
Contributor

ctb commented Oct 29, 2023

I'm getting some intriguing errors with

sourmash gather ERR5766174_se.sig sourmash-db.zip --threshold-bp=0 --save-prefetch-csv prefetch.csv

as well. It seems likely that there is a bug related to handling of many different scaled values -- here the query metagenome is at scaled of 10, while the database has three different scaled values of 1, 20, and 1600 (!?):

% sourmash sig summarize ERR5766174_se.sig

== This is sourmash version 4.8.5.dev0. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

** loading from 'ERR5766174_se.sig'
path filetype: MultiIndex
location: ERR5766174_se.sig
is database? no
has manifest? yes
num signatures: 1
** examining manifest...
total hashes: 4526202
summary of sketches:
   1 sketches with DNA, k=31, scaled=10               4526202 total hashes
% sourmash sig summarize sourmash-db.zip

== This is sourmash version 4.8.5.dev0. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

** loading from 'sourmash-db.zip'
path filetype: ZipFileLinearIndex
location: /Users/t/dev/sourmash/failgather/sourmash-db.zip
is database? yes
has manifest? yes
num signatures: 6
** examining manifest...
total hashes: 768424
summary of sketches:
   4 sketches with DNA, k=31, scaled=20               735189 total hashes
   1 sketches with DNA, k=31, scaled=1                16508 total hashes
   1 sketches with DNA, k=31, scaled=1600             16727 total hashes

@vmikk
Copy link
Author

vmikk commented Oct 29, 2023

Yes, scaling is a bit weird in this case 😅.
It was done this way because the test samples vary in size, and with uniform scaling, one sample ended up with too few hashes (so maybe it was better to take a fixed number of hashes for each sample).

However, these sketches work well with the larger database (GTDB), so I believed the issue might not be related to the different scaling.

@ctb
Copy link
Contributor

ctb commented Nov 20, 2023

Sorry, I'm still tracking this down! I understand where the problem lies but not exactly why it's happening - working on it over in #2832.

But I wanted to respond to this:

It was done this way because the test samples vary in size, and with uniform scaling, one sample ended up with too few hashes (so maybe it was better to take a fixed number of hashes for each sample).

Note that for comparison purposes, everything needs to be at the same scaled, so all samples are scaled "up" to the highest scaled value needed for the comparison. This operation is always possible on FracMinHash sketches (just like reducing n is always possible on MinHash sketches). So, while you can create sketches with different scaled values, it doesn't mean that they will all be used in a specific comparison. We try to output good warning messages on this front but may not always succeed.

There is a brief technical discussion of this in the internals guide, and a more complete (but academic) discussion in Irber et al., 2022.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants