Add probability of overlap and weighted containment for Multisearch matches #458

olgabot · 2024-09-28T21:08:41Z

Originally explored in this notebook, some high-containment hits are a result of highly frequent k-mers, and I want to downweight the containment by the probability of overlap. As I imagine it now, the frequencies would be computed based on all queries and all database signatures.

Here is a worked example, please let me know if I am missing something:

query: `ACGTTTTT`

3-mers (6 total):

ACG
CGT
GTT
TTT x 3

target: `TTTTTTTTTAC`

3-mers (8 total):

TTT x 7
TAC

Containment:

intersecting k-mers in query / intersecting k-mers in target

= 3/7

Probability of overlap:

frequency of intersecting k-mers in query * frequency of intersecting k-mers in target

= 3/6 * 7/8

= 1/2 * 7/8

= 7/16

Weighted containment:

Containment / Probability of overlap = Containment * (1/Probability of overlap)

= 3/7 * 7/16

= 3/16

Update: fix k-mers in example

…sult

…nto utils.rs for now

olgabot · 2024-09-30T17:38:27Z

This function uses log of probabilities to prevent underflow, but apparently Rust log calculations, e.g. ln() for natural log has unspecified precision, with non-deterministic outputs? This seems Really Bad(TM) for consistent results across different multisearch runs...

Source: https://doc.rust-lang.org/std/primitive.f64.html#method.ln

pub fn [ln](https://doc.rust-lang.org/std/primitive.[f64](https://doc.rust-lang.org/std/primitive.f64.html).html#method.ln)(self) -> f64

Returns the natural logarithm of the number.

Unspecified precision

The precision of this function is non-deterministic. This means it varies by platform, Rust version, and can even differ within the same execution from one invocation to the next.

Examples
let one = 1.0_f64;
// e^1
let e = one.exp();

// ln(e) - 1 == 0
let abs_difference = (e.ln() - 1.0).abs();

assert!(abs_difference < 1e-10);

EDIT: update code formatting

…ion hashes of all queries and all database minhash

…e standard library

…ainment_adjusted_log10 values to test_multisearch

olgabot · 2024-10-01T19:26:25Z

Interesting, the example dataset for test_multisearch.py:test_simple_no_ani shows some interesting behavior for self-similarity using "adjusted" containment, where containment_adjusted = containment / prob_overlap_adjusted. The adjustment of probability was performed with simple, strict Bonferroni correction: prob_overlap_adjusted = prob_overlap * n_comparisons

It's interesting that while the containment varies from ~0.48-1, the containment_adjusted varies from ~2405 - 4610.

query_name	query_md5	match_name	match_md5	containment	max_containment	jaccard	intersect_hashes	prob_overlap	prob_overlap_adjusted	containment_adjusted	containment_adjusted_log10
CP001071.1 Akkermansia muciniphila ATCC BAA-835, complete genome	f3a90d4e5528864a5bcc8434b0d0c3b1	CP001071.1 Akkermansia muciniphila ATCC BAA-835, complete genome	f3a90d4e5528864a5bcc8434b0d0c3b1	1	1	1	2701	2.41E-05	0.0002168808804	4610.82599	3.663778732
NC_009661.1 Shewanella baltica OS185 plasmid pS18501, complete genome	09a08691ce52952152f0e866a59f6261	NC_011665.1 Shewanella baltica OS223 plasmid pS22303, complete sequence	38729c6374925585db28916b82a6f513	0.4885068573	0.4885068573	0.3206949024	2529	2.26E-05	0.0002030698802	2405.609619	3.381225152
NC_011665.1 Shewanella baltica OS223 plasmid pS22303, complete sequence	38729c6374925585db28916b82a6f513	NC_011665.1 Shewanella baltica OS223 plasmid pS22303, complete sequence	38729c6374925585db28916b82a6f513	1	1	1	5238	4.67E-05	0.0004205931327	2377.594693	3.376137823
NC_009661.1 Shewanella baltica OS185 plasmid pS18501, complete sequence	09a08691ce52952152f0e866a59f6261	NC_009661.1 Shewanella baltica OS185 plasmid pS18501, complete sequence	09a08691ce52952152f0e866a59f6261	1	1	1	5177	4.62E-05	0.0004156950454	2405.609619	3.381225152
NC_011665.1 Shewanella baltica OS223 plasmid pS22303, complete sequence	38729c6374925585db28916b82a6f513	NC_009661.1 Shewanella baltica OS185 plasmid pS18501, complete sequence	09a08691ce52952152f0e866a59f6261	0.4828178694	0.4885068573	0.3206949024	2529	2.26E-05	0.0002030698802	2377.594693	3.376137823

olgabot · 2024-10-01T19:34:30Z

As a follow-up, from this notebook, I compared all human GENCODE proteins vs Botryllus schlosseri proteins. I was experimenting with how to avoid very common k-mers and in particular, hits to Titin, the largest known protein with 25,000 - 35,000 amino acids per protein (!!!). This method of using the frequency of k-mers across all queries and againsts, subsetting to only the overlapping k-mers between a single query and against, and multiplying each pair and taking the sum, was successful in getting rid of the spurious matches to Titin.

However, this method doesn't take length of the query or against into account. It only uses the frequencies of the k-mers across all queries/againsts.

Here are some plots to show the distribution of p-values, containment, and adjusted containment:

Adjusted p-value distribution

Containment (original)

Containment adjusted, log10

I think the bump to the left is all false positives, caused by spurious matches from very common k-mers.

…usted, containment_adjusted_log10

olgabot · 2024-11-12T18:42:23Z

Hi @ctb, I have addressed your requests, but the tests are still failing and I have some questions.

Why is there a difference in `Cargo.lock` changes source repo for `sourmash` to `rust-lang`?

My local Cargo.lock file has the differences below. I haven't committed them because they don't make sense to me. Do you know what may be going on?

diff --git a/Cargo.lock b/Cargo.lock
index 72ed890..dfa863d 100644
--- a/Cargo.lock
+++ b/Cargo.lock
@@ -1796,7 +1796,8 @@ checksum = "bceb57dc07c92cdae60f5b27b3fa92ecaaa42fe36c55e22dbfb0b44893e0b1f7"
 [[package]]
 name = "sourmash"
 version = "0.17.0"
-source = "git+https://github.com/sourmash-bio/sourmash.git?branch=latest#c7363154b546058eb417b78bb77aca6523591cb1"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "8ce05fed73303390f6f208d6640f390cd999db0af0b6c007d60db2794ad5fcc0"
 dependencies = [
  "az",
  "byteorder",

Help with `test_fastgather.py:test_against_multisigfile` failing due to `not implemented: only one Signature currently allowed when using 'load_sig'`

All tests are passing except test_fastgather.py:test_against_multisigfile (which is explored here: #445 -- so I'm not the only one who is confused!), which is failing on the GitHub actions (but not on my local machine) due to pyo3_runtime.PanicException: not implemented: only one Signature currently allowed when using 'load_sig' (full error message below). But I can't see any differences between my branch and main for src/fastgather.rs. The error seems to be originating with the fix from PR sourmash-bio/sourmash#3333 -- how should we proceed?

(branchwater) (base) 
 Tue 12 Nov - 18:37  ~/sourmash_plugin_branchwater   origin ☊ olgabot/multisearch-evalue 1● 
 ec2-user@ip-172-31-54-97  git diff olgabot/multisearch-evalue..main -- src/fastgather.rs  | wc -l
0

Error output

============================= test session starts ==============================
platform linux -- Python 3.12.7, pytest-8.3.3, pluggy-1.5.0
rootdir: /home/runner/work/sourmash_plugin_branchwater/sourmash_plugin_branchwater
configfile: pyproject.toml
collected 398 items

src/python/tests/test_cluster.py ................                        [  4%]
src/python/tests/test_fastgather.py .................................... [ 13%]
.F....................                                                   [ 18%]
src/python/tests/test_fastmultigather.py ............................... [ 26%]
.........................................                                [ 36%]
src/python/tests/test_index.py ...................................       [ 45%]
src/python/tests/test_manysearch.py .................................... [ 54%]
................................                                         [ 62%]
src/python/tests/test_multisearch.py ................................... [ 71%]
..................................................                       [ 83%]
src/python/tests/test_pairwise.py ................................       [ 91%]
src/python/tests/test_sketch.py ................................         [100%]

=================================== FAILURES ===================================
_______________________ test_against_multisigfile[False] _______________________

runtmp = <tests.sourmash_tst_utils.RunnerContext object at 0x7f2c1d89ec30>
zip_against = False

    def test_against_multisigfile(runtmp, zip_against):
        # test against a sigfile that contains multiple sketches
        query = get_test_data("SRR606249.sig.gz")
        against_list = runtmp.output("against.txt")
    
        sig2 = get_test_data("2.fa.sig.gz")
        sig47 = get_test_data("47.fa.sig.gz")
        sig63 = get_test_data("63.fa.sig.gz")
    
        combined = runtmp.output("combined.sig.gz")
        runtmp.sourmash("sig", "cat", sig2, sig47, sig63, "-o", combined)
        make_file_list(against_list, [combined])
    
        if zip_against:
            against_list = zip_siglist(runtmp, against_list, runtmp.output("against.zip"))
    
        g_output = runtmp.output("gather.csv")
        p_output = runtmp.output("prefetch.csv")
    
>       runtmp.sourmash(
            "scripts",
            "fastgather",
            query,
            against_list,
            "-o",
            g_output,
            "--output-prefetch",
            p_output,
            "-s",
            "100000",
        )

src/python/tests/test_fastgather.py:483: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <tests.sourmash_tst_utils.RunnerContext object at 0x7f2c1d89ec30>
args = ('scripts', 'fastgather', '/home/runner/work/sourmash_plugin_branchwater/sourmash_plugin_branchwater/src/python/tests/test-data/SRR606249.sig.gz', '/tmp/sourmashtest_hoei6qdi/against.txt', '-o', '/tmp/sourmashtest_hoei6qdi/gather.csv', ...)
kwargs = {'fail_ok': True, 'in_directory': '/tmp/sourmashtest_hoei6qdi'}
cmdlist = ['sourmash', 'scripts', 'fastgather', '/home/runner/work/sourmash_plugin_branchwater/sourmash_plugin_branchwater/src/python/tests/test-data/SRR606249.sig.gz', '/tmp/sourmashtest_hoei6qdi/against.txt', '-o', ...]

    def run_sourmash(self, *args, **kwargs):
        "Run the sourmash script with the given arguments."
        kwargs["fail_ok"] = True
        if "in_directory" not in kwargs:
            kwargs["in_directory"] = self.location
    
        cmdlist = ["sourmash"]
        cmdlist.extend((str(x) for x in args))
        self.last_command = " ".join(cmdlist)
        self.last_result = runscript("sourmash", args, **kwargs)
    
        if self.last_result.status:
>           raise SourmashCommandFailed(self.last_result.err)
E           tests.sourmash_tst_utils.SourmashCommandFailed: 
=> sourmash_plugin_branchwater 0.9.10; cite Irber et al., doi: 10.1101/2022.11.02.514947
E           
E           
ksize: 31 / scaled: 100000 / moltype: DNA / threshold bp: 50000
E           
gathering all sketches in '/home/runner/work/sourmash_plugin_branchwater/sourmash_plugin_branchwater/src/python/tests/test-data/SRR606249.sig.gz' against '/tmp/sourmashtest_hoei6qdi/against.txt' using 4 threads
E           Traceback (most recent call last):
E             File "/home/runner/work/sourmash_plugin_branchwater/sourmash_plugin_branchwater/src/python/tests/sourmash_tst_utils.py", line 143, in runscript
E               status = _runscript(scriptname)
E                        ^^^^^^^^^^^^^^^^^^^^^^
E             File "/home/runner/work/sourmash_plugin_branchwater/sourmash_plugin_branchwater/src/python/tests/sourmash_tst_utils.py", line 85, in _runscript
E               pkg_resources.load_entry_point("sourmash", "console_scripts", scriptname)()
E             File "/home/runner/miniconda3/envs/sourmash_dev/lib/python3.12/site-packages/sourmash/__main__.py", line 20, in main
E               retval = mainmethod(args)
E                        ^^^^^^^^^^^^^^^^
E             File "/home/runner/work/sourmash_plugin_branchwater/sourmash_plugin_branchwater/src/python/sourmash_plugin_branchwater/__init__.py", line 206, in main
E               status = sourmash_plugin_branchwater.do_fastgather(
E                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
E           pyo3_runtime.PanicException: not implemented: only one Signature currently allowed when using 'load_sig'

src/python/tests/sourmash_tst_utils.py:220: SourmashCommandFailed
----------------------------- Captured stdout call -----------------------------
running: sourmash in: /tmp/sourmashtest_hoei6qdi
arguments ['sourmash', 'sig', 'cat', '/home/runner/work/sourmash_plugin_branchwater/sourmash_plugin_branchwater/src/python/tests/test-data/2.fa.sig.gz', '/home/runner/work/sourmash_plugin_branchwater/sourmash_plugin_branchwater/src/python/tests/test-data/47.fa.sig.gz', '/home/runner/work/sourmash_plugin_branchwater/sourmash_plugin_branchwater/src/python/tests/test-data/63.fa.sig.gz', '-o', '/tmp/sourmashtest_hoei6qdi/combined.sig.gz']
running: sourmash in: /tmp/sourmashtest_hoei6qdi
arguments ['sourmash', 'scripts', 'fastgather', '/home/runner/work/sourmash_plugin_branchwater/sourmash_plugin_branchwater/src/python/tests/test-data/SRR606249.sig.gz', '/tmp/sourmashtest_hoei6qdi/against.txt', '-o', '/tmp/sourmashtest_hoei6qdi/gather.csv', '--output-prefetch', '/tmp/sourmashtest_hoei6qdi/prefetch.csv', '-s', '100000']
----------------------------- Captured stderr call -----------------------------
Reading query(s) from: '/home/runner/work/sourmash_plugin_branchwater/sourmash_plugin_branchwater/src/python/tests/test-data/SRR606249.sig.gz'
Loaded 1 query signature(s)
Reading search(s) from: '/tmp/sourmashtest_hoei6qdi/against.txt'
SUCCEEDED in loading as JSON files, woot woot
Loaded 3 search signature(s)
using threshold overlap: 1 50000
thread '<unnamed>' panicked at /home/runner/.cargo/registry/src/index.crates.io-6f17d22bba15001f/sourmash-0.17.0/src/storage/mod.rs:317:13:
not implemented: only one Signature currently allowed when using 'load_sig'
thread '<unnamed>' panicked at /home/runner/.cargo/registry/src/index.crates.io-6f17d22bba15001f/sourmash-0.17.0/src/storage/mod.rs:317:13:
not implemented: only one Signature currently allowed when using 'load_sig'
thread '<unnamed>' panicked at /home/runner/.cargo/registry/src/index.crates.io-6f17d22bba15001f/sourmash-0.17.0/src/storage/mod.rs:317:13:
not implemented: only one Signature currently allowed when using 'load_sig'
=============================== warnings summary ===============================
src/python/tests/sourmash_tst_utils.py:10
  /home/runner/work/sourmash_plugin_branchwater/sourmash_plugin_branchwater/src/python/tests/sourmash_tst_utils.py:10: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
    import pkg_resources

src/python/tests/test_fastgather.py::test_equal_matches
  /home/runner/work/sourmash_plugin_branchwater/sourmash_plugin_branchwater/src/python/tests/test_fastgather.py:1346: DeprecatedWarning: save_signatures is deprecated as of 4.8.9 and will be removed in 5.0. use sourmash_args.SaveSignaturesToLocation instead.
    sourmash.save_signatures([ss], open(runtmp.output("a.sig"), "wb"))

src/python/tests/test_fastgather.py::test_equal_matches
  /home/runner/work/sourmash_plugin_branchwater/sourmash_plugin_branchwater/src/python/tests/test_fastgather.py:1348: DeprecatedWarning: save_signatures is deprecated as of 4.8.9 and will be removed in 5.0. use sourmash_args.SaveSignaturesToLocation instead.
    sourmash.save_signatures([ss], open(runtmp.output("b.sig"), "wb"))

src/python/tests/test_fastgather.py::test_equal_matches
  /home/runner/work/sourmash_plugin_branchwater/sourmash_plugin_branchwater/src/python/tests/test_fastgather.py:1350: DeprecatedWarning: save_signatures is deprecated as of 4.8.9 and will be removed in 5.0. use sourmash_args.SaveSignaturesToLocation instead.
    sourmash.save_signatures([ss], open(runtmp.output("mg.sig"), "wb"))

src/python/tests/test_fastmultigather.py::test_equal_matches[True]
src/python/tests/test_fastmultigather.py::test_equal_matches[False]
  /home/runner/work/sourmash_plugin_branchwater/sourmash_plugin_branchwater/src/python/tests/test_fastmultigather.py:2128: DeprecatedWarning: save_signatures is deprecated as of 4.8.9 and will be removed in 5.0. use sourmash_args.SaveSignaturesToLocation instead.
    sourmash.save_signatures([ss], open(runtmp.output("a.sig"), "wb"))

src/python/tests/test_fastmultigather.py::test_equal_matches[True]
src/python/tests/test_fastmultigather.py::test_equal_matches[False]
  /home/runner/work/sourmash_plugin_branchwater/sourmash_plugin_branchwater/src/python/tests/test_fastmultigather.py:2130: DeprecatedWarning: save_signatures is deprecated as of 4.8.9 and will be removed in 5.0. use sourmash_args.SaveSignaturesToLocation instead.
    sourmash.save_signatures([ss], open(runtmp.output("b.sig"), "wb"))

src/python/tests/test_fastmultigather.py::test_equal_matches[True]
src/python/tests/test_fastmultigather.py::test_equal_matches[False]
  /home/runner/work/sourmash_plugin_branchwater/sourmash_plugin_branchwater/src/python/tests/test_fastmultigather.py:2132: DeprecatedWarning: save_signatures is deprecated as of 4.8.9 and will be removed in 5.0. use sourmash_args.SaveSignaturesToLocation instead.
    sourmash.save_signatures([ss], open(runtmp.output("mg.sig"), "wb"))

src/python/tests/test_sketch.py::test_manysketch_singleton
  /home/runner/work/sourmash_plugin_branchwater/sourmash_plugin_branchwater/src/python/tests/test_sketch.py:705: DeprecatedWarning: load_signatures is deprecated as of 3.5.1 and will be removed in 5.0. Use load_file_as_signatures instead.
    ss_sketch = sourmash.load_signatures(singleton_sketch)

src/python/tests/test_sketch.py::test_manysketch_reads
  /home/runner/work/sourmash_plugin_branchwater/sourmash_plugin_branchwater/src/python/tests/test_sketch.py:768: DeprecatedWarning: load_one_signature is deprecated as of 4.8.9 and will be removed in 5.0. Use load_file_as_signatures instead.
    sig1 = sourmash.load_one_signature(s1)

src/python/tests/test_sketch.py::test_manysketch_reads
  /home/runner/work/sourmash_plugin_branchwater/sourmash_plugin_branchwater/src/python/tests/test_sketch.py:781: DeprecatedWarning: load_one_signature is deprecated as of 4.8.9 and will be removed in 5.0. Use load_file_as_signatures instead.
    sig2 = sourmash.load_one_signature(s3)

src/python/tests/test_sketch.py::test_manysketch_reads_singleton
  /home/runner/work/sourmash_plugin_branchwater/sourmash_plugin_branchwater/src/python/tests/test_sketch.py:845: DeprecatedWarning: load_signatures is deprecated as of 3.5.1 and will be removed in 5.0. Use load_file_as_signatures instead.
    ss = sourmash.load_signatures(s1)

src/python/tests/test_sketch.py::test_manysketch_prefix
  /home/runner/work/sourmash_plugin_branchwater/sourmash_plugin_branchwater/src/python/tests/test_sketch.py:930: DeprecatedWarning: load_one_signature is deprecated as of 4.8.9 and will be removed in 5.0. Use load_file_as_signatures instead.
    sig1 = sourmash.load_one_signature(s1)

src/python/tests/test_sketch.py::test_manysketch_prefix
  /home/runner/work/sourmash_plugin_branchwater/sourmash_plugin_branchwater/src/python/tests/test_sketch.py:943: DeprecatedWarning: load_one_signature is deprecated as of 4.8.9 and will be removed in 5.0. Use load_file_as_signatures instead.
    sig2 = sourmash.load_one_signature(s2)

src/python/tests/test_sketch.py::test_manysketch_prefix2
  /home/runner/work/sourmash_plugin_branchwater/sourmash_plugin_branchwater/src/python/tests/test_sketch.py:1026: DeprecatedWarning: load_one_signature is deprecated as of 4.8.9 and will be removed in 5.0. Use load_file_as_signatures instead.
    sig1 = sourmash.load_one_signature(s1)

src/python/tests/test_sketch.py::test_manysketch_prefix2
  /home/runner/work/sourmash_plugin_branchwater/sourmash_plugin_branchwater/src/python/tests/test_sketch.py:1039: DeprecatedWarning: load_one_signature is deprecated as of 4.8.9 and will be removed in 5.0. Use load_file_as_signatures instead.
    sig2 = sourmash.load_one_signature(s2)

src/python/tests/test_sketch.py::test_singlesketch_simple
  /home/runner/work/sourmash_plugin_branchwater/sourmash_plugin_branchwater/src/python/tests/test_sketch.py:1190: DeprecatedWarning: load_one_signature is deprecated as of 4.8.9 and will be removed in 5.0. Use load_file_as_signatures instead.
    sig = sourmash.load_one_signature(output)

src/python/tests/test_sketch.py::test_singlesketch_with_name
  /home/runner/work/sourmash_plugin_branchwater/sourmash_plugin_branchwater/src/python/tests/test_sketch.py:1208: DeprecatedWarning: load_one_signature is deprecated as of 4.8.9 and will be removed in 5.0. Use load_file_as_signatures instead.
    sig = sourmash.load_one_signature(output)

src/python/tests/test_sketch.py::test_singlesketch_mult_k
  /home/runner/work/sourmash_plugin_branchwater/sourmash_plugin_branchwater/src/python/tests/test_sketch.py:1236: DeprecatedWarning: load_signatures is deprecated as of 3.5.1 and will be removed in 5.0. Use load_file_as_signatures instead.
    sigs = list(sourmash.load_signatures(output))

src/python/tests/test_sketch.py::test_singlesketch_mult_moltype
  /home/runner/work/sourmash_plugin_branchwater/sourmash_plugin_branchwater/src/python/tests/test_sketch.py:1256: DeprecatedWarning: load_one_signature is deprecated as of 4.8.9 and will be removed in 5.0. Use load_file_as_signatures instead.
    sig = sourmash.load_one_signature(output)

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
=========================== short test summary info ============================
FAILED src/python/tests/test_fastgather.py::test_against_multisigfile[False] - tests.sourmash_tst_utils.SourmashCommandFailed: 
=> sourmash_plugin_branchwater 0.9.10; cite Irber et al., doi: 10.1101/2022.11.02.514947


ksize: 31 / scaled: 100000 / moltype: DNA / threshold bp: 50000

gathering all sketches in '/home/runner/work/sourmash_plugin_branchwater/sourmash_plugin_branchwater/src/python/tests/test-data/SRR606249.sig.gz' against '/tmp/sourmashtest_hoei6qdi/against.txt' using 4 threads
Traceback (most recent call last):
  File "/home/runner/work/sourmash_plugin_branchwater/sourmash_plugin_branchwater/src/python/tests/sourmash_tst_utils.py", line 143, in runscript
    status = _runscript(scriptname)
             ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/runner/work/sourmash_plugin_branchwater/sourmash_plugin_branchwater/src/python/tests/sourmash_tst_utils.py", line 85, in _runscript
    pkg_resources.load_entry_point("sourmash", "console_scripts", scriptname)()
  File "/home/runner/miniconda3/envs/sourmash_dev/lib/python3.12/site-packages/sourmash/__main__.py", line 20, in main
    retval = mainmethod(args)
             ^^^^^^^^^^^^^^^^
  File "/home/runner/work/sourmash_plugin_branchwater/sourmash_plugin_branchwater/src/python/sourmash_plugin_branchwater/__init__.py", line 206, in main
    status = sourmash_plugin_branchwater.do_fastgather(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
pyo3_runtime.PanicException: not implemented: only one Signature currently allowed when using 'load_sig'
============ 1 failed, 397 passed, 22 warnings in 71.18s (0:01:11) =============
Error: Process completed with exit code 1.

Thank you so much for your help!

src/python/tests/test_fastgather.py

ctb · 2024-11-12T18:46:00Z

My local Cargo.lock file has the differences below. I haven't committed them because they don't make sense to me. Do you know what may be going on?

it looks like maybe some cruft left over from using a development branch of sourmash, vs the official crates.io release. I would suggest this: try merging and then doing cargo update -p sourmash and see what happens!

….py:test_against_multisigfile`

ctb · 2024-11-12T18:46:59Z

please let me know when ready for rereview!

olgabot · 2024-11-12T18:57:28Z

My local Cargo.lock file has the differences below. I haven't committed them because they don't make sense to me. Do you know what may be going on?

it looks like maybe some cruft left over from using a development branch of sourmash, vs the official crates.io release. I would suggest this: try merging and then doing cargo update -p sourmash and see what happens!

I did that, but still get the same diff 😕 What I'm confused about is that Cargo.lock suggests using the official crate on rust-lang, but the current main of sourmash_plugin_branchwater matches my "old" one, what Cargo.lock is trying to change:

sourmash_plugin_branchwater/Cargo.lock

Lines 1561 to 1563 in c5f5866

    
           name = "sourmash" 
        
           version = "0.17.0" 
        
           source = "git+https://github.com/sourmash-bio/sourmash.git?branch=latest#c7363154b546058eb417b78bb77aca6523591cb1"

I can change the sourmash dependency to the below, but it will cause merge conflicts, so I haven't yet:

[[package]]
name = "sourmash"
version = "0.17.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "8ce05fed73303390f6f208d6640f390cd999db0af0b6c007d60db2794ad5fcc0"

ctb · 2024-11-12T19:02:09Z

I did that, but still get the same diff 😕 What I'm confused about is that Cargo.lock suggests using the official crate on rust-lang, but the current main of sourmash_plugin_branchwater matches my "old" one, what Cargo.lock is trying to change:

if you merge in main, and then change it to the rust-lang crate, you shouldn't get merge conflicts out of it?

In any case, I can deal with it in review, as long as your tests pass with whatever you have on the branch!

olgabot · 2024-11-12T20:02:23Z

Ready for re-review @ctb!

ctb · 2024-11-12T20:18:40Z

src/multisearch.rs

@@ -45,10 +168,10 @@ pub fn multisearch(

    let ksize = selection.ksize().unwrap() as f64;

-    let mut new_selection = selection;
+    let mut new_selection = selection.clone();


CTB: to check. This clone should maybe not be needed?

I was correct that the clone is not required; I like not having it because it consumes selection so you can't reuse selection accidentally below.

Yep, fixed! This saves a problem because I was using selection accidentally below .. so yay memory safety!

ctb · 2024-11-12T20:20:08Z

src/python/tests/test_fastgather.py

@@ -494,7 +494,11 @@ def test_against_multisigfile(runtmp, zip_against):
        "100000",
    )
    df = pandas.read_csv(g_output)
-    assert len(df) == 3
+    if zip_against:


I'm pretty sure this set of changes is unintentional. Can you revert to what's in main?

src/python/tests/test_sketch.py

ctb

Overall looks great! Only one or two minor changes left and then I can approve.

Note that I bumped sourmash to sourmash v0.17.1 in Cargo.toml.

ctb · 2024-11-12T20:32:49Z

p.s. please let me know whether or not you'd like to merge it once I approve it!

olgabot added 11 commits September 28, 2024 13:48

Add probability of overlap and weighted containment to multisearch re…

a1d385f

…sult

Start writing prob_overlap

cf1b2bd

Couldn't figure out how to get prob_overlap.rs to import .. putting i…

2e1b338

…nto utils.rs for now

Trying to get prob overlap to at least import properly

b621854

Start writing a merge_all_minhashes function

3639a1b

Write in commented code what needs to happen

1003afe

Remove mut from unused variables for now

5fe707a

wrote function to merge all minhashes of a vector of signatures

ca54e0d

Added mege_all_minhashes to multisearch

e32f8e3

Add crates for stable calculation of log values

e049420

Add dependencies for stable calculation of log values in Cargo.lock

af6190a

olgabot mentioned this pull request Sep 30, 2024

Add explanation of how to enable maths feature paupino/rust-decimal#681

Merged

olgabot added 12 commits September 30, 2024 23:00

Add rust decimal with math feature

f7368f2

Add function to get probability of overlap between specific intersect…

2d81dab

…ion hashes of all queries and all database minhash

Call probability of overlap between queries and database

149f67f

I'm getting too confused by rust_decimal .. let's go back to using th…

95b9489

…e standard library

Add adjusted prob_overlap to MultiSearchResult

8655ab6

Getting prob_overlap to actually work

55aaf41

Add failing test for test_multisearch.py

f398996

Fix n_comparisons to be float, remove commented out pseudocode

8626d6f

Remove unnecessary parens

e301b6e

Added prob_overlap, prob_overlap_adjusted, containment_adjusted, cont…

4afe614

…ainment_adjusted_log10 values to test_multisearch

Add print statements

77de30a

Add containment_adjusted_log10

3807891

olgabot added 2 commits October 1, 2024 19:55

Fix compiler errors

ab046cb

Fix rounding for prob_overlap, prob_overlap_adjusted, containment_adj…

2030909

…usted, containment_adjusted_log10

olgabot mentioned this pull request Oct 1, 2024

Add multisearch with probability of overlap calculation seanome/nf-core-kmerseek#5

Merged

11 tasks

olgabot added 10 commits November 7, 2024 23:16

unwrap -> expect

f0f1c3a

Modularize the probability of overlap computation into functions

a63a5c3

set values for prob_overlap results in the if statement

bcbff23

Merge branch 'main' into olgabot/multisearch-evalue

8129f00

Add longer argument name and description

7d3064b

Cargo fmt

d6c5bf9

Borrow 'selection'

e23ee7b

Clone selection

53c221b

Add longer argument name

49ff137

Use new_selection to set scaled

93d6085

ctb reviewed Nov 12, 2024

View reviewed changes

src/python/tests/test_fastgather.py Outdated Show resolved Hide resolved

Add @pytest.mark.xfail(reason="should work, bug") to `test_fastgather…

de6287b

….py:test_against_multisigfile`

olgabot requested a review from ctb November 12, 2024 20:02

ctb reviewed Nov 12, 2024

View reviewed changes

src/python/tests/test_sketch.py Show resolved Hide resolved

ctb requested changes Nov 12, 2024

View reviewed changes

olgabot added 2 commits November 12, 2024 21:22

Revert test_against_multisigfile back to main

fd0deaf

Remove .clone() from selection

caf90c3

olgabot requested a review from ctb November 12, 2024 21:29

ctb approved these changes Nov 12, 2024

View reviewed changes

ctb merged commit 0dd65d6 into main Nov 12, 2024
3 checks passed

ctb deleted the olgabot/multisearch-evalue branch November 12, 2024 21:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add probability of overlap and weighted containment for Multisearch matches #458

Add probability of overlap and weighted containment for Multisearch matches #458

olgabot commented Sep 28, 2024 •

edited

Loading

olgabot commented Sep 30, 2024 •

edited

Loading

Unspecified precision

Examples

olgabot commented Oct 1, 2024

olgabot commented Oct 1, 2024

olgabot commented Nov 12, 2024 •

edited

Loading

ctb commented Nov 12, 2024

ctb commented Nov 12, 2024

olgabot commented Nov 12, 2024

ctb commented Nov 12, 2024

olgabot commented Nov 12, 2024

ctb Nov 12, 2024

ctb Nov 12, 2024

olgabot Nov 12, 2024

ctb Nov 12, 2024

ctb left a comment

ctb commented Nov 12, 2024

Add probability of overlap and weighted containment for Multisearch matches #458

Add probability of overlap and weighted containment for Multisearch matches #458

Conversation

olgabot commented Sep 28, 2024 • edited Loading

query: ACGTTTTT

target: TTTTTTTTTAC

Containment:

Probability of overlap:

Weighted containment:

olgabot commented Sep 30, 2024 • edited Loading

Unspecified precision

Examples

olgabot commented Oct 1, 2024

olgabot commented Oct 1, 2024

Adjusted p-value distribution

Containment (original)

Containment adjusted, log10

olgabot commented Nov 12, 2024 • edited Loading

Why is there a difference in Cargo.lock changes source repo for sourmash to rust-lang?

Help with test_fastgather.py:test_against_multisigfile failing due to not implemented: only one Signature currently allowed when using 'load_sig'

ctb commented Nov 12, 2024

ctb commented Nov 12, 2024

olgabot commented Nov 12, 2024

ctb commented Nov 12, 2024

olgabot commented Nov 12, 2024

ctb Nov 12, 2024

Choose a reason for hiding this comment

ctb Nov 12, 2024

Choose a reason for hiding this comment

olgabot Nov 12, 2024

Choose a reason for hiding this comment

ctb Nov 12, 2024

Choose a reason for hiding this comment

ctb left a comment

Choose a reason for hiding this comment

ctb commented Nov 12, 2024

olgabot commented Sep 28, 2024 •

edited

Loading

query: `ACGTTTTT`

target: `TTTTTTTTTAC`

olgabot commented Sep 30, 2024 •

edited

Loading

olgabot commented Nov 12, 2024 •

edited

Loading

Why is there a difference in `Cargo.lock` changes source repo for `sourmash` to `rust-lang`?

Help with `test_fastgather.py:test_against_multisigfile` failing due to `not implemented: only one Signature currently allowed when using 'load_sig'`