-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add probability of overlap and weighted containment for Multisearch matches #458
Conversation
…nto utils.rs for now
This function uses log of probabilities to prevent underflow, but apparently Rust log calculations, e.g. Source: https://doc.rust-lang.org/std/primitive.f64.html#method.ln
EDIT: update code formatting |
…ion hashes of all queries and all database minhash
…e standard library
…ainment_adjusted_log10 values to test_multisearch
Interesting, the example dataset for It's interesting that while the containment varies from ~0.48-1, the
|
As a follow-up, from this notebook, I compared all human GENCODE proteins vs Botryllus schlosseri proteins. I was experimenting with how to avoid very common k-mers and in particular, hits to Titin, the largest known protein with 25,000 - 35,000 amino acids per protein (!!!). This method of using the frequency of k-mers across all queries and againsts, subsetting to only the overlapping k-mers between a single query and against, and multiplying each pair and taking the sum, was successful in getting rid of the spurious matches to Titin. However, this method doesn't take length of the query or against into account. It only uses the frequencies of the k-mers across all queries/againsts. Here are some plots to show the distribution of p-values, containment, and adjusted containment: Adjusted p-value distributionContainment (original)Containment adjusted, log10I think the bump to the left is all false positives, caused by spurious matches from very common k-mers. |
…usted, containment_adjusted_log10
Hi @ctb, I have addressed your requests, but the tests are still failing and I have some questions. Why is there a difference in
|
it looks like maybe some cruft left over from using a development branch of sourmash, vs the official crates.io release. I would suggest this: try merging and then doing |
….py:test_against_multisigfile`
please let me know when ready for rereview! |
I did that, but still get the same diff 😕 What I'm confused about is that sourmash_plugin_branchwater/Cargo.lock Lines 1561 to 1563 in c5f5866
I can change the [[package]]
name = "sourmash"
version = "0.17.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "8ce05fed73303390f6f208d6640f390cd999db0af0b6c007d60db2794ad5fcc0" |
if you merge in main, and then change it to the rust-lang crate, you shouldn't get merge conflicts out of it? In any case, I can deal with it in review, as long as your tests pass with whatever you have on the branch! |
Ready for re-review @ctb! |
src/multisearch.rs
Outdated
@@ -45,10 +168,10 @@ pub fn multisearch( | |||
|
|||
let ksize = selection.ksize().unwrap() as f64; | |||
|
|||
let mut new_selection = selection; | |||
let mut new_selection = selection.clone(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CTB: to check. This clone should maybe not be needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was correct that the clone is not required; I like not having it because it consumes selection
so you can't reuse selection
accidentally below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, fixed! This saves a problem because I was using selection
accidentally below .. so yay memory safety!
src/python/tests/test_fastgather.py
Outdated
@@ -494,7 +494,11 @@ def test_against_multisigfile(runtmp, zip_against): | |||
"100000", | |||
) | |||
df = pandas.read_csv(g_output) | |||
assert len(df) == 3 | |||
if zip_against: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty sure this set of changes is unintentional. Can you revert to what's in main?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks great! Only one or two minor changes left and then I can approve.
Note that I bumped sourmash to sourmash v0.17.1 in Cargo.toml.
p.s. please let me know whether or not you'd like to merge it once I approve it! |
Originally explored in this notebook, some high-containment hits are a result of highly frequent k-mers, and I want to downweight the containment by the probability of overlap. As I imagine it now, the frequencies would be computed based on all queries and all database signatures.
Here is a worked example, please let me know if I am missing something:
query:
ACGTTTTT
3-mers (6 total):
ACG
CGT
GTT
TTT
x 3target:
TTTTTTTTTAC
3-mers (8 total):
TTT
x 7TAC
Containment:
intersecting k-mers in query / intersecting k-mers in target
= 3/7
Probability of overlap:
frequency of intersecting k-mers in query * frequency of intersecting k-mers in target
= 3/6 * 7/8
= 1/2 * 7/8
= 7/16
Weighted containment:
Containment / Probability of overlap = Containment * (1/Probability of overlap)
= 3/7 * 7/16
= 3/16
Update: fix k-mers in example