-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MRG: improve downsampling behavior on KmerMinHash
; fix RevIndex::gather
bug around scaled
.
#3342
Conversation
downsample_max_hash
in terms of downsample_scaled
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## latest #3342 +/- ##
==========================================
- Coverage 86.50% 86.42% -0.09%
==========================================
Files 137 137
Lines 16046 16069 +23
Branches 2211 2211
==========================================
+ Hits 13881 13888 +7
- Misses 1858 1874 +16
Partials 307 307
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
downsample_max_hash
in terms of downsample_scaled
KmerMinHash
; fix RevIndex::gather
bug around scaled
.
KmerMinHash
; fix RevIndex::gather
bug around scaled
.KmerMinHash
; fix RevIndex::gather
bug around scaled
.
Ready for review & merge @sourmash-bio/devs !! |
(if & when this is merged, I will cut a new core release too.) |
…nHash (#3348) Ref: https://github.com/sourmash-bio/sourmash_plugin_branchwater/pull/467/files#r1797783380 Implement `TryInto<KmerMinHash>` for Signature and SigStore to avoid having to clone a (potentially big) minhash sketch.
fastmultigather
performance degradation
sourmash-bio/sourmash_plugin_branchwater#472
CodSpeed Performance ReportMerging #3342 will degrade performances by 16.6%Comparing Summary
Benchmarks breakdown
|
(self, other) | ||
} else { | ||
(other, self) | ||
}; | ||
let downsampled_mh = second.downsample_max_hash(first.max_hash)?; | ||
let downsampled_mh = second.clone().downsample_scaled(first.scaled())?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we can do (in a future PR, not this one) a new .downsampled_iter(scaled)
for operations like count_common
, and avoid the conversion.
The downsampled iter would iterate over values that are in the appropriate scaled value, but wouldn't need to create new minhash sketches (can reuse the largest one and stop returning values once they go over max_hash
, for example)
…ing (#3352) This PR builds on the refactoring in #3342 to do less downsampling and also avoids doing intersections twice (per #3196). Benchmarks in sourmash-bio/sourmash_plugin_branchwater#471 are pretty astonishing... Fixes #3196 --------- Co-authored-by: Luiz Irber <[email protected]>
## [0.16.0] - 2024-10-15 MSRV: 1.65 Changes/additions: * refactor `calculate_gather_stats` to disallow repeated downsampling (#3352) * improve downsampling behavior on `KmerMinHash`; fix `RevIndex::gather` bug around `scaled`. (#3342) * derive Hash for `HashFunctions` (#3344) Updates: * Bump web-sys from 0.3.70 to 0.3.72 (#3354) * Bump tempfile from 3.12.0 to 3.13.0 (#3340)
This PR does five things:
First, it swaps the implementation of
KmerMinHash::downsample_max_hash
withKmerMinHash::downsample_scaled
, and the same forKmerMinHashBTree
. Previously a call todownsample_scaled
calculated the rightmax_hash
fromscaled
, then calleddownsample_max_hash
, which then convertedmax_hash
back toscaled
. This reverses the logic so that (slightly) less work is done and, more importantly, the code is a bit more straightforward.Second, it changes the
downsample_*
functions so that they do not downsample when no downsampling is needed. As part of this the method signatures are changed to take an object, rather than a reference. This lets the functions return an unmodifiedKmerMinHash
when no downsampling is needed.Third, it turns out the
downsample_*
functions didn't check to make sure that the newscaled
value was larger than the old one, i.e. they didn't prevent upsampling. That check was added and a new error,CannotUpsampleScaled
, was added to sourmash core.Fourth, this uncovered a bug in
RevIndex::gather
where the query was downsampled to the match, even when the match was lower scaled. This PR rejiggers the code so that downsampling is done appropriately in thegather
andcalculate_gather_stats
. SinceRevIndex::gather
isn't used in the the sourmash CLI, the bug only presented in the test suite and in the branchwater plugin; see sourmash-bio/sourmash_plugin_branchwater#468 and sourmash-bio/sourmash_plugin_branchwater#467, where a fastmultigather test had to be fixed because of the incorrect scaled values output byRevIndex::gather
.Fifth, it includes #3348 from @luizirber, which adds a
Signature::try_into()
toKmerMinHash
to support the elimination of some clones.Because of the method signature change for the
downsample_*
functions, the sourmash-core version needs to be bumped to a new major version, 0.16.0.It's been a fun journey! 😅
Fixes #3343
Some notes on further changes and performance implications:
As a consequence of the
RevIndex::gather
changes, redundant downsampling has to be done inRevIndex::gather
andcalculate_gather_stats
, unless we want to change the method signature ofcalculate_gather_stats
. I decided the PR was big enough that I didn't want to do that in addition. It should not affect most use cases wherescaled
is the same, and we will see if it results in any slowdowns over in the branchwater plugin. See #3196 for an issue on all of this.We could also just insist that the query scaled is the one to pay attention to, per #2951. This would simplify the code in Python-land as well.
Overall, the performance implications of this PR are not clear. Previously downsampling was being done even when it wasn't needed, so this may speed things up quite a lot for our typical use case! On the other hand, redundant downsampling will happen in cases where there are scaled mismatches. We just need to benchmark it, I think.
Some preliminary benchmarking reported in sourmash-bio/sourmash_plugin_branchwater#430 (comment) suggests that fastgather is now much more memory effficient 🎉 so that's good!
TODO:
Err
or what if the downsampling can't be performed?