-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Produce list of hashes from a sequence #1653
Conversation
Codecov Report
@@ Coverage Diff @@
## latest #1653 +/- ##
==========================================
- Coverage 82.68% 82.63% -0.05%
==========================================
Files 113 113
Lines 11902 11995 +93
Branches 1511 1513 +2
==========================================
+ Hits 9841 9912 +71
- Misses 1807 1829 +22
Partials 254 254
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
@ctb Would you please check and see if that is the expected behavior of the new function? |
wow, looks good to me so far! It would be good to check that it works for translated sequence and for protein sequence, as well as different |
src/core/src/signature.rs
Outdated
fn seq_to_hashes(&self, seq: &[u8], force: bool) -> Result<Vec<u64>, Error> { | ||
let mut seq_hashes: Vec<u64> = Vec::new(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will trigger the allocation of a (possibly quite large) vector to hold all the hashes. Although this makes sense for a function that returns all hashes, the issue is that for the more common use case of adding the hashes to the MinHash (without needing the kmer -> hash mapping) it will make performance much worse.
Suggestion: make seq_to_hashes
into a free function (not a part of the SigsTrait
trait), move most of the implementation in add_sequence
to it, but seq_to_hashes
return an Iterator
instead. For the FFI then collect all the hashes generated by seq_to_hashes
into a vector and return it. This way it is up to the caller (add_sequence
here or kmerminhash_seq_to_hashes
in the FFI) to decide if it wants to allocate the vector or just consume the values one by one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std::iter::from_fn might be a shortcut, or you might want to implement Iterator
more explicitly to have more control
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got the point thanks. And here is the asv benchmark
supporting evidence.
before after ratio
[21f5e632] [7d4ecb6f]
<v4.2.0^0> <mo/seq_to_hashes>
+ 29.9M 37.4M 1.25 benchmarks.PeakmemMinHashSuite.peakmem_add_sequence
+ 29.9M 36.7M 1.23 benchmarks.PeakmemMinAbundanceSuite.peakmem_add_many
+ 29.9M 36.7M 1.23 benchmarks.PeakmemMinAbundanceSuite.peakmem_add_protein
+ 29.9M 36.7M 1.23 benchmarks.PeakmemMinHashSuite.peakmem_add_hash
+ 29.9M 36.7M 1.23 benchmarks.PeakmemMinHashSuite.peakmem_add_many
+ 29.9M 36.7M 1.23 benchmarks.PeakmemMinHashSuite.peakmem_add_protein
+ 30.5M 37.4M 1.23 benchmarks.PeakmemMinAbundanceSuite.peakmem_add_sequence
+ 30.6M 37.5M 1.22 benchmarks.PeakmemMinAbundanceSuite.peakmem_add_hash
+ 84.6±0.5μs 99.2±7μs 1.17 benchmarks.TimeMinHashSuite.time_add_sequence
+ 88.0±1μs 101±0.8μs 1.15 benchmarks.TimeMinAbundanceSuite.time_add_sequence
SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE DECREASED.
ERROR: InvocationError for command /home/mabuelanin/dib-dev/sourmash/.tox/asv/bin/asv continuous latest HEAD (exited with code 1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While researching this, I found that Rust experimentally supports Generators. I think this could be very useful in terms of memory reduction and lazy execution, maybe later when the feature is stable.
Co-authored-by: Luiz Irber <[email protected]>
Co-authored-by: Luiz Irber <[email protected]>
Co-authored-by: Luiz Irber <[email protected]>
Co-authored-by: Luiz Irber <[email protected]>
Note that this doesn't return the hashes returned by |
I think this different behaviors is not related to the changes made in this PR. The protein (flag) information is saved while creating the Minhash object. sourmash/src/sourmash/minhash.py Line 157 in eeb1874
Then the hashing function is selected accordingly, sourmash/src/sourmash/minhash.py Lines 204 to 206 in eeb1874
And from my understanding to the code, these two paths leads to two different behaviors
∵ If the What do you think? |
I'm going to embarrass myself here, because I'm not up to digging into the code right now, but ISTR -
We do eventually want to be able to get hashes from both, and it may be a good time to ...reconsider the |
Ok, I got it now. Then the fix is as you've proposed, passing |
Yes, I think so! Note the key code in command_compute.py, for constructing sketches:
Here, See also #186 where changing the API to And, finally, see #1057 for @luizirber suggestion on Rust code reorg. This doesn't all need to get done here, of course! But we'd get a lot of value out of being able to get the hashes from both |
Err, this was already done in #1223 🙈 |
@luizirber not too close, please 😬 |
yep. I don't use rebase, but I'll manage the merge resolution. Note that github automatically switches the base to latest when the current base is merged there, too! |
Cool! Good to know 😄 |
The Are the Rust FFI files manually ignored in Codecov? I think it's from the Codecov report. I will try disabling it in the |
Main issue is that FFI files don't have their coverage measured. I guess something that can do coverage of C code (like |
There is a bit of a regression on the performance (executed with
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor comments, but this looks great!
@mr-eyes shall I merge? |
Yes! |
🎉 |
Description
minhash
sketch
.TODO
Rust
functionseq_to_hashes
.add_sequence()
to use theseq_to_hashes()
function.