Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Produce list of hashes from a sequence #1653

Merged
merged 37 commits into from
Jul 21, 2021
Merged

[MRG] Produce list of hashes from a sequence #1653

merged 37 commits into from
Jul 21, 2021

Conversation

mr-eyes
Copy link
Member

@mr-eyes mr-eyes commented Jul 5, 2021

Description

TODO

  • Add test case
  • Implement the new Rust function seq_to_hashes.
  • Refactor add_sequence() to use the seq_to_hashes() function.
  • Python Binding.

⚠️ This PR will touch all the layers (Python, C header, Rust).

@codecov
Copy link

codecov bot commented Jul 5, 2021

Codecov Report

Merging #1653 (d25b54e) into latest (b223638) will decrease coverage by 0.04%.
The diff coverage is 79.22%.

Impacted file tree graph

@@            Coverage Diff             @@
##           latest    #1653      +/-   ##
==========================================
- Coverage   82.68%   82.63%   -0.05%     
==========================================
  Files         113      113              
  Lines       11902    11995      +93     
  Branches     1511     1513       +2     
==========================================
+ Hits         9841     9912      +71     
- Misses       1807     1829      +22     
  Partials      254      254              
Flag Coverage Δ
python 90.13% <100.00%> (+0.01%) ⬆️
rust 66.45% <77.93%> (+0.16%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/core/src/ffi/minhash.rs 0.00% <0.00%> (ø)
src/core/src/signature.rs 60.84% <82.56%> (+2.66%) ⬆️
src/core/tests/minhash.rs 99.32% <92.00%> (-0.44%) ⬇️
src/sourmash/minhash.py 91.62% <100.00%> (+0.19%) ⬆️
src/core/src/sketch/minhash.rs 86.93% <0.00%> (+0.11%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b223638...d25b54e. Read the comment docs.

@mr-eyes
Copy link
Member Author

mr-eyes commented Jul 5, 2021

@ctb Would you please check and see if that is the expected behavior of the new function?

@ctb
Copy link
Contributor

ctb commented Jul 5, 2021

wow, looks good to me so far!

It would be good to check that it works for translated sequence and for protein sequence, as well as different MinHash seeds.

Comment on lines 37 to 38
fn seq_to_hashes(&self, seq: &[u8], force: bool) -> Result<Vec<u64>, Error> {
let mut seq_hashes: Vec<u64> = Vec::new();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will trigger the allocation of a (possibly quite large) vector to hold all the hashes. Although this makes sense for a function that returns all hashes, the issue is that for the more common use case of adding the hashes to the MinHash (without needing the kmer -> hash mapping) it will make performance much worse.

Suggestion: make seq_to_hashes into a free function (not a part of the SigsTrait trait), move most of the implementation in add_sequence to it, but seq_to_hashes return an Iterator instead. For the FFI then collect all the hashes generated by seq_to_hashes into a vector and return it. This way it is up to the caller (add_sequence here or kmerminhash_seq_to_hashes in the FFI) to decide if it wants to allocate the vector or just consume the values one by one.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::iter::from_fn might be a shortcut, or you might want to implement Iterator more explicitly to have more control

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got the point thanks. And here is the asv benchmark supporting evidence.

       before           after         ratio
     [21f5e632]       [7d4ecb6f]
     <v4.2.0^0>       <mo/seq_to_hashes>
+           29.9M            37.4M     1.25  benchmarks.PeakmemMinHashSuite.peakmem_add_sequence
+           29.9M            36.7M     1.23  benchmarks.PeakmemMinAbundanceSuite.peakmem_add_many
+           29.9M            36.7M     1.23  benchmarks.PeakmemMinAbundanceSuite.peakmem_add_protein
+           29.9M            36.7M     1.23  benchmarks.PeakmemMinHashSuite.peakmem_add_hash
+           29.9M            36.7M     1.23  benchmarks.PeakmemMinHashSuite.peakmem_add_many
+           29.9M            36.7M     1.23  benchmarks.PeakmemMinHashSuite.peakmem_add_protein
+           30.5M            37.4M     1.23  benchmarks.PeakmemMinAbundanceSuite.peakmem_add_sequence
+           30.6M            37.5M     1.22  benchmarks.PeakmemMinAbundanceSuite.peakmem_add_hash
+      84.6±0.5μs         99.2±7μs     1.17  benchmarks.TimeMinHashSuite.time_add_sequence
+        88.0±1μs        101±0.8μs     1.15  benchmarks.TimeMinAbundanceSuite.time_add_sequence

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE DECREASED.
ERROR: InvocationError for command /home/mabuelanin/dib-dev/sourmash/.tox/asv/bin/asv continuous latest HEAD (exited with code 1)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While researching this, I found that Rust experimentally supports Generators. I think this could be very useful in terms of memory reduction and lazy execution, maybe later when the feature is stable.

@ctb
Copy link
Contributor

ctb commented Jul 8, 2021

Note that this doesn't return the hashes returned by MinHash.add_protein. Not sure what the right fix is here, but maybe for an initial Python API, have it be def seq_to_hashes(..., *, protein=True) with a default to False?

@mr-eyes
Copy link
Member Author

mr-eyes commented Jul 10, 2021

Note that this doesn't return the hashes returned by MinHash.add_protein. Not sure what the right fix is here, but maybe for an initial Python API, have it be def seq_to_hashes(..., *, protein=True) with a default to False?

I think this different behaviors is not related to the changes made in this PR.

The protein (flag) information is saved while creating the Minhash object.

is_protein=False,

Then the hashing function is selected accordingly,

elif is_protein:
hash_function = lib.HASH_FUNCTIONS_MURMUR64_PROTEIN
ksize = ksize*3

And from my understanding to the code, these two paths leads to two different behaviors

  1. Creating Minhash object with is_protein=True, then add_sequence()
  2. Creating Minhash object with is_protein=True, then add_protein()

seq_to_hashes() is called from a Minhash object to mimic it's config. and,
∵ the add_sequence() function is initially designed to auto detect the hash function initialized when instantiate the Minhash object dna or protein and then perform the hashing.
∴ I think, the fix here is to replace this code chunk that's responsible for the protein hashing in the add_sequence() function and invoke the add_protein() function directly. Then the add_protein() & add_sequence() functions should behave like seq_to_hashes().


If the seq_to_hashes() function will be defined freely as proposed in this review comment, then we can set the is_protein flag, but what about the other parameters? we will need them too (kSize, force, dayhoff).

What do you think?

@ctb
Copy link
Contributor

ctb commented Jul 10, 2021

I'm going to embarrass myself here, because I'm not up to digging into the code right now, but ISTR -

  • add_sequence(...) is currently for adding DNA sequence, so on MinHash with is_protein set to True, it will translate the DNA sequence.
  • add_protein(...) is for adding protein sequences directly, without translation.

We do eventually want to be able to get hashes from both, and it may be a good time to ...reconsider the MinHash interface a bit. I'll look at it later today.

@mr-eyes
Copy link
Member Author

mr-eyes commented Jul 10, 2021

Ok, I got it now. Then the fix is as you've proposed, passing is_protein to seq_to_hashes() to directly call the code inside add_protein(). I will implement the iterator, and modify seq_to_hashes() to be invoked from both add_sequence() & add_protein().

@ctb
Copy link
Contributor

ctb commented Jul 10, 2021

Ok, I got it now. Then the fix is as you've proposed, passing is_protein to seq_to_hashes() to directly call the code inside add_protein(). I will implement the iterator, and modify seq_to_hashes() to be invoked from both add_sequence() & add_protein().

Yes, I think so! Note the key code in command_compute.py, for constructing sketches:

def add_seq(sigs, seq, input_is_protein, check_sequence):
    for sig in sigs:
        if input_is_protein:
            sig.add_protein(seq)
        else:
            sig.add_sequence(seq, not check_sequence)

Here, input_is_protein is True for sourmash sketch protein, and is False for sourmash sketch dna and sourmash sketch translate.

See also #186 where changing the API to add_sequence(..., sequence_type=dna/protein) is suggested.

And, finally, see #1057 for @luizirber suggestion on Rust code reorg.

This doesn't all need to get done here, of course! But we'd get a lot of value out of being able to get the hashes from both add_sequence and add_protein.

@luizirber
Copy link
Member

And, finally, see #1057 for @luizirber suggestion on Rust code reorg.

Err, this was already done in #1223 🙈

@mr-eyes
Copy link
Member Author

mr-eyes commented Jul 17, 2021

I'll take a closer look tomorrow

@luizirber not too close, please 😬

@ctb
Copy link
Contributor

ctb commented Jul 17, 2021

hi @mr-eyes this looks great on the Python side!

Over in #1673 I've added more tests that nail down the behavior even more; since I also add new behavior related to stop codons, and more work is needed on that PR, I don't want to merge it in here, though.

@mr-eyes
Copy link
Member Author

mr-eyes commented Jul 17, 2021

Over in #1673 I've added more tests that nail down the behavior even more; since I also add new behavior related to stop codons, and more work is needed on that PR, I don't want to merge it in here, though.

I'd suggest merging #1653 into latest then rebasing #1673 to latest

@ctb
Copy link
Contributor

ctb commented Jul 17, 2021

yep. I don't use rebase, but I'll manage the merge resolution. Note that github automatically switches the base to latest when the current base is merged there, too!

@mr-eyes
Copy link
Member Author

mr-eyes commented Jul 17, 2021

Cool! Good to know 😄

@mr-eyes
Copy link
Member Author

mr-eyes commented Jul 19, 2021

The asv benchmark check finally passed!

Are the Rust FFI files manually ignored in Codecov? I think it's from the Codecov report. I will try disabling it in the codecov.yml.

@luizirber
Copy link
Member

Are the Rust FFI files manually ignored in Codecov? I think it's from the Codecov report. I will try disabling it in the codecov.yml.

Main issue is that FFI files don't have their coverage measured. I guess something that can do coverage of C code (like gcov?) should work, but never managed to actually make it work... Deactivating from codecov is probably the best solution at the moment.

@luizirber
Copy link
Member

luizirber commented Jul 21, 2021

There is a bit of a regression on the performance (executed with cargo bench -- --save-baseline latest on the latest branch, and then cargo bench -- --baseline latest in this branch), but that's not really a blocker:

add_sequence/valid      time:   [3.1378 ms 3.1464 ms 3.1617 ms]
                        change: [+9.9048% +10.293% +10.750%] (p = 0.00 < 0.05)
                        Performance has regressed.
add_sequence/lowercase  time:   [3.1118 ms 3.1153 ms 3.1196 ms]
                        change: [+8.6023% +8.8512% +9.0956%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild

add_sequence/invalid kmers
                        time:   [2.3074 ms 2.3094 ms 2.3109 ms]
                        change: [+6.1021% +6.3405% +6.5829%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild

add_sequence/force with valid kmers
                        time:   [3.1026 ms 3.1138 ms 3.1265 ms]
                        change: [+3.7008% +4.0898% +4.4566%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild

Copy link
Member

@luizirber luizirber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor comments, but this looks great!

@ctb
Copy link
Contributor

ctb commented Jul 21, 2021

@mr-eyes shall I merge?

@mr-eyes
Copy link
Member Author

mr-eyes commented Jul 21, 2021

@mr-eyes shall I merge?

Yes!

@ctb ctb merged commit 22c858f into latest Jul 21, 2021
@ctb ctb deleted the mo/seq_to_hashes branch July 21, 2021 12:57
@ctb
Copy link
Contributor

ctb commented Jul 21, 2021

🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Expose a function to produce a list of all hashes from a sequence
3 participants