[MRG] Produce list of hashes from a sequence #1653

mr-eyes · 2021-07-05T02:16:50Z

Description

This PR exposes a new function into the sourmash core API to retrieve all hashes for a sequence without the need to add the hashes to the minhash sketch.
Fixes Expose a function to produce a list of all hashes from a sequence #1214

TODO

Add test case
Implement the new Rust function seq_to_hashes.
Refactor add_sequence() to use the seq_to_hashes() function.
Python Binding.

⚠️ This PR will touch all the layers (Python, C header, Rust).

codecov · 2021-07-05T02:23:08Z

Codecov Report

Merging #1653 (d25b54e) into latest (b223638) will decrease coverage by 0.04%.
The diff coverage is 79.22%.

@@            Coverage Diff             @@
##           latest    #1653      +/-   ##
==========================================
- Coverage   82.68%   82.63%   -0.05%     
==========================================
  Files         113      113              
  Lines       11902    11995      +93     
  Branches     1511     1513       +2     
==========================================
+ Hits         9841     9912      +71     
- Misses       1807     1829      +22     
  Partials      254      254

Flag	Coverage Δ
python	`90.13% <100.00%> (+0.01%)`	⬆️
rust	`66.45% <77.93%> (+0.16%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
src/core/src/ffi/minhash.rs	`0.00% <0.00%> (ø)`
src/core/src/signature.rs	`60.84% <82.56%> (+2.66%)`	⬆️
src/core/tests/minhash.rs	`99.32% <92.00%> (-0.44%)`	⬇️
src/sourmash/minhash.py	`91.62% <100.00%> (+0.19%)`	⬆️
src/core/src/sketch/minhash.rs	`86.93% <0.00%> (+0.11%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b223638...d25b54e. Read the comment docs.

mr-eyes · 2021-07-05T22:08:40Z

@ctb Would you please check and see if that is the expected behavior of the new function?

ctb · 2021-07-05T22:11:59Z

wow, looks good to me so far!

It would be good to check that it works for translated sequence and for protein sequence, as well as different MinHash seeds.

src/core/src/signature.rs

luizirber · 2021-07-07T16:17:43Z

src/core/src/signature.rs

+    fn seq_to_hashes(&self, seq: &[u8], force: bool) -> Result<Vec<u64>, Error> {
+        let mut seq_hashes: Vec<u64> = Vec::new();


This will trigger the allocation of a (possibly quite large) vector to hold all the hashes. Although this makes sense for a function that returns all hashes, the issue is that for the more common use case of adding the hashes to the MinHash (without needing the kmer -> hash mapping) it will make performance much worse.

Suggestion: make seq_to_hashes into a free function (not a part of the SigsTrait trait), move most of the implementation in add_sequence to it, but seq_to_hashes return an Iterator instead. For the FFI then collect all the hashes generated by seq_to_hashes into a vector and return it. This way it is up to the caller (add_sequence here or kmerminhash_seq_to_hashes in the FFI) to decide if it wants to allocate the vector or just consume the values one by one.

std::iter::from_fn might be a shortcut, or you might want to implement Iterator more explicitly to have more control

I got the point thanks. And here is the asv benchmark supporting evidence.

before after ratio [21f5e632] [7d4ecb6f] <v4.2.0^0> <mo/seq_to_hashes> + 29.9M 37.4M 1.25 benchmarks.PeakmemMinHashSuite.peakmem_add_sequence + 29.9M 36.7M 1.23 benchmarks.PeakmemMinAbundanceSuite.peakmem_add_many + 29.9M 36.7M 1.23 benchmarks.PeakmemMinAbundanceSuite.peakmem_add_protein + 29.9M 36.7M 1.23 benchmarks.PeakmemMinHashSuite.peakmem_add_hash + 29.9M 36.7M 1.23 benchmarks.PeakmemMinHashSuite.peakmem_add_many + 29.9M 36.7M 1.23 benchmarks.PeakmemMinHashSuite.peakmem_add_protein + 30.5M 37.4M 1.23 benchmarks.PeakmemMinAbundanceSuite.peakmem_add_sequence + 30.6M 37.5M 1.22 benchmarks.PeakmemMinAbundanceSuite.peakmem_add_hash + 84.6±0.5μs 99.2±7μs 1.17 benchmarks.TimeMinHashSuite.time_add_sequence + 88.0±1μs 101±0.8μs 1.15 benchmarks.TimeMinAbundanceSuite.time_add_sequence SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY. PERFORMANCE DECREASED. ERROR: InvocationError for command /home/mabuelanin/dib-dev/sourmash/.tox/asv/bin/asv continuous latest HEAD (exited with code 1)

While researching this, I found that Rust experimentally supports Generators. I think this could be very useful in terms of memory reduction and lazy execution, maybe later when the feature is stable.

src/core/src/ffi/minhash.rs

Co-authored-by: Luiz Irber <[email protected]>

ctb · 2021-07-08T17:09:57Z

Note that this doesn't return the hashes returned by MinHash.add_protein. Not sure what the right fix is here, but maybe for an initial Python API, have it be def seq_to_hashes(..., *, protein=True) with a default to False?

mr-eyes · 2021-07-10T15:36:29Z

Note that this doesn't return the hashes returned by MinHash.add_protein. Not sure what the right fix is here, but maybe for an initial Python API, have it be def seq_to_hashes(..., *, protein=True) with a default to False?

I think this different behaviors is not related to the changes made in this PR.

The protein (flag) information is saved while creating the Minhash object.

sourmash/src/sourmash/minhash.py

Line 157 in eeb1874

is_protein=False,

Then the hashing function is selected accordingly,

sourmash/src/sourmash/minhash.py

Lines 204 to 206 in eeb1874

    
           elif is_protein: 
        
               hash_function = lib.HASH_FUNCTIONS_MURMUR64_PROTEIN 
        
               ksize = ksize*3

And from my understanding to the code, these two paths leads to two different behaviors

Creating Minhash object with is_protein=True, then add_sequence()
Creating Minhash object with is_protein=True, then add_protein()

∵ seq_to_hashes() is called from a Minhash object to mimic it's config. and,
∵ the add_sequence() function is initially designed to auto detect the hash function initialized when instantiate the Minhash object dna or protein and then perform the hashing.
∴ I think, the fix here is to replace this code chunk that's responsible for the protein hashing in the add_sequence() function and invoke the add_protein() function directly. Then the add_protein() & add_sequence() functions should behave like seq_to_hashes().

If the seq_to_hashes() function will be defined freely as proposed in this review comment, then we can set the is_protein flag, but what about the other parameters? we will need them too (kSize, force, dayhoff).

What do you think?

ctb · 2021-07-10T16:33:47Z

I'm going to embarrass myself here, because I'm not up to digging into the code right now, but ISTR -

add_sequence(...) is currently for adding DNA sequence, so on MinHash with is_protein set to True, it will translate the DNA sequence.
add_protein(...) is for adding protein sequences directly, without translation.

We do eventually want to be able to get hashes from both, and it may be a good time to ...reconsider the MinHash interface a bit. I'll look at it later today.

mr-eyes · 2021-07-10T16:40:39Z

Ok, I got it now. Then the fix is as you've proposed, passing is_protein to seq_to_hashes() to directly call the code inside add_protein(). I will implement the iterator, and modify seq_to_hashes() to be invoked from both add_sequence() & add_protein().

ctb · 2021-07-10T18:25:00Z

Ok, I got it now. Then the fix is as you've proposed, passing is_protein to seq_to_hashes() to directly call the code inside add_protein(). I will implement the iterator, and modify seq_to_hashes() to be invoked from both add_sequence() & add_protein().

Yes, I think so! Note the key code in command_compute.py, for constructing sketches:

def add_seq(sigs, seq, input_is_protein, check_sequence):
    for sig in sigs:
        if input_is_protein:
            sig.add_protein(seq)
        else:
            sig.add_sequence(seq, not check_sequence)

Here, input_is_protein is True for sourmash sketch protein, and is False for sourmash sketch dna and sourmash sketch translate.

See also #186 where changing the API to add_sequence(..., sequence_type=dna/protein) is suggested.

And, finally, see #1057 for @luizirber suggestion on Rust code reorg.

This doesn't all need to get done here, of course! But we'd get a lot of value out of being able to get the hashes from both add_sequence and add_protein.

luizirber · 2021-07-12T00:51:32Z

And, finally, see #1057 for @luizirber suggestion on Rust code reorg.

Err, this was already done in #1223 🙈

src/sourmash/minhash.py

mr-eyes · 2021-07-17T22:26:17Z

I'll take a closer look tomorrow

@luizirber not too close, please 😬

ctb · 2021-07-17T22:46:51Z

hi @mr-eyes this looks great on the Python side!

Over in #1673 I've added more tests that nail down the behavior even more; since I also add new behavior related to stop codons, and more work is needed on that PR, I don't want to merge it in here, though.

mr-eyes · 2021-07-17T22:50:48Z

Over in #1673 I've added more tests that nail down the behavior even more; since I also add new behavior related to stop codons, and more work is needed on that PR, I don't want to merge it in here, though.

I'd suggest merging #1653 into latest then rebasing #1673 to latest

ctb · 2021-07-17T23:04:13Z

yep. I don't use rebase, but I'll manage the merge resolution. Note that github automatically switches the base to latest when the current base is merged there, too!

mr-eyes · 2021-07-17T23:06:04Z

Cool! Good to know 😄

mr-eyes · 2021-07-19T12:33:24Z

The asv benchmark check finally passed!

Are the Rust FFI files manually ignored in Codecov? I think it's from the Codecov report. I will try disabling it in the codecov.yml.

luizirber · 2021-07-21T03:25:08Z

Are the Rust FFI files manually ignored in Codecov? I think it's from the Codecov report. I will try disabling it in the codecov.yml.

Main issue is that FFI files don't have their coverage measured. I guess something that can do coverage of C code (like gcov?) should work, but never managed to actually make it work... Deactivating from codecov is probably the best solution at the moment.

luizirber · 2021-07-21T03:27:58Z

There is a bit of a regression on the performance (executed with cargo bench -- --save-baseline latest on the latest branch, and then cargo bench -- --baseline latest in this branch), but that's not really a blocker:

add_sequence/valid      time:   [3.1378 ms 3.1464 ms 3.1617 ms]
                        change: [+9.9048% +10.293% +10.750%] (p = 0.00 < 0.05)
                        Performance has regressed.
add_sequence/lowercase  time:   [3.1118 ms 3.1153 ms 3.1196 ms]
                        change: [+8.6023% +8.8512% +9.0956%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild

add_sequence/invalid kmers
                        time:   [2.3074 ms 2.3094 ms 2.3109 ms]
                        change: [+6.1021% +6.3405% +6.5829%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild

add_sequence/force with valid kmers
                        time:   [3.1026 ms 3.1138 ms 3.1265 ms]
                        change: [+3.7008% +4.0898% +4.4566%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild

src/core/src/signature.rs

luizirber

Some minor comments, but this looks great!

ctb · 2021-07-21T12:53:23Z

@mr-eyes shall I merge?

mr-eyes · 2021-07-21T12:56:23Z

@mr-eyes shall I merge?

Yes!

ctb · 2021-07-21T12:57:48Z

🎉

test for seq_to_hashes

8749f1e

working seq_to_hashes

8aa54b6

Merge branch 'latest' into mo/seq_to_hashes

c6ffc0b

luizirber reviewed Jul 7, 2021

View reviewed changes

src/core/src/signature.rs Outdated Show resolved Hide resolved

luizirber reviewed Jul 7, 2021

View reviewed changes

src/core/src/signature.rs Outdated Show resolved Hide resolved

luizirber reviewed Jul 7, 2021

View reviewed changes

src/core/src/ffi/minhash.rs Outdated Show resolved Hide resolved

luizirber reviewed Jul 7, 2021

View reviewed changes

src/core/src/ffi/minhash.rs Outdated Show resolved Hide resolved

mr-eyes and others added 6 commits July 8, 2021 08:40

Update src/core/src/signature.rs

9eb1ff8

Co-authored-by: Luiz Irber <[email protected]>

Update src/core/src/signature.rs

1b03491

Co-authored-by: Luiz Irber <[email protected]>

Update src/core/src/ffi/minhash.rs

64d9330

Co-authored-by: Luiz Irber <[email protected]>

Merge branch 'latest' into mo/seq_to_hashes

ac967ff

Update src/core/src/ffi/minhash.rs

7af9b37

Co-authored-by: Luiz Irber <[email protected]>

fixes to adapt luiz's suggestions

7d4ecb6

Merge branch 'latest' into mo/seq_to_hashes

897896d

This was referenced Jul 12, 2021

Direct access to hash fn #1663

Closed

What about using different hash functions (reversible, rolling, etc.) with sourmash? #1659

Open

mr-eyes added 3 commits July 13, 2021 18:10

support is_protein in seq_to_hashes

9bfeaa4

extended tests for seq_to_hashes

050f3c0

Merge branch 'latest' into mo/seq_to_hashes

f506a08

ctb reviewed Jul 15, 2021

View reviewed changes

src/sourmash/minhash.py Outdated Show resolved Hide resolved

mr-eyes added 5 commits July 18, 2021 21:25

more rust test coverage

15dcb71

iterator refactoring

0124a28

partially enhance add_protein

c4bf154

fully enhance add_protein

fdeae5b

ignore ffi coverage

46edafd

luizirber reviewed Jul 21, 2021

View reviewed changes

src/core/src/signature.rs Outdated Show resolved Hide resolved

luizirber reviewed Jul 21, 2021

View reviewed changes

src/core/src/signature.rs Outdated Show resolved Hide resolved

luizirber reviewed Jul 21, 2021

View reviewed changes

src/core/src/signature.rs Outdated Show resolved Hide resolved

luizirber approved these changes Jul 21, 2021

View reviewed changes

mr-eyes added 2 commits July 21, 2021 14:21

final refactoring

9bed5d9

Merge branch 'latest' into mo/seq_to_hashes

d25b54e

ctb merged commit 22c858f into latest Jul 21, 2021

ctb deleted the mo/seq_to_hashes branch July 21, 2021 12:57

This was referenced Jul 23, 2021

[WIP] change k-mer hashing to use sourmash hash function spacegraphcats/spacegraphcats#429

Open

Motif predictions using protein sketch #1692

Closed

[MRG] add MinHash.kmers_and_hashes(...) and sourmash sig kmers #1695

Merged

mr-eyes mentioned this pull request Aug 1, 2021

[MRG] add option: bad_kmers_as_zeroes #1696

Merged

ctb mentioned this pull request Aug 7, 2021

Draft release notes for 4.2.2 #1705

Closed

ctb mentioned this pull request Sep 23, 2021

How to use low level functions? #489

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MRG] Produce list of hashes from a sequence #1653

[MRG] Produce list of hashes from a sequence #1653

mr-eyes commented Jul 5, 2021 •

edited

Loading

codecov bot commented Jul 5, 2021 •

edited

Loading

mr-eyes commented Jul 5, 2021

ctb commented Jul 5, 2021

luizirber Jul 7, 2021

luizirber Jul 7, 2021

mr-eyes Jul 8, 2021

mr-eyes Jul 13, 2021

ctb commented Jul 8, 2021

mr-eyes commented Jul 10, 2021 •

edited

Loading

ctb commented Jul 10, 2021 •

edited

Loading

mr-eyes commented Jul 10, 2021

ctb commented Jul 10, 2021

luizirber commented Jul 12, 2021

mr-eyes commented Jul 17, 2021

ctb commented Jul 17, 2021

mr-eyes commented Jul 17, 2021

ctb commented Jul 17, 2021

mr-eyes commented Jul 17, 2021

mr-eyes commented Jul 19, 2021 •

edited

Loading

luizirber commented Jul 21, 2021

luizirber commented Jul 21, 2021 •

edited

Loading

luizirber left a comment

ctb commented Jul 21, 2021

mr-eyes commented Jul 21, 2021

ctb commented Jul 21, 2021

		fn seq_to_hashes(&self, seq: &[u8], force: bool) -> Result<Vec<u64>, Error> {
		let mut seq_hashes: Vec<u64> = Vec::new();

[MRG] Produce list of hashes from a sequence #1653

[MRG] Produce list of hashes from a sequence #1653

Conversation

mr-eyes commented Jul 5, 2021 • edited Loading

codecov bot commented Jul 5, 2021 • edited Loading

Codecov Report

mr-eyes commented Jul 5, 2021

ctb commented Jul 5, 2021

luizirber Jul 7, 2021

Choose a reason for hiding this comment

luizirber Jul 7, 2021

Choose a reason for hiding this comment

mr-eyes Jul 8, 2021

Choose a reason for hiding this comment

mr-eyes Jul 13, 2021

Choose a reason for hiding this comment

ctb commented Jul 8, 2021

mr-eyes commented Jul 10, 2021 • edited Loading

ctb commented Jul 10, 2021 • edited Loading

mr-eyes commented Jul 10, 2021

ctb commented Jul 10, 2021

luizirber commented Jul 12, 2021

mr-eyes commented Jul 17, 2021

ctb commented Jul 17, 2021

mr-eyes commented Jul 17, 2021

ctb commented Jul 17, 2021

mr-eyes commented Jul 17, 2021

mr-eyes commented Jul 19, 2021 • edited Loading

luizirber commented Jul 21, 2021

luizirber commented Jul 21, 2021 • edited Loading

luizirber left a comment

Choose a reason for hiding this comment

ctb commented Jul 21, 2021

mr-eyes commented Jul 21, 2021

ctb commented Jul 21, 2021

mr-eyes commented Jul 5, 2021 •

edited

Loading

codecov bot commented Jul 5, 2021 •

edited

Loading

mr-eyes commented Jul 10, 2021 •

edited

Loading

ctb commented Jul 10, 2021 •

edited

Loading

mr-eyes commented Jul 19, 2021 •

edited

Loading

luizirber commented Jul 21, 2021 •

edited

Loading