Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MRG: Make kmers_and_hashes iterable #70

Merged
merged 22 commits into from
Oct 28, 2024
Merged

Conversation

Adamtaranto
Copy link
Collaborator

@Adamtaranto Adamtaranto commented Sep 29, 2024

  • Make kmers_and_hashes iterable
  • Decide what to do about skip_bad_kmers (force arg in SeqToHashes) - suppress output or error?
  • Integrate with count() and consume()
  • Init KmerCountTable with option to track kmers
  • Add HashMap to store hash:kmer pairs
  • Add unhash() function
  • Add dump_kmers() function for sorted kmer,count output
  • Make dump_kmers() compatible with records dropped by mincut(), maxcut(), drop(), or drop_hash() Close dump_kmers fails after removing records with maxcut #75

Close #21

@Adamtaranto Adamtaranto self-assigned this Sep 29, 2024
@Adamtaranto
Copy link
Collaborator Author

@ctb thoughts on bad kmer handling for this function?

skip_bad_kmers is being used to set the force option in SeqToHashes which decides if bad kmers should halt seq processing or just emit a hash of 0.

When skip_bad_kmers is set to True our kmers_and_hashes function will still output a blank kmer with hash 0 rather than actually "skipping" it.

I think we should have the force option in SeqToHashes permanently set to True and then only use skip_bad_kmers to decide if we will output the blank kmer:hash pair as ("",0).

Can you think of any cases where you would want to have force set to False so that only kmers up to the first bad kmer are processed?

@ctb
Copy link
Contributor

ctb commented Sep 30, 2024

I think I agree with you - good point ;)

@Adamtaranto
Copy link
Collaborator Author

Ok, I've updated the iterator so it just prints a warning msg to stderr if it sees a bad kmer, and only return includes empty ("",0) in output if skip_bad_kmers = false.

If this is too noisy then I guess we could turn off warning if skip_bad_kmers = true, though I feel it would be nice to keep as is in case the user does something wacky like try to process protein as DNA.

Can you take a look at the tests? I only know the one method for capturing warning msgs. hammer, nail, etc.

@Adamtaranto
Copy link
Collaborator Author

Previous fix for dependabot ruff action GITHUB_TOKEN prevents jobs from restarting after being interrupted by the ruff action.

reverted.

Will need to disable ruff formatting in any actions spawned by dependabot.

@Adamtaranto
Copy link
Collaborator Author

Adamtaranto commented Oct 15, 2024

@ctb ready for review. Do we want to work on making this faster before merging into main?

Low hanging efficiency fruit:

  • Revcomp the full seq and step through backwards in sliding window to find canonical kmers instead of revcomp per kmer.
  • Make consume multithreaded at the kmer level - both KmersAndHashesIter and SeqToHashes are iterators so this should be fairly simple, unless we are worried about diff threads trying to write to same hash?
  • Make multithreaded at the sequence level?
  • Consume directly from file to avoid having to load seqs in Python and pass back to rust per WIP: add consume_file #10

Copy link
Contributor

@ctb ctb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

excellent!

@ctb ctb changed the title WIP: Make kmers_and_hashes iterable MRG: Make kmers_and_hashes iterable Oct 28, 2024
@Adamtaranto
Copy link
Collaborator Author

Are we good to merge?

@ctb
Copy link
Contributor

ctb commented Oct 28, 2024

yep! go ahead!

@Adamtaranto
Copy link
Collaborator Author

Cool, I'll pull in changes from #82 when tests are finished, then merge.

* compute revcomp once, not every time

* rm comment

* run cargo fmt

---------

Co-authored-by: @ctb
@Adamtaranto Adamtaranto merged commit d400eb2 into main Oct 28, 2024
21 checks passed
@Adamtaranto Adamtaranto deleted the integrate_kmer_tracking branch October 28, 2024 03:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

dump_kmers fails after removing records with maxcut Store hash to kmer map
2 participants