Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different max_hash values with Rust vs Python #1402

Open
olgabot opened this issue Mar 16, 2021 · 0 comments
Open

Different max_hash values with Rust vs Python #1402

olgabot opened this issue Mar 16, 2021 · 0 comments

Comments

@olgabot
Copy link
Collaborator

olgabot commented Mar 16, 2021

In this kmermaid PR (nf-core/kmermaid#131) I'm using some of @luizirber's kindly contributed Sourmash Rust code (luizirber/2021-01-27-olga-remove-protein#3) to remove k-mer hashes from constitutive genes in single cells. For each single cell, two signatures are created: one for aligned reads, the other for unaligned reads. Then, the aligned/unaligned signatures are merged (nf-core/kmermaid#117, nf-core/kmermaid#132 for stragglers) using the Python API in merge_rename_sigs.py (because sourmash sig merge doesn't allow for --name and I like reinventing the wheel)

Command executed:

  subtract \
      --track-abundance \
      --scaled 10 \
      --ksize 30 \
      --encoding dna \
      --output subtracted/ \
      vertebrate_mammalian--205--2021-03-15.rna.fa__only_constitutive_genes.fa__molecule-dna__ksize-21,30,51__scaled-10__track_abundance-true.sig \
      dna__30.txt

Command exit status:
  1

Command output:
  (empty)

Command error:
  + nxf_launch
  + /bin/bash .command.run nxf_trace
  [2021-03-16T22:08:27Z INFO  subtract] Loading queries
  [2021-03-16T22:08:28Z INFO  subtract] Loaded query signature, k=30
  [2021-03-16T22:08:28Z INFO  subtract] Loading siglist
  [2021-03-16T22:08:28Z INFO  subtract] Loaded 2 sig paths in siglist
  [2021-03-16T22:08:28Z INFO  subtract] Processed 0 sigs
  Error: Unable to load a sketch from "mouse_lung__AAATGCCCAAACTGCT---molecule-dna__ksize-21,30,51__scaled-10__track_abundance-true.sig"

  Caused by:
      No sketch matching the provided template: MinHash(KmerMinHash { num: 0, ksize: 30, hash_function: murmur64_DNA, seed: 42, max_hash: 1844674407370955264, mins: [], abunds: Some([]), md5sum: Mutex { data: None } })

But the signature definitely has the right ksize (30):

(nf-core-kmermaid-0.1.0dev--remove-ribo-kmers)
 Tue 16 Mar - 15:09  ~/code/nf-core/kmermaid--olgabot/remove-ribo-kmers/work/29/d1d3b59571362d38339a5d5abefb58   origin ☊ olgabot/remove-ribo-kmers ✔ 
  cat dna__30.txt
mouse_lung__AAATGCCCAAACTGCT---molecule-dna__ksize-21,30,51__scaled-10__track_abundance-true.sig
mouse_brown_fat_ptprc_plus_unaligned__CTGAAGTCAATGGTCT---molecule-dna__ksize-21,30,51__scaled-10__track_abundance-true.sig
(nf-core-kmermaid-0.1.0dev--remove-ribo-kmers)
 Tue 16 Mar - 15:09  ~/code/nf-core/kmermaid--olgabot/remove-ribo-kmers/work/29/d1d3b59571362d38339a5d5abefb58   origin ☊ olgabot/remove-ribo-kmers ✔ 
  sourmash sig describe mouse_lung__AAATGCCCAAACTGCT---molecule-dna__ksize-21,30,51__scaled-10__track_abundance-true.sig

== This is sourmash version 3.5.0. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

<<<AACTGCT---molecule-dna__ksize-21,30,51__scaled-10__track_abundance-tru---ig'
signature filename: mouse_lung__AAATGCCCAAACTGCT---molecule-dna__ksize-21,30,51__scaled-10__track_abundance-true.sig
signature: mouse_lung__AAATGCCCAAACTGCT
source file: mouse_lung__aligned__AAATGCCCAAACTGCT_R1_trimmed.fastq.gz
md5: 7fa9454a8934f9a73f398817c7976004
k=21 molecule=DNA num=0 scaled=10 seed=42 track_abundance=1
size: 511
signature license: CC0

---
signature filename: mouse_lung__AAATGCCCAAACTGCT---molecule-dna__ksize-21,30,51__scaled-10__track_abundance-true.sig
signature: mouse_lung__AAATGCCCAAACTGCT
source file: mouse_lung__aligned__AAATGCCCAAACTGCT_R1_trimmed.fastq.gz
md5: 0ab48bda345122a624070e50f2554319
k=30 molecule=DNA num=0 scaled=10 seed=42 track_abundance=1
size: 606
signature license: CC0

---
signature filename: mouse_lung__AAATGCCCAAACTGCT---molecule-dna__ksize-21,30,51__scaled-10__track_abundance-true.sig
signature: mouse_lung__AAATGCCCAAACTGCT
source file: mouse_lung__aligned__AAATGCCCAAACTGCT_R1_trimmed.fastq.gz
md5: 4121bcaf08ae56f0458c88a82f786c01
k=51 molecule=DNA num=0 scaled=10 seed=42 track_abundance=1
size: 545
signature license: CC0

<<<AACTGCT---molecule-dna__ksize-21,30,51__scaled-10__track_abundance-true.sig'
loaded 3 signatures total.

But the max_hash values don't match! This signature was made with Sourmash 3.5.0

(nf-core-kmermaid-0.1.0dev--remove-ribo-kmers)
 Tue 16 Mar - 15:25  ~/code/nf-core/kmermaid--olgabot/remove-ribo-kmers/work/29/d1d3b59571362d38339a5d5abefb58   origin ☊ olgabot/remove-ribo-kmers ✔ 
  jq . mouse_lung__AAATGCCCAAACTGCT---molecule-dna__ksize-21,30,51__scaled-10__track_abundance-true.sig| head -n 20
[
  {
    "class": "sourmash_signature",
    "email": "",
    "hash_function": "0.murmur64",
    "filename": "mouse_lung__aligned__AAATGCCCAAACTGCT_R1_trimmed.fastq.gz",
    "name": "mouse_lung__AAATGCCCAAACTGCT",
    "license": "CC0",
    "signatures": [
      {
        "num": 0,
        "ksize": 21,
        "seed": 42,
        "max_hash": 1844674407370955300,
        "mins": [
          1197050739821756,
          1249382636645827,
          1454265027862216,
          4466166023839276,
          6938112553264142,
  • Rust: 1844674407370955264
  • Python: 1844674407370955300 (last three digits are different)

@luizirber mentioned this may be related to the scaled/max_hash changes in: #1139

Example Files

Here are some signatures for example.

Single cell signatures, with aligned+unaligned merged

mouse_brown_fat_ptprc_plus_unaligned__CTGAAGTCAATGGTCT---molecule-dna__ksize-21,30,51__scaled-10__track_abundance-true.sig.txt
mouse_lung__AAATGCCCAAACTGCT---molecule-dna__ksize-21,30,51__scaled-10__track_abundance-true.sig.txt

Non-single cell signatures

SRR4050379__molecule-dna__ksize-21,30,51__scaled-10__track_abundance-true.sig.txt
SRR4050380__molecule-dna__ksize-21,30,51__scaled-10__track_abundance-true.sig.txt

Signature to subtract

vertebrate_mammalian--205--2021-03-15.rna.fa__only_constitutive_genes.fa__molecule-dna__ksize-21,30,51__scaled-10__track_abundance-true.sig.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant