Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Binary format differs from expectation #73

Closed
arivers opened this issue Apr 9, 2021 · 4 comments
Closed

Binary format differs from expectation #73

arivers opened this issue Apr 9, 2021 · 4 comments

Comments

@arivers
Copy link

arivers commented Apr 9, 2021

I used Dashing v0.5.6 s128 on a Linux machine to compare pre-hashed genomes. the command was:

./dashing_s128 cmp -p78 --presketched  -b -Ofull_dashing_S16_k31_dist.bin -F fullpath_hll_filelist.txt -Q fullpath_hll_filelist.txt

From the specification here I was expecting a half matrix output with 1 byte specifying full or half matrix, 8 bytes specifying the length in np.float64, and ((n*(n-1)/2)*4 bytes of data in npfloat32. Note that supplying -Q only for the file path did not work.

Instead, I get a file of exactly (n**2)*4 bytes so I'm assuming I just got a square matrix of 4-byte float32 values.

The file is 422,393,406,724 bytes for n = 324,959.

I can import the data as a Numpy memory map doing this:

import numpy as np
val = np.memmap('full_dashing_S16_k31_dist.bin', dtype=np.float32, shape=(324959,324959))

I just wanted to know if this import was correct and also make you aware that the output was not what I expected. I saw in the previous issues you are working on documenting the binary format so I thought I'd pass this along. Overall, Dashing is fantastic and I really appreciate your team's hard work.

@dnbaker
Copy link
Owner

dnbaker commented Apr 9, 2021

Hi,

You're rather close, though the distances are float32, not float64, and unlike the packed upper-triangular distance, the asymmetric comparison has no bookkeeping in the file. I realize now that usage for asymmetric comparisons (the -Q option) isn't sufficiently clear, and I'll try to improve it.

The -Q option is for asymmetric comparisons. Default cmp comparisons produce upper-triangular distances (if -F or positional-only arguments are provided), but if -Q is enabled, then the output shape is (|F|, |Q|).

You would typically only want to use both -Q and -F for asymmetric distances like containment, where f(x_i, x_j) != f(x_j, x_i), since otherwise you could just compute the upper-triangular portion of the matrix (which is the behavior from providing only -F and not -Q.)

Does this help?

Thanks!

Daniel

@arivers
Copy link
Author

arivers commented Apr 9, 2021

Okay thanks!

I knew the actual data was in nt.float.32, I updated my comment to make that clear.

I ended up using Q and F because of this line in the README. "To generate a full, asymmetric distance matrix, provide the same path to -F and -Q."

I tried it both ways and when I ran with --presketched and -F with no -Q and I got this error:

$ dashing_s128 cmp -p1 --presketched  -b -Otest1.bin -F sample.txt
Dashing version: v0.5.6
#Path	Size (est.)
2021-04-01/GCA/000/007/545/GCA_000007545.1_ASM754v1_genomic.fna.gz.w.31.spacing.16.hll	4727565
...
terminate called after throwing an instance of 'std::system_error'
  what():  Unknown error -1: Invalid argument
Aborted (core dumped)

Running with --presketched, -Q and -F works:

$ dashing_s128 cmp -p1 --presketched  -b -Otest1.bin -F sample.txt -Q sample.txt
Dashing version: v0.5.6
#Path	Size (est.)
2021-04-01/GCA/000/007/545/GCA_000007545.1_ASM754v1_genomic.fna.gz.w.31.spacing.16.hll	4727565
...

@dnbaker
Copy link
Owner

dnbaker commented Apr 9, 2021

I see. That makes sense -- in fact, in the process of investigating the problem today, I ran into the same problem (Unknown error -1), fixed it, and incorporated into a new release which just finished building. Want to give it a try?

@arivers
Copy link
Author

arivers commented Apr 9, 2021

Yes, your new release, v0.5.7, fixed the issue. Thanks.

@arivers arivers closed this as completed Apr 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants