-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
manifest file is not consistent between sourmash sig manifest
and sourmash sketch fromfile
#2749
Comments
thanks @chunyuma I'm not sure what exactly is going on but I'd love to help debug! It sounds like it may be the third item from #1849, to whit:
but I'm not sure why that would happen internally to sourmash! If you can paste a few example lines in that differ b/t the manifests that might be helpful? note that some manifest behavior is just weird and dumb - e.g. ref also #2033 re |
thanks for your help in debugging, @ctb. Here is a simple example that might help you for debug: In this example, I have two genomes that are quite close in taxonomy. After I built a custom database based on these two genomes via:
I unzip the
As you can see, sourmash generates the same md5 for them. When I looked at the files under the However, when I did
Only genome1 has been shown. I know perhaps sourmash considers them as the "same" genomes. I'm wondering if both genomes will be used in the downstream |
yep, this inconsistency is concerning :). Looking into it! I'm tempted to say it's going to be fixed by #2747 but I think I may be being overoptimistic 😆
If you're using the zip file with the "correct" (dual entry) |
I can replicate:
|
Ooh, this is a nifty bug all right. 🤯 Briefly,
instead of
This is only triggered in unusual circumstances but is still a bug... gotta think on resolution! |
OK, confirmed the underlying code - the Regardless, this behavior is the root cause of the discrepancy noted here! It doesn't matter at all for manifests generated when saving sketches, because there the correct filename is recorded (and if a filename is in the manifest, it doesn't matter what the filename is - it's loaded as a signature file). I see a few different possible resolutions -
I'm leaning towards one of the first two now, and also the third for v5? I don't think either of the first two changes would break anything in sourmash currently, but of course... we'll have to see! I'll poke around at this as I have time. Either way @chunyuma I understand the problem and hopefully it won't affect your work - and if you need |
Thanks @ctb for your hard working on this issue. I'm glad to figure out the issue and find the potential solutions for it. And thanks for your suggestions. |
Hello,
This issue seems a bug but I'm not sure.
I recently utilized
sourmash sketch fromfile
command to generate a custom database. However, the genome set may have some duplicates. After I usedsourmash sketch fromfile
to generate a database.zip file, I usedsourmash sig manifest
to extract its manifest metadata file. I also try to unzip the database.zip file and there is aSOURMASH-MANIFEST.csv
file.One interesting thing I found is that the row in the manifest file generated from
sourmash sig manifest
is different from that inSOURMASH-MANIFEST.csv
file. I found that those missing rows are duplicated genomes (the genomes with the same md5). I'm wondering if these duplicated genomes will be used in the downstreamsourmash
command. Thanks!The text was updated successfully, but these errors were encountered: