Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

manifest file is not consistent between sourmash sig manifest and sourmash sketch fromfile #2749

Open
chunyuma opened this issue Sep 5, 2023 · 7 comments
Labels

Comments

@chunyuma
Copy link

chunyuma commented Sep 5, 2023

Hello,

This issue seems a bug but I'm not sure.

I recently utilized sourmash sketch fromfile command to generate a custom database. However, the genome set may have some duplicates. After I used sourmash sketch fromfile to generate a database.zip file, I used sourmash sig manifest to extract its manifest metadata file. I also try to unzip the database.zip file and there is a SOURMASH-MANIFEST.csv file.

One interesting thing I found is that the row in the manifest file generated from sourmash sig manifest is different from that in SOURMASH-MANIFEST.csv file. I found that those missing rows are duplicated genomes (the genomes with the same md5). I'm wondering if these duplicated genomes will be used in the downstream sourmash command. Thanks!

@ctb
Copy link
Contributor

ctb commented Sep 5, 2023

thanks @chunyuma I'm not sure what exactly is going on but I'd love to help debug!

It sounds like it may be the third item from #1849, to whit:

third, there are some interesting corner cases popping up in #1837 where the manifest may (or may not) contain all signatures in the database. One specific case is ZipFileLinearIndex, where if the manifest was generated with traverse_all_files, it may contain signatures from files that don't have .sig in the name. This results in oddities where you get different reports out of sourmash sig fileinfo depending on whether you've asked it to regenerate the manifest or not: for example, if you're looking at tests/test-data/prot/all.zip, the included manifest does contain dna-sig.noext, but if you regenerate the manifest from an index loaded without traverse_all_files=True, you'll exclude it. See the test_fileinfo_4_zip* tests as well as the test_sig_manifest_7_allzip tests for tests that explore this behavior.

but I'm not sure why that would happen internally to sourmash!

If you can paste a few example lines in that differ b/t the manifests that might be helpful?

note that some manifest behavior is just weird and dumb - e.g. sig manifest regenerates the manifest by default !? per #2034 - and that's all my fault 😭 .

ref also #2033 re fromfile.

@chunyuma
Copy link
Author

chunyuma commented Sep 7, 2023

thanks for your help in debugging, @ctb.

Here is a simple example that might help you for debug:

In this example, I have two genomes that are quite close in taxonomy. After I built a custom database based on these two genomes via:

sourmash sketch fromfile datasets.csv -p k=31,scaled=100,dna,abund -o database.zip

I unzip the database.zip file and got a SOURMASH-MANIFEST.csv file:

# SOURMASH-MANIFEST-VERSION: 1.0
internal_location,md5,md5short,ksize,moltype,num,scaled,n_hashes,with_abundance,name,filename
signatures/331a7076f08e2b1d671aedf70cd63230.sig.gz,331a7076f08e2b1d671aedf70cd63230,331a7076,31,DNA,0,100,91554,1,genome1,/home/xxx/genome1.fasta
signatures/331a7076f08e2b1d671aedf70cd63230.sig.gz_0,331a7076f08e2b1d671aedf70cd63230,331a7076,31,DNA,0,100,91554,1,genome2,/home/xxx/genome2.fna.gz

As you can see, sourmash generates the same md5 for them.

When I looked at the files under the signatures folder that generated by unzipping database.zip, I found that there are two signature file: one's name is 331a7076f08e2b1d671aedf70cd63230.sig.gz and another is 331a7076f08e2b1d671aedf70cd63230.sig.gz_0.

However, when I did sourmash sig manifest data.zip -o manifest.csv, I had the following information:

# SOURMASH-MANIFEST-VERSION: 1.0
internal_location,md5,md5short,ksize,moltype,num,scaled,n_hashes,with_abundance,name,filename
signatures/331a7076f08e2b1d671aedf70cd63230.sig.gz,331a7076f08e2b1d671aedf70cd63230,331a7076,31,DNA,0,100,91554,1,genome1,/home/xxx/genome1.fasta

Only genome1 has been shown. I know perhaps sourmash considers them as the "same" genomes. I'm wondering if both genomes will be used in the downstream sourmash analysis (e.g., sourmash compare). Thanks!

@ctb
Copy link
Contributor

ctb commented Sep 10, 2023

yep, this inconsistency is concerning :). Looking into it! I'm tempted to say it's going to be fixed by #2747 but I think I may be being overoptimistic 😆

Only genome1 has been shown. I know perhaps sourmash considers them as the "same" genomes. I'm wondering if both genomes will be used in the downstream sourmash analysis (e.g., sourmash compare). Thanks!

If you're using the zip file with the "correct" (dual entry) SOURMASH-MANIFEST.csv then they should both be used!

@ctb
Copy link
Contributor

ctb commented Sep 10, 2023

I can replicate:

cp podar-ref/47.fa dup/47a.fa
cp podar-ref/47.fa dup/47b.fa
cd dup
sourmash sketch dna -p k=31 *
sourmash sig cat *.sig -o dup.zip
sourmash sig summarize dup.zip # shows 2 sketches in zip
unzip dup.zip
cat SOURMASH-MANIFEST.csv
# shows 2 sketches in CSV

sourmash sig manifest dup.zip -o mf.csv
cat mf.csv
# shows 1 sketch

@ctb
Copy link
Contributor

ctb commented Sep 10, 2023

Ooh, this is a nifty bug all right. 🤯

Briefly, ZipStorage is storing the duplicate filename in a way that is incompatible with the default manifest-independent loading code use in ZipFileLinearIndex -

signatures/09a08691ce52952152f0e866a59f6261.sig.gz_0

instead of

signatures/09a08691ce52952152f0e866a59f6261_0.sig.gz

This is only triggered in unusual circumstances but is still a bug... gotta think on resolution!

@ctb
Copy link
Contributor

ctb commented Sep 12, 2023

OK, confirmed the underlying code - the FSStorage and ZipStorage classes in sbt_storage.py both avoid overwriting existing files by appending _{n} to filenames until there is no existing filename. This causes problems in code that relies on specific filename extensions (which is one reason we often try to avoid that, although perspectives differ ;).

Regardless, this behavior is the root cause of the discrepancy noted here!
(And this means this problem should crop up with FSStorage, too.)

It doesn't matter at all for manifests generated when saving sketches, because there the correct filename is recorded (and if a filename is in the manifest, it doesn't matter what the filename is - it's loaded as a signature file).

I see a few different possible resolutions -

  • change the code to use {n}_ as a prefix rather than _{n} as a suffix. Suboptimal for directory listings but for big collections those aren't useful anyway. This is a general fix.
  • use {n} internally, e.g. at end of first component of filename, or something - basically, avoid changing the extension - this is a general fix. It costs more compute than the previous fix, though, because we'll have to inspect the filename in some way.
  • for this specific problem, change the defaults so that traverse_yield_all is True, so that when creating a manifest it always tries to load the file, independent of filename. This would break semantic versioning so would need to wait 'til v5. Could be part of behavior change to sig manifest in change sourmash sig manifest to not regenerate a manifest by default #2034.

I'm leaning towards one of the first two now, and also the third for v5? I don't think either of the first two changes would break anything in sourmash currently, but of course... we'll have to see! I'll poke around at this as I have time.

Either way @chunyuma I understand the problem and hopefully it won't affect your work - and if you need sig manifest to work consistently, either use -f or --no-rebuild-manifest and you should get consistent manifests out. Thanks again for reporting this!!

@chunyuma
Copy link
Author

Thanks @ctb for your hard working on this issue. I'm glad to figure out the issue and find the potential solutions for it. And thanks for your suggestions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants