-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Taxonomic labels for individual contigs in metagenomic assembly #2816
Comments
I think multigather should let you do this - in brief, do
the last command would need to be run on many files. I'll dig into it when I can - might not be 'til next week tho, sorry! |
ok, had a few moments to try this out. The following worked, but is probably something we could improve quite a bit! For the below to work, you'll need to be using this code: #2722. Sorry! I'll work to get this merged...
Note that we have a really fast multigather implemented in pyo3_branchwater if you're doing this on really big things. Poke me if you want more info - not as convenient as sourmash, but WAY faster and lower memory. |
Hi @ctb the sourmash author 😃 ,
The error message:
So I updated the sourmash version to latest version [4.8.5] and installed sourmash_plugin_branchwater, but still got this error message.... My environment to install sourmash and branchwater is :
🤔 what I think is can I just remove -U flag in this command? Or this flag is super important and I cannot remove it? Another question is can I use customed db to run multigather? HTH and thanks for your contribution❤️, |
NOTE: #2722 is now merged, and this functionality will be in sourmash v4.8.7 right, I have not had time to work further on #2722, so it's not merged. To try using it, you would need to do something like:
inside a conda environment or a Python virtualenv. Note that you'll need to have a Rust compiler and a few other things installed as well - see developer instructions for details. |
Thank you so much @ctb I finally got this working! Really appreciate your help and support on this (sorry i didn't respond earlier - this was on the back-burner for a while). |
fantastic, thanks for reporting back! |
Hi @ctb , the process is as followed:
the final results txt file only contain this :
what we expected is sourmash will annotate each contig like your example, can you guide us what's wrong with our process? thanks for your patience and time spending on develping this useful tool, |
Hi @ctb, I had a chance to revisit the data, and I realized that the run that finished only resulted in a taxonomic distribution for the sample, not individual contig annotation. From the commands i gather that sourmash should have access to this info, just not displaying it. Is there a way to get at the contig-wise taxonomic labels in a sample? For example, I want to know which contigs are E. coli, which are Staph, etc. |
hi all, I updated the code in the above comment to work properly - the last few commands needed adjustment:
expected resultswhen I run this, I get multiple results in the file
yields
diving in a bit deeper - debuggingLooking at the CSV output for
Tracing back further, when I look at the sketch collections I see different numbers of sketches in the singleton and in the genomes zip file:
and when I grep for the
this is because in the podar-ref-genomes.zip file all of the Nostoc contigs are combined into one sketch, while in the singleton file there are 7 different FASTA sequences. @ursky and @yuzie0314 when you run the above code do you get the same results? I can explain how to adapt all of this to run against GTDB, too, but it would be reassuring to know that you're getting these results first :) |
Note #2722 is merged, so as of sourmash v4.8.7 |
Hi @ctb,
thanks for bearing me asking so many questions, |
wonderful, and that makes sense!
v4.8.6 is now an official release that includes #2722, so you can just install it, using the development version.
Well,
thanks for asking them! :) |
I've written a short |
Hi @ctb, I have been using Sourmash for a while in a metagenomic context - its honestly an amazing package that changed the way much of the field does taxonomic annotation. However, there is one feature that I always wanted from sourmash that I still can't figure out. Simply put, I want a way to get the taxonomic annotation of every contig in a metagenomic assembly, which would enable me to do a lot of fun analysis downstream. How do I do that with sourmash without making individual files for every sequence? Can you give me a rough command list to accomplish this?
The text was updated successfully, but these errors were encountered: