Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Taxonomic labels for individual contigs in metagenomic assembly #2816

Closed
ursky opened this issue Oct 18, 2023 · 15 comments
Closed

Taxonomic labels for individual contigs in metagenomic assembly #2816

ursky opened this issue Oct 18, 2023 · 15 comments

Comments

@ursky
Copy link

ursky commented Oct 18, 2023

Hi @ctb, I have been using Sourmash for a while in a metagenomic context - its honestly an amazing package that changed the way much of the field does taxonomic annotation. However, there is one feature that I always wanted from sourmash that I still can't figure out. Simply put, I want a way to get the taxonomic annotation of every contig in a metagenomic assembly, which would enable me to do a lot of fun analysis downstream. How do I do that with sourmash without making individual files for every sequence? Can you give me a rough command list to accomplish this?

@ctb
Copy link
Contributor

ctb commented Oct 18, 2023

I think multigather should let you do this - in brief, do

sourmash sketch dna ... --singleton -o all-contig-sketches.zip

sourmash multigather --query all-contig-sketches.zip --db databases.zip

sourmash tax genome ...

the last command would need to be run on many files. I'll dig into it when I can - might not be 'til next week tho, sorry!

@ctb
Copy link
Contributor

ctb commented Oct 19, 2023

ok, had a few moments to try this out.

The following worked, but is probably something we could improve quite a bit!

For the below to work, you'll need to be using this code: #2722. Sorry! I'll work to get this merged...

# make working dir
mkdir podar-ref-singleton
cd podar-ref-singleton

# download example data
curl -L https://osf.io/vbhy5/download -o podar-ref.tar.gz

# unpack
tar xzf podar-ref.tar.gz

# sketch twice - once with all contigs using --singleton, once combining each file
sourmash sketch dna --singleton *.fa -o podar-ref-singleton.zip
sourmash sketch dna --name-from-first *.fa -o podar-ref-genomes.zip

# run multigather
sourmash multigather --query podar-ref-singleton.zip --db podar-ref-genomes.zip -U
# all your gather results will be in *.csv

# grab lineage file
curl -L https://osf.io/4yhjw/download -o podar-ref.tax.csv

# run sourmash tax genome
mkdir tax
cd tax

sourmash tax genome -g ../*.fa.*.csv -t ../podar-ref.tax.csv -F human -o out

# all results will be in tax/out.human.txt

Note that we have a really fast multigather implemented in pyo3_branchwater if you're doing this on really big things. Poke me if you want more info - not as convenient as sourmash, but WAY faster and lower memory.

@yuzie0314
Copy link

yuzie0314 commented Jan 16, 2024

Hi @ctb the sourmash author 😃 ,
We are quite interesting with your multigather command, but we always failed to reproduce the results from your example commands/resources.

sourmash multigather --query podar-ref-singleton.zip --db podar-ref-genomes.zip **-U**

The error message:

sourmash: error: unrecognized arguments: -U

So I updated the sourmash version to latest version [4.8.5] and installed sourmash_plugin_branchwater, but still got this error message....

My environment to install sourmash and branchwater is :

mamba 1.5.3
conda 23.10.0.

🤔 what I think is can I just remove -U flag in this command? Or this flag is super important and I cannot remove it?

Another question is can I use customed db to run multigather?
I saw you were using reference signature from the same set of example genomes, and I think we could build customed db from *.fasta and have another customed-ref.tax.csv to generate the annotations at contig level.

HTH and thanks for your contribution❤️,
YJ

@ctb
Copy link
Contributor

ctb commented Jan 16, 2024

NOTE: #2722 is now merged, and this functionality will be in sourmash v4.8.7

right, I have not had time to work further on #2722, so it's not merged.

To try using it, you would need to do something like:

git clone https://github.com/sourmash-bio/sourmash -b fix_multigather_output
cd sourmash
pip install -e .

inside a conda environment or a Python virtualenv.

Note that you'll need to have a Rust compiler and a few other things installed as well - see developer instructions for details.

@ursky
Copy link
Author

ursky commented Jan 18, 2024

Thank you so much @ctb I finally got this working! Really appreciate your help and support on this (sorry i didn't respond earlier - this was on the back-burner for a while).

@ctb
Copy link
Contributor

ctb commented Jan 19, 2024

fantastic, thanks for reporting back!

@yuzie0314
Copy link

yuzie0314 commented Jan 24, 2024

Hi @ctb ,
I used your tutorial and sourmash-4.8.5.dev0 with your data, which worked to me.
However, when I tested on our assembled contigs and used gtdb-rs207.genomic-reps.dna.k31.zip and gtdb-rs207.taxonomy.csv as references, the results were not I expected.

the process is as followed:

# 1. sketch contigs signature -> gzip signatures ( ${contigs_sig} )
 sourmash sketch dna -p scaled=1000,k=31,abund \\
        --name ${meta}_fa \\
        --singleton \\
        -o ${meta}_fa.singleton.gz \\
        ${contigs}

 # 2. run multigather -> a csv gather ( ${gtdb_matches} )
 sourmash multigather --query ${contigs_sig} --db gtdb-rs207.genomic-reps.dna.k31.zip -U
 
 # 3. using tax genome
 sourmash tax genome -g ${gtdb_matches} -t gtdb-rs207.taxonomy.csv -F human -o ${meta}-contig-lineage

the final results txt file only contain this :

sample name    status    proportion   cANI   lineage
-----------    ------    ----------   ----   -------
${meta}_fa     match    26.7%     95.8%  d__Bacteria;p__Firmicutes_A;c__Clostridia;o__Lachnospirales;f__Lachnospiraceae

what we expected is sourmash will annotate each contig like your example, can you guide us what's wrong with our process?
We can also provide this example data to you if you want to give it a try.

thanks for your patience and time spending on develping this useful tool,
YuJen

@ursky
Copy link
Author

ursky commented Feb 27, 2024

Hi @ctb, I had a chance to revisit the data, and I realized that the run that finished only resulted in a taxonomic distribution for the sample, not individual contig annotation. From the commands i gather that sourmash should have access to this info, just not displaying it. Is there a way to get at the contig-wise taxonomic labels in a sample? For example, I want to know which contigs are E. coli, which are Staph, etc.

@ctb
Copy link
Contributor

ctb commented Feb 27, 2024

hi all, I updated the code in the above comment to work properly - the last few commands needed adjustment:


# run sourmash tax genome
mkdir tax
cd tax

sourmash tax genome -g ../*.fa.*.csv -t ../podar-ref.tax.csv -F human -o out

expected results

when I run this, I get multiple results in the file out.human.txt, one for each contig in the .fa files - e.g.

% grep Nostoc out.human.txt

yields

NC_003270.1 Nostoc sp. PCC 7120 plasmid pCC7120epsilon DNA, complete sequence   match    100.0%     100.0%  Bacteria;Cyanobacteria;;Nostocales;Nostocaceae;Nostoc;Nostoc sp. PCC 7120
NC_003272.1 Nostoc sp. PCC 7120 DNA, complete genome   match    100.0%     100.0%  Bacteria;Cyanobacteria;;Nostocales;Nostocaceae;Nostoc;Nostoc sp. PCC 7120
NC_003240.1 Nostoc sp. PCC 7120 plasmid pCC7120beta DNA, complete sequence   match    100.0%     100.0%  Bacteria;Cyanobacteria;;Nostocales;Nostocaceae;Nostoc;Nostoc sp. PCC 7120
NC_003273.1 Nostoc sp. PCC 7120 plasmid pCC7120delta DNA, complete sequence   match    100.0%     100.0%  Bacteria;Cyanobacteria;;Nostocales;Nostocaceae;Nostoc;Nostoc sp. PCC 7120
NC_003276.1 Nostoc sp. PCC 7120 plasmid pCC7120alpha DNA, complete genome   match    100.0%     100.0%  Bacteria;Cyanobacteria;;Nostocales;Nostocaceae;Nostoc;Nostoc sp. PCC 7120
NC_003267.1 Nostoc sp. PCC 7120 plasmid pCC7120gamma DNA, complete sequence   match    100.0%     100.0%  Bacteria;Cyanobacteria;;Nostocales;Nostocaceae;Nostoc;Nostoc sp. PCC 7120

diving in a bit deeper - debugging

Looking at the CSV output for multigather, I see multiple CSV files per input .fa file - one per FASTA sequence in the input FASTA. For example, ls 35.fa.*.csv gives:

35.fa.017a37b39705d80a55b4dd5e48dbcd6e.csv
35.fa.2f96b4b20c447755d9989b7c6da67702.csv
35.fa.75814670ca749232f64b167a068227a0.csv
35.fa.87183a96ba413b60d0dd31433ea31e67.csv
35.fa.a92a259309e2f51b150f900391f4c01e.csv
35.fa.ee46cb1a4568b18bed310299b69ae83c.csv

Tracing back further, when I look at the sketch collections I see different numbers of sketches in the singleton and in the genomes zip file:

% sourmash sig summarize podar-ref-singleton.zip 

== This is sourmash version 4.8.7.dev0. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

** loading from 'podar-ref-singleton.zip'
path filetype: ZipFileLinearIndex
location: /Users/t/dev/sourmash/podar-ref-singleton/podar-ref-singleton.zip
is database? yes
has manifest? yes
num signatures: 490
** examining manifest...
total hashes: 207008
summary of sketches:
   490 sketches with DNA, k=31, scaled=1000           207008 total hashes
% sourmash sig summarize podar-ref-genomes.zip 

== This is sourmash version 4.8.7.dev0. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

** loading from 'podar-ref-genomes.zip'
path filetype: ZipFileLinearIndex
location: /Users/t/dev/sourmash/podar-ref-singleton/podar-ref-genomes.zip
is database? yes
has manifest? yes
num signatures: 64
** examining manifest...
total hashes: 206660
summary of sketches:
   64 sketches with DNA, k=31, scaled=1000            206660 total hashes

and when I grep for the Nostoc name above I see:

 sig grep -ci nostoc podar-ref-singleton.zip podar-ref-genomes.zip 

== This is sourmash version 4.8.7.dev0. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

(no signatures will be saved because of --silent/--count).
7 matches: podar-ref-singleton.zip
1 matches: podar-ref-genomes.zip

this is because in the podar-ref-genomes.zip file all of the Nostoc contigs are combined into one sketch, while in the singleton file there are 7 different FASTA sequences.

@ursky and @yuzie0314 when you run the above code do you get the same results?

I can explain how to adapt all of this to run against GTDB, too, but it would be reassuring to know that you're getting these results first :)

@ctb
Copy link
Contributor

ctb commented Feb 29, 2024

Note #2722 is merged, so as of sourmash v4.8.7 multigather will behave better :).

@yuzie0314
Copy link

yuzie0314 commented Mar 7, 2024

Hi @ctb,
Sorry for the late replying you.
I was palying around again useing your snippet and test set, and the results were just same as yours hopefully. 😍
In addition, I learned that I accidentally add --name ${meta}_fa to the sketch command, which was the main reason why I failed to reproduce resuls that I intended to gain.
Our testing was currently running, but I still beared a small question about why it was so slow in multigather step.
I didn't get a time to update my image to sourmash v4.8.7, and I was curious about why you said "behave better" ? Is the memory improvement or could run in several thread in a time?
Could I replace multigather with fastgather in branchwater plunge?
If so, could you please provide me some examples?

sourmash sketch dna -p scaled=1000,k=31,abund \\
        --name ${meta}_fa \\
        --singleton \\
        -o ${meta}_fa.singleton.gz \\
        ${contigs}

thanks for bearing me asking so many questions,
yuzie

@ctb
Copy link
Contributor

ctb commented Mar 7, 2024

Hi @ctb, Sorry for the late replying you. I was palying around again useing your snippet and test set, and the results were just same as yours hopefully. 😍 In addition, I learned that I accidentally add _--name ${meta}fa to the sketch command, which was the main reason why I failed to reproduce resuls that I intended to gain.

wonderful, and that makes sense!

Our testing was currently running, but I still beared a small question about why it was so slow in multigather step. I didn't get a time to update my image to sourmash v4.8.7, and I was curious about why you said "behave better" ? Is the memory improvement or could run in several thread in a time?

v4.8.6 is now an official release that includes #2722, so you can just install it, using the development version.

Could I replace multigather with fastgather in branchwater plugin? If so, could you please provide me some examples?

Well, fastmultigather from the branchwater plugin will generally be much, much, much faster, mostly because it is multithreaded. Replacing multigather with fastmultigather is possible, but I don't have the time right now to write it up - I'll circle back around in a bit to do so.

thanks for bearing me asking so many questions, yuzie

thanks for asking them! :)

@ctb
Copy link
Contributor

ctb commented Mar 22, 2024

v4.8.6 is now an official release that includes #2722, so you can just install it, using the development version.

I don't know why I said this. I was wrong - v4.8.6 did not include #2722. Apologies. Brain 🤯

However, the good news is that v4.8.7 is now out, and that DOES include #2722.

@ctb
Copy link
Contributor

ctb commented Mar 22, 2024

I've written a short fastmultigather quickstart here: #3095.

@ctb
Copy link
Contributor

ctb commented Mar 22, 2024

Between #2722 being merged, and the fastmultigather tutorial in #3095, I'm going to close this issue for now. Please don't hesitate to ask further questions here or in a new issue, depending on where you think it belongs - new issues are not a problem :)

@ctb ctb closed this as completed Mar 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants