Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

signalp6 parser #1886

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

signalp6 parser #1886

wants to merge 2 commits into from

Conversation

mschecht
Copy link
Contributor

Hi anvi'o team!

I would like to add a new parser for anvi-import-function to include signalp6 annotations to gene calls. I believe it will be impactful for the userbase to learn which proteins in their contigs-dbs have the potential to be secreted.

Here's an example of signalp6 with the Infant Gut Dataset:

If just want to run the parser, paste this output into the file prediction_results.txt and that will work with the example below

$ cat prediction_results.txt
# SignalP-6.0	Organism: Other	Timestamp: 20220210111111
# ID	Prediction	OTHER	SP(Sec/SPI)	LIPO(Sec/SPII)	TAT(Tat/SPI)	TATLIPO(Sec/SPII)	PILIN(Sec/SPIII)	CS Position
52	LIPO	0.000000	0.000000	1.000050	0.000000	0.000000	0.000000	CS pos: 23-24. Pr: 0.9947
57	SP	0.000394	0.998806	0.000252	0.000182	0.000179	0.000159	CS pos: 32-33. Pr: 0.9711
58	LIPO	0.000000	0.000000	1.000062	0.000000	0.000000	0.000000	CS pos: 22-23. Pr: 0.9938
66	SP	0.001318	0.511084	0.486915	0.000299	0.000197	0.000167	CS pos: 32-33. Pr: 0.4850
80	SP	0.000331	0.998925	0.000253	0.000165	0.000164	0.000147	CS pos: 32-33. Pr: 0.9691
84	SP	0.000336	0.998851	0.000203	0.000207	0.000197	0.000182	CS pos: 24-25. Pr: 0.9744
86	LIPO	0.000000	0.000000	1.000077	0.000000	0.000000	0.000000	CS pos: 21-22. Pr: 0.9955
122	LIPO	0.000000	0.000000	1.000044	0.000000	0.000000	0.000000	CS pos: 22-23. Pr: 0.9970
123	LIPO	0.000000	0.000000	1.000072	0.000000	0.000000	0.000000	CS pos: 22-23. Pr: 0.9957

Quick note about the signalp6 output:

  • Not including the model probability for the signal peptide annotation type
  • filtering out annotations the are "OTHER" (no signal peptide annotations)

Here's is the parser in action:

# get sequences from IGD genome
anvi-get-sequences-for-gene-calls -c additional-files/pangenomics/external-genomes/Enterococcus_faecalis_6563.db  --get-aa-sequences -o Enterococcus_faecalis_6563.fa

# Subset for a faster prediction
head -n 3001 Enterococcus_faecalis_6563.fa > Enterococcus_faecalis_6563_small.fa

# run signalp6
signalp6 --fastafile Enterococcus_faecalis_6563_small.fa --organism other --output_dir signal_peps --format txt --mode fast

# import signalp6 predictions into anvio
anvi-import-functions -c additional-files/pangenomics/external-genomes/Enterococcus_faecalis_6563.db -p signalp6 -i signal_peps/prediction_results.txt

Thanks!

@mschecht mschecht requested a review from ekiefl February 11, 2022 00:26
@mschecht mschecht self-assigned this Feb 11, 2022
@meren
Copy link
Member

meren commented Feb 14, 2022

Hey @mschecht,

Thanks for adding this! Here are a few coments:

  • You shouldn't be using csv.reader when anvi'o has all the necessary utils functions to read such data in, but let's ignore that for this one.

  • More critically, we don't name our parsers based on the version number of output they can deal with. So the parser should be called signalp, and deal with versions internally. For instance, if the output file (and/or its headers) doesn't look like the expected output the parser knows how to work with, it should say "I'm only dealing with output from v6 at the moment, and this doesn't look like it". Currently there is no check. What if I send this parser any TAB-delimited file I want? You know :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants