Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Genes missing /gene field in .gbk output #708

Open
NonAggressiveHail opened this issue Sep 16, 2024 · 0 comments
Open

Genes missing /gene field in .gbk output #708

NonAggressiveHail opened this issue Sep 16, 2024 · 0 comments

Comments

@NonAggressiveHail
Copy link

NonAggressiveHail commented Sep 16, 2024

Hello,

I am currently reannotating many P. aeruginosa genomes, and I want to use the PAO1 annotations from the pseudomonas genome database, with a couple of other proteins, as a reference for the first round of annotation. However, when PAO1 itself is annotated not all the expected genes are there, and I am struggling to work out why.

In my reference file, Pa_PAO1_107_annotations.gbk, on gene has the following entry:
gene complement(2694546..2694764)
/gene="PA2412"
/locus_tag="PA2412"
/db_xref="Pseudomonas Genome DB: PGD107602"
CDS complement(2694546..2694764)
/gene="PA2412"
/locus_tag="PA2412"
/product="conserved hypothetical protein"
/codon_start=1
/translation_table=11
/translation="MTSVFDRDDIQFQVVVNHEEQYSIWPEYKEIPQGWRAAGKSGLKK
DCLAYIEEVWTDMRPLSLRQHMDKAAG"
/protein_id="NP_251102.1"

After converting to a fasta file with prokka-genbank_to_fasta_db, we have the following entry:
>NP_251102.1 ~~~PA2412~~~conserved hypothetical protein
MTSVFDRDDIQFQVVVNHEEQYSIWPEYKEIPQGWRAAGKSGLKKDCLAYIEEVWTDMRP
LSLRQHMDKAAG

I then run Prokka with:
prokka --outdir ./Pa_PAO1_107/ --prefix Pa_PAO1_107 --proteins ../../raw_data/genomes/siderophore_annotations.db --force --locustag Pa_PAO1_107 --cpus 8 ../oriented_genomes/Pa_PAO1_107/Pa_PAO1_107_reoriented.fasta

In the output file, Pa_PAO1_107.gbk I have no matches for PA2412, however I do have the following entry
CDS complement(2694064..2694282)
/locus_tag="Pa_PAO1_107_02485"
/inference="ab initio prediction:Prodigal:002006"
/inference="similar to AA
sequence:siderophore_annotations.db:NP_251102.1"
/note="conserved hypothetical protein"
/codon_start=1
/transl_table=11
/product="hypothetical protein"
/translation="MTSVFDRDDIQFQVVVNHEEQYSIWPEYKEIPQGWRAAGKSGLK
KDCLAYIEEVWTDMRPLSLRQHMDKAAG"

You can see that the two amino acid sequences are identical:
MTSVFDRDDIQFQVVVNHEEQYSIWPEYKEIPQGWRAAGKSGLKKDCLAYIEEVWTDMRPLSLRQHMDKAAG
MTSVFDRDDIQFQVVVNHEEQYSIWPEYKEIPQGWRAAGKSGLKKDCLAYIEEVWTDMRPLSLRQHMDKAAG

I am unsure why, with identical amino acid sequences, this has not been annotated with /gene="PA2412". Clearly it has matched to some degree, as the inference is /inference="similar to AA sequence:siderophore_annotations.db:NP_251102.1".

For another protein it has worked as expected:
Reference entry:
gene complement(2693781..2694545)
/gene="PA2411"
/locus_tag="PA2411"
/db_xref="Pseudomonas Genome DB: PGD107600"
CDS complement(2693781..2694545)
/gene="PA2411"
/locus_tag="PA2411"
/product="probable thioesterase"
/codon_start=1
/translation_table=11
/translation="MGGTPVRLFCLPYSGASAMTYSRWRRKLPAWLAVRPVELPGRGAR
MAEPLQTDLASLAQQLARELHDEVRQGPYAMLGHSLGALLACEVLYALRELGCPTPLGF
FACGTAAPSRRAEYDRGFAEPKSDAELIADLRDLQGTPEEVLGNRELMSLTLPILRADF
LLCGSYRHQRRPPLACPIRTLGGREDKASEEQLLAWAEETRSGFELELFDGGHFFIHQR
EAEVLAVVECQVEAWRAGQGAAALAVESAAIC"
/protein_id="NP_251101.1"

Fasta entry:
>NP_251101.1 ~~~PA2411~~~probable thioesterase MGGTPVRLFCLPYSGASAMTYSRWRRKLPAWLAVRPVELPGRGARMAEPLQTDLASLAQQ LARELHDEVRQGPYAMLGHSLGALLACEVLYALRELGCPTPLGFFACGTAAPSRRAEYDR GFAEPKSDAELIADLRDLQGTPEEVLGNRELMSLTLPILRADFLLCGSYRHQRRPPLACP IRTLGGREDKASEEQLLAWAEETRSGFELELFDGGHFFIHQREAEVLAVVECQVEAWRAG QGAAALAVESAAIC

Output .gbk entry:
CDS complement(2693299..2694063)
/gene="PA2411"
/locus_tag="Pa_PAO1_107_02484"
/inference="ab initio prediction:Prodigal:002006"
/inference="similar to AA
sequence:siderophore_annotations.db:NP_251101.1"
/codon_start=1
/transl_table=11
/product="putative thioesterase"
/translation="MGGTPVRLFCLPYSGASAMTYSRWRRKLPAWLAVRPVELPGRGA
RMAEPLQTDLASLAQQLARELHDEVRQGPYAMLGHSLGALLACEVLYALRELGCPTPL
GFFACGTAAPSRRAEYDRGFAEPKSDAELIADLRDLQGTPEEVLGNRELMSLTLPILR
ADFLLCGSYRHQRRPPLACPIRTLGGREDKASEEQLLAWAEETRSGFELELFDGGHFF
IHQREAEVLAVVECQVEAWRAGQGAAALAVESAAIC"

Why is it that for the second entry there is a gene field, but for the first there is not?

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant