-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use Plass for euk metagenomics data #28
Comments
@liuxianghui Plass extract all open reading frames from short reads and extending them through overlap detection. This works well for proteins that are encoded in an consecutive way. However, eukaryotes have introns so it is not possible to overlap the reads to extract the proteins. What makes the detection of eukaryotic genes hard? The fragmentation of the genomes? |
For bacteria, the usual approach is to assembly the reads into contigs and then use prodigal to predict the genes. However, this is not OK for euk, we have to do binning of genomes. Find those euk genomes and try the taxonomic assignment. Then use different tools like GeneMark-ES for prediction of gene for each genome. ( There is no tool to work with euk contigs like prodigal for bacteria ). GeneMark-ES use self-training model based on each genome to make prediction. Augustus have limited model and only apply for specify euk genomes. |
I want to extract euk genes/proteins from metagenomics data. I want to build a gene/protein catalog for euk genes.
Seems that metaeuk is a reference guided approach ( based on mmseq2) and Plass is a denova approach ( not relying on reference protein sequences).
I don't understand the statement in your paper about Plass on euk protein assembly.
"Our chief limitation is that, unlike nucleotide assemblers, Plass cannot place the assembled protein sequences into genomic context. Furthermore, it cannot assemble intron-containing eukaryotic proteins, although, as shown, it can assemble eukaryotic proteins from transcriptome data. Another drawback is its inability to resolve homologous proteins from closely related strains or species with sequence identities above ~95%. However, the impact on the accuracy of predicted functions is low (Fig. 2) and bacterial phenotypes are determined more by the complement of horizontally acquired accessory genes than by minor variations in protein sequences."
I understand the methods behind the mmseq2 and Plass are different.... but mmseq2 should able to handle the 'intron-containing eukaryotic proteins' ...
Anyway,,, could you kindly suggest a good way to identify those euk proteins?? ( the prediction of euk genes from binned euk genomes are so troublesome...)
Expected Behavior
Current Behavior
Steps to Reproduce (for bugs)
Please make sure to execute the reproduction steps with newly recreated and empty tmp folders.
Plass Output (for bugs)
Please make sure to also post the complete output of Plass. You can use gist.github.com for large output.
Context
Providing context helps us come up with a solution and improve our documentation for the future.
Your Environment
Include as many relevant details about the environment you experienced the bug in.
The text was updated successfully, but these errors were encountered: