-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using more stringent parameters to avoid spurious sequences #17
Comments
@apcamargo thank you trying Plass. Yes it is hard to reliably call short genes and distinguish the frames correctly. We set the threshold of our NN to not harm the sensitivity too much. It would for sure help to change the threshold to 0.5 or even higher. Another idea for short genes would be, instead of doing a six-frame translation use some gene caller on the read and only assemble this fragments using Plass. This way a lot of the wrong frame assemblies should disappear. Alternative approach to call short genes: |
Thank you, @martin-steinegger I have 2×150bp reads that do not overlap, so I'm afraid that I won't be able to call genes in such short sequences. Thank you for the suggestion! I'll try to raise the neural network threshold and see how much of a difference it does in my dataset. In case I still get a large fraction of short peptides, I'll probably disable partial sequences by allowing only peptides with a start and a stop codon. I know that by filtering out partial peptides I'll filter out many real protein sequences but most of those partial peptides can't be functionally anotated anyway. Realistically, I don't think that I'll lose much information (especially considering that I'll aggregate my PLASS and my MEGAHIT+Prodigal proteins into a non-redundant dataset). |
@apcamargo Did you have good results from adjusting the NN option with your shorter peptides? It would be helpful to know! |
Hi @mooreryan I classified my sequences into three groups:
Here's the length histogram for each of these groups: I've noticed that there's a bimodal distribution with a short peptides peak (<60aa) and a long peptides peak (>60aa). So, to test whether raising the NN threshold would improve the reliability of the assembled peptides, I BLASTed subsets of 500 short peptides of each group. In the table below, I put the number of sequences that had at least one hit in the
As you can see, the proportion of sequences that had at least one hit is not very different. As the assembly with stricter parameters had a much lower number of peptides (27065097 vs. 32156588), I decided to keep the default parameters. |
@apcamargo Thanks, that is very helpful!! |
@apcamargo Thanks for the very interesting analysis! I was wondering what the left distributions below 60 amino acids length are. They look a bit like what you would expect of false calls with exponentially decaying length distribution (due to the random occurrence of stop codons). Could you plot them on a linear scale between 0 and 100 amino acids to get a better feeling for this? Another possibility is that at least some of them are real but very short, expressed, and active peptides like the ones recently described by Amy Bhat's group: Could you plot the fractions of BLAST matches for l<=60 aa and l > 60 aa for your three groups (complet, semi, partial)? (It might turn out ambiguous because a large fraction of them will have no homologs in the database and homologies are more difficult to find for such short proteins.) Perhaps some clustering and conservation analysis like the Bhatt lab performed could turn out interesting. They clustered their short putative proteins predicted in 1700 human-associated metagenomes and ran RNAcode (Washietl et al., 2011) on the MSAs, which classifies them as coding or noncoding based on amino acid versus nucleotide-level conservation. This might be more appropriate to continue by email. Feel free to reply to soeding at mpibpc.mpg.de |
@apcamargo Thank you again for your insightful analysis. We added a remark in the README of Plass. |
Thanks @martin-steinegger! After our conversation. Got some ideas to improve the NN for small peptides. If you plan to work on that in the future, let me know! |
@apcamargo great to hear! Currently I do not work on it. But I am happy to help integrating the networks if needed. |
Hey there,
I've been testing some PLASS assemblies with my datasets and I noticed that it retrieves ~50x times more proteins than my MEGAHIT+Prodigal predictions, which is way more than what is shown in Fig S2 of the paper.
I ran PLASS with default parameters, which means that a lot of very short peptides were retrieved (as the default value of
--min-length
is 45). I also noticed that most of them are incomplete (without * at the beginning and the end of the sequence). Even though partial peptides are to be expected in metagenome assemblies, I'm concerned that PLASS may be giving me a substantial amount of spurious sequences. In Swiss-Prot, ~8% of the prokaryotic proteins are between 45 and 100 residues. In my assemblies, 40% of the sequences fall into this interval.In the paper, you excluded peptides shorter than 100 residues (which is something that I'd expect to attain by using
--min-length 100
), but I'm apprehensive of doing that because I don't want to left short bona-fide proteins out of my assembly.Do you think raising the neural network threshold is a sound idea to solve the problem of spurious sequences? How was the default threshold (0.2) determined?
The text was updated successfully, but these errors were encountered: