You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I read your approach on Transcriptome de novo assembly approach and am interested. However, I am not a bioinformatics expert but a DIY and "learn while doing it" kind of guy.
I want to follow the process and use it on my data.
Could you be kind enough to provide an outline of the commands you used to achieve the end result of ~20,000 transcripts. This would be very helpful. Thanks in advance. my email is [email protected]
Your words are as follows:
I started with >100,000 transcripts in a de-novo transcriptome made from
pooled siblings' tissues.
What I have done to "winnow" transcripts is to filter by coverage, as here: https://github.com/trinityrnaseq/trinityrnaseq/wiki/Trinity-Transcript-Quantification#filtering-transcripts
Then I take the remaining transcripts that passed that filter and I predict
ORFs with something like Transdecoder (I used GeneMarkS-T).
THEN I cluster the predicted proteome at a 70% identity threshold using
USEARCH: https://www.drive5.com/usearch/
The centroid sequences you get from that are the ones that are most
representative of each cluster. I take the headers for the centroid
proteins and use them to pull the matching nucleotide transcripts from my
assembly.
This has generally ended up with a nice manageable transcriptome of ~20,000
transcripts. The N50 goes up considerably. And my BUSCO results are quite
good!
The text was updated successfully, but these errors were encountered:
AlexGaithuma
changed the title
"winnow" transcripts is to filter by coverage
"winnow" transcripts to filter by coverage
Dec 24, 2020
Hi fishercera,
I read your approach on Transcriptome de novo assembly approach and am interested. However, I am not a bioinformatics expert but a DIY and "learn while doing it" kind of guy.
I want to follow the process and use it on my data.
Could you be kind enough to provide an outline of the commands you used to achieve the end result of ~20,000 transcripts. This would be very helpful. Thanks in advance. my email is [email protected]
Your words are as follows:
I started with >100,000 transcripts in a de-novo transcriptome made from
pooled siblings' tissues.
What I have done to "winnow" transcripts is to filter by coverage, as here:
https://github.com/trinityrnaseq/trinityrnaseq/wiki/Trinity-Transcript-Quantification#filtering-transcripts
Then I take the remaining transcripts that passed that filter and I predict
ORFs with something like Transdecoder (I used GeneMarkS-T).
THEN I cluster the predicted proteome at a 70% identity threshold using
USEARCH: https://www.drive5.com/usearch/
The centroid sequences you get from that are the ones that are most
representative of each cluster. I take the headers for the centroid
proteins and use them to pull the matching nucleotide transcripts from my
assembly.
This has generally ended up with a nice manageable transcriptome of ~20,000
transcripts. The N50 goes up considerably. And my BUSCO results are quite
good!
The text was updated successfully, but these errors were encountered: