Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"total" alignments is less than "protein_coding" #325

Open
mebbert opened this issue Apr 7, 2017 · 7 comments
Open

"total" alignments is less than "protein_coding" #325

mebbert opened this issue Apr 7, 2017 · 7 comments

Comments

@mebbert
Copy link

mebbert commented Apr 7, 2017

Hi,

I'm using bam_vs_bed to quantify the number of reads that are aligned to different transcript types (RNASeq) using a bed file where the ID is a gencode gene_type. The output for one sample is below. The apparent issue (unless I'm missing something) is that the "total" count is an order of magnitude less than the count for "protein_coding".

Thanks!

category                           alignments
total                              201301470
unitary_pseudogene                 173278
rRNA                               4248
lincRNA                            62936102
IG_C_pseudogene                    8794
Mt_tRNA                            51846
sense_intronic                     987776
IG_V_gene                          168506
misc_RNA                           32574414
Mt_rRNA                            264
polymorphic_pseudogene             169312
scaRNA                             311310
IG_J_gene                          3077
TR_J_pseudogene                    393
IG_J_pseudogene                    27
scRNA                              27
TEC                                596168
protein_coding                     1123560888
bidirectional_promoter_lncRNA      1142
TR_V_pseudogene                    2807
sense_overlapping                  903021
TR_J_gene                          19092
IG_V_pseudogene                    4222
pseudogene                         881
snRNA                              2404686
sRNA                               87
unprocessed_pseudogene             2308132
TR_V_gene                          95649
transcribed_unprocessed_pseudogene 6068130
3prime_overlapping_ncRNA           225029
miRNA                              702555
non_coding                         10232988
macro_lncRNA                       1320
antisense                          27570605
IG_C_gene                          452409
ribozyme                           1103904
IG_D_gene                          12
TR_C_gene                          851700
processed_transcript               10490019
transcribed_processed_pseudogene   698627
processed_pseudogene               2236140
snoRNA                             321205
vaultRNA                           9
transcribed_unitary_pseudogene     137381
@AndreasHeger
Copy link
Member

AndreasHeger commented Apr 10, 2017

Thanks, I will have a look. The counts come from bedtools, while the "total" value comes from the BAM file and is the number of (mapped) alignments.

@tushardave26
Copy link

@AndreasHeger - Did you get a chance to look at the issue? I am experiencing the same issue. Can you please shed some light?

@AndreasHeger
Copy link
Member

AndreasHeger commented Dec 8, 2017

Sorry for the delay. The issue is likely that double-counting occurs. I assume that the bed file is probably derived from a gene set with multiple alternative transcripts? There are probably duplicates:

chr1     10000     20000     protein_coding            # gene1, transrcipt1
chr1     10000     20000     protein_coding            # gene1, transcript2

This will be counted twice into the protein_coding bin by bedtools. To avoid this, try:

zcat input_with_duplicates.bed.gz | cgat bed2bed --merge-by-name | bgzip > input_without_duplicates.bed.gz

@AndreasHeger
Copy link
Member

I will add some text to the help message to warn users.

@AndreasHeger
Copy link
Member

Please let me know if this is likely to be the issue.

@tushardave26
Copy link

@AndreasHeger - Thanks for your suggestion. I tried it and it doesn't make any difference. Seems like input BED file doesn't have any multiple alternative transcripts.

@AndreasHeger
Copy link
Member

Thanks. Could you share some of your data? I was able to reproduce the issue using duplicated intervals, but don't know how else it comes about.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants