"total" alignments is less than "protein_coding" #325

mebbert · 2017-04-07T21:53:45Z

Hi,

I'm using bam_vs_bed to quantify the number of reads that are aligned to different transcript types (RNASeq) using a bed file where the ID is a gencode gene_type. The output for one sample is below. The apparent issue (unless I'm missing something) is that the "total" count is an order of magnitude less than the count for "protein_coding".

Thanks!

category                           alignments
total                              201301470
unitary_pseudogene                 173278
rRNA                               4248
lincRNA                            62936102
IG_C_pseudogene                    8794
Mt_tRNA                            51846
sense_intronic                     987776
IG_V_gene                          168506
misc_RNA                           32574414
Mt_rRNA                            264
polymorphic_pseudogene             169312
scaRNA                             311310
IG_J_gene                          3077
TR_J_pseudogene                    393
IG_J_pseudogene                    27
scRNA                              27
TEC                                596168
protein_coding                     1123560888
bidirectional_promoter_lncRNA      1142
TR_V_pseudogene                    2807
sense_overlapping                  903021
TR_J_gene                          19092
IG_V_pseudogene                    4222
pseudogene                         881
snRNA                              2404686
sRNA                               87
unprocessed_pseudogene             2308132
TR_V_gene                          95649
transcribed_unprocessed_pseudogene 6068130
3prime_overlapping_ncRNA           225029
miRNA                              702555
non_coding                         10232988
macro_lncRNA                       1320
antisense                          27570605
IG_C_gene                          452409
ribozyme                           1103904
IG_D_gene                          12
TR_C_gene                          851700
processed_transcript               10490019
transcribed_processed_pseudogene   698627
processed_pseudogene               2236140
snoRNA                             321205
vaultRNA                           9
transcribed_unitary_pseudogene     137381

The text was updated successfully, but these errors were encountered:

AndreasHeger · 2017-04-10T13:47:52Z

Thanks, I will have a look. The counts come from bedtools, while the "total" value comes from the BAM file and is the number of (mapped) alignments.

tushardave26 · 2017-12-05T21:05:25Z

@AndreasHeger - Did you get a chance to look at the issue? I am experiencing the same issue. Can you please shed some light?

AndreasHeger · 2017-12-08T13:22:59Z

Sorry for the delay. The issue is likely that double-counting occurs. I assume that the bed file is probably derived from a gene set with multiple alternative transcripts? There are probably duplicates:

chr1     10000     20000     protein_coding            # gene1, transrcipt1
chr1     10000     20000     protein_coding            # gene1, transcript2

This will be counted twice into the protein_coding bin by bedtools. To avoid this, try:

zcat input_with_duplicates.bed.gz | cgat bed2bed --merge-by-name | bgzip > input_without_duplicates.bed.gz

AndreasHeger · 2017-12-08T13:23:35Z

I will add some text to the help message to warn users.

AndreasHeger · 2017-12-08T13:30:55Z

Please let me know if this is likely to be the issue.

tushardave26 · 2017-12-14T20:20:05Z

@AndreasHeger - Thanks for your suggestion. I tried it and it doesn't make any difference. Seems like input BED file doesn't have any multiple alternative transcripts.

AndreasHeger · 2017-12-15T16:46:26Z

Thanks. Could you share some of your data? I was able to reproduce the issue using duplicated intervals, but don't know how else it comes about.

AndreasHeger added a commit that referenced this issue Dec 8, 2017

{AH} add message to help text warning about duplicates, see #325

8c01418

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"total" alignments is less than "protein_coding" #325

"total" alignments is less than "protein_coding" #325

mebbert commented Apr 7, 2017 •

edited

Loading

AndreasHeger commented Apr 10, 2017 •

edited

Loading

tushardave26 commented Dec 5, 2017

AndreasHeger commented Dec 8, 2017 •

edited

Loading

AndreasHeger commented Dec 8, 2017

AndreasHeger commented Dec 8, 2017

tushardave26 commented Dec 14, 2017

AndreasHeger commented Dec 15, 2017

"total" alignments is less than "protein_coding" #325

"total" alignments is less than "protein_coding" #325

Comments

mebbert commented Apr 7, 2017 • edited Loading

AndreasHeger commented Apr 10, 2017 • edited Loading

tushardave26 commented Dec 5, 2017

AndreasHeger commented Dec 8, 2017 • edited Loading

AndreasHeger commented Dec 8, 2017

AndreasHeger commented Dec 8, 2017

tushardave26 commented Dec 14, 2017

AndreasHeger commented Dec 15, 2017

mebbert commented Apr 7, 2017 •

edited

Loading

AndreasHeger commented Apr 10, 2017 •

edited

Loading

AndreasHeger commented Dec 8, 2017 •

edited

Loading