LT_3 > LT_10? #2

ailurophilia · 2023-10-16T23:01:38Z

Hi,

Thanks for this very useful set of tools! I'm running the octopus3 pipeline (just with run-octopus-analysis.sh), and the pipeline is apparently completing successfully, but when I look at LT_3 and LT_10 in aggregated-stats.tsv, some samples have LT_10s that are smaller than their LT_3s. This seems like it should be impossible, since any part of my input reference that has <3x coverage certainly should have <10x coverage.

Unfortunately, since I'm working with proprietary data, it's difficult for me to provide a reproducible example, but if you have any insight into why this LT_10/LT_3 pattern might be cropping up, I'd really appreciate it!

Thanks so much!

Best,
David

ailurophilia · 2023-10-17T17:32:51Z

Updating this to report what I think may be a bug, at least based on given variable definitions for LT_3 and LT_10.

Line 317 of the makefile currently reads

\''{if($$3 < 3) lt3 += 1; else if($$3 < 10) lt10 += 1} END {len=split(path,a,"/"); print a[len-3],a[len-1],a[len],lt10/NR,lt3/NR}'\' \

so I believe that lt10 is only being incremented if coverage at a given position in the reference is < 10 but not <3, which I don't think is the desired behavior. (I presume lt10 should be incremented whenever coverage is <10.)

Does this seem correct? Thanks!

Best,
David

bryanliujiang · 2023-11-03T20:30:49Z

Hey David! Thanks for putting up with this bug for this long and going beyond to isolate the source. You are correct that there is an extraneous else statement that's resulting in the unintended behavior of LT_10 metric not also including the reads classified under LT_3. And simply removing the else would result in the correct behavior.

CURRENT:
\''{if($$3 < 3) lt3 += 1; else if($$3 < 10) lt10 += 1} END {len=split(path,a,"/"); print a[len-3],a[len-1],a[len],lt10/NR,lt3/NR}'\' \

CORRECTED:
\''{if($$3 < 3) lt3 += 1; if($$3 < 10) lt10 += 1} END {len=split(path,a,"/"); print a[len-3],a[len-1],a[len],lt10/NR,lt3/NR}'\' \

Will add this patch in a future update before closing the issue.

Best,
Bryan

bryanliujiang · 2023-11-08T09:03:39Z

Going to add another point to this in case anyone else looking might be interested.

It might still be handy to include this bug as its own separate metric displaying the difference of LT_10 and LT_3 (i.e. LT_10 minus LT_3), as this would allow one to, at a glance, get a hunch on whether a region of low coverage is likely the result of low reads or of deletion. The idea being that the closer this hypothetical diff metric is to zero, the more likely it could suggest a deletion (assuming sufficient read counts/high Leftover metric and uniform coverage).

Contrived example:

Well	LT_10	LT_3	LT_diff (LT_10 - LT_3)	deletion?
A01	0.50	0.50	0.00	likely
B01	0.50	0.05	0.45	not as likely

If one suspects generally low coverage for a plasmid, it would be expected to look more like sample B01 where 50% of the plasmid has less than 10X coverage with the occasional dip to below 3X coverage, perhaps 5% of the B01 reference. However, if 50% of the plasmid has < 10X coverage but all of that is also < 3X coverage as suggested in sample A01, it may very well be due to the fact that there were zero reads in that region due to the plasmid having a deletion there but strong coverage everywhere else.

Depending on one's preferences, the diff metric could perhaps be something like the percentage of (LT_3 / LT_10) instead of just the difference.

Of course, one could also just glance at the CIGAR string for any large deletions, but the diff metric could still be a useful redundancy in case the Contig is messy (and thus messy CIGAR).

However, one should also keep in mind confounding sources of steep LT_diff. For example, in the cases of non-uniform coverage such as due to highly polymeric regions, this can manifest in sudden low coverage and thus relying on a diff metric alone would be misleading. Though, true deletions would be corroborated by other metrics, such as the existence of non-zero n_vars and mismatching of Ref_Seq and Consensus sequences, Ds in the CIGAR, and missing regions in the samtools tview pileup view.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LT_3 > LT_10? #2

LT_3 > LT_10? #2

ailurophilia commented Oct 16, 2023 •

edited

Loading

ailurophilia commented Oct 17, 2023

bryanliujiang commented Nov 3, 2023

bryanliujiang commented Nov 8, 2023

LT_3 > LT_10? #2

LT_3 > LT_10? #2

Comments

ailurophilia commented Oct 16, 2023 • edited Loading

ailurophilia commented Oct 17, 2023

bryanliujiang commented Nov 3, 2023

bryanliujiang commented Nov 8, 2023

ailurophilia commented Oct 16, 2023 •

edited

Loading