Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LT_3 > LT_10? #2

Open
ailurophilia opened this issue Oct 16, 2023 · 3 comments
Open

LT_3 > LT_10? #2

ailurophilia opened this issue Oct 16, 2023 · 3 comments

Comments

@ailurophilia
Copy link

ailurophilia commented Oct 16, 2023

Hi,

Thanks for this very useful set of tools! I'm running the octopus3 pipeline (just with run-octopus-analysis.sh), and the pipeline is apparently completing successfully, but when I look at LT_3 and LT_10 in aggregated-stats.tsv, some samples have LT_10s that are smaller than their LT_3s. This seems like it should be impossible, since any part of my input reference that has <3x coverage certainly should have <10x coverage.

Unfortunately, since I'm working with proprietary data, it's difficult for me to provide a reproducible example, but if you have any insight into why this LT_10/LT_3 pattern might be cropping up, I'd really appreciate it!

Thanks so much!

Best,
David

@ailurophilia
Copy link
Author

Updating this to report what I think may be a bug, at least based on given variable definitions for LT_3 and LT_10.

Line 317 of the makefile currently reads

\''{if($$3 < 3) lt3 += 1; else if($$3 < 10) lt10 += 1} END {len=split(path,a,"/"); print a[len-3],a[len-1],a[len],lt10/NR,lt3/NR}'\' \

so I believe that lt10 is only being incremented if coverage at a given position in the reference is < 10 but not <3, which I don't think is the desired behavior. (I presume lt10 should be incremented whenever coverage is <10.)

Does this seem correct? Thanks!

Best,
David

@bryanliujiang
Copy link
Contributor

Hey David! Thanks for putting up with this bug for this long and going beyond to isolate the source. You are correct that there is an extraneous else statement that's resulting in the unintended behavior of LT_10 metric not also including the reads classified under LT_3. And simply removing the else would result in the correct behavior.

CURRENT:
\''{if($$3 < 3) lt3 += 1; else if($$3 < 10) lt10 += 1} END {len=split(path,a,"/"); print a[len-3],a[len-1],a[len],lt10/NR,lt3/NR}'\' \

CORRECTED:
\''{if($$3 < 3) lt3 += 1; if($$3 < 10) lt10 += 1} END {len=split(path,a,"/"); print a[len-3],a[len-1],a[len],lt10/NR,lt3/NR}'\' \

Will add this patch in a future update before closing the issue.

Best,
Bryan

@bryanliujiang
Copy link
Contributor

Going to add another point to this in case anyone else looking might be interested.

It might still be handy to include this bug as its own separate metric displaying the difference of LT_10 and LT_3 (i.e. LT_10 minus LT_3), as this would allow one to, at a glance, get a hunch on whether a region of low coverage is likely the result of low reads or of deletion. The idea being that the closer this hypothetical diff metric is to zero, the more likely it could suggest a deletion (assuming sufficient read counts/high Leftover metric and uniform coverage).

Contrived example:

Well LT_10 LT_3 LT_diff (LT_10 - LT_3) deletion?
A01 0.50 0.50 0.00 likely
B01 0.50 0.05 0.45 not as likely

If one suspects generally low coverage for a plasmid, it would be expected to look more like sample B01 where 50% of the plasmid has less than 10X coverage with the occasional dip to below 3X coverage, perhaps 5% of the B01 reference. However, if 50% of the plasmid has < 10X coverage but all of that is also < 3X coverage as suggested in sample A01, it may very well be due to the fact that there were zero reads in that region due to the plasmid having a deletion there but strong coverage everywhere else.

Depending on one's preferences, the diff metric could perhaps be something like the percentage of (LT_3 / LT_10) instead of just the difference.

Of course, one could also just glance at the CIGAR string for any large deletions, but the diff metric could still be a useful redundancy in case the Contig is messy (and thus messy CIGAR).

However, one should also keep in mind confounding sources of steep LT_diff. For example, in the cases of non-uniform coverage such as due to highly polymeric regions, this can manifest in sudden low coverage and thus relying on a diff metric alone would be misleading. Though, true deletions would be corroborated by other metrics, such as the existence of non-zero n_vars and mismatching of Ref_Seq and Consensus sequences, Ds in the CIGAR, and missing regions in the samtools tview pileup view.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants