Automatic threshold determination #126

ckuenne · 2020-05-14T15:40:01Z

Hi,

is there a systematic way to find meaningful cutoffs for -s/-w? How do you decide on these?

Best,
Carsten

VJalili · 2020-05-14T17:36:33Z

That's a good question!

The choice of stringent (-s) and weak (-w) thresholds are subject to your experiment setup and peak calling pipeline.

You may consider this example for choosing thresholds:
Based on your experiment setup, you may decide to use, e.g., 1e-8 as the p-value threshold on MACS so to minimize the number of false-positives. In order to increase the number of true-positives without the penalty of significantly increasing the number of false-positives, we suggest you use a more permissive threshold for peak calling (e.g., 1e-4), and then apply mspc to rescue weak but reproducible regions. Specifically, using 1e-4 on MACS will increase the number of true-positives with the penalty of more false-positives, then you can run MSPC on the replicates with -w 1e-4 and -s 1e-8, which will rescue the reproducible weak binding sites (i.e., increase the number of true-positives without increasing the number of false-positives). Additionally, in this setup, you may choose -g 1e-6 for a more permissive process; by default -g (combined stringency) is set to be equal to -s.

You may need to choose different thresholds if you use a different peak caller or different experiment setup.

We are interested in developing a method that can suggest values for these thresholds based on given input. Do you have any particular method in mind?

ping @marziacremona @fernandoPalluzzi

ckuenne · 2020-05-15T08:02:14Z

Not really. I'm working with multiple setups including ChIP-Seq (histones, tf), ATAC-Seq, RIP-Seq, and so on, which result in different peak sizes, distributions, and signal-to-noise ratio. Peak calling is done with MACS2, EPIC2, and MUSIC. If you plot these peaks after sorting for ascending -log10(fdr) you get something like this (example from using MACS2 with q-value 1e-4 on ATAC-Seq resulting in ~80k peaks).

A generalized algorithm would have to dynamically find lower/upper bounds here, which I have no good idea about. You could look for the slope, i.e. where it changes from linear to exponential. Or area under the curve, but that basically amounts to percentiles, which seems kinda crude.

BTW: in my experience the "extreme" peaks are often artefacts that result from erroneously calling multiple peaks as one, or are located in weird regions that should really be blacklisted. I know you can't fix issues with the peak calling itself, this should be done before your software is run. Just saying that the super peaks are often fishy, despite being highly significant and reproducible.

Edit: I just included your two suggested thresholds.

VJalili · 2020-05-19T03:46:40Z

This is a good suggestion, we can give it a try. We can work on it together if you're interested.

Regarding the second point: Yes, agree. We discussed excluding black listed regions (#85), probably we can prioritize it.

marziacremona · 2020-05-20T18:38:34Z

The automatic thresholding would be indeed a cool improvement for MSPC, but it’s very hard to do since it depends a lot on the particular experiment and on the peak caller used.
It’s no surprise that peak callers don’t do that, but they just propose a default fixed threshold… so I don’t think this should be a priority.

Regarding the extreme peaks, excluding ENCODE blacklisted regions as a first MSPC step would be certainty useful, and not complex to do.

ckuenne · 2020-05-25T07:25:16Z

I agree that it's probably impossible to automatically find the perfect threshold for every experimental variation automatically. But you might be able to identify certain types of distributions that behave similarly. And an imperfect auto threshold option would already be a great help. Just something to start off from a not totally arbitrary point.

Considering the ENCODE blacklisted regions: these were already removed from my example peaks. I guess that most people do this as part of the normal peak calling. But it sure can't hurt to have the option here.

VJalili · 2020-05-25T18:20:09Z

@ckuenne I agree suggesting a value for these thresholds is important. I think it would be a very useful addition; would you be interested leading its method development and implementation? I can help you as much needed.

Some questions probably we can start with:

Can a value be suggested for these thresholds based on the info in the input BED files (e.g., p-value)? I guess we'd prefer doing our best to avoid asking users for any additional files/information.
I guess we want to suggest values for the weak (-w) and stringent (-s) thresholds; it may also worth suggesting a value for the optional combined stringency (-g) threshold .

ckuenne · 2020-05-26T07:11:41Z

Well, if I had a viable idea for this I would probably just have written a small wrapper and that's that. Sadly, I don't have any. And this is not really something i can devote myself to right now. Sorry.

VJalili · 2020-05-26T15:58:40Z

Thanks for the suggestion anyway :) I'd keep issue open for a while in case anyone comes up a method.

VJalili added documentation question labels May 14, 2020

VJalili added the enhancement label May 25, 2020

VJalili removed the enhancement label Aug 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automatic threshold determination #126

Automatic threshold determination #126

ckuenne commented May 14, 2020

VJalili commented May 14, 2020 •

edited

Loading

ckuenne commented May 15, 2020 •

edited

Loading

VJalili commented May 19, 2020

marziacremona commented May 20, 2020

ckuenne commented May 25, 2020

VJalili commented May 25, 2020

ckuenne commented May 26, 2020

VJalili commented May 26, 2020

Automatic threshold determination #126

Automatic threshold determination #126

Comments

ckuenne commented May 14, 2020

VJalili commented May 14, 2020 • edited Loading

ckuenne commented May 15, 2020 • edited Loading

VJalili commented May 19, 2020

marziacremona commented May 20, 2020

ckuenne commented May 25, 2020

VJalili commented May 25, 2020

ckuenne commented May 26, 2020

VJalili commented May 26, 2020

VJalili commented May 14, 2020 •

edited

Loading

ckuenne commented May 15, 2020 •

edited

Loading