Filtering accuracy question #297

Ge0rges · 2024-11-08T23:28:20Z

Hi Art,

I was wondering if you could guide my intuition on filtering parameters for modkit. I currently use the default filter settings (i.e. bottom 10% of calls get filtered out). I then apply a coverage filter on my pileup of at least 5.

I was wondering, if I wanted to be 95% sure that each data point in my pileup is real, how would I achieve this?

The source of complex I'm considering currently is whether A) I should taken into account (if modkit doesn't already) the different stated accuracies of the base calling models themselves for each modification type and B) whether I should set a manual threshold.

Let me expand on B. My understanding is that modkit will take the ML tags determined by the model for each base and then take the threshold at the bottom 10% percentile, but two datasets might have different means for the ML tags and so come up with lower or higher thresholds. This is the part that inclines me to set a fixed threshold. This is why I haven't just set -p 0.95. but I'm assuming there's complexity here I don't appreciate which is also why I decided to write before setting --filter-threshold 0.95 for example.

Thank you and apologies if you've covered this extensively already. The documentation is quite detailed but I had trouble wrapping my head around things.

The text was updated successfully, but these errors were encountered:

Ge0rges · 2024-11-09T00:54:48Z

To illustrate my point, what would be the recommended protocol to determine a threshold when something like Threshold of 0.5722656 for base C is very low. Consider increasing the filter-percentile or specifying a higher threshold. gets printed.

Theo-Nelson · 2024-11-10T03:29:17Z

Hi George,

Not the developer, but I wrote about this on a Dorado thread recently: nanoporetech/dorado#951

For RNA modifications, I currently use 99% as a filter cutoff for the most recent models (v5.1.0 SUP).

Hope this helps!

Sincerely,
Theo

ArtRand · 2024-11-12T23:42:02Z

Hello @Ge0rges,

I have to apologize for the slow response.

I was wondering, if I wanted to be 95% sure that each data point in my pileup is real, how would I achieve this?

I think there is a way to create settings that give you this desired outcome, but let me answer your more specific questions first.

A) I should taken into account (if modkit doesn't already) the different stated accuracies of the base calling models themselves for each modification type

Modkit doesn't do anything like this since it would require getting some metadata about each basecalling/modification calling model. It simply drops the 10% lowest confidence calls as you understand.

B) whether I should set a manual threshold

It's true that different model/sample combinations can produce different ML-tag probability distributions. As you've noticed, some samples will give you that warning message that the threshold value is low, and some wont. The modified base probabilities are reasonably well calibrated, (you could even go and check using the data we recently released). So if you want to be at least 95% confident that a given call is "real" you should use a filter threshold of 0.95. The reason we don't have hard threshold like this as the default in modkit is say you have an unusual sample with systemically low confidence calls, there is a risk that the majority of them may get filtered out by default. This isn't really the experience we want (anything, anywhere, etc.).

So to get back to:

To illustrate my point, what would be the recommended protocol to determine a threshold when something like Threshold of 0.5722656 for base C is very low. Consider increasing the filter-percentile or specifying a higher threshold. gets printed.

I would run sample-probs, look at the output tables and set a per-mod, per-base threshold value for each sample. This is really the best way to get the most control - but it's also the most work and probably overkill for most applications.

Hope this helps.

Ge0rges · 2024-11-12T23:48:35Z

No worries hope you enjoyed the long weekend.

The modified base probabilities are reasonably well calibrated, (you could even go and check using the data we recently released).

Agreed that's very impressive, I understand why it's better to forego the complexities of tracking per model metadata.

The reason we don't have hard threshold like this as the default in modkit is say you have an unusual sample with systemically low confidence calls, there is a risk that the majority of them may get filtered out by default.

What scenario would produce systemically low confidence calls? I definitely agree with the modkit default, but this makes me wonder if I'm too aggressive in pursuing a 95% threshold.

I would run sample-probs, look at the output tables and set a per-mod, per-base threshold value for each sample. This is really the best way to get the most control - but it's also the most work and probably overkill for most applications.

Why not set --filter-threshold to 0.95 directly? I understand that what you're recommending would yield the top X percentile, but that's not necessarily X confidence correct? Especially considering that the basecall models are basically >95% accurate

Ge0rges · 2024-11-13T00:09:17Z

Also I noticed that table is missing the 4mC model, do you know what its accuracy is?

ArtRand · 2024-11-13T16:14:07Z

@Ge0rges let me see if I can find it.

ArtRand · 2024-11-13T22:39:47Z

@Ge0rges the latest 4mC accuracy information is presented in this video from NCM2023. I imagine the 4mC data will be released eventually, but don't hold me to that.

Ge0rges · 2024-11-13T23:45:32Z

85.4% for future reference if whoever is reading this doesn't want to watch the video. Thanks art.

Ge0rges · 2024-11-24T22:02:37Z

@ArtRand I was wondering if you have any thoughts on the following:

To determine the minimum number of observations (coverage) $N$ to place the methylation fraction of a base (for one methylation type) above a certain confidence threshold $X$. Perhaps one could do: logit(confidence_threshold)/logit(filter_threshold)? For example, for a threshold of 95%, where $X$ is the pileup filter threshold.

$N \ge\lceil\frac{ln (\frac{0.95}{1−0.95})}{ln(\frac{𝑋}{1−𝑋})}\rceil$

One could incorporate the model accuracy by doing $X = X * A$ where $A$ is the model's accuracy.

Does this make sense to you?

ArtRand added the question Looking for clarification on inputs and/or outputs label Nov 12, 2024

ArtRand mentioned this issue Dec 14, 2024

modkit DMR result not as expected #316

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Filtering accuracy question #297

Filtering accuracy question #297

Ge0rges commented Nov 8, 2024

Ge0rges commented Nov 9, 2024

Theo-Nelson commented Nov 10, 2024

ArtRand commented Nov 12, 2024

Ge0rges commented Nov 12, 2024 •

edited

Loading

Ge0rges commented Nov 13, 2024

ArtRand commented Nov 13, 2024

ArtRand commented Nov 13, 2024

Ge0rges commented Nov 13, 2024

Ge0rges commented Nov 24, 2024

Filtering accuracy question #297

Filtering accuracy question #297

Comments

Ge0rges commented Nov 8, 2024

Ge0rges commented Nov 9, 2024

Theo-Nelson commented Nov 10, 2024

ArtRand commented Nov 12, 2024

Ge0rges commented Nov 12, 2024 • edited Loading

Ge0rges commented Nov 13, 2024

ArtRand commented Nov 13, 2024

ArtRand commented Nov 13, 2024

Ge0rges commented Nov 13, 2024

Ge0rges commented Nov 24, 2024

Ge0rges commented Nov 12, 2024 •

edited

Loading