Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filtering accuracy question #297

Open
Ge0rges opened this issue Nov 8, 2024 · 9 comments
Open

Filtering accuracy question #297

Ge0rges opened this issue Nov 8, 2024 · 9 comments
Labels
question Looking for clarification on inputs and/or outputs

Comments

@Ge0rges
Copy link

Ge0rges commented Nov 8, 2024

Hi Art,

I was wondering if you could guide my intuition on filtering parameters for modkit. I currently use the default filter settings (i.e. bottom 10% of calls get filtered out). I then apply a coverage filter on my pileup of at least 5.

I was wondering, if I wanted to be 95% sure that each data point in my pileup is real, how would I achieve this?

The source of complex I'm considering currently is whether A) I should taken into account (if modkit doesn't already) the different stated accuracies of the base calling models themselves for each modification type and B) whether I should set a manual threshold.

Let me expand on B. My understanding is that modkit will take the ML tags determined by the model for each base and then take the threshold at the bottom 10% percentile, but two datasets might have different means for the ML tags and so come up with lower or higher thresholds. This is the part that inclines me to set a fixed threshold. This is why I haven't just set -p 0.95. but I'm assuming there's complexity here I don't appreciate which is also why I decided to write before setting --filter-threshold 0.95 for example.

Thank you and apologies if you've covered this extensively already. The documentation is quite detailed but I had trouble wrapping my head around things.

@Ge0rges
Copy link
Author

Ge0rges commented Nov 9, 2024

To illustrate my point, what would be the recommended protocol to determine a threshold when something like Threshold of 0.5722656 for base C is very low. Consider increasing the filter-percentile or specifying a higher threshold. gets printed.

@Theo-Nelson
Copy link

Hi George,

Not the developer, but I wrote about this on a Dorado thread recently: nanoporetech/dorado#951

For RNA modifications, I currently use 99% as a filter cutoff for the most recent models (v5.1.0 SUP).

Hope this helps!

Sincerely,
Theo

@ArtRand
Copy link
Contributor

ArtRand commented Nov 12, 2024

Hello @Ge0rges,

I have to apologize for the slow response.

I was wondering, if I wanted to be 95% sure that each data point in my pileup is real, how would I achieve this?

I think there is a way to create settings that give you this desired outcome, but let me answer your more specific questions first.

A) I should taken into account (if modkit doesn't already) the different stated accuracies of the base calling models themselves for each modification type

Modkit doesn't do anything like this since it would require getting some metadata about each basecalling/modification calling model. It simply drops the 10% lowest confidence calls as you understand.

B) whether I should set a manual threshold

It's true that different model/sample combinations can produce different ML-tag probability distributions. As you've noticed, some samples will give you that warning message that the threshold value is low, and some wont. The modified base probabilities are reasonably well calibrated, (you could even go and check using the data we recently released). So if you want to be at least 95% confident that a given call is "real" you should use a filter threshold of 0.95. The reason we don't have hard threshold like this as the default in modkit is say you have an unusual sample with systemically low confidence calls, there is a risk that the majority of them may get filtered out by default. This isn't really the experience we want (anything, anywhere, etc.).

So to get back to:

To illustrate my point, what would be the recommended protocol to determine a threshold when something like Threshold of 0.5722656 for base C is very low. Consider increasing the filter-percentile or specifying a higher threshold. gets printed.

I would run sample-probs, look at the output tables and set a per-mod, per-base threshold value for each sample. This is really the best way to get the most control - but it's also the most work and probably overkill for most applications.

Hope this helps.

@ArtRand ArtRand added the question Looking for clarification on inputs and/or outputs label Nov 12, 2024
@Ge0rges
Copy link
Author

Ge0rges commented Nov 12, 2024

No worries hope you enjoyed the long weekend.

The modified base probabilities are reasonably well calibrated, (you could even go and check using the data we recently released).

Agreed that's very impressive, I understand why it's better to forego the complexities of tracking per model metadata.

The reason we don't have hard threshold like this as the default in modkit is say you have an unusual sample with systemically low confidence calls, there is a risk that the majority of them may get filtered out by default.

What scenario would produce systemically low confidence calls? I definitely agree with the modkit default, but this makes me wonder if I'm too aggressive in pursuing a 95% threshold.

I would run sample-probs, look at the output tables and set a per-mod, per-base threshold value for each sample. This is really the best way to get the most control - but it's also the most work and probably overkill for most applications.

Why not set --filter-threshold to 0.95 directly? I understand that what you're recommending would yield the top X percentile, but that's not necessarily X confidence correct? Especially considering that the basecall models are basically >95% accurate

@Ge0rges
Copy link
Author

Ge0rges commented Nov 13, 2024

Also I noticed that table is missing the 4mC model, do you know what its accuracy is?

@ArtRand
Copy link
Contributor

ArtRand commented Nov 13, 2024

@Ge0rges let me see if I can find it.

@ArtRand
Copy link
Contributor

ArtRand commented Nov 13, 2024

@Ge0rges the latest 4mC accuracy information is presented in this video from NCM2023. I imagine the 4mC data will be released eventually, but don't hold me to that.

@Ge0rges
Copy link
Author

Ge0rges commented Nov 13, 2024

85.4% for future reference if whoever is reading this doesn't want to watch the video. Thanks art.

@Ge0rges
Copy link
Author

Ge0rges commented Nov 24, 2024

@ArtRand I was wondering if you have any thoughts on the following:

To determine the minimum number of observations (coverage) $N$ to place the methylation fraction of a base (for one methylation type) above a certain confidence threshold $X$. Perhaps one could do: logit(confidence_threshold)/logit(filter_threshold)? For example, for a threshold of 95%, where $X$ is the pileup filter threshold.

$N \ge\lceil\frac{ln (\frac{0.95}{1−0.95})}{ln(\frac{𝑋}{1−𝑋})}\rceil$

One could incorporate the model accuracy by doing $X = X * A$ where $A$ is the model's accuracy.

Does this make sense to you?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Looking for clarification on inputs and/or outputs
Projects
None yet
Development

No branches or pull requests

3 participants