-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Filtering accuracy question #297
Comments
To illustrate my point, what would be the recommended protocol to determine a threshold when something like |
Hi George, Not the developer, but I wrote about this on a Dorado thread recently: nanoporetech/dorado#951 For RNA modifications, I currently use 99% as a filter cutoff for the most recent models (v5.1.0 SUP). Hope this helps! Sincerely, |
Hello @Ge0rges, I have to apologize for the slow response.
I think there is a way to create settings that give you this desired outcome, but let me answer your more specific questions first.
Modkit doesn't do anything like this since it would require getting some metadata about each basecalling/modification calling model. It simply drops the 10% lowest confidence calls as you understand.
It's true that different model/sample combinations can produce different ML-tag probability distributions. As you've noticed, some samples will give you that warning message that the threshold value is low, and some wont. The modified base probabilities are reasonably well calibrated, (you could even go and check using the data we recently released). So if you want to be at least 95% confident that a given call is "real" you should use a filter threshold of 0.95. The reason we don't have hard threshold like this as the default in So to get back to:
I would run Hope this helps. |
No worries hope you enjoyed the long weekend.
Agreed that's very impressive, I understand why it's better to forego the complexities of tracking per model metadata.
What scenario would produce systemically low confidence calls? I definitely agree with the modkit default, but this makes me wonder if I'm too aggressive in pursuing a 95% threshold.
Why not set |
Also I noticed that table is missing the |
@Ge0rges let me see if I can find it. |
|
@ArtRand I was wondering if you have any thoughts on the following: To determine the minimum number of observations (coverage) One could incorporate the model accuracy by doing Does this make sense to you? |
Hi Art,
I was wondering if you could guide my intuition on filtering parameters for modkit. I currently use the default filter settings (i.e. bottom 10% of calls get filtered out). I then apply a coverage filter on my pileup of at least 5.
I was wondering, if I wanted to be 95% sure that each data point in my pileup is real, how would I achieve this?
The source of complex I'm considering currently is whether A) I should taken into account (if modkit doesn't already) the different stated accuracies of the base calling models themselves for each modification type and B) whether I should set a manual threshold.
Let me expand on B. My understanding is that modkit will take the ML tags determined by the model for each base and then take the threshold at the bottom 10% percentile, but two datasets might have different means for the ML tags and so come up with lower or higher thresholds. This is the part that inclines me to set a fixed threshold. This is why I haven't just set
-p 0.95
. but I'm assuming there's complexity here I don't appreciate which is also why I decided to write before setting--filter-threshold 0.95
for example.Thank you and apologies if you've covered this extensively already. The documentation is quite detailed but I had trouble wrapping my head around things.
The text was updated successfully, but these errors were encountered: