Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

modkit DMR result not as expected #316

Open
baibhav-bioinfo opened this issue Dec 13, 2024 · 3 comments
Open

modkit DMR result not as expected #316

baibhav-bioinfo opened this issue Dec 13, 2024 · 3 comments
Labels
question Looking for clarification on inputs and/or outputs

Comments

@baibhav-bioinfo
Copy link

Hello,
Thankyou for making such a versatile software.
I am working with Nanopore DRS dataset of plant.

I ran dorado and now working with modkit for downstream analysis.

I was able to run the DMR for my 2 conditions with 3 replicates each.

(1) But as i can see, the table contains no such row (site) which have 0 modification in one condition and something in other. The table have rows (sites) which have at least 1 modification in one of the condition.

I mean there should be some sites which is only modified in one of the two condition and those should be interesting to study.

(2) Following is one row of the result
chromosome_1 13914 13915 . -0.3043135065740614 a:25 104 a:16 72 a:24.04 a:22.22 0.24038461 0.22222222 1 0.01816239316239318 1 0.03205128205128205 66 66 0.33730454254826525,0.22492942694127707 -0.10303030303030303,0.14160839160839161

whats the test used in the comparison analysis, and is the "MAP-based p-value" and "balanced MAP-based p-value" corresponds to pvalue and adjusted pvalue which we generally see and take cutoffs as 0.05 for both?

(3) Also, we can see there is not any strand information column in the DMR results table. But the input for calculating the DMR which are the bed files do have the strand information.

so, is there any way we can extract the strand information too for the DMR sites.

@ArtRand
Copy link
Contributor

ArtRand commented Dec 14, 2024

Hello @baibhav-bioinfo,

(1) But as i can see, the table contains no such row (site) which have 0 modification in one condition and something in other. The table have rows (sites) which have at least 1 modification in one of the condition.

This is a property of your data and the accuracy of the m6A model. All models are imperfect to some degree and will have a false positive rate (FPR). At the coverage you have here (104 and 72, combined) it's unlikely that you'll see zero predicted m6A sites. We've observed FPR values of 0.7% and 2% for the DRACH and m6A all-context models, respectively, in a 2-way m6A/A classification task. Given that, the row you've provided has ~20% m6A in both samples - considerably higher than the expected FPR. You may consider using a very high pass threshold for m6A detection (others have been using values as high as 0.99)

whats the test used in the comparison analysis, and is the "MAP-based p-value" and "balanced MAP-based p-value" corresponds to pvalue and adjusted pvalue which we generally see and take cutoffs as 0.05 for both?

The "MAP-based p-value" is used as well as the "balanced MAP-based p-value" since you have an equivalent number of replicas, the documentation explains how and when you should see these columns. 0.05 is an OK threshold, the default in the segmenter is 0.01.

(3) Also, we can see there is not any strand information column in the DMR results table. But the input for calculating the DMR which are the bed files do have the strand information.

Try the build I've attached to this thread. The strand will be after the "score" column.
modkit_u16_x86_64.tar.gz

@ArtRand ArtRand added the question Looking for clarification on inputs and/or outputs label Dec 14, 2024
@baibhav-bioinfo
Copy link
Author

hello,
I was able to run the Dorado basecall + methylation call with my DRS reads. I ran the command both in m6A_DRACH context and m6A_all context, using the following commands

dorado basecaller [email protected] pod5/ --modified-bases-models [email protected]_m6A_DRACH@v1/ --device cuda:all > sample.bam

dorado basecaller [email protected] pod5/ --modified-bases-models [email protected]_m6A@v1/ --device cuda:all > sample.bam

for the DRACH motif context i got 60,000 sites per sample (10 million reads with ~1000nt per read)
but for all context run, i got >1.6 million m6A sites predicted.

Did something wrong happened in my ALL context run?
Is that number of sites possible?

Please suggest.

@baibhav-bioinfo
Copy link
Author

baibhav-bioinfo commented Dec 16, 2024

This is a property of your data and the accuracy of the m6A model. All models are imperfect to some degree and will have a false positive rate (FPR). At the coverage you have here (104 and 72, combined) it's unlikely that you'll see zero predicted m6A sites. We've observed FPR values of 0.7% and 2% for the DRACH and m6A all-context models, respectively, in a 2-way m6A/A classification task. Given that, the row you've provided has ~20% m6A in both samples - considerably higher than the expected FPR. You may consider using a very high pass threshold for m6A detection (others have been using values as high as 0.99)

Thankyou so much for the detailed response, but the main problem here i think is Actually due to the way the master table is prepared for the DMR analysis. The table includes the only sites which have been found to be modified in atleast one of the condition. So the sites which are in one and not in other are left behind.
is there any way those can be recovered?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Looking for clarification on inputs and/or outputs
Projects
None yet
Development

No branches or pull requests

2 participants