Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compare methylation at a position between different genotype groups #318

Open
sumudu-rangika opened this issue Dec 15, 2024 · 4 comments
Open

Comments

@sumudu-rangika
Copy link

sumudu-rangika commented Dec 15, 2024

Hi Arthur,

I want to compare 5mC methylation at a position between 3 genotypes groups within a cohort.

For ex. a C>T SNP which disrupt a reference CpG site, I want to see methylation differences at this site between CC (homo reference) and CT (heterozygotes) within the cohort. In the case of TTs methylation at both haplotypes are zero.

I see some CC individuals have ~80% methylation at this site. Also some CTs have ~80% at this site while only one haplotype being methylated and other has no methylation (I have attached a part of cohort for one such CpG site). Because, in case of CT it considers only filter passed modified and canonical base. Ndiff reads are high in this case with the C>T mutation but not counted in arriving at this methylation %. It's like methylation is over estimated here compared to CC where in CTs only half of the reads are methylated and other half having the SNP.

I am confused about how to approach this. Should I recalculate the methylation % at each position as
(Nmod / Nmod + Ncanonical + Ndiff + Nnocall), for all the individuals, instead of (Nmod / Nmod + Ncanonical) so that all reads are captured in calculating the methylation % for comparison in my case?

I came across a comment in modbam2bed by cjw85 which explains a similar case. Can I use this approach in modkit as well?

[https://github.com/nanoporetech/megalodon/issues/247]

That the interpretation of column 11 is contigent on column 5 is the case in other circumstances also, its just that such facts are often ignored. For example consider the case that 90% of reads contain a substitution to a C>T, is it really correct to conclude from a 100% value in column 11 that there is 100% methylation in the sample? No it is not, however much the casual user wants column 11 to mean that. This is why the definitions in columns 5 and 10 were chosen the way they were; the "specification" referenced in the modbam2bed documentation makes a false dichotomy between modified and unmodified. The modbam2bed avoids making this erroneous assumption, which makes the output messier but rather more truthful. It is also why the extended mode reporting the verbatim counts was added so users can derived whatever summary statistics they wish from the counts (one thing that the extended output misses is that only the sum Nmod + Ndel can be derived not the individual components).

Appreciate your thoughts on this .
Thank you in advance.

Best,
Sumudu

@marcus1487
Copy link
Contributor

The core question here is what biological question are you trying answer?

Can you compute percent methylation as (Nmod / Nmod + Ncanonical + Ndiff + Nnocall)? Absolutely! The fields are there and the result is a valid value, but I think the more important part is to ask what question you aim to answer with this value.

The standard definition for percent methylation as Nmod / (Nmod + Ncanonical) is designed to produce the most accurate estimate of methylation across reads potentially containing such methylation. If you could explain the biological question or phenomenon that you are trying to query I think that might help to figure out what metric would be best to investigate.

@sumudu-rangika
Copy link
Author

Thank you for your answer.

What I'm trying to answer is that, I have a set of individuals grouped into a drug response phenotype based on their genetic background. E.g. Normal Metabolisers. However, within the group there is still drug response variability, perhaps not captured by genetic mutations. I'm trying to understand if they have notable methylation changes within the promoter and/or CpG island in the gene primarily responsible for that drug metabolism.

I have noticed certain people have mutations that are not functionally relevant in determining the drug response phenotype but they disrupt CpG sites (for instance C>T) or create new ones. Some have these mutations in one allele, some have in both the alleles and some do not have such mutations. I'm trying to understand if these cause notable methylation changes that might contribute to changes we see within this group. In this I'm specifically looking at 5mC methylation.

If I consider Nmod / (Nmod + Ncanonical) as a measure of methylation, when a person has a mutation that disrupts a CpG site in one allele and where a another has no mutation and both alleles are methylated the same level of methylation is observed. Except the score value which is the Nvalid_cov is less in the case of a mutation.

What I'm trying so do is compare the level of methylation at each CpG site and then across the entire promoter/ CpG island among these individuals to understand whether these mutations cause any notable changes in methylation.

I selected two individuals of interest and did DMR pair for region. There was notable difference with a score 135 across the promoter. But comparing the CpG sites, difference between CpG sites of interest have a MAP-based p-value > 0.01 due to what I have explained earlier.

Your inputs are greatly appreciated. Thank you.

@ArtRand
Copy link
Contributor

ArtRand commented Dec 19, 2024

Hello @sumudu-rangika,

That's an interesting problem, thanks for explaining it.

Out of curiosity, what is the effect size you're seeing in the promoter regions? Where does the score of the promoters fall in the distribution of scores? Basically, I'm wondering if you can make a claim about the change in the methylation in the region.

If I'm understanding you correctly, the single-site MAP-based p-values are low due to one sample (the one carrying the mutation) having half of the observations of the reference-allele sample. If you assume for a second that the methylation rate of CpGs in close proximity will likely have a similar distribution. Suppose this formalism:

$$ \begin{align} p_i \sim \text{Beta}(\alpha_i, \beta_i) \\ p_j \sim \text{Beta}(\alpha_j, \beta_j) \end{align} $$

Where $p_i$ and $p_j$ are the probability of a read being methylated at a position aligning to genomic position $i$ or $j$, respectively. If $i$ and $j$ are close in linear sequence space, I'm proposing the beta distributions that generate their $p$s are likely to be similar. Now in a real biological setting, this could be a poor assumption but I'm not the first to make it see here. What you could do then is when you have a heterozygous mutation that disrupts the CpG, use a neighboring CpG's counts instead. There's no calculator in Modkit that will do this for you, unfortunately. If necessary, I can give you some code that exposes the calculation, there is a reference in the documentation.

Another approach is to use the segmenting HMM over the region. The transition probability between sites being "Same" vs "Different" is exactly intended to model the proximity relationship (granted only as a first order Markov process). Maybe try --segment first and let me know what you observe?

@sumudu-rangika
Copy link
Author

sumudu-rangika commented Dec 19, 2024

Thanks a lot for the explanation.

I've attached here my results from 2 samples of interest as an example. sample_a has 3 heterozygous SNPs in the promoter region that disrupts 3 CpG sites (3 SNPs in the same allele). sample_b has no such mutations. The CpG sites I'm looking at are
42131469 (mutation 42131469 C>T)
42132374 (mutation 42132375 G>C)
42132561 (mutation 42132561 C>T)

I ran with --segment and the results
sample_a-sample_b_segments.xlsx
show that regions with 2 out of 3 SNPs as different (highlighted in the file).

I also attached CpG sites without --segment here
sample_a-sample_b_dmr_prom_sites.xlsx

And with --region here
sample_a-sample_b_dmr_prom.xlsx

Thank you !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants