Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make estimates of SNP linkage #1

Open
petercombs opened this issue Jan 16, 2019 · 6 comments
Open

Make estimates of SNP linkage #1

petercombs opened this issue Jan 16, 2019 · 6 comments
Labels
enhancement New feature or request

Comments

@petercombs
Copy link
Owner

petercombs commented Jan 16, 2019

We don't yet have a great sense of what the linkage is in the populations we're looking at. Flowers et al 2010 implies it should be small (see: Figure 8), on the order of 10-25kb.

In this case, we cannot directly assay the SNP values. However, we can assay the SNP scores. One approach to take is similar to Figure 8B in Flowers:

  • Find all pairs of adjacent SNPs that are between bin_low and bin_high bases apart.
  • Measure the correlation of {p-values, log10 pvalues} between those adjacent SNPs.
  • Plot for all bin sizes.
@petercombs petercombs added the enhancement New feature or request label Jan 16, 2019
@petercombs
Copy link
Owner Author

Made a first attempt at this in the ldplot branch. It gets very jaggedy and there's fairly high correlation in some bins. See this plot with 10bp bins:

ldtest

@petercombs
Copy link
Owner Author

Presumably what's going on here (though I should check) is that there are a lot of SNPs with very low coverage, and thus very low p-values. Maybe by taking correlation of log10 p-values?

@petercombs
Copy link
Owner Author

Okay, log10 pvalues does seem to help, as does taking wider bins:
image

Now one question is whether I should do all pairs of SNPs that are between [N,N+k) basepairs apart, or only adjacent pairs. All pairs is a little bit harder to set up, but should give more data. Is that double counting in a bad way though? I should ask around.

@petercombs
Copy link
Owner Author

  • Hunter agrees that all pairs is probably not necessary.

  • One way to get around the noisiness is to break SNPs up by groups sorted by distance, e.g., first 100 snps , second 100, etc. rather than a distance bin.

  • Can also do a spearman correlation within each bin, rather than worrying about log10 pvalue vs pvalue.

@petercombs
Copy link
Owner Author

Okay, making progress here in the ldplot branch. In addition to plotting each subtype separately, I should make one that has all the subtypes together.

@petercombs
Copy link
Owner Author

petercombs commented Feb 15, 2019

So the issue I'm seeing now is that there seems to be a persistent baseline level of correlation—it never really gets below about 0.25, even between 100kb and 1mb.
all_ld

I talked to Sur, Mark, and Thomas, and some ideas are:

  • Look at the correlation of the random p-values. I need to double check exactly how I'm doing that randomization to decide whether this will do what I think it will do, but not a bad first step.
  • Bootstrap the standard deviation of the correlation with a jack-knife procedure by repeated leave-one-out. I'm not optimistic that this will work, since there are hundreds of SNPs at these larger distances
  • Look at the correlation between SNPs on different chromosomes. These are not physically linked, so it should go to zero. Except that because these are haploid organisms, there could be some population structure that's keeping the correlations high, even across chromosomes.
  • Look at the correlation of p-values in real GWAS or already published pooling studies.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant