Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add test_imputation.py file #2802

Open
jeromekelleher opened this issue Jul 17, 2023 · 27 comments · May be fixed by #2815
Open

Add test_imputation.py file #2802

jeromekelleher opened this issue Jul 17, 2023 · 27 comments · May be fixed by #2815
Assignees

Comments

@jeromekelleher
Copy link
Member

We want to test some specific examples against the new backward matrix code in #2799. To be certain that we're interpreting the model correctly, we can output the forward and backward matrices from Beagle, and compare them with the results we get from running the local LS hmm code.

We don't want to rely on msprime simulations producing the same results over time, so what we can do is to run a small simulation, and output to text (using ts.dump_text()), storing the results in the file as examples.

Then, we run BEAGLE on these same simulations, with a few different parameter combinations and capture the forward and backward matrices, as well as the final imputation output. We can save these outputs into the file as python arrays (or text, if that's simpler). Then we compare with our own output.

We can start with a simple example, but we should try and get a few awkward corner cases in there as well.

Note that we don't currently have a sensible high-level API for the LS model in tskit, so we should wrap the low-level API in functions locally in this file, for now.

@szhan
Copy link
Member

szhan commented Jul 18, 2023

I have the following code produces a toy ts with 101 individuals, which were then split into a ref. panel of 100 individuals and a query set of one individual, and then dumps the nodes, edges, sites, mutations, and individuals tables to separate text files. When I ran it, I get a ref. panel with 8 variable sites and a query with 4 variable sites. I think that it should be good enough as a test case? I'll keep the code here for record keeping purposes.

import msprime
import tskit
import tszip
import numpy as np

seed = 1234
ts = msprime.sim_mutations(
    msprime.sim_ancestry(
        11,
        sequence_length=1e6,
        recombination_rate=1e-7,
        random_seed=seed,
    ),
    rate=1e-6,
    random_seed=seed,
)

ref_ts = ts.simplify(samples=np.arange(ts.num_samples - 2))
query_ts = ts.simplify(samples=np.arange(ts.num_samples - 2, ts.num_samples))

with open("ref_ts.nodes.txt", "w") as f:
    ref_ts.dump_text(nodes=f)

with open("ref_ts.edges.txt", "w") as f:
    ref_ts.dump_text(edges=f)

with open("ref_ts.sites.txt", "w") as f:
    ref_ts.dump_text(sites=f)

with open("ref_ts.mutations.txt", "w") as f:
    ref_ts.dump_text(mutations=f)

with open("ref_ts.individuals.txt", "w") as f:
    ref_ts.dump_text(individuals=f)

@szhan
Copy link
Member

szhan commented Jul 18, 2023

Should we put those files in a new directory named, say, python/tests/data/beagle/?

@szhan
Copy link
Member

szhan commented Jul 18, 2023

For input to BEAGLE, we have two gzipped VCF files, which we don't need to keep here, I think.

@szhan
Copy link
Member

szhan commented Jul 18, 2023

On a second thought, we might want to keep those VCF files, just to have a record of the input files to BEAGLE used to generate the forward and backward matrices.

@jeromekelleher
Copy link
Member Author

The input tree sequences should be sufficiently reproducible, we don't need to store the VCFs

@jeromekelleher
Copy link
Member Author

I'd like the unit tests we do for unit tests to be even smaller - 3 or 4 samples and < 10 sites.

@jeromekelleher
Copy link
Member Author

Might be simpler to place the mutations "by hand" here on a small tree to start off with, like ``Tree.generate_balanced"

@szhan
Copy link
Member

szhan commented Jul 18, 2023

For the fwd and bwd matrices, BEAGLE has scaled and shifted values, e.g., bwdVal[h] = scale*bwdVal[h] + shift, where h is index of a haplotype in the ref. panel. I guess we should keep both the pre-adjusted values and adjusted values for comparison. The lshmm and tskit versions also do this or something similar, right?

@szhan
Copy link
Member

szhan commented Jul 18, 2023

I'd like the unit tests we do for unit tests to be even smaller - 3 or 4 samples and < 10 sites.

How about this?
5 haploid individuals (so, 4 for ref. panel and one as query)
5 variable sites
4 trees

@szhan
Copy link
Member

szhan commented Jul 18, 2023

When we get the fwd and bwd matrices, we want one run with all the sites and then another run with chip-like sites, right? If so, probably we want like 9 variable sites and set 4 to missing for the latter, I think.

@szhan
Copy link
Member

szhan commented Jul 18, 2023

Ah, actually, scrap the above comment. BEAGLE just computes the values for the sites in the target VCF file. The values for the in-between sites are interpolated thereafer.

@szhan
Copy link
Member

szhan commented Aug 7, 2023

I've run BEAGLE 4.1 using a ref. panel of 2 diploid individuals (BEAGLE doesn't take haploid individuals, it seems) and a query set consisting of 1 diploid individual. This toy dataset has 5 sites in the ref. panel and 4 sites in the query.

m = number of sites in ref. panel = 5
n = number of haplotypes in the ref. panel = 4

Forward prob. matrix of size (m, n) = (5, 4)
Prob. of switch at each site in query, so of length m - 1 = 4
Prob. of mismatch at each site in query
"Scaling factor" at each site in query
"Shift factor" at each site in query
Sum of forward prob. at each site in query

Total of log10(sum of forward prob.) of all sites in query

@szhan szhan linked a pull request Aug 7, 2023 that will close this issue
6 tasks
@szhan
Copy link
Member

szhan commented Aug 7, 2023

I'm storing everything in text and then convert to a Pandas dataframe or tree sequence.

@szhan
Copy link
Member

szhan commented Aug 7, 2023

Should we add an example with a simple genetic map? By default, BEAGLE 4.1 assumes that 1 Mb = 1 cM.

@szhan
Copy link
Member

szhan commented Aug 7, 2023

We should probably also add the state probability matrix that combines the forward and backward probabilities.

@szhan
Copy link
Member

szhan commented Aug 14, 2023

I've rerun BEAGLE using different query haplotypes by setting Ne to 10. BEAGLE sets Ne to 1 million by default. I don't fully understand why they set it so big, but I suspect it has to do with defining the switch probabilities wrt to the ratio between Ne and ref. panel size. When Ne is set to 1 million and the ref. panel is so small as in the toy example, switch probability is always 1; intuitively, it is saying that the ref. panel is so small that it is a very poor representation of the genetic diversity in the population with large Ne, so it should always be favouring switching over mismatch.

Below are the matrices.

# Forward probability matrix
# tskit
[[0.05555556 0.38888889 0.05555556 0.05555556 0.38888889 0.05555556]
 [0.01234568 0.47530864 0.01234568 0.01234568 0.47530864 0.01234568]
 [0.02777778 0.44444444 0.02777778 0.02777778 0.44444444 0.02777778]
 [0.04166667 0.41666667 0.04166667 0.04166667 0.41666667 0.04166667]
 [0.05416667 0.39166667 0.05416667 0.05416667 0.39166667 0.05416667]]
# beagle
[[1.000000e-04 9.999000e-01 1.000000e-04 1.000000e-04 9.999000e-01
  1.000000e-04]
 [1.000000e-06 2.816427e+00 1.000000e-06 1.000000e-06 2.816427e+00
  1.000000e-06]
 [4.724000e-02 4.100000e-05 4.724000e-02 4.724000e-02 4.100000e-05
  4.724000e-02]
 [2.445010e-01 1.000000e-06 2.445010e-01 2.445010e-01 1.000000e-06
  2.445010e-01]]
# Backward probability matrix
# tskit
[[0.22222222 1.22222222 0.22222222 0.22222222 1.22222222 0.22222222]
 [1.         1.         1.         1.         1.         1.        ]
 [1.         1.         1.         1.         1.         1.        ]
 [1.         1.         1.         1.         1.         1.        ]
 [1.         1.         1.         1.         1.         1.        ]]
# beagle
[[0.166667 0.166667 0.166667 0.166667 0.166667 0.166667]
 [0.244614 0.010772 0.244614 0.244614 0.010772 0.244614]
 [0.226377 0.047246 0.226377 0.226377 0.047246 0.226377]
 [0.010973 0.478054 0.010973 0.010973 0.478054 0.010973]]

The BEAGLE values in the backward matrix now vary and make more sense. Do the _tskit.lshmm values make sense to you, @jeromekelleher ?

I think we should next tweak the parametrization and/or mimic BEAGLE's scaling and shifting?

@szhan
Copy link
Member

szhan commented Aug 14, 2023

In case I forget, just jogging down some notes about BEAGLE 4.1's approach to calculating allele probabilities, which are used to get the MAP alleles at each imputed site of the query haplotypes.

h = number of haplotypes in ref. panel
m = number of genotyped markers

  1. Fill forward probability matrix (m x h).
  2. Fill backward probability matrix (m x h). Only the values for a site is kept at a time.
  3. Fill state probability matrix (m x h). This combines the forward and backward values at the genotyped markers.
  4. Fill allele probability array (sum of number of marker states over m).

@szhan
Copy link
Member

szhan commented Aug 14, 2023

I think I understand the difference between state probability matrix and allele probability matrix. The former is what you get from forward-backward algorithm, and it contains the allele probabilities at the genotyped markers (excluding the ungenotyped markers that we want to impute). The latter contains allele probabilities interpolated using the allele probabilities of the flanking genotyped markers.

@szhan
Copy link
Member

szhan commented Aug 14, 2023

So, we should mimic what BEAGLE is doing, right? And interpolate the allele probabilities of the ungenotyped markers? So, we are really implementing BEAGLE 4.1's algorithm in tskit.

@szhan
Copy link
Member

szhan commented Aug 14, 2023

Then, getting the MAP alleles at each ungenotyped marker site for each query haplotype is straightforward.

@jeromekelleher
Copy link
Member Author

Great, thanks @szhan.

Implementing BEAGLE 4.1's algorithm would ideal I guess - it's very good.

@szhan
Copy link
Member

szhan commented Aug 18, 2023

I've added a separate file (python/tests/test_beagle.py) to contain the code to replicate the BEAGLE 4.1 algorithm. It is temporary. I'll reorganise it later. I still don't have all the little details right, but I think I'm getting close.

@szhan
Copy link
Member

szhan commented Aug 18, 2023

Some notes about some concepts used in the BEAGLE 4.1 algorithm.

  • Marker refers to a genotyped site.
  • Cluster refers to a set of adjacent markers (defined by a specified threshold).
  • Segment refers to a set of clusters of markers.

There are nested indices to keep track of at each level of organisation, as far as I can tell.

@szhan
Copy link
Member

szhan commented Aug 26, 2023

Most of the implementation is there. I'm corresponding with Brian Browning to understand how reference haplotype segments are indiced.

@szhan
Copy link
Member

szhan commented Aug 30, 2023

In the BEAGLE 4.1, the reference haplotype segments used to index the HMM state probability matrix are defined in the function refHapSegs() in ImputationData.java.

In refHapSegs(), the reference haplotype segments are defined by a left genotyped marker (inclusive) and a right genotyped marker (exclusive) indexed with respect to the reference markers (the genotyped markers and the imputed markers). The number of such segments (n) is the number of genotyped markers plus one.

Suppose that the first and last genotyped markers are not the first and last markers in the reference panel, respectively. Also suppose that BEAGLE input parameters are set such that each segment is defined by two individual markers rather than aggregated markers.

The following steps define the left (L) and right (R) indices of the segments:

  • i = 0; L = 0; R = index of 0th genotyped marker in ref. markers plus 1.
  • i = 1; L = index of 0th genotyped marker in ref. markers; R = index of 1st genotyped marker in ref. markers plus 1.
  • loop from i = 2 to n-1; L = index of the (i-1)th genotyped marker in ref. marker; R = index of the ith genotyped marker plus 1.
  • i = n; L = index of the (n-1)th segment in ref. markers; R = number of ref. markers

@szhan
Copy link
Member

szhan commented Aug 30, 2023

It's simpler if we don't follow their way of normalizing the probability values in the matrices. We really just want to use Equation 1 in BB2016, but in their implementation they follow Equation 2.

@szhan
Copy link
Member

szhan commented Sep 4, 2023

It doesn't seem that BEAGLE uses the interpolated allele probabilities directly to get the imputed alleles. It computes the genotype probability (simply equal to the product of the probability of allele 0 and probability of allele 1) and then gets the genotype with the highest probability, I think. Also, I'm not getting the scaling of the probabilities right...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

Successfully merging a pull request may close this issue.

2 participants