Add test_imputation.py file #2802

jeromekelleher · 2023-07-17T08:48:02Z

We want to test some specific examples against the new backward matrix code in #2799. To be certain that we're interpreting the model correctly, we can output the forward and backward matrices from Beagle, and compare them with the results we get from running the local LS hmm code.

We don't want to rely on msprime simulations producing the same results over time, so what we can do is to run a small simulation, and output to text (using ts.dump_text()), storing the results in the file as examples.

Then, we run BEAGLE on these same simulations, with a few different parameter combinations and capture the forward and backward matrices, as well as the final imputation output. We can save these outputs into the file as python arrays (or text, if that's simpler). Then we compare with our own output.

We can start with a simple example, but we should try and get a few awkward corner cases in there as well.

Note that we don't currently have a sensible high-level API for the LS model in tskit, so we should wrap the low-level API in functions locally in this file, for now.

The text was updated successfully, but these errors were encountered:

szhan · 2023-07-18T12:54:58Z

I have the following code produces a toy ts with 101 individuals, which were then split into a ref. panel of 100 individuals and a query set of one individual, and then dumps the nodes, edges, sites, mutations, and individuals tables to separate text files. When I ran it, I get a ref. panel with 8 variable sites and a query with 4 variable sites. I think that it should be good enough as a test case? I'll keep the code here for record keeping purposes.

import msprime
import tskit
import tszip
import numpy as np

seed = 1234
ts = msprime.sim_mutations(
    msprime.sim_ancestry(
        11,
        sequence_length=1e6,
        recombination_rate=1e-7,
        random_seed=seed,
    ),
    rate=1e-6,
    random_seed=seed,
)

ref_ts = ts.simplify(samples=np.arange(ts.num_samples - 2))
query_ts = ts.simplify(samples=np.arange(ts.num_samples - 2, ts.num_samples))

with open("ref_ts.nodes.txt", "w") as f:
    ref_ts.dump_text(nodes=f)

with open("ref_ts.edges.txt", "w") as f:
    ref_ts.dump_text(edges=f)

with open("ref_ts.sites.txt", "w") as f:
    ref_ts.dump_text(sites=f)

with open("ref_ts.mutations.txt", "w") as f:
    ref_ts.dump_text(mutations=f)

with open("ref_ts.individuals.txt", "w") as f:
    ref_ts.dump_text(individuals=f)

szhan · 2023-07-18T12:56:28Z

Should we put those files in a new directory named, say, python/tests/data/beagle/?

szhan · 2023-07-18T12:58:36Z

For input to BEAGLE, we have two gzipped VCF files, which we don't need to keep here, I think.

szhan · 2023-07-18T13:01:02Z

On a second thought, we might want to keep those VCF files, just to have a record of the input files to BEAGLE used to generate the forward and backward matrices.

jeromekelleher · 2023-07-18T14:39:38Z

The input tree sequences should be sufficiently reproducible, we don't need to store the VCFs

jeromekelleher · 2023-07-18T14:40:40Z

I'd like the unit tests we do for unit tests to be even smaller - 3 or 4 samples and < 10 sites.

jeromekelleher · 2023-07-18T14:42:24Z

Might be simpler to place the mutations "by hand" here on a small tree to start off with, like ``Tree.generate_balanced"

szhan · 2023-07-18T15:14:30Z

For the fwd and bwd matrices, BEAGLE has scaled and shifted values, e.g., bwdVal[h] = scale*bwdVal[h] + shift, where h is index of a haplotype in the ref. panel. I guess we should keep both the pre-adjusted values and adjusted values for comparison. The lshmm and tskit versions also do this or something similar, right?

szhan · 2023-07-18T15:28:06Z

I'd like the unit tests we do for unit tests to be even smaller - 3 or 4 samples and < 10 sites.

How about this?
5 haploid individuals (so, 4 for ref. panel and one as query)
5 variable sites
4 trees

szhan · 2023-07-18T16:18:08Z

When we get the fwd and bwd matrices, we want one run with all the sites and then another run with chip-like sites, right? If so, probably we want like 9 variable sites and set 4 to missing for the latter, I think.

szhan · 2023-07-18T20:15:52Z

Ah, actually, scrap the above comment. BEAGLE just computes the values for the sites in the target VCF file. The values for the in-between sites are interpolated thereafer.

szhan · 2023-08-07T16:01:25Z

I've run BEAGLE 4.1 using a ref. panel of 2 diploid individuals (BEAGLE doesn't take haploid individuals, it seems) and a query set consisting of 1 diploid individual. This toy dataset has 5 sites in the ref. panel and 4 sites in the query.

m = number of sites in ref. panel = 5
n = number of haplotypes in the ref. panel = 4

Forward prob. matrix of size (m, n) = (5, 4)
Prob. of switch at each site in query, so of length m - 1 = 4
Prob. of mismatch at each site in query
"Scaling factor" at each site in query
"Shift factor" at each site in query
Sum of forward prob. at each site in query

Total of log10(sum of forward prob.) of all sites in query

szhan · 2023-08-07T17:25:42Z

I'm storing everything in text and then convert to a Pandas dataframe or tree sequence.

szhan · 2023-08-07T20:57:59Z

Should we add an example with a simple genetic map? By default, BEAGLE 4.1 assumes that 1 Mb = 1 cM.

szhan · 2023-08-07T21:43:53Z

We should probably also add the state probability matrix that combines the forward and backward probabilities.

szhan · 2023-08-14T10:17:40Z

I've rerun BEAGLE using different query haplotypes by setting Ne to 10. BEAGLE sets Ne to 1 million by default. I don't fully understand why they set it so big, but I suspect it has to do with defining the switch probabilities wrt to the ratio between Ne and ref. panel size. When Ne is set to 1 million and the ref. panel is so small as in the toy example, switch probability is always 1; intuitively, it is saying that the ref. panel is so small that it is a very poor representation of the genetic diversity in the population with large Ne, so it should always be favouring switching over mismatch.

Below are the matrices.

# Forward probability matrix
# tskit
[[0.05555556 0.38888889 0.05555556 0.05555556 0.38888889 0.05555556]
 [0.01234568 0.47530864 0.01234568 0.01234568 0.47530864 0.01234568]
 [0.02777778 0.44444444 0.02777778 0.02777778 0.44444444 0.02777778]
 [0.04166667 0.41666667 0.04166667 0.04166667 0.41666667 0.04166667]
 [0.05416667 0.39166667 0.05416667 0.05416667 0.39166667 0.05416667]]
# beagle
[[1.000000e-04 9.999000e-01 1.000000e-04 1.000000e-04 9.999000e-01
  1.000000e-04]
 [1.000000e-06 2.816427e+00 1.000000e-06 1.000000e-06 2.816427e+00
  1.000000e-06]
 [4.724000e-02 4.100000e-05 4.724000e-02 4.724000e-02 4.100000e-05
  4.724000e-02]
 [2.445010e-01 1.000000e-06 2.445010e-01 2.445010e-01 1.000000e-06
  2.445010e-01]]

# Backward probability matrix
# tskit
[[0.22222222 1.22222222 0.22222222 0.22222222 1.22222222 0.22222222]
 [1.         1.         1.         1.         1.         1.        ]
 [1.         1.         1.         1.         1.         1.        ]
 [1.         1.         1.         1.         1.         1.        ]
 [1.         1.         1.         1.         1.         1.        ]]
# beagle
[[0.166667 0.166667 0.166667 0.166667 0.166667 0.166667]
 [0.244614 0.010772 0.244614 0.244614 0.010772 0.244614]
 [0.226377 0.047246 0.226377 0.226377 0.047246 0.226377]
 [0.010973 0.478054 0.010973 0.010973 0.478054 0.010973]]

The BEAGLE values in the backward matrix now vary and make more sense. Do the _tskit.lshmm values make sense to you, @jeromekelleher ?

I think we should next tweak the parametrization and/or mimic BEAGLE's scaling and shifting?

szhan · 2023-08-14T14:02:24Z

In case I forget, just jogging down some notes about BEAGLE 4.1's approach to calculating allele probabilities, which are used to get the MAP alleles at each imputed site of the query haplotypes.

h = number of haplotypes in ref. panel
m = number of genotyped markers

Fill forward probability matrix (m x h).
Fill backward probability matrix (m x h). Only the values for a site is kept at a time.
Fill state probability matrix (m x h). This combines the forward and backward values at the genotyped markers.
Fill allele probability array (sum of number of marker states over m).

szhan · 2023-08-14T18:16:51Z

I think I understand the difference between state probability matrix and allele probability matrix. The former is what you get from forward-backward algorithm, and it contains the allele probabilities at the genotyped markers (excluding the ungenotyped markers that we want to impute). The latter contains allele probabilities interpolated using the allele probabilities of the flanking genotyped markers.

szhan · 2023-08-14T18:19:40Z

So, we should mimic what BEAGLE is doing, right? And interpolate the allele probabilities of the ungenotyped markers? So, we are really implementing BEAGLE 4.1's algorithm in tskit.

szhan · 2023-08-14T18:21:02Z

Then, getting the MAP alleles at each ungenotyped marker site for each query haplotype is straightforward.

jeromekelleher · 2023-08-15T07:15:57Z

Great, thanks @szhan.

Implementing BEAGLE 4.1's algorithm would ideal I guess - it's very good.

szhan · 2023-08-18T10:51:32Z

I've added a separate file (python/tests/test_beagle.py) to contain the code to replicate the BEAGLE 4.1 algorithm. It is temporary. I'll reorganise it later. I still don't have all the little details right, but I think I'm getting close.

szhan · 2023-08-18T15:00:15Z

Some notes about some concepts used in the BEAGLE 4.1 algorithm.

Marker refers to a genotyped site.
Cluster refers to a set of adjacent markers (defined by a specified threshold).
Segment refers to a set of clusters of markers.

There are nested indices to keep track of at each level of organisation, as far as I can tell.

szhan · 2023-08-26T18:22:43Z

Most of the implementation is there. I'm corresponding with Brian Browning to understand how reference haplotype segments are indiced.

szhan · 2023-08-30T08:55:39Z

In the BEAGLE 4.1, the reference haplotype segments used to index the HMM state probability matrix are defined in the function refHapSegs() in ImputationData.java.

In refHapSegs(), the reference haplotype segments are defined by a left genotyped marker (inclusive) and a right genotyped marker (exclusive) indexed with respect to the reference markers (the genotyped markers and the imputed markers). The number of such segments (n) is the number of genotyped markers plus one.

Suppose that the first and last genotyped markers are not the first and last markers in the reference panel, respectively. Also suppose that BEAGLE input parameters are set such that each segment is defined by two individual markers rather than aggregated markers.

The following steps define the left (L) and right (R) indices of the segments:

i = 0; L = 0; R = index of 0th genotyped marker in ref. markers plus 1.
i = 1; L = index of 0th genotyped marker in ref. markers; R = index of 1st genotyped marker in ref. markers plus 1.
loop from i = 2 to n-1; L = index of the (i-1)th genotyped marker in ref. marker; R = index of the ith genotyped marker plus 1.
i = n; L = index of the (n-1)th segment in ref. markers; R = number of ref. markers

szhan · 2023-08-30T09:54:32Z

It's simpler if we don't follow their way of normalizing the probability values in the matrices. We really just want to use Equation 1 in BB2016, but in their implementation they follow Equation 2.

szhan · 2023-09-04T13:49:17Z

It doesn't seem that BEAGLE uses the interpolated allele probabilities directly to get the imputed alleles. It computes the genotype probability (simply equal to the product of the probability of allele 0 and probability of allele 1) and then gets the genotype with the highest probability, I think. Also, I'm not getting the scaling of the probabilities right...

jeromekelleher assigned szhan Jul 17, 2023

szhan linked a pull request Aug 7, 2023 that will close this issue

Add tests for genotype imputation #2815

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add test_imputation.py file #2802

Add test_imputation.py file #2802

jeromekelleher commented Jul 17, 2023

szhan commented Jul 18, 2023

szhan commented Jul 18, 2023

szhan commented Jul 18, 2023

szhan commented Jul 18, 2023

jeromekelleher commented Jul 18, 2023

jeromekelleher commented Jul 18, 2023

jeromekelleher commented Jul 18, 2023

szhan commented Jul 18, 2023

szhan commented Jul 18, 2023

szhan commented Jul 18, 2023

szhan commented Jul 18, 2023

szhan commented Aug 7, 2023

szhan commented Aug 7, 2023

szhan commented Aug 7, 2023 •

edited

Loading

szhan commented Aug 7, 2023

szhan commented Aug 14, 2023

szhan commented Aug 14, 2023 •

edited

Loading

szhan commented Aug 14, 2023

szhan commented Aug 14, 2023

szhan commented Aug 14, 2023

jeromekelleher commented Aug 15, 2023

szhan commented Aug 18, 2023

szhan commented Aug 18, 2023

szhan commented Aug 26, 2023

szhan commented Aug 30, 2023

szhan commented Aug 30, 2023 •

edited

Loading

szhan commented Sep 4, 2023

Add test_imputation.py file #2802

Add test_imputation.py file #2802

Comments

jeromekelleher commented Jul 17, 2023

szhan commented Jul 18, 2023

szhan commented Jul 18, 2023

szhan commented Jul 18, 2023

szhan commented Jul 18, 2023

jeromekelleher commented Jul 18, 2023

jeromekelleher commented Jul 18, 2023

jeromekelleher commented Jul 18, 2023

szhan commented Jul 18, 2023

szhan commented Jul 18, 2023

szhan commented Jul 18, 2023

szhan commented Jul 18, 2023

szhan commented Aug 7, 2023

szhan commented Aug 7, 2023

szhan commented Aug 7, 2023 • edited Loading

szhan commented Aug 7, 2023

szhan commented Aug 14, 2023

szhan commented Aug 14, 2023 • edited Loading

szhan commented Aug 14, 2023

szhan commented Aug 14, 2023

szhan commented Aug 14, 2023

jeromekelleher commented Aug 15, 2023

szhan commented Aug 18, 2023

szhan commented Aug 18, 2023

szhan commented Aug 26, 2023

szhan commented Aug 30, 2023

szhan commented Aug 30, 2023 • edited Loading

szhan commented Sep 4, 2023

szhan commented Aug 7, 2023 •

edited

Loading

szhan commented Aug 14, 2023 •

edited

Loading

szhan commented Aug 30, 2023 •

edited

Loading