Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add more linear algebra tools #1547

Open
petrelharp opened this issue Jun 30, 2021 · 27 comments
Open

add more linear algebra tools #1547

petrelharp opened this issue Jun 30, 2021 · 27 comments
Labels
enhancement New feature or request statistics

Comments

@petrelharp
Copy link
Contributor

petrelharp commented Jun 30, 2021

Let G be the "extended genotype matrix", whose rows are indexed by mutations (not sites) and samples, with G[i,j] = 1 if the sample has inherited that mutation and 0 otherwise. This is something people want to work with, but for many applications we don't need to actually return the matrix, but instead just multiply it by things. For instance:

  1. If s is a vector of effect sizes, then G^T s gives a vector of additive phenotypes computed by adding up the entries of s corresponding to all the mutations each sample has inherited
  2. G^T G is related to the (site) genetic relatedness matrix; so "matrix multiplication" statistic #1246 is something in this direction.
  3. G G^T is similarly related to the LD matrix, and multiplying the LD matrix by things is important in eg estimating heritabiliy
  4. G x is similar to what we've called trait_covariance with mode="site", windows="sites" - except that gives something per site, not per mutation.

Here "related to" means "up to some normalization that I'm not figuring out right now".

There's API and computational issues to work out here. For instance, we could just implement methods that do G x and G^T y, and use these to get G^T G x; however, that's not how we currently do things in #1246.

Preliminary brainstorming: besides #1246,

  • G^T s: maybe ts.additive_phenotypes(samples, effect_sizes) would return G^T effect_sizes. Also, we probably want ts.individual_phenotypes(individuals, effect_sizes, dominance_coefficients) that does the analogous thing for diploids. (But, maybe it'd be nice to have a single function that you can pass either sample IDs or individual IDs somehow.)

  • I need to better understand what computations are done in LD space to see if this is the right thing, but possibly we should implement G G^T x as ts.ld_matrix_multiplication(x).

  • And, maybe G x would be an option to trait_covariance? Or, mutation_covariance(x)?

@gregorgorjanc
Copy link
Member

@petrelharp sorry for being pedantic, but G^T s gives you additive genetic (or breeding) values, not phenotypes. To get phenotypes we would have to add to G^T s at least environmental values, as well as effects, appropriate for a given trait, such as sex, age, …

This nomenclature follows the established Fisher (1918) model and 100+ years of nomenclature evolution with quite strong stabilising selection on these terms: Phenotype value = Genetic (or genotypic) value + Environment value, possibly including interaction between the two. Then genetic value is decomposed into additive, dominance and epistatic values.

I propose ts.genetic_values() which can in principle cover nodes or samples and additive or additive + dominance cases (dominance only with samples).

@jeromekelleher jeromekelleher added the enhancement New feature or request label Jul 1, 2021
@petrelharp
Copy link
Contributor Author

G^T s gives you additive genetic (or breeding) values, not phenotypes.

Yes, good point! (I was going for 'quick notes', not being careful.)

I propose ts.genetic_values() which can in principle cover nodes or samples and additive or additive + dominance cases (dominance only with samples).

Great idea!

@petrelharp
Copy link
Contributor Author

Separately - I've thought it through and I think the efficient way to multiply a vector by the LD matrix is via G G^T - i.e., by first getting breeding values per sample and then applying G to this.

@gregorgorjanc
Copy link
Member

So, two matvecs instead of a matmul and a matvec, right?

@petrelharp
Copy link
Contributor Author

So, two matvecs instead of a matmul and a matvec, right?

Right - but also, recall that in #1246 we're basically doing G^T G all at once, as a single matvec. However, I can't see how to do the equivalent for LD in a single matvec - but, it's easy with two matvecs.

@petrelharp
Copy link
Contributor Author

Here's a draft of I think a very efficient way to do (1), G^T s, which is given a vector s of length equal to the number of mutations, compute for each sample the sum of the values of s for every mutation that sample carries.

We will iterate left-to-right over the trees, maintaining at all times a vector x of length equal to the number of nodes, such that x[u] will be equal to: the sum of the values of s for all mutations that have been inherited by node u since the last time the set of samples inherited by u has changed, except for any mutations still accounted for by x[p(u)], where p(u) is the parent of u.

So: the output will be out, and suppose that the output for sample v is stored in out[k[v]].

  1. Initialize x to 0.
  2. After adding an edge with parent p and child c, traverse from the root down to p, and for each node v on this path, add x[v] to x[u] for each child u of v. If v is a sample then also add x[v] to out[k[v]]. Then set x[v] to zero.
  3. Do the same before removing an edge with parent p and child c.
  4. If a mutation i appears on node u, add s[i] to x[u].

Note that this avoids propagating mutations down the tree until "necessary"; i.e., until a maximal shared haplotype in some sense is attained - so, the complexity is, I think, like O((num edges) * (log N) + (num mutations) )

@petrelharp
Copy link
Contributor Author

A less efficient way to do this would be to do:

out = np.zeros(ts.num_samples)
for t in ts.trees(sample_lists=True):
   for m in t.mutations():
     for u in t.samples(m.node):
       out[u] += s[m.id]

... but, maybe this is good enough? I think not; see below:

This way does generalize better to the case where we want to compute phenotypes with dominance. For that case, we need to know, for each individual, how many of their nodes are below a given mutation, if the number is nonzero. This fits without our statistics framework already, but the internal state would be equal to the number of individuals, so this is not feasible. If we knew not only sample_lists but also, say, individual_lists, then that would do the trick. It would essentially be a sparse representation of the internal state from the statistics framework. However, it is not clear to me that this is very sparse: with N samples the total length of all sample_lists is still O(N^2). But maybe there is something sparser we could keep track of, like "which individuals have only one node below this one"?

@jeromekelleher
Copy link
Member

The first option sounds super interesting @petrelharp. I wonder if there's any connection with keeping track of the last time a node was updated (e.g., here, which we might be able to reuse conceptually?

I really like this! It's using the mutation parent column for dynamic programming ❤️

@jeromekelleher
Copy link
Member

The generalisation to individuals is trickier - how about we get an implementation of the first algorithm implemented and tested, and then think about how to generalise? The sample_lists idea has never really proved that valuable to be honest - because it's so much work to maintain, it's often quicker to simply do the traversal. This would almost certainly be true of individuals too.

@petrelharp
Copy link
Contributor Author

I wonder if there's any connection with keeping track of the last time a node was updated (e.g., here), which we might be able to reuse conceptually?

Ah, you're right! That "last updated time" should be the same - i.e., should give the last time that x was set to zero. So, that's probably what we need to compute a branch stat version of this (but, haven't thought that through).

@petrelharp
Copy link
Contributor Author

The generalisation to individuals is trickier - how about we get an implementation of the first algorithm implemented and tested, and then think about how to generalise?

Sounds good. That's a different issue to this one, anyhow (which is "linear algebra tools").

@jeromekelleher
Copy link
Member

Sounds good. That's a different issue to this one, anyhow (which is "linear algebra tools").

Was commenting on this:

If we knew not only sample_lists but also, say, individual_lists, then that would do the trick.I

@gregorgorjanc
Copy link
Member

Is there a reason to focus on samples only? I would be interested in sums for ancestors too.

@petrelharp
Copy link
Contributor Author

Is there a reason to focus on samples only? I would be interested in sums for ancestors too.

No, I guess not - but, just to check: the sum would be only over the portion of the genome where we have their genotypes, skipping out the "missing data" portions. So, it'd be the contribution to breeding value from the chunk of ancestral haplotype carried by that ancestor. Is that what you want? (We will probably want it to do Bayesian things...)

@gregorgorjanc
Copy link
Member

Yes, "partial" ancestral genomes/individuals will have only "partial" genetic values. In some cases ancestral genomes/individuals could be known in full, say in a simulation or in a pedigreed population, so we should get the full genetic value for such ancestors in these cases (but maybe these are classed as "samples" anyway).

I think that providing genetic values for ancestors (full or partial) will be useful if we want to study evolution of a particular genome segment. I think the same applies also to other node or individual based statistics.

@gregorgorjanc
Copy link
Member

So: the output will be out, and suppose that the output for sample v is stored in out[k[v]].

@petrelharp what is k here?

@petrelharp
Copy link
Contributor Author

@petrelharp what is k here?

k is the vector of indices that lets us map from nodes (v) to index in the output - so, if we wanted output for nodes [0, 2, 4, 1] then I guess k would be a vector [0, 3, 1, 0, 2, 0, 0, ..]. This might not be the best way to arrange things.

@gregorgorjanc
Copy link
Member

gregorgorjanc commented Mar 9, 2022

Copying Slack discussion also here so we don’t loose it;)

@jeromekelleher you asked what the other multiplication (pre- or left-multiplication) does. Let G be an nSites x nSamples matrix, l be an nSites x 1 vector, and r be an nSamples x 1 vector.

Today you showed matrix-vector multiplication of G r = x where x is an nSites x 1 vector (so we get some "stats" per site) - this is 4. above at #1547 (comment) as @petrelharp mentioned this is connected to trait_covariance where we calculate covariance between 1) the number of mutations samples carry at a site and 2) sample phenotypes - so for a site i we calculate cov(r, G[i,]) = sum_over_j_samples(r_centred[j], G_centred[i,]) / nSamples (centred means that we removed the mean - for every row in G and overall in r). Of note, in a genome-wide association study, we want to regress sample phenotype onto the number of mutations samples carry, to estimate potential association between the two. The simplest estimator is to do one site at a time - here we calculate regression coefficient b_i = cov(r, G[i,]) / var(G[i,]). This reg coeff is often called allele substitution effect, which tells us if mutation at site I (=allele substitution) increases or decreases the phenotype.

The other way is l G = y where y is an nSamples x 1 vector (so we get some "stats" per sample). This is 1. above at #1547 (comment) Then 1 G = y (1 (not l!) here is a vector of 1s) gives us how many mutations - a "mutation count" (across all sites) samples carry. Further, say, l holds allele substitution values of all the mutations at the sites (assuming 0/1 mutations or that each mutation is in its own row). Then l G = y gives a "weighted mutation count" where each mutation is weighted by its effect on a phenotype - we call this additive genetic value (aka breeding value or polygenic risk score) - collective effect of all mutations, assuming additive effects only. @petrelharp proposed an algo for this above #1547 (comment)

Having the l G = y functionality would mean that we can start doing quantitative genetics simulations with tree sequences!

@gregorgorjanc
Copy link
Member

@jeromekelleher thinking a bit more about the “parsimony approach” you have taken. As you mentioned the nSamples*1 vector, with which you post multiply the genotype matrix, has to be “compressible”, as in, mutations that give rise to the vector values have to map to local tree (trees?) very well. If I am getting this right, this is a strong assumption and it won’t hold for a complex traits, where the samples vector is continuous (say something like a Gaussian) and we don’t have a good way of mapping the sample vector onto individual mutations effects, hence onto trees, but it could be useful in less complex traits?

@jeromekelleher
Copy link
Member

It's a very strong assumption @gregorgorjanc and it definitely won't hold in general. It's only going to work for things in which the underlying generative process is based on the trees - but since this is the background assumption we're making for a lot things anyway (we think the trees are useful because they are the generating process), I think it's worth exploring.

@gregorgorjanc
Copy link
Member

It's a very strong assumption @gregorgorjanc and it definitely won't hold in general. It's only going to work for things in which the underlying generative process is based on the trees - but since this is the background assumption we're making for a lot things anyway (we think the trees are useful because they are the generating process), I think it's worth exploring.

Totally!

@brieuclehmann is this "compressed" view useful for PCA type calculations? The ancestry that PCA is revealing, is driven exclusively by the tree generating process!

@gregorgorjanc
Copy link
Member

Let G be an nSites x nSamples matrix, l be an nSites x 1 vector, and r be an nSamples x 1 vector.

Just a note that in the genetics literature G symbol is often used to denote the genetic relatedness matrix (GRM) obtained from a genotype matrix (possibly centred or even standardised see here for more).

@gregorgorjanc
Copy link
Member

Thinking about the usefulness of linear algebra operations with tree sequence ... One of key operations in data analysis is solving the least squares problem. There are many flavours of solving the least squares problem. To estimate mutation effects (= allele substitution effects) by regressing phenotypes onto genotypes, I have been using Gauss-Seidel iterative method (see this nice "Technical Note: Computing Strategies in Genome-Wide Selection" where the Gauss-Seidel Residual Update variant is described) - here we are estimating all mutation effects jointly so that we can use them in genomic prediction, not one mutation at a time as mentioned above for GWAS. In Gauss-Seidel Residual update we need the following operations (I have a simple Fortran90 implementation here):

a. a dot product of a mutation row of G with itself dot(G[mutation,],G[mutation,]) (see here)
b. a dot product of a mutation row of G with a samples vector r (dot(G[mutation,],r) (see here)
c. a product of a mutation row of G with a scalar k (G[mutation,]*k) (see here)
d. a product of sites vector l with a matrix G (l G = y) (see here)

@gregorgorjanc
Copy link
Member

I have been thinking about various quantitative genetics models and tree sequence. In this thread the mentioned linear algebra tools involve the extended genotype matrix G, so all calculations involve sites and their mutations. Would it be possible to do the same “style” of operations but involving branches instead of sites&mutations? Say, in #1547 (comment) one operation was calculating “weighted mutation count”, which is y inl G = y, where l are mutation weights (=effects). I am thinking of calculating the “sum of branch weights”, that is, start at the top of the tree sequence and accumulate weights for each branch as we go down the trees (maybe samples-to-root direction works to?). This is related to general_stat but there only sample weights are accumulated (from samples towards roots).

@jeromekelleher
Copy link
Member

jeromekelleher commented Mar 21, 2022

Possibly - you'd need to explain the details to me in person though.

@petrelharp
Copy link
Contributor Author

I think that yes, definitely! The "expected value of statistic S under infinite sites" is a good definition, and asked to anything we're considering.

@petrelharp
Copy link
Contributor Author

On this topic, see also #2882.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request statistics
Projects
None yet
Development

No branches or pull requests

3 participants