-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add more linear algebra tools #1547
Comments
@petrelharp sorry for being pedantic, but This nomenclature follows the established Fisher (1918) model and 100+ years of nomenclature evolution with quite strong stabilising selection on these terms: Phenotype value = Genetic (or genotypic) value + Environment value, possibly including interaction between the two. Then genetic value is decomposed into additive, dominance and epistatic values. I propose |
Yes, good point! (I was going for 'quick notes', not being careful.)
Great idea! |
Separately - I've thought it through and I think the efficient way to multiply a vector by the LD matrix is via |
So, two matvecs instead of a matmul and a matvec, right? |
Right - but also, recall that in #1246 we're basically doing |
Here's a draft of I think a very efficient way to do (1), We will iterate left-to-right over the trees, maintaining at all times a vector So: the output will be
Note that this avoids propagating mutations down the tree until "necessary"; i.e., until a maximal shared haplotype in some sense is attained - so, the complexity is, I think, like O((num edges) * (log N) + (num mutations) ) |
A less efficient way to do this would be to do:
... but, maybe this is good enough? I think not; see below: This way does generalize better to the case where we want to compute phenotypes with dominance. For that case, we need to know, for each individual, how many of their nodes are below a given mutation, if the number is nonzero. This fits without our statistics framework already, but the internal state would be equal to the number of individuals, so this is not feasible. If we knew not only |
The first option sounds super interesting @petrelharp. I wonder if there's any connection with keeping track of the last time a node was updated (e.g., here, which we might be able to reuse conceptually? I really like this! It's using the mutation parent column for dynamic programming ❤️ |
The generalisation to individuals is trickier - how about we get an implementation of the first algorithm implemented and tested, and then think about how to generalise? The |
Ah, you're right! That "last updated time" should be the same - i.e., should give the last time that |
Sounds good. That's a different issue to this one, anyhow (which is "linear algebra tools"). |
Was commenting on this:
|
Is there a reason to focus on samples only? I would be interested in sums for ancestors too. |
No, I guess not - but, just to check: the sum would be only over the portion of the genome where we have their genotypes, skipping out the "missing data" portions. So, it'd be the contribution to breeding value from the chunk of ancestral haplotype carried by that ancestor. Is that what you want? (We will probably want it to do Bayesian things...) |
Yes, "partial" ancestral genomes/individuals will have only "partial" genetic values. In some cases ancestral genomes/individuals could be known in full, say in a simulation or in a pedigreed population, so we should get the full genetic value for such ancestors in these cases (but maybe these are classed as "samples" anyway). I think that providing genetic values for ancestors (full or partial) will be useful if we want to study evolution of a particular genome segment. I think the same applies also to other node or individual based statistics. |
@petrelharp what is k here? |
|
Copying Slack discussion also here so we don’t loose it;) @jeromekelleher you asked what the other multiplication (pre- or left-multiplication) does. Let Today you showed matrix-vector multiplication of The other way is Having the |
@jeromekelleher thinking a bit more about the “parsimony approach” you have taken. As you mentioned the nSamples*1 vector, with which you post multiply the genotype matrix, has to be “compressible”, as in, mutations that give rise to the vector values have to map to local tree (trees?) very well. If I am getting this right, this is a strong assumption and it won’t hold for a complex traits, where the samples vector is continuous (say something like a Gaussian) and we don’t have a good way of mapping the sample vector onto individual mutations effects, hence onto trees, but it could be useful in less complex traits? |
It's a very strong assumption @gregorgorjanc and it definitely won't hold in general. It's only going to work for things in which the underlying generative process is based on the trees - but since this is the background assumption we're making for a lot things anyway (we think the trees are useful because they are the generating process), I think it's worth exploring. |
Totally! @brieuclehmann is this "compressed" view useful for PCA type calculations? The ancestry that PCA is revealing, is driven exclusively by the tree generating process! |
Just a note that in the genetics literature |
Thinking about the usefulness of linear algebra operations with tree sequence ... One of key operations in data analysis is solving the least squares problem. There are many flavours of solving the least squares problem. To estimate mutation effects (= allele substitution effects) by regressing phenotypes onto genotypes, I have been using Gauss-Seidel iterative method (see this nice "Technical Note: Computing Strategies in Genome-Wide Selection" where the Gauss-Seidel Residual Update variant is described) - here we are estimating all mutation effects jointly so that we can use them in genomic prediction, not one mutation at a time as mentioned above for GWAS. In Gauss-Seidel Residual update we need the following operations (I have a simple Fortran90 implementation here): a. a dot product of a mutation row of |
I have been thinking about various quantitative genetics models and tree sequence. In this thread the mentioned linear algebra tools involve the extended genotype matrix |
Possibly - you'd need to explain the details to me in person though. |
I think that yes, definitely! The "expected value of statistic S under infinite sites" is a good definition, and asked to anything we're considering. |
On this topic, see also #2882. |
Let
G
be the "extended genotype matrix", whose rows are indexed by mutations (not sites) and samples, withG[i,j] = 1
if the sample has inherited that mutation and0
otherwise. This is something people want to work with, but for many applications we don't need to actually return the matrix, but instead just multiply it by things. For instance:s
is a vector of effect sizes, thenG^T s
gives a vector of additive phenotypes computed by adding up the entries ofs
corresponding to all the mutations each sample has inheritedG^T G
is related to the (site) genetic relatedness matrix; so "matrix multiplication" statistic #1246 is something in this direction.G G^T
is similarly related to the LD matrix, and multiplying the LD matrix by things is important in eg estimating heritabiliyG x
is similar to what we've calledtrait_covariance
withmode="site", windows="sites"
- except that gives something per site, not per mutation.Here "related to" means "up to some normalization that I'm not figuring out right now".
There's API and computational issues to work out here. For instance, we could just implement methods that do
G x
andG^T y
, and use these to getG^T G x
; however, that's not how we currently do things in #1246.Preliminary brainstorming: besides #1246,
G^T s
: maybets.additive_phenotypes(samples, effect_sizes)
would returnG^T effect_sizes
. Also, we probably wantts.individual_phenotypes(individuals, effect_sizes, dominance_coefficients)
that does the analogous thing for diploids. (But, maybe it'd be nice to have a single function that you can pass either sample IDs or individual IDs somehow.)I need to better understand what computations are done in LD space to see if this is the right thing, but possibly we should implement
G G^T x
asts.ld_matrix_multiplication(x)
.And, maybe
G x
would be an option totrait_covariance
? Or,mutation_covariance(x)
?The text was updated successfully, but these errors were encountered: