"matrix multiplication" statistic #1246

petrelharp · 2021-03-16T05:45:32Z

Here's a draft of the C code to compute covariances of aribtrary weighted sums of genotypes. Borrowing from @brieuclehmann's code, this should always be true:

def genetic_relatedness_matrix(ts, sample_sets, mode):
    n = len(sample_sets)
    indexes = [(n1, n2) for n1, n2 in itertools.combinations_with_replacement(range(n), 2)]
    K = np.zeros((n,n))
    K[np.triu_indices(n)] = ts.genetic_relatedness(sample_sets, indexes, mode = mode, proportion=False, span_normalise=False)
    K = K + np.triu(K,1).transpose()
    return K

ts = msprime.simulate(10, length=1e6, recombination_rate=1e-8, Ne=1e4, mutation_rate=1e-8)
W = np.random.normal(size=2 * ts.num_samples).reshape((ts.num_samples, 2))
K = genetic_relatedness_matrix(ts, [[n] for n in ts.samples()], mode='site')
A = W.T @ K @ W
B = ts.genetic_relatedness_weighted(W, indexes=[(0,0), (0, 1), (1, 0), (1,1)], mode='site', span_normalise=False).reshape((2,2))
assert np.allclose(A, B)

(... and, it is, happy day!)

TODO: input checking, and tests. Tests should be straightforward, as we can just check for the property above, but hooking it in to the testing code will take some doing.

Also TODO:

is this a good name? a good python API? I also thought about just making weights an argument to genetic_relatedness, which would call one method or the other depending on whether sample_sets or weights were present. This could be a general pattern, even.
maybe there's some tidying to do at the C level, in terms of refactoring the various general stats computing helpers? but this could be put off.

codecov · 2021-03-16T06:13:40Z

Codecov Report

Merging #1246 (039a3b8) into main (555913c) will increase coverage by 3.50%.
The diff coverage is 80.50%.

❗ Current head 039a3b8 differs from pull request most recent head 6c144d8. Consider uploading reports for the commit 6c144d8 to get more accurate results

@@            Coverage Diff             @@
##             main    #1246      +/-   ##
==========================================
+ Coverage   89.89%   93.39%   +3.50%     
==========================================
  Files          30       28       -2     
  Lines       29043    27464    -1579     
  Branches     5672     1259    -4413     
==========================================
- Hits        26107    25649     -458     
- Misses       1667     1781     +114     
+ Partials     1269       34    -1235

Flag	Coverage Δ
c-tests	`92.26% <91.89%> (+5.86%)`	⬆️
lwt-tests	`89.05% <ø> (+8.91%)`	⬆️
python-c-tests	`71.44% <75.30%> (+4.39%)`	⬆️
python-tests	`98.67% <10.00%> (-0.35%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
python/tskit/trees.py	`97.82% <10.00%> (-0.83%)`	⬇️
c/tskit/trees.c	`94.94% <91.89%> (+4.16%)`	⬆️
python/_tskitmodule.c	`91.12% <96.72%> (+2.71%)`	⬆️

... and 20 files with indirect coverage changes

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 555913c...6c144d8. Read the comment docs.

jeromekelleher · 2021-03-16T06:36:59Z

Very cool!

is this a good name? a good python API? I also thought about just making weights an argument to genetic_relatedness, which would call one method or the other depending on whether sample_sets or weights were present. This could be a general pattern, even.

I can't think of a better one, but I'd be +1 on adding a weights argument to genetic_relatedness. We'd need a different name though - is this the same "weight" as the W arg in the general_stat method? I can see this being confusing.

jeromekelleher

Looks great!

I think we should add a genetic_relatedness_matrix method as well as a convenience, but you probably had that in mind too?

The code all looks basically right to me, although I have scrutinised the detail. My only question really is about the "weight" term and whether this will be confusing. If it's the same W as in general stat (which I think it is?) then we're all good.

c/tskit/trees.c

petrelharp · 2021-03-16T12:56:53Z

is this the same "weight" as the W arg in the general_stat method?

It is the same. So, maybe the argument should be called W.

I think we should add a genetic_relatedness_matrix method as well as a convenience, but you probably had that in mind too?

Yeah, I guess? We still haven't gotten around to writing the divergence_matrix method either.

brieuclehmann · 2021-05-29T11:42:11Z

Just been playing around with this and it seems like there might be some numerical stability issues. I tried @petrelharp 's code above with different seeds and occasionally get an AssertionError. Actually, the SUPER WEIRD thing is that sometimes it works with the same seed and sometimes it doesn't... i.e. I run exactly the same code twice and the first time it will be fine but the second time returns, e.g., B = [inf, inf ; inf, inf]. Has anyone seen anything like this before? This is on my MacBook btw.

def genetic_relatedness_matrix(ts, sample_sets, mode):
    n = len(sample_sets)
    indexes = [(n1, n2) for n1, n2 in itertools.combinations_with_replacement(range(n), 2)]
    K = np.zeros((n,n))
    K[np.triu_indices(n)] = ts.genetic_relatedness(sample_sets, indexes, mode = mode, proportion=False, span_normalise=False)
    K = K + np.triu(K,1).transpose()
    return K

seed = 42
ts = msprime.simulate(10, length=1e6, recombination_rate=1e-8, Ne=1e4, mutation_rate=1e-8, random_seed=seed)
np.random.seed(seed)
W = np.random.normal(size=2 * ts.num_samples).reshape((ts.num_samples, 2))
K = genetic_relatedness_matrix(ts, [[n] for n in ts.samples()], mode='site')
A = W.T @ K @ W
B = ts.genetic_relatedness_weighted(W, indexes=[(0,0), (0, 1), (1, 0), (1,1)], mode='site', span_normalise=False).reshape((2,2))
assert np.allclose(A, B)

It seems to work fine on rescomp (the Oxford BDI computing cluster, which runs linux). Though this is using python v3.7.10 compared to v3.8.6. Also, the K matrices are ever so slightly different between the two machines.

Bizarre... I'll just develop on rescomp for now.

brieuclehmann · 2021-05-29T12:51:09Z

OK, still some bizarre behaviour on rescomp. See the following code to perform matrix-vector multiplication using the same tree sequence as above. I am using the old code for comparison as mat_mul then mat_mul_stat for the call to genetic_relatedness_weighted. On some occasions it returns the right output, on others not:

n_ind = int(ts.num_samples / 2)
sample_sets = [(2 * i, (2 * i) + 1) for i in range(n_ind)]
W_samples = np.array([[float(u in A) for A in sample_sets] for u in ts.samples()])
indexes = [(i, n_ind) for i in range(len(sample_sets))]
n = np.array([len(x) for x in sample_sets])
n_total = sum(n)

def mat_mul_stat(a):
    W = np.c_[W_samples, W_samples @ a]
    return ts.genetic_relatedness_weighted(
        W, indexes=indexes, mode="site", span_normalise=False
    )

def mat_mul(a):
    W = np.c_[W_samples, W_samples @ a]
    wts = np.sum(W, axis=0)
    # summary function
    def f(x):
        mx = np.sum(x[0:n_ind]) / n_total
        return np.array(
            [(x[i] - wts[i] * mx) * (x[j] - wts[j] * mx) / 2 for i, j in indexes]
        )
    return ts.general_stat(W, f, len(indexes), span_normalise=False)

np.random.seed(seed)
a = np.random.randn(n_ind)
K = genetic_relatedness_matrix(ts, sample_sets, "site")
print(mat_mul_stat(a))
print(K @ a)
print(mat_mul(a))
print(mat_mul_stat(a))
print(mat_mul_stat(a))

Returns:

[  73.45122395 -205.96299977   53.02061499  435.5723698  -352.63344965]
[  73.45122395 -205.96299977   53.02061499  432.12461049 -352.63344965]
[  73.45122395 -205.96299977   53.02061499  432.12461049 -352.63344965]
[  41.41281634 3375.49081761 3669.81766139 4035.61057064 3232.49239152]
[-842.64551716  690.48841404  967.14364331 1339.59209569  545.65397605]

petrelharp · 2021-05-30T03:48:59Z

Yep, I'm seeing that too. Bizarrely, only in an interactive shell,not at the command line. It must be that I've done something bad with memory management. I'm not sure how to track that down - can we run valgrind on the python tests?

brieuclehmann · 2021-06-03T15:26:45Z

I'm also getting the problem when I run the above code in the command line via a script... Never used valgrind before but could look into it if that would help?

jeromekelleher · 2021-06-03T16:23:54Z

I'm not sure how to track that down - can we run valgrind on the python tests?

No, Python gets in the way. We'd have to reproduce the error as a C test case to be able to use valgrind.

I have a bit of time in the morning, I could take a look if someone sends me a script to reproduce?

petrelharp · 2021-06-03T16:39:07Z

#!/usr/bin/python3
import msprime, tskit
import numpy as np
import itertools

def genetic_relatedness_matrix(ts, sample_sets, mode):
    n = len(sample_sets)
    indexes = [(n1, n2) for n1, n2 in itertools.combinations_with_replacement(range(n), 2)]
    K = np.zeros((n,n))
    K[np.triu_indices(n)] = ts.genetic_relatedness(sample_sets, indexes, mode = mode, proportion=False, span_normalise=False)
    K = K + np.triu(K,1).transpose()
    return K


ts = msprime.simulate(10, length=1e6, recombination_rate=1e-8, Ne=1e4, mutation_rate=1e-8, random_seed=42)

for seed in range(10):
    np.random.seed(seed)
    W = np.random.normal(size=2 * ts.num_samples).reshape((ts.num_samples, 2))
    K = genetic_relatedness_matrix(ts, [[n] for n in ts.samples()], mode='site')
    A = W.T @ K @ W
    B = ts.genetic_relatedness_weighted(W, indexes=[(0,0), (0, 1), (1, 0), (1,1)], mode='site', span_normalise=False).reshape((2,2))
    print('A:', A)
    print('B:', B)
    assert np.allclose(A, B)

jeromekelleher · 2021-06-03T16:52:53Z

Thanks @petrelharp, this is a weird one. Reliably fails on the 10th seed, but not if we just run that seed on its own. I'll poke around, but it nearly has to be a memory management error.

jeromekelleher · 2021-06-03T16:59:11Z

I think it's worthwhile putting in some tests to cover the remaining code paths now - this will have to be done later anyway, and it might unearth the problem we're having here. Stuff covering error paths in the Python C stuff is particularly important to exercise, subtle problems can manifest themselves in all sorts of unexpected ways and lots of time can be lost tracking them down.

petrelharp · 2021-06-03T17:09:27Z

That sounds good. I'm pretty swamped right now, might you be able to have a go at the tests, @brieuclehmann?

brieuclehmann · 2021-06-03T17:11:44Z

Yep, I'd be happy to! Please could you point me towards a relevant function/method that I could use as a template?

jeromekelleher · 2021-06-03T17:15:03Z

Thanks @brieuclehmann but it's probably better if I put in a commit to test out the code paths I'm thinking about - the testing code is quite convoluted for this stuff, and it would be a good way for me to start getting up to speed on what the function is doing anyway. I'll have a go as soon as I can (hopefully tomorrow AM)

jeromekelleher · 2021-06-04T11:09:26Z

I think that should do it - I think the issue was that the total_weights array was being malloced rather than calloced (and therefore never being initialised to 0). Most of the time the memory happened to be zero already so it worked.

I don't understand how valgrind missed this though...

c/tskit/trees.c

petrelharp · 2021-06-04T15:59:44Z

I think the issue was that the total_weights array was being malloced rather than calloced (and therefore never being initialised to 0)

Doh! Good catch.

jeromekelleher · 2021-10-04T08:42:48Z

I think we should try to get this merged @petrelharp - do you know what's needed to get it in? We don't need to have it fully documented, but it would be good to get it in to main so we can experiment.

petrelharp · 2021-10-07T07:43:38Z

It needs, at least:

better documentation for things like what indexes defaults to
tests for genetic_relatedness_weighted to be added to tests/test_tree_stats.py

The latter should be easy because we can just compare its output to what we'd get by matrix-multiplying the genetic_relatedness matrix by the weight matrix (as in the definition). Pretty quick, but not quick enough for me to have done it just now...

jeromekelleher · 2021-12-08T17:17:07Z

Do you think you could update this branch please @petrelharp? If you rebase and force push, then @brieuclehmann can open a PR against your fork with a some tests, and that should finish things up.

As far as I can see it's mainly simple interface tests needed here now?

brieuclehmann · 2022-02-22T14:41:43Z

Ping @petrelharp - I'd be happy to write some tests and more documentation in the coming days so that we can get this merged, but I think I'm not able to rebase the branch myself?

petrelharp · 2022-02-22T21:56:55Z

Done!

jeromekelleher · 2022-02-23T10:55:50Z

FYI I just pushed an update fixing the build problems.

brieuclehmann · 2022-07-26T09:54:51Z

Would it be possible to rebase this onto main? I'd like to use genetic_relatedness_weighted and decapitate in the same script, i.e. to do PCA on a decapitated tree sequence.

(Sorry for dropping the ball on writing the tests & documentation for this - I will get round to it v soon!)

petrelharp · 2022-07-26T19:53:02Z

Uh-oh, a segfault - ping @jeromekelleher?

jeromekelleher · 2022-07-27T08:17:18Z

Looks like a simple compile error because of API changes we made for C 1.0. Should be fixed.

Initialise total_weights to 0 and avoid an extra loop through arrays. Fix compile error

First pass at genetic_relatedness_weighted tests Full pass at tests for genetic_relatedness_weighted Update python/tskit/trees.py Co-authored-by: Peter Ralph <[email protected]> Add summary func to genetic_relatedness_weighted tests Fix summary function definition in docs

jeromekelleher · 2023-07-13T08:27:20Z

Closing this in favour of #2785 (which includes these commits)

petrelharp force-pushed the matmul branch from f9c1f0c to b7cb129 Compare March 16, 2021 05:46

jeromekelleher reviewed Mar 16, 2021

View reviewed changes

c/tskit/trees.c Outdated Show resolved Hide resolved

c/tskit/trees.c Outdated Show resolved Hide resolved

jeromekelleher force-pushed the matmul branch from b7cb129 to 9076387 Compare June 4, 2021 10:02

jeromekelleher reviewed Jun 4, 2021

View reviewed changes

c/tskit/trees.c Outdated Show resolved Hide resolved

petrelharp mentioned this pull request Jun 30, 2021

add more linear algebra tools #1547

Closed

brieuclehmann mentioned this pull request Sep 29, 2021

ts-PCA performance is slow compared scikit-allel #1743

Closed

petrelharp force-pushed the matmul branch from 1078247 to 2994659 Compare February 22, 2022 21:56

jeromekelleher force-pushed the matmul branch from 2994659 to 829431f Compare February 23, 2022 10:55

jeromekelleher marked this pull request as ready for review February 23, 2022 10:56

jeromekelleher force-pushed the matmul branch from 829431f to aae0a30 Compare February 23, 2022 13:07

petrelharp force-pushed the matmul branch from aae0a30 to 8487f73 Compare July 26, 2022 19:11

brieuclehmann mentioned this pull request Jul 7, 2023

Tests and documentation for genetic_relatedness_weighted #2785

Merged

petrelharp and others added 4 commits July 13, 2023 09:20

matmul plumbing

13d8977

Add tests for weighted genetic relatedness.

e130576

Fix memory bug.

d5d24e1

Initialise total_weights to 0 and avoid an extra loop through arrays. Fix compile error

jeromekelleher force-pushed the matmul branch from 039a3b8 to 6c144d8 Compare July 13, 2023 08:21

jeromekelleher closed this Jul 13, 2023

"matrix multiplication" statistic #1246

"matrix multiplication" statistic #1246

Uh oh!

Conversation

petrelharp commented Mar 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Mar 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

jeromekelleher commented Mar 16, 2021

Uh oh!

jeromekelleher left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

petrelharp commented Mar 16, 2021

Uh oh!

brieuclehmann commented May 29, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

brieuclehmann commented May 29, 2021

Uh oh!

petrelharp commented May 30, 2021

Uh oh!

brieuclehmann commented Jun 3, 2021

Uh oh!

jeromekelleher commented Jun 3, 2021

Uh oh!

petrelharp commented Jun 3, 2021

Uh oh!

jeromekelleher commented Jun 3, 2021

Uh oh!

jeromekelleher commented Jun 3, 2021

Uh oh!

petrelharp commented Jun 3, 2021

Uh oh!

brieuclehmann commented Jun 3, 2021

Uh oh!

jeromekelleher commented Jun 3, 2021

Uh oh!

jeromekelleher commented Jun 4, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

petrelharp commented Jun 4, 2021

Uh oh!

jeromekelleher commented Oct 4, 2021

Uh oh!

petrelharp commented Oct 7, 2021

Uh oh!

jeromekelleher commented Dec 8, 2021

Uh oh!

brieuclehmann commented Feb 22, 2022

Uh oh!

petrelharp commented Feb 22, 2022

Uh oh!

jeromekelleher commented Feb 23, 2022

Uh oh!

brieuclehmann commented Jul 26, 2022

Uh oh!

petrelharp commented Jul 26, 2022

Uh oh!

jeromekelleher commented Jul 27, 2022

Uh oh!

jeromekelleher commented Jul 13, 2023

Uh oh!

Uh oh!

petrelharp commented Mar 16, 2021 •

edited

Loading

codecov bot commented Mar 16, 2021 •

edited

Loading

brieuclehmann commented May 29, 2021 •

edited

Loading

jeromekelleher commented Jun 4, 2021 •

edited

Loading