divmat tree by tree #2736

jeromekelleher · 2023-04-12T22:51:15Z

This is a simple implementation of the divergence matrix, based on the infrastructure developed in
#2710.

The idea is that we do the very straighforward and naive thing of going pairwise through the samples,
but compute the MRCAs using a constant time algorithm. It's slightly faster than the current AVL
tree based approach in #2710, while using far less memory. Another advantage is that this implementation
is simple to parallelise by adding left and right coordinate arguments
and taking advantage of the new quick seeking in #2661. I think this is probably
more important in practise now than good algorithms, as we'll need to throw lots of threads
at these calculations whatever happens.

I think we'll probably want to keep this implementation around as an option in the long run anyway,
since it's likely to be faster than more complex algorithms for small n, and it uses very little
memory (which may also be important in practise).

So, I vote we push ahead with fleshing out the API for this?

jeromekelleher · 2023-04-17T10:10:38Z

This is ready for a look @petrelharp - most of the stuff is implemented I think, except for the samples argument (which is trivial). The threading stuff is mostly not profiled, so I don't know how well it performs in practise (the basic "split the genome by tree" one works well though). I'm making threading stuff reasonably general with the goal of applying it to the stats API later - all we'd need in principle is a way of limiting the general stats calculations to a contiguous slice of the genome, and it would then mostly apply directly.

Questions:

This implementation is coming up with a different answer when we have trees with multiple roots. Do we special-case to a slower implementation that finds the local root when the MRCA is null?
Should we call this branch_divergence_matrix for clarity, or should we also think about implementing the "site" version of this also (not sure how much work that is though)?

jeromekelleher · 2023-04-22T10:14:25Z

I think this is more-or-less done now @petrelharp - the only thing I'm not sure about is the
semantics for multiroot trees. Shouldn't the divergence between disconnected subtrees be
infinity, rather than the sum of the local distances to root?

I guess that would cause problems for tree sequences that contain missing data though (say
a snipped out centromere), as the divergence matrix would just be infinity everywhere then.

codecov · 2023-04-22T13:48:22Z

Codecov Report

Merging #2736 (da0ea6e) into main (3d4fc51) will increase coverage by 0.01%.
The diff coverage is 91.21%.

❗ Current head da0ea6e differs from pull request most recent head 7b11fae. Consider uploading reports for the commit 7b11fae to get more accurate results

@@            Coverage Diff             @@
##             main    #2736      +/-   ##
==========================================
+ Coverage   89.83%   89.84%   +0.01%     
==========================================
  Files          30       30              
  Lines       28624    29026     +402     
  Branches     5590     5669      +79     
==========================================
+ Hits        25713    26079     +366     
- Misses       1655     1676      +21     
- Partials     1256     1271      +15

Flag	Coverage Δ
c-tests	`86.30% <88.99%> (+0.05%)`	⬆️
lwt-tests	`80.13% <ø> (ø)`
python-c-tests	`67.04% <53.00%> (-0.10%)`	⬇️
python-tests	`99.01% <100.00%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
c/tskit/trees.c	`90.44% <88.99%> (-0.16%)`	⬇️
python/_tskitmodule.c	`88.40% <96.00%> (+0.05%)`	⬆️
python/tskit/trees.py	`98.65% <100.00%> (+0.03%)`	⬆️

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3d4fc51...7b11fae. Read the comment docs.

jeromekelleher · 2023-04-22T22:32:41Z

Tests failing because of #2745

jeromekelleher · 2023-04-23T21:12:32Z

Aha, turns out there isn't a problem at all, it's just a subtle numerical precision problem around 0.

Ready for a look now I think!

petrelharp · 2023-05-04T17:03:36Z

Notes from the meeting now:

mode="site": we should add the mode argument, and if "site" it should do just the same thing, but measure branch lengths by counting the number of mutations on them.
individuals: we'd like to obtain the indiv x indiv matrix, so should add this argument, but in a later PR? (And, note that the easy way to do this is postprocessed, by numpy; however, the indiv-by-indiv matrix is 4x smaller, so if we did push the operation down all the way we'd have a not-insignificant reduction in memory.)
samples or sample_sets? (tldr; proposal is keep API as-is): Other stats methods use sample_sets; we could follow this, and special-case so that if sample sets is a flat list of IDs then it computes the n x n matrix between those samples. However, we think maybe this is the wrong way to go, because if the API follows the other stats like this, people will want to use this method like sample_sets = [ts.samples(population=0), ts.samples(population=1)], which would be horribly inefficient if there's lots of samples per population and we're just getting the reduced matrix by numpy as mentioned above. Also, we can't think of use cases besides (a) samples = <list of sample IDs>, (b) samples = <list of individuals>, and (c) samples = <list of large groups of samples, eg, populations>; the proposal is that people doing (c) would use ts.divergence( ) instead of this function, even though that does something slightly different in the mode="site" case. (Yes, this will lead to confusion, but confusion is inevitable.)

jeromekelleher · 2023-05-04T18:37:03Z

I think having sample_sets, samples and individuals as mutually exclusive parameters that do what we want would be fine? Specifying a list of n singleton sets is quite annoying and ineffecient, and having another parameter to me is better than overloading sample_sets.

So, keep the API as is for now, but ultimately implement following the standard sample sets approach in C, and then add support in python for individuals and sample_sets?

jeromekelleher · 2023-06-21T13:16:55Z

How does this look as a basic definition of the site-mode version @petrelharp? If we agree on this as what we're trying to compute, I'll have a think about how to do things a bit more efficiently (maintaining cumulative mutation counts per node should be possible so that we can use the efficient MRCA calculation, but will involve writing a full-fat incremental algorithm).

def rootward_path(tree, u, v):
    while u != v:
        yield u
        u = tree.parent(u)


def site_divergence_matrix_naive(ts, windows=None, samples=None):
    windows_specified = windows is not None
    windows = [0, ts.sequence_length] if windows is None else windows
    num_windows = len(windows) - 1
    samples = ts.samples() if samples is None else samples

    n = len(samples)
    D = np.zeros((num_windows, n, n))
    tree = tskit.Tree(ts)
    for i in range(num_windows):
        left = windows[i]
        right = windows[i + 1]
        tree.seek(left)
        # Iterate over the trees in this window
        while tree.interval.left < right and tree.index != -1:
            span_left = max(tree.interval.left, left)
            span_right = min(tree.interval.right, right)
            mutations_per_node = collections.Counter()
            for site in tree.sites():
                if span_left <= site.position < span_right:
                    for mutation in site.mutations:
                        mutations_per_node[mutation.node] += 1
            for j in range(n):
                u = samples[j]
                for k in range(j + 1, n):
                    v = samples[k]
                    w = tree.mrca(u, v)
                    if w != tskit.NULL:
                        wu = w
                        wv = w
                    else:
                        wu = local_root(tree, u)
                        wv = local_root(tree, v)
                    du = sum(mutations_per_node[x] for x in rootward_path(tree, u, wu))
                    dv = sum(mutations_per_node[x] for x in rootward_path(tree, v, wv))
                    # NOTE: we're just accumulating the raw mutation counts, not
                    # multiplying by span
                    D[i, j, k] += du + dv
            tree.next()
        # Fill out symmetric triangle in the matrix
        for j in range(n):
            for k in range(j + 1, n):
                D[i, k, j] = D[i, j, k]
    if not windows_specified:
        D = D[0]
    return D

jeromekelleher · 2023-07-06T09:17:50Z

I pushed through the changes here for the site mode divmat @petrelharp, and I think we're almost API complete enough to go for the initial merge.

Annoyingly, the basic tree-by-tree site mode version that I have here is much slower than I thought it would be, and we need a different algorithm. In some quick testing, we currently have the branch mode divmat is at least 10X faster than the general-stat based method, whereas the site mode is at least 10X slower.

I don't really know why this is exactly - it might just be that doing the MRCAs by traversal really is that much slower than using Schieber-Vishkin. But it's clear we'll need to do some more work here.

What I'd suggest is that we fill out the testing here on this much, to make sure we're all happy that this really is what we want to compute, and then revisit after this PR is merged, with a better algorithm for site-mode. Are you happy with this? Is the current API a good forwards-compatible starting point (pending the change of default mode to "stat")?

I'll go ahead and finalise the testing if so.

python/tests/test_divmat.py

c/tskit/trees.c

petrelharp

I see no issues - LGTM!

jeromekelleher · 2023-07-06T14:06:52Z

OK great, I'll ping you when it's ready for final review.

jeromekelleher · 2023-07-07T11:12:09Z

I think this is ready for a final look and merge. Would you mind having a quick scan please @benjeffery?

benjeffery

Looks great!
I note that numba is mentioned but not used - was the intent to use it in the test suite?

benjeffery · 2023-07-10T15:54:02Z

c/tskit/trees.c

+    }
+
+    if ((options & TSK_STAT_POLARISED) || (options & TSK_STAT_SPAN_NORMALISE)) {
+        /* TODO better error */


Was this a TODO for this PR or future?

I've added the specific errors.

jeromekelleher · 2023-07-11T08:13:01Z

I note that numba is mentioned but not used - was the intent to use it in the test suite?

Well spotted. It's handy to have a version of this code lying around that you can run numba on directly, but I don't think we need to bother with adding numba support here for functions in the test suite.

I've added a comment to clarify

jeromekelleher · 2023-07-11T08:46:11Z

I've tied up a few loose ends, this is ready to go now.

Also opened #2791 to track the missing support for string-window specs.

Implement the basic version of the divergence matrix operation using tree-by-tree algorithms, and provide interface for parallelising along the genome.

jeromekelleher force-pushed the divmat-tree-by-tree branch 2 times, most recently from f274999 to 8d7a42f Compare April 16, 2023 23:18

jeromekelleher force-pushed the divmat-tree-by-tree branch from 8d7a42f to e8a3e0d Compare April 21, 2023 23:04

jeromekelleher mentioned this pull request Apr 22, 2023

Error in handling final empty tree in stats API #2745

Closed

jeromekelleher requested a review from petrelharp April 23, 2023 21:33

jeromekelleher marked this pull request as ready for review April 23, 2023 21:33

jeromekelleher force-pushed the divmat-tree-by-tree branch 2 times, most recently from e323e05 to cf5269f Compare June 21, 2023 13:08

jeromekelleher force-pushed the divmat-tree-by-tree branch from cf5269f to a5f9756 Compare July 5, 2023 15:28

petrelharp reviewed Jul 6, 2023

View reviewed changes

python/tests/test_divmat.py Outdated Show resolved Hide resolved

petrelharp reviewed Jul 6, 2023

View reviewed changes

c/tskit/trees.c Show resolved Hide resolved

petrelharp approved these changes Jul 6, 2023

View reviewed changes

jeromekelleher force-pushed the divmat-tree-by-tree branch from da0ea6e to 7b11fae Compare July 7, 2023 10:36

jeromekelleher added this to the Python 0.5.5 milestone Jul 7, 2023

This was referenced Jul 7, 2023

divergence matrix #201

Closed

Memory consumption of tree sequence statistics #647

Closed

Improved algorithm for site-mode divergence matrix #2779

Closed

Add support for individuals to divergence_matrix #2780

Open

Document divergence_matrix #2781

Open

jeromekelleher force-pushed the divmat-tree-by-tree branch from 7b11fae to 741164e Compare July 7, 2023 11:12

This was referenced Jul 7, 2023

new (?) statistic to efficiently calculate shared times between every pair of samples at every tree #1619

Closed

Add parallelism to stats API #2784

Open

Divergence matrix #2710

Closed

benjeffery approved these changes Jul 10, 2023

View reviewed changes

jeromekelleher force-pushed the divmat-tree-by-tree branch from 741164e to 21be777 Compare July 11, 2023 08:45

jeromekelleher added the AUTOMERGE-REQUESTED Ask Mergify to merge this PR label Jul 11, 2023

Divergence matrix tree-by-tree algorithms

f84f5e3

Implement the basic version of the divergence matrix operation using tree-by-tree algorithms, and provide interface for parallelising along the genome.

jeromekelleher force-pushed the divmat-tree-by-tree branch from 21be777 to f84f5e3 Compare July 11, 2023 09:19

mergify bot merged commit a3b095f into tskit-dev:main Jul 11, 2023
21 checks passed

mergify bot removed the AUTOMERGE-REQUESTED Ask Mergify to merge this PR label Jul 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

divmat tree by tree #2736

divmat tree by tree #2736

jeromekelleher commented Apr 12, 2023

jeromekelleher commented Apr 17, 2023

jeromekelleher commented Apr 22, 2023

codecov bot commented Apr 22, 2023 •

edited

Loading

jeromekelleher commented Apr 22, 2023

jeromekelleher commented Apr 23, 2023

petrelharp commented May 4, 2023

jeromekelleher commented May 4, 2023

jeromekelleher commented Jun 21, 2023

jeromekelleher commented Jul 6, 2023 •

edited

Loading

petrelharp left a comment

jeromekelleher commented Jul 6, 2023

jeromekelleher commented Jul 7, 2023

benjeffery left a comment

benjeffery Jul 10, 2023

jeromekelleher Jul 11, 2023

jeromekelleher commented Jul 11, 2023

jeromekelleher commented Jul 11, 2023

divmat tree by tree #2736

divmat tree by tree #2736

Conversation

jeromekelleher commented Apr 12, 2023

jeromekelleher commented Apr 17, 2023

jeromekelleher commented Apr 22, 2023

codecov bot commented Apr 22, 2023 • edited Loading

Codecov Report

jeromekelleher commented Apr 22, 2023

jeromekelleher commented Apr 23, 2023

petrelharp commented May 4, 2023

jeromekelleher commented May 4, 2023

jeromekelleher commented Jun 21, 2023

jeromekelleher commented Jul 6, 2023 • edited Loading

petrelharp left a comment

Choose a reason for hiding this comment

jeromekelleher commented Jul 6, 2023

jeromekelleher commented Jul 7, 2023

benjeffery left a comment

Choose a reason for hiding this comment

benjeffery Jul 10, 2023

Choose a reason for hiding this comment

jeromekelleher Jul 11, 2023

Choose a reason for hiding this comment

jeromekelleher commented Jul 11, 2023

jeromekelleher commented Jul 11, 2023

codecov bot commented Apr 22, 2023 •

edited

Loading

jeromekelleher commented Jul 6, 2023 •

edited

Loading