Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

divmat tree by tree #2736

Merged
merged 1 commit into from
Jul 11, 2023
Merged

Conversation

jeromekelleher
Copy link
Member

This is a simple implementation of the divergence matrix, based on the infrastructure developed in
#2710.

The idea is that we do the very straighforward and naive thing of going pairwise through the samples,
but compute the MRCAs using a constant time algorithm. It's slightly faster than the current AVL
tree based approach in #2710, while using far less memory. Another advantage is that this implementation
is simple to parallelise by adding left and right coordinate arguments
and taking advantage of the new quick seeking in #2661. I think this is probably
more important in practise now than good algorithms, as we'll need to throw lots of threads
at these calculations whatever happens.

I think we'll probably want to keep this implementation around as an option in the long run anyway,
since it's likely to be faster than more complex algorithms for small n, and it uses very little
memory (which may also be important in practise).

So, I vote we push ahead with fleshing out the API for this?

@jeromekelleher jeromekelleher force-pushed the divmat-tree-by-tree branch 2 times, most recently from f274999 to 8d7a42f Compare April 16, 2023 23:18
@jeromekelleher
Copy link
Member Author

This is ready for a look @petrelharp - most of the stuff is implemented I think, except for the samples argument (which is trivial). The threading stuff is mostly not profiled, so I don't know how well it performs in practise (the basic "split the genome by tree" one works well though). I'm making threading stuff reasonably general with the goal of applying it to the stats API later - all we'd need in principle is a way of limiting the general stats calculations to a contiguous slice of the genome, and it would then mostly apply directly.

Questions:

  • This implementation is coming up with a different answer when we have trees with multiple roots. Do we special-case to a slower implementation that finds the local root when the MRCA is null?
  • Should we call this branch_divergence_matrix for clarity, or should we also think about implementing the "site" version of this also (not sure how much work that is though)?

@jeromekelleher
Copy link
Member Author

I think this is more-or-less done now @petrelharp - the only thing I'm not sure about is the
semantics for multiroot trees. Shouldn't the divergence between disconnected subtrees be
infinity, rather than the sum of the local distances to root?

I guess that would cause problems for tree sequences that contain missing data though (say
a snipped out centromere), as the divergence matrix would just be infinity everywhere then.

@codecov
Copy link

codecov bot commented Apr 22, 2023

Codecov Report

Merging #2736 (da0ea6e) into main (3d4fc51) will increase coverage by 0.01%.
The diff coverage is 91.21%.

❗ Current head da0ea6e differs from pull request most recent head 7b11fae. Consider uploading reports for the commit 7b11fae to get more accurate results

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #2736      +/-   ##
==========================================
+ Coverage   89.83%   89.84%   +0.01%     
==========================================
  Files          30       30              
  Lines       28624    29026     +402     
  Branches     5590     5669      +79     
==========================================
+ Hits        25713    26079     +366     
- Misses       1655     1676      +21     
- Partials     1256     1271      +15     
Flag Coverage Δ
c-tests 86.30% <88.99%> (+0.05%) ⬆️
lwt-tests 80.13% <ø> (ø)
python-c-tests 67.04% <53.00%> (-0.10%) ⬇️
python-tests 99.01% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
c/tskit/trees.c 90.44% <88.99%> (-0.16%) ⬇️
python/_tskitmodule.c 88.40% <96.00%> (+0.05%) ⬆️
python/tskit/trees.py 98.65% <100.00%> (+0.03%) ⬆️

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3d4fc51...7b11fae. Read the comment docs.

@jeromekelleher
Copy link
Member Author

Tests failing because of #2745

@jeromekelleher
Copy link
Member Author

Aha, turns out there isn't a problem at all, it's just a subtle numerical precision problem around 0.

Ready for a look now I think!

@jeromekelleher jeromekelleher marked this pull request as ready for review April 23, 2023 21:33
@petrelharp
Copy link
Contributor

Notes from the meeting now:

  • mode="site": we should add the mode argument, and if "site" it should do just the same thing, but measure branch lengths by counting the number of mutations on them.
  • individuals: we'd like to obtain the indiv x indiv matrix, so should add this argument, but in a later PR? (And, note that the easy way to do this is postprocessed, by numpy; however, the indiv-by-indiv matrix is 4x smaller, so if we did push the operation down all the way we'd have a not-insignificant reduction in memory.)
  • samples or sample_sets? (tldr; proposal is keep API as-is): Other stats methods use sample_sets; we could follow this, and special-case so that if sample sets is a flat list of IDs then it computes the n x n matrix between those samples. However, we think maybe this is the wrong way to go, because if the API follows the other stats like this, people will want to use this method like sample_sets = [ts.samples(population=0), ts.samples(population=1)], which would be horribly inefficient if there's lots of samples per population and we're just getting the reduced matrix by numpy as mentioned above. Also, we can't think of use cases besides (a) samples = <list of sample IDs>, (b) samples = <list of individuals>, and (c) samples = <list of large groups of samples, eg, populations>; the proposal is that people doing (c) would use ts.divergence( ) instead of this function, even though that does something slightly different in the mode="site" case. (Yes, this will lead to confusion, but confusion is inevitable.)

@jeromekelleher
Copy link
Member Author

I think having sample_sets, samples and individuals as mutually exclusive parameters that do what we want would be fine? Specifying a list of n singleton sets is quite annoying and ineffecient, and having another parameter to me is better than overloading sample_sets.

So, keep the API as is for now, but ultimately implement following the standard sample sets approach in C, and then add support in python for individuals and sample_sets?

@jeromekelleher jeromekelleher force-pushed the divmat-tree-by-tree branch 2 times, most recently from e323e05 to cf5269f Compare June 21, 2023 13:08
@jeromekelleher
Copy link
Member Author

How does this look as a basic definition of the site-mode version @petrelharp? If we agree on this as what we're trying to compute, I'll have a think about how to do things a bit more efficiently (maintaining cumulative mutation counts per node should be possible so that we can use the efficient MRCA calculation, but will involve writing a full-fat incremental algorithm).

def rootward_path(tree, u, v):
    while u != v:
        yield u
        u = tree.parent(u)


def site_divergence_matrix_naive(ts, windows=None, samples=None):
    windows_specified = windows is not None
    windows = [0, ts.sequence_length] if windows is None else windows
    num_windows = len(windows) - 1
    samples = ts.samples() if samples is None else samples

    n = len(samples)
    D = np.zeros((num_windows, n, n))
    tree = tskit.Tree(ts)
    for i in range(num_windows):
        left = windows[i]
        right = windows[i + 1]
        tree.seek(left)
        # Iterate over the trees in this window
        while tree.interval.left < right and tree.index != -1:
            span_left = max(tree.interval.left, left)
            span_right = min(tree.interval.right, right)
            mutations_per_node = collections.Counter()
            for site in tree.sites():
                if span_left <= site.position < span_right:
                    for mutation in site.mutations:
                        mutations_per_node[mutation.node] += 1
            for j in range(n):
                u = samples[j]
                for k in range(j + 1, n):
                    v = samples[k]
                    w = tree.mrca(u, v)
                    if w != tskit.NULL:
                        wu = w
                        wv = w
                    else:
                        wu = local_root(tree, u)
                        wv = local_root(tree, v)
                    du = sum(mutations_per_node[x] for x in rootward_path(tree, u, wu))
                    dv = sum(mutations_per_node[x] for x in rootward_path(tree, v, wv))
                    # NOTE: we're just accumulating the raw mutation counts, not
                    # multiplying by span
                    D[i, j, k] += du + dv
            tree.next()
        # Fill out symmetric triangle in the matrix
        for j in range(n):
            for k in range(j + 1, n):
                D[i, k, j] = D[i, j, k]
    if not windows_specified:
        D = D[0]
    return D

@jeromekelleher
Copy link
Member Author

jeromekelleher commented Jul 6, 2023

I pushed through the changes here for the site mode divmat @petrelharp, and I think we're almost API complete enough to go for the initial merge.

Annoyingly, the basic tree-by-tree site mode version that I have here is much slower than I thought it would be, and we need a different algorithm. In some quick testing, we currently have the branch mode divmat is at least 10X faster than the general-stat based method, whereas the site mode is at least 10X slower.

I don't really know why this is exactly - it might just be that doing the MRCAs by traversal really is that much slower than using Schieber-Vishkin. But it's clear we'll need to do some more work here.

What I'd suggest is that we fill out the testing here on this much, to make sure we're all happy that this really is what we want to compute, and then revisit after this PR is merged, with a better algorithm for site-mode. Are you happy with this? Is the current API a good forwards-compatible starting point (pending the change of default mode to "stat")?

I'll go ahead and finalise the testing if so.

Copy link
Contributor

@petrelharp petrelharp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see no issues - LGTM!

@jeromekelleher
Copy link
Member Author

OK great, I'll ping you when it's ready for final review.

@jeromekelleher
Copy link
Member Author

I think this is ready for a final look and merge. Would you mind having a quick scan please @benjeffery?

Copy link
Member

@benjeffery benjeffery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!
I note that numba is mentioned but not used - was the intent to use it in the test suite?

c/tskit/trees.c Outdated
}

if ((options & TSK_STAT_POLARISED) || (options & TSK_STAT_SPAN_NORMALISE)) {
/* TODO better error */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this a TODO for this PR or future?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added the specific errors.

@jeromekelleher
Copy link
Member Author

I note that numba is mentioned but not used - was the intent to use it in the test suite?

Well spotted. It's handy to have a version of this code lying around that you can run numba on directly, but I don't think we need to bother with adding numba support here for functions in the test suite.

I've added a comment to clarify

@jeromekelleher
Copy link
Member Author

I've tied up a few loose ends, this is ready to go now.

Also opened #2791 to track the missing support for string-window specs.

@jeromekelleher jeromekelleher added the AUTOMERGE-REQUESTED Ask Mergify to merge this PR label Jul 11, 2023
Implement the basic version of the divergence matrix operation using
tree-by-tree algorithms, and provide interface for parallelising along
the genome.
@mergify mergify bot merged commit a3b095f into tskit-dev:main Jul 11, 2023
21 checks passed
@mergify mergify bot removed the AUTOMERGE-REQUESTED Ask Mergify to merge this PR label Jul 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants