-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
randomized svd draft #3008
base: main
Are you sure you want to change the base?
randomized svd draft #3008
Conversation
@petrelharp Here's the code. |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #3008 +/- ##
==========================================
- Coverage 89.82% 87.07% -2.75%
==========================================
Files 29 11 -18
Lines 31986 24666 -7320
Branches 6192 4556 -1636
==========================================
- Hits 28730 21478 -7252
+ Misses 1859 1824 -35
+ Partials 1397 1364 -33
Flags with carried forward coverage won't be shown. Click here to find out more. |
python/tskit/trees.py
Outdated
x = individual_idx_sparray(ts.num_individuals, cols).dot(x) | ||
x = sample_individual_sparray(ts).dot(x) | ||
x = ts.genetic_relatedness_vector(W=x, windows=windows, mode="branch", centre=False) | ||
x = sample_individual_sparray(ts).T.dot(x) | ||
x = individual_idx_sparray(ts.num_individuals, rows).T.dot(x) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this assumes that all individuals' nodes are samples. Note that we can use the nodes
argument to genetic_relatedness_vector
to get an arbitrary list of (possibly non-sample) nodes; why not just use that?
So, I think we can do something like this:
ij = np.vstack([[n, k] for k, i in enumerate(individuals) for n in self.individual(i).nodes])
sample_list = ij[:, 0]
indiv_index = ij[:, 1]
x = ts.genetic_relatedness_vector(W=x, ..., nodes=sample_list)
x = np.bincount(indiv_index, x)
This also gets rid of the scipy.sparse
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
x = ts.genetic_relatedness_vector(W=x, ..., nodes=sample_list)
should slightly be
x = ts.genetic_relatedness_vector(W=x[indiv_index], ..., nodes=sample_list)
to expand the array of individuals to array of nodes, I think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hah, yes - good catch!
This looks great! Very elegant. I think probably we ought to include a So, how about the signature is like
and:
Note that we could be getting PCs for non-sample nodes (since individual's nodes need not be samples); I haven't thought through whether the values you get are correct or informative. My guess is that maybe they are? But we need a "user beware" note for this? |
python/tskit/trees.py
Outdated
x = individual_idx_sparray(ts.num_individuals, rows).T.dot(x) | ||
x = self.genetic_relatedness_vector(W=x[sample_individuals], windows=windows, mode="branch", centre=False, nodes=samples) | ||
bincount_fn = lambda w: np.bincount(sample_individuals, w) | ||
x = np.apply_along_axis(bincount_fn, axis=0, arr=x) # I think it should be axis=1, but axis=0 gives the correct values why? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The matvec is sometimes GRM * matrix, so x
is often a matrix than a vector. np.bincount
only works for 1-dimensional weights, so I used np.apply_along_axis
and lambda
to vectorize np.bincount
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment I left after # looks like mostly a convention issue in that function. When axis=0
, the columns are separately retrieved from the array. When axis=1
, the rows are retrieved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, that seems confusing but maybe makes sense after all?
python/tskit/trees.py
Outdated
individuals: np.ndarray = None, | ||
centre: bool = True, | ||
windows: list = None, | ||
random_state: np.random.Generator = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
usually we just pass in a seed
, any objections to doing that, instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the option from random_state
to random_seed
following msprime.
Ah, sorry - one more thing - does this work with I think the way to do the windows would be something like
Basically - get it to work in the case where |
A simple test case for the
Because of the randomness of the algo, the correlation is not exactly 1, although it's nearly 1 like 0.99995623-ish. |
I just noticed that |
Check results for two windows.
|
python/tskit/trees.py
Outdated
@@ -8593,138 +8593,188 @@ def genetic_relatedness_vector( | |||
return out | |||
|
|||
def pca( | |||
self, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I rearranged these to better match other methods (e.g., windows
always come first, so I had it first after n_components
)
python/tskit/trees.py
Outdated
API partially adopted from `scikit-learn`: | ||
https://scikit-learn.org/dev/modules/generated/sklearn.decomposition.PCA.html | ||
self, | ||
n_components: int = 10, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps we should not have a default?
python/tskit/trees.py
Outdated
def _rand_pow_range_finder( | ||
operator: Callable, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
linting complains about Callable
for some reason
python/tskit/trees.py
Outdated
x = np.apply_along_axis(bincount_fn, axis=0, arr=x) | ||
x = x - x.mean(axis=0) if centre else x # centering within index in cols | ||
x = x - x.mean(axis=0) if centre else x # centering within index in cols | ||
|
||
return x | ||
|
||
def _genetic_relatedness_vector_node( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same: automatic linting
Okay; here I've made a good start at the tests. I think everything is working fine; the tests are not passing because (I think) of numerical tolerance. I could just set the numerical tolerance to like 1e-4 and they'd pass, but I think this is flagging a bigger issue - how do we tell we're getting good answers? I ask because currently the tests pass (at default tolerances) for small numbers of samples but not for 30 samples; if I increase
We'd like this to be not a can of worms; I think our goal is to have something that is good enough, and fowards-compatible for an improved method in the future. Notes:
TODO:
|
Just a quick note that I'd be very much in favour of returning a dataclass here rather than a tuple so that the option of returning more information about convergence etc is open. |
There's an adaptive rangefinder algorithm described in Halko et al. (https://arxiv.org/pdf/0909.4061, Algo 4.2). I don't see it implemented in scikit-learn (https://scikit-learn.org/dev/modules/generated/sklearn.decomposition.TruncatedSVD.html#sklearn.decomposition.TruncatedSVD). I like Jerome's idea to return a class instead of the result. There's an intermediate matrix |
Not a classobject yet but added random_sketch to the input/output. |
…tion, it omits return in the end of the function
Re the result object, I'd imagined something like @dataclasses.dataclass
class PcaResult:
descriptive_name1: np.ndarray # Or whatever type hints we can get to work
descriptive_name2... |
Now,
A user can continuously improve their estimate through Q.
If the first round did |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great, but I would suggest we break the nested functions out to the module level rather than embedding them in the TreeSequence class. The function is currently too long, and it's not clear what needs to be embedded within the function because it's using the namespace, vs what's in there just because. It would be nice to be able to test the bits of this individually, and putting them at the module level will make that possible.
Certainly the return class should be defined at the module level and added to the Sphinx documentation so that it can be linked to.
@@ -8637,8 +8637,11 @@ def pca( | |||
be automatically generated. Valid random seeds must be between 1 and | |||
:math:`2^32 − 1`. | |||
:param np.ndarray range_sketch: Sketch matrix for each window. Default is None. | |||
:return: A tuple (U, D, Q) of ndarrays, with the principal component loadings in U | |||
and the principal values in D. Q is the range sketch array for each window. | |||
:return: A class object with attributes U, D, Q and E. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this is probably heresy, but could we put names on these variables? Like:
@dataclasses.dataclass
class PcaResult:
"""
The result of a call to TreeSequence.pca() capturing the output values and algorithm convergence details.
"""
loadings: np.ndarray
"""
The principle component loadings.
"""
values: np.ndarray
"""
The principle component values.
"""
range_sketch: np.ndarray
"""
The range sketch. See XXX for details?
"""
error_bound: np.ndarray
"""
...
"""
Then, the return would be
:return: An instance of :class:`PcaResult` encapsulating the principle components, loadings and algorithm
converence details.
(A "class object" is something specific, which this isn't)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. Where should I place the functions and the new class? At the end of trees.py
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Put them at the end of trees.py for now. We should probably make a new module for some of this statsy stuff, but let's not bother for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: the functions should not be public, I would think. (the class, yes)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a rationale between these choices? wonder if it has something to do with maintenence etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Essentially maintenance, yes, but also freedom to change things in the future
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nitpick about code organisation!
python/tskit/trees.py
Outdated
error_bound = D[-1] * (1 + error_factor) | ||
return U[:, :rank], D[:rank], V[:rank], Q, error_bound | ||
|
||
def _genetic_relatedness_vector_individual( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are best seen as methods of the TreeSequence class because they have the first argument as a tree sequence. I'd refactor as
def _genetic_relatedness_vector_individual(self, arr, indices, mode...):
...
def _genetic_relatedness_vector_node(self, ...)
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are now tree sequence methods. SVD functions were moved into the pca()
function. Only the class definition is left outside.
Description
A draft of randomized principal component analysis (PCA) using the
TreeSequence.genetic_relatedness_vector
. The implementation containsspicy.sparse
which should eventually be removed.This part of the code is only used when collapsing a
#sample * #sample
GRM into a#individual * #individual
matrix.Therefore, it will not be difficult to replace with pure numpy.
The API was partially taken from scikit-learn.
To add some details,
iterated_power
is the number of power iterations in the range finder in the randomized algorithm. The error of SVD decreases exponentially as a function of this number.The effect of power iteration is profound when the eigen spectrum of the matrix decays slowly, which seems to be the case of tree sequence GRMs in my experience.
indices
specifies the individuals to be included in the PCA, although decreasing the number of individuals does not meaningfully reduce the amount of computation.