Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

randomized svd draft #3008

Draft
wants to merge 23 commits into
base: main
Choose a base branch
from
Draft

randomized svd draft #3008

wants to merge 23 commits into from

Conversation

hanbin973
Copy link

Description

A draft of randomized principal component analysis (PCA) using the TreeSequence.genetic_relatedness_vector. The implementation contains spicy.sparse which should eventually be removed.
This part of the code is only used when collapsing a #sample * #sample GRM into a #individual * #individual matrix.
Therefore, it will not be difficult to replace with pure numpy.

The API was partially taken from scikit-learn.

To add some details, iterated_power is the number of power iterations in the range finder in the randomized algorithm. The error of SVD decreases exponentially as a function of this number.
The effect of power iteration is profound when the eigen spectrum of the matrix decays slowly, which seems to be the case of tree sequence GRMs in my experience.

indices specifies the individuals to be included in the PCA, although decreasing the number of individuals does not meaningfully reduce the amount of computation.

@hanbin973
Copy link
Author

@petrelharp Here's the code.

Copy link

codecov bot commented Oct 3, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 87.07%. Comparing base (76ab046) to head (e81db15).
Report is 44 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3008      +/-   ##
==========================================
- Coverage   89.82%   87.07%   -2.75%     
==========================================
  Files          29       11      -18     
  Lines       31986    24666    -7320     
  Branches     6192     4556    -1636     
==========================================
- Hits        28730    21478    -7252     
+ Misses       1859     1824      -35     
+ Partials     1397     1364      -33     
Flag Coverage Δ
c-tests 86.69% <ø> (ø)
lwt-tests 80.78% <ø> (ø)
python-c-tests 89.05% <ø> (ø)
python-tests ?

Flags with carried forward coverage won't be shown. Click here to find out more.

see 18 files with indirect coverage changes

python/tskit/trees.py Outdated Show resolved Hide resolved
Comment on lines 8711 to 8715
x = individual_idx_sparray(ts.num_individuals, cols).dot(x)
x = sample_individual_sparray(ts).dot(x)
x = ts.genetic_relatedness_vector(W=x, windows=windows, mode="branch", centre=False)
x = sample_individual_sparray(ts).T.dot(x)
x = individual_idx_sparray(ts.num_individuals, rows).T.dot(x)
Copy link
Contributor

@petrelharp petrelharp Oct 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this assumes that all individuals' nodes are samples. Note that we can use the nodes argument to genetic_relatedness_vector to get an arbitrary list of (possibly non-sample) nodes; why not just use that?

So, I think we can do something like this:

ij = np.vstack([[n, k] for k, i in enumerate(individuals) for n in self.individual(i).nodes])
sample_list = ij[:, 0]
indiv_index = ij[:, 1]
x = ts.genetic_relatedness_vector(W=x, ..., nodes=sample_list)
x = np.bincount(indiv_index, x)

This also gets rid of the scipy.sparse.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

x = ts.genetic_relatedness_vector(W=x, ..., nodes=sample_list)

should slightly be

x = ts.genetic_relatedness_vector(W=x[indiv_index], ..., nodes=sample_list)

to expand the array of individuals to array of nodes, I think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hah, yes - good catch!

python/tskit/trees.py Outdated Show resolved Hide resolved
python/tskit/trees.py Outdated Show resolved Hide resolved
@petrelharp
Copy link
Contributor

This looks great! Very elegant. I think probably we ought to include a samples argument, though? For consistency, but also since the tree sequence represents phased data, and so it's actually informative to look at the PCs of maternally- and paternally-inherited chromosomes separately.

So, how about the signature is like

def pca(samples=None, individuals=None, ...)

and:

  • the default is equivalent to samples=ts.samples(), individuals=None
  • you can't have both samples and individuals specified
  • if individuals is a list of individual IDs then it does as in the code currently
  • otherwise, is just skips the "sum over individuals" step

Note that we could be getting PCs for non-sample nodes (since individual's nodes need not be samples); I haven't thought through whether the values you get are correct or informative. My guess is that maybe they are? But we need a "user beware" note for this?

@petrelharp petrelharp marked this pull request as draft October 4, 2024 01:46
x = individual_idx_sparray(ts.num_individuals, rows).T.dot(x)
x = self.genetic_relatedness_vector(W=x[sample_individuals], windows=windows, mode="branch", centre=False, nodes=samples)
bincount_fn = lambda w: np.bincount(sample_individuals, w)
x = np.apply_along_axis(bincount_fn, axis=0, arr=x) # I think it should be axis=1, but axis=0 gives the correct values why?
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The matvec is sometimes GRM * matrix, so x is often a matrix than a vector. np.bincount only works for 1-dimensional weights, so I used np.apply_along_axis and lambda to vectorize np.bincount.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment I left after # looks like mostly a convention issue in that function. When axis=0, the columns are separately retrieved from the array. When axis=1, the rows are retrieved.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, that seems confusing but maybe makes sense after all?

individuals: np.ndarray = None,
centre: bool = True,
windows: list = None,
random_state: np.random.Generator = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

usually we just pass in a seed, any objections to doing that, instead?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the option from random_state to random_seed following msprime.

@petrelharp
Copy link
Contributor

Ah, sorry - one more thing - does this work with windows? (It looks like not?)

I think the way to do the windows would be something like

drop_windows = windows is None
if drop_windows:
    windows = [0, self.sequence_length]

# then do stuff; with these windows genetic_relatedness will always return an array where the first dimension is "window";
# so you can operate on each slice separately

if drop_windows:
    # get rid of the first dimension in the output

Basically - get it to work in the case where windows are specified (ie not None) and then we can get it to have the right behavior.

@hanbin973
Copy link
Author

A simple test case for the windows feature.

demography = msprime.Demography()
demography.add_population(name="A", initial_size=5_000)
demography.add_population(name="B", initial_size=5_000)
demography.add_population(name="C", initial_size=1_000)
demography.add_population_split(time=1000, derived=["A", "B"], ancestral="C")
ts = msprime.sim_ancestry(
    samples={"A": 500, "B": 500},
    sequence_length=1e6,
    recombination_rate=3e-8,
    demography=demography, 
    random_seed=12)
seq_length = ts.sequence_length

U, _ = ts.pca(individuals=np.asarray([i.id for i in ts.individuals()]), iterated_power=5, random_seed=1, windows=[0, seq_length/2, seq_length])
U0, _ = ts.pca(individuals=np.asarray([i.id for i in ts.individuals()]), iterated_power=5, random_seed=1, windows=[0, seq_length/2])
U1, _ = ts.pca(individuals=np.asarray([i.id for i in ts.individuals()]), iterated_power=5, random_seed=1, windows=[seq_length/2, seq_length])

idx = 0 # idx is the idx-th principal component
# correlation instead of allclose because PCA is rotation symmetric
np.corrcoef(U[0][:,idx], U0[:,idx]), np.corrcoef(U[1][:,idx], U1[:,idx])

Because of the randomness of the algo, the correlation is not exactly 1, although it's nearly 1 like 0.99995623-ish.

@hanbin973
Copy link
Author

I just noticed that centre doesn't work with nodes option. The new commit fixed this problem.

@hanbin973
Copy link
Author

Check results for two windows.

demography = msprime.Demography()
demography.add_population(name="A", initial_size=5_000)
demography.add_population(name="B", initial_size=5_000)
demography.add_population(name="C", initial_size=1_000)
demography.add_population_split(time=1000, derived=["A", "B"], ancestral="C")
seq_length =1e6
ts = msprime.sim_ancestry(
    samples={"A": 500, "B": 500},
    sequence_length=seq_length,
    recombination_rate=3e-8,
    demography=demography, 
    random_seed=12)

# for individuals
U, _ = ts.pca(individuals=np.asarray([i.id for i in ts.individuals()]), iterated_power=5, random_seed=1, windows=[0, seq_length/2, seq_length])
U0, _ = ts.pca(individuals=np.asarray([i.id for i in ts.individuals()]), iterated_power=5, random_seed=1, windows=[0, seq_length/2])
U1, _ = ts.pca(individuals=np.asarray([i.id for i in ts.individuals()]), iterated_power=5, random_seed=1, windows=[seq_length/2, seq_length])

idx = 0 # idx is the idx-th principal component
# correlation instead of allclose because PCA is rotation symmetric
np.corrcoef(U[0][:,idx], U0[0][:,idx]), np.corrcoef(U[1][:,idx], U1[0][:,idx])

# for nodes
U, _ = ts.pca(iterated_power=5, random_seed=1, windows=[0, seq_length/2, seq_length])
U0, _ = ts.pca(iterated_power=5, random_seed=1, windows=[0, seq_length/2])
U1, _ = ts.pca(iterated_power=5, random_seed=1, windows=[seq_length/2, seq_length])

idx = 0 # idx is the idx-th principal component
# correlation instead of allclose because PCA is rotation symmetric
np.corrcoef(U[0][:,idx], U0[0][:,idx]), np.corrcoef(U[1][:,idx], U1[0][:,idx])

@@ -8593,138 +8593,188 @@ def genetic_relatedness_vector(
return out

def pca(
self,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rearranged these to better match other methods (e.g., windows always come first, so I had it first after n_components)

API partially adopted from `scikit-learn`:
https://scikit-learn.org/dev/modules/generated/sklearn.decomposition.PCA.html
self,
n_components: int = 10,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps we should not have a default?

def _rand_pow_range_finder(
operator: Callable,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

linting complains about Callable for some reason

python/tskit/trees.py Outdated Show resolved Hide resolved
x = np.apply_along_axis(bincount_fn, axis=0, arr=x)
x = x - x.mean(axis=0) if centre else x # centering within index in cols
x = x - x.mean(axis=0) if centre else x # centering within index in cols

return x

def _genetic_relatedness_vector_node(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same: automatic linting

@petrelharp
Copy link
Contributor

Okay; here I've made a good start at the tests. I think everything is working fine; the tests are not passing because (I think) of numerical tolerance. I could just set the numerical tolerance to like 1e-4 and they'd pass, but I think this is flagging a bigger issue - how do we tell we're getting good answers? I ask because currently the tests pass (at default tolerances) for small numbers of samples but not for 30 samples; if I increase iterated_power then they do pass; so the current defaults seem okay for 30 samples, but I worry that at 10,000 samples they might be very bad. If u, s is one of the output vector, value pairs, can we compare s * u to W @ u and translate the result to some measure of tolerance? Not very good options here that I can think of are

  1. add a tolerance argument and we iterate until this tolerance is reached
  2. return an estimate of error; perhaps we should return some kind of structured output that contains information about convergence as well?

We'd like this to be not a can of worms; I think our goal is to have something that is good enough, and fowards-compatible for an improved method in the future.

Notes:

  • For testing, I use np.linalg.svd, and (of course) had to mess around so that we'd correctly deal with indeterminancy of sign and also indeterminancy of ordering when singular values are the same.
  • Perhaps we should call it num_components rather than n_components to match elsewhere?
  • I did some rearranging of how the operator in _rand_svd was defined to avoid lint complaining (but was unsuccessful; so inserted the NOQA)
  • Some of the changes were due to automatic linting; apologies for the noise there.

TODO:

  • add tests that include samples and individuals arguments
  • test on some weirder tree sequences (that's easy)

@jeromekelleher
Copy link
Member

Just a quick note that I'd be very much in favour of returning a dataclass here rather than a tuple so that the option of returning more information about convergence etc is open.

@hanbin973
Copy link
Author

There's an adaptive rangefinder algorithm described in Halko et al. (https://arxiv.org/pdf/0909.4061, Algo 4.2). I don't see it implemented in scikit-learn (https://scikit-learn.org/dev/modules/generated/sklearn.decomposition.TruncatedSVD.html#sklearn.decomposition.TruncatedSVD).
Halko et al. also has an error estimation formula with an overhead similar to the main routine.

I like Jerome's idea to return a class instead of the result. There's an intermediate matrix Q (approximate range of the original matrix) that takes up most of the computation, and storing it is useful in many cases. For example, one can add additional power iterations to Q without doing it from scratch to improve precision.

@hanbin973
Copy link
Author

Not a classobject yet but added random_sketch to the input/output.

…tion, it omits return in the end of the function
@jeromekelleher
Copy link
Member

Re the result object, I'd imagined something like

@dataclasses.dataclass
class PcaResult:
    descriptive_name1: np.ndarray # Or whatever type hints we can get to work
    descriptive_name2...

@hanbin973
Copy link
Author

Now, pca() returns a dataclass of the following

@dataclass
class PCAResult:
    U: np.ndarray
    D: np.ndarray
    Q: np.ndarray
    E: np.ndarray

U and D are as before. Q is the range sketch matrix that is used as the approximate orthonormal basis of the GRM. It is also the most and the only expensive part of the algorithm that involves GRM*matrix operations. E is the error bounds for the singular values. Both Q and E will have different values for each windows if present.

A user can continuously improve their estimate through Q. pca now has a range_sketch: np.ndarray = None option that accepts Q from the previous found of the pca. This can be done like

pca_result = ts.pca( ... )
pca_result_round_2 = ts.pca( ..., range_sketch = pca_result.Q, ...)

If the first round did q power iterations and the second round did p additional power iterations, the result of the second round has total q+p iterations. By adding additional power iterations in successive rounds, one can improve the accuracy without running the whole process from scratch.

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, but I would suggest we break the nested functions out to the module level rather than embedding them in the TreeSequence class. The function is currently too long, and it's not clear what needs to be embedded within the function because it's using the namespace, vs what's in there just because. It would be nice to be able to test the bits of this individually, and putting them at the module level will make that possible.

Certainly the return class should be defined at the module level and added to the Sphinx documentation so that it can be linked to.

@@ -8637,8 +8637,11 @@ def pca(
be automatically generated. Valid random seeds must be between 1 and
:math:`2^32 − 1`.
:param np.ndarray range_sketch: Sketch matrix for each window. Default is None.
:return: A tuple (U, D, Q) of ndarrays, with the principal component loadings in U
and the principal values in D. Q is the range sketch array for each window.
:return: A class object with attributes U, D, Q and E.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is probably heresy, but could we put names on these variables? Like:

@dataclasses.dataclass
class PcaResult:
     """
     The result of a call to TreeSequence.pca() capturing the output values and algorithm convergence details.
     """
     loadings: np.ndarray
     """
     The principle component loadings.
     """
     values: np.ndarray
     """
     The principle component values.
     """
     range_sketch: np.ndarray
     """
     The range sketch. See XXX for details?
     """
     error_bound: np.ndarray
     """
     ...
     """

Then, the return would be

:return: An instance of :class:`PcaResult` encapsulating the principle components, loadings and algorithm
    converence details.

(A "class object" is something specific, which this isn't)

Copy link
Author

@hanbin973 hanbin973 Oct 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Where should I place the functions and the new class? At the end of trees.py?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put them at the end of trees.py for now. We should probably make a new module for some of this statsy stuff, but let's not bother for now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: the functions should not be public, I would think. (the class, yes)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a rationale between these choices? wonder if it has something to do with maintenence etc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Essentially maintenance, yes, but also freedom to change things in the future

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nitpick about code organisation!

error_bound = D[-1] * (1 + error_factor)
return U[:, :rank], D[:rank], V[:rank], Q, error_bound

def _genetic_relatedness_vector_individual(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are best seen as methods of the TreeSequence class because they have the first argument as a tree sequence. I'd refactor as

    def _genetic_relatedness_vector_individual(self, arr, indices, mode...):
          ...
    def _genetic_relatedness_vector_node(self, ...)
         ...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are now tree sequence methods. SVD functions were moved into the pca() function. Only the class definition is left outside.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants