Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ts-PCA performance is slow compared scikit-allel #1743

Open
brieuclehmann opened this issue Sep 29, 2021 · 4 comments
Open

ts-PCA performance is slow compared scikit-allel #1743

brieuclehmann opened this issue Sep 29, 2021 · 4 comments
Assignees

Comments

@brieuclehmann
Copy link
Contributor

brieuclehmann commented Sep 29, 2021

Building on #898 and using the 'matrix multiplication' in WIP #1246 (i.e. genetic_relatedness_weighted), we're trying to implement PCA for tskit. This appears to be working 🎉 but is rather slow compared to scikit-allel. See the following code for a small reprex, where scikit-allel is approximately 20 times faster than our current tskit implementation.

import itertools
import time

import allel
import msprime
import numpy as np
import stdpopsim
import tskit
from scipy.sparse.linalg import eigsh
from scipy.sparse.linalg import LinearOperator

# Get realistic simulation parameters
species = stdpopsim.get_species("HomSap")
chrom = species.genome.get_chromosome("chr20")
gm = None
model = species.get_demographic_model("OutOfAfrica_3G09")
engine = stdpopsim.get_engine("msprime")
n_pop = model.num_populations

# Set parameters
seed = 1
n_ind = 64
mult = 0.05
mut_rate = 1e-8
recomb_rate=chrom.recombination_rate
    
n_tot = n_pop * n_ind
n_hap = 2 * n_ind
samples = model.get_samples(n_hap, n_hap, n_hap)

recomb_map = msprime.RecombinationMap.uniform_map(
    chrom.length * mult, recomb_rate
)

contig = stdpopsim.Contig(
    recombination_map=recomb_map, mutation_rate=mut_rate, genetic_map=gm
)

# Simulate tree sequence
ts = engine.simulate(model, contig, samples)

sample_sets = [(2 * i, (2 * i) + 1) for i in range(n_tot)]
W_samples = np.array(
    [[float(u in A) for A in sample_sets] for u in ts.samples()]
)
indexes = [(i, n_tot) for i in range(len(sample_sets))]

def mat_mul_stat(a):
    W = np.c_[W_samples, W_samples @ a]
    return ts.genetic_relatedness_weighted(
        W, indexes=indexes, mode="site", span_normalise=False
    )

# Linear operator
start_time = time.time()
A = LinearOperator((n_tot, n_tot), matvec=mat_mul_stat)
eigval_linop, eigvec_linop = eigsh(A)
linop_pc = (A @ eigvec_linop[:, ::-1]) / np.sqrt(eigval_linop[::-1])
end_time = time.time()
time_linop = end_time - start_time

# Direct genotype matrix
haps = allel.HaplotypeArray(ts.genotype_matrix())
gns = haps.to_genotypes(ploidy=2).to_n_ref()
start_time = time.time()
allel_pc = allel.pca(gns, n_components=6, scaler=None)[0]
end_time = time.time()
time_allel = end_time - start_time

print(time_linop)
# 5.768216848373413
print(time_allel)
# 0.2641010284423828
@petrelharp
Copy link
Contributor

This may be related to #647, since as written the code is maintaining a (num nodes) x (num samples) matrix as it iterates over the trees.

@jeromekelleher
Copy link
Member

I just tried this out but realised we need to get #1246 merged

@brieuclehmann
Copy link
Contributor Author

Ah apologies, I should have said - I was running this off the branch in #1246!

@jeromekelleher
Copy link
Member

jeromekelleher commented Oct 4, 2021

No worries @brieuclehmann! I just did some profiling, and the majority of the time is spent in tsk_treeseq_general_stat, specifically at this line. I think it's pretty clear the issue is what @petrelharp mentioned above, in that we're maintaining too much information as we move along the trees.

What are out options here for doing less work per node @petrelharp? Would it worth trying to cast this as a function of IBD segments, so we can see if that approach is at least potentially faster?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

No branches or pull requests

3 participants