Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GRCCA gives quite different results with original implementation #163

Open
JohannesWiesner opened this issue Feb 10, 2023 · 9 comments
Open

Comments

@JohannesWiesner
Copy link
Contributor

JohannesWiesner commented Feb 10, 2023

Hi James, similar to #107, I implemented a test to check if cca_zoo.models.GRCCA gives similar results as the original implementation. But the feature weights are quite dissimilar especially for the much smaller y input array:

image

Here's the testing code:

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Sanity check if cca-zoo's implementation of GRCCA gives the same result
as the original implementation'

Requires an environment with:
- cca-zoo 
- r-base (conda install -c conda-forge r-base)
- r-devtools (conda install -c conda-forge r-devtools)
- RCCA (installed via: devtools::install_github('https://github.com/ElenaTuzhilina/RCCA')
- rpy2 (conda install -c conda-forge rpy2)

Useful links:
- https://github.com/murraylab/gemmr/blob/master/gemmr/estimators/r_estimators.py
- https://github.com/Teekuningas/sparsecca/blob/master/tests/test_compare_pmd_to_r.py
    

@author: johannes.wiesner

"""


import numpy as np
rng = np.random.RandomState(42)
from cca_zoo.models import GRCCA
import rpy2.robjects as robjects
from rpy2.robjects.packages import importr
import rpy2.robjects.numpy2ri
rpy2.robjects.numpy2ri.activate()
import matplotlib.pyplot as plt

###############################################################################
# Get original data  and parameters as in original example
###############################################################################

robjects.r(
'''
library(RCCA) 
data(X)
data(Y)
n.groups = 5
group1 = rep(1:n.groups,rep(ncol(X)/n.groups,n.groups))
group2 = rep(1,ncol(Y))
lambda1 = 1000
lambda2 = 0
mu1 = 100
mu2 = 0
''')

X = np.array(robjects.globalenv['X'])
y = np.array(robjects.globalenv['Y'])
group1 = np.array(robjects.globalenv['group1'])
group2 = np.array(robjects.globalenv['group2'])
lambda1 = robjects.globalenv['lambda1'][0]
lambda2 = robjects.globalenv['lambda2'][0]
mu1 = robjects.globalenv['mu1'][0]
mu2 = robjects.globalenv['mu2'][0]

###############################################################################
# Run original and cca-zoo implementation and return coefficients for X
###############################################################################

def test(X,y,group1,group2,lambda1,lambda2,mu1,mu2,rng):

    # run original implementation
    RCCA = importr('RCCA')
    results_R = RCCA.GRCCA(X=X,
                           Y=y,
                           group1=group1,
                           group2=group2,
                           lambda1=lambda1,
                           lambda2=lambda2,
                           mu1=mu1,
                           mu2=mu2)
    
    # save original coefficients
    xcoefs = results_R.rx2('x.coefs')
    ycoefs = results_R.rx2('y.coefs')
    
    # save number of components because in the orignal implementation number
    # of components is always set to maximum using n.comp = min(ncol(X), ncol(Y), nrow(X))
    n_comp = results_R.rx2('n.comp')[0]
    
    # re-run with cca-zoo implementation
    # TODO: I think original implementation does not standardize?
    estimator = GRCCA(latent_dims=n_comp,
                      c=[lambda1,lambda2],
                      mu=[mu1,mu2],
                      random_state=rng,
                      scale=False,
                      centre=False)
    
    # in python, things start with 0, not with 1
    feature_groups = [np.int64(group1)-1,np.int64(group2)-1]
    estimator.fit([X,y],feature_groups=feature_groups)
    
    # save replicated coefficients
    X_weights,y_weights = estimator.weights
    
    return xcoefs,X_weights,ycoefs,y_weights

xcoefs,X_weights,ycoefs,y_weights = test(X,y,group1,group2,lambda1,lambda2,mu1,mu2,rng)

###############################################################################
# Plot deviations
###############################################################################

for org,rep in [[xcoefs,X_weights],[ycoefs,y_weights]]:

    # plot distribution of deviations (ignore zeros for plotting purposes)
    plt.figure()
    deviations = np.abs(org)-np.abs(rep)
    deviations = deviations.flatten()
    deviations = deviations[deviations != 0]
    plt.hist(deviations)
    

###############################################################################
# Throw error if not close
###############################################################################

assert(np.allclose(np.abs(xcoefs),np.abs(X_weights)))
assert(np.allclose(np.abs(ycoefs),np.abs(y_weights)))
@jameschapman19
Copy link
Owner

Looks like a scaling thing a bit like #162 - the correlations of the weights are >0.90 or higher. I'm pushing an update that will ensure that for cca_zoo models wXXw=n where n is the number of samples.

@jameschapman19
Copy link
Owner

image
image

putting weights on the same scale

@jameschapman19
Copy link
Owner

Would be good to get a bit closer though I'll take another look at the implementation.

@jameschapman19
Copy link
Owner

ok think I've found the source of the error. Will let you know the results soon.

@jameschapman19
Copy link
Owner

OK the source of this is basically a misalignment between RCCA lambda and cca_zoo c.

lambda gets added to the covariance matrices but c slides between 0 regularisation and 1 maximal regularisation (all ones on the diagonal).

@JohannesWiesner
Copy link
Contributor Author

Would it make sense to include a testing folder somewhere in cca_zoo that checks original R-implementations against cca_zoos doppelgängers? I noticed that it's actually quite handy to just be able to import R-packages with RCCA = importr('RCCA'). Let me know if you're interested. I don't know anything about CI, but at least some folders for semi-automatic tests maybe?

@jameschapman19
Copy link
Owner

Yeah definitely! Will almost certainly add this one in somewhere. Obviously there's some sharp edges (e.g. here where the parameterizations don't quite match) but no harm in having them

@jameschapman19
Copy link
Owner

jameschapman19 commented Feb 10, 2023

I've used variants of your script to check that the source of the difference is lambda/c and convinced myself.

PRCCA with no regularisation (and fewer features than samples) everything matches

GRCCA with very high regularisation also matches

GRCCA with no regularisation doesn't match but it is because of line 91 in https://github.com/ElenaTuzhilina/RCCA/blob/master/R/GRCCA.R where PRCCA is called with lambda=1 regardless of what the user inputs. This means it's not really possible to match cca_zoo and RCCA up in this instance. It seems a bit odd that they do this I think it's possible its a bug in RCCA.

@JohannesWiesner
Copy link
Contributor Author

Yeah definitely! Will almost certainly add this one in somewhere. Obviously, there's some sharp edges (e.g. here where the parameterizations don't quite match) but no harm in having them

Let me know if I can help with a PR!

GRCCA with no regularisation doesn't match but it is because of line 91 in https://github.com/ElenaTuzhilina/RCCA/blob/master/R/GRCCA.R where PRCCA is called with lambda=1 regardless of what the user inputs. This means it's not really possible to match cca_zoo and RCCA up in this instance. It seems a bit odd that they do this I think it's possible its a bug in RCCA.

Ah okay. Do you still have to update cca_zoo.models.GRCCA then or is the deviation purely stemming from this bug?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants