Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Applying model to new data #115

Open
gdbeck opened this issue Jan 16, 2021 · 2 comments
Open

Applying model to new data #115

gdbeck opened this issue Jan 16, 2021 · 2 comments

Comments

@gdbeck
Copy link

gdbeck commented Jan 16, 2021

Hi there, This looks a great package. I'm particularly interested in the ability to fit LRMs to datasets with missing data (or in my case, outliers that need to be masked). I have a quick question that may be pretty basic, but an answer would help me to apply the code to my own data. Apologies if I've missed something in the documentation. I'm also fairly new to Julia.

If I fit a PCA model to a set of training data A (following your example):

loss        = QuadLoss()
r           = ZeroReg()
n_comp      = 1
glrm        = GLRM(A,loss,r,r,n_comp)
X,Y,ch.     = fit!(glrm)

how do I then apply the same model to a new set of data B? I would like to keep X fixed and obtain new values Y_b that give the best fit of X to B. That is, I would like to project the observations in B onto the PCA components found from A.

There are other PCA packages in Julia that will do this (e.g., the reconstruct function in MultivariateStats), but they don't seem to be able to handle missing data or sparse arrays.

Thanks in advance! Any help is appreciated!

@mihirparadkar
Copy link
Collaborator

Hi!

I want to first clarify the intent of the question. Let's say A is a matrix (or DataFrame/sparse matrix) of m rows by n columns. The GLRM (assuming real-valued or boolean-valued data for simplicity) produces a matrix X of m rows by k columns, and a matrix Y of k rows by n columns, where k is the rank.

It sounds like you have another dataset B, of size p rows by n columns. B's projection on the PCA components from A would be a matrix of size p rows by k columns. In PCA with no missing values and centered data, this would be a matrix multiplication (B * Y' *<a diagonal matrix>). However, that projection doesn't work with the structure of GLRM because that formula is only correct with a quadratic (least-squares) loss function.

With LowRankModels, the easiest way to do this is to fit another GLRM while holding Y constant. You can do this like so:

loss             = QuadLoss() # Or whatever loss you chose before
r_x              = ZeroReg()    # Or whichever regularizer you desired on X
r_y              = [FixedLatentFeaturesConstraint(Y[:, i]) for i=1:size(Y, 2)]
n_comp      = 1
glrm_b        = GLRM(B, loss, r_x, r_y, n_comp)
X_b, Y, ch   = fit!(glrm_b)

If you want to calculate a new Y matrix instead of a new X matrix, just keep r_y to be whatever you used as r, and define r_x = [FixedLatentFeaturesConstraint(X[:, i]) for i=1:size(X, 2)]

@gdbeck
Copy link
Author

gdbeck commented Jan 16, 2021

That works perfectly! Thank you very much for your help, and for replying so quickly!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants