Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can i obtain my original data shape? #247

Open
leeleavitt opened this issue Mar 6, 2024 · 3 comments
Open

Can i obtain my original data shape? #247

leeleavitt opened this issue Mar 6, 2024 · 3 comments

Comments

@leeleavitt
Copy link

I would like to use Harmony to normalize my data, but i need the original shape to use in other part of my analytical pipeline.

Harmony takes as input principal components ($PC$), and outputs corrected principal components ($PC'$).

All applications I've seen using Harmony takes the top $k$ ranks of principal components. Since I need the original data structure, I would input all principal components, assuming my assumptions below are accurate.

The general approach I am considering, is creating my principal components using singular value decomposition (SVD).

$$A = U S V^T$$

Where $U$ and $V$ are orthogonal matrices, and $S$ is a diagonal matrix containing the singular values.

Assuming $U * S$ can be represented as all possible principal components $PC$. Through Harmony normalization, we transform $PC$ into $PC'$.

Harmony normalizes all principal components

$PC \rightarrow Harmony \rightarrow PC'$

I then assume

$PC'$ $\equiv$ $U' S'$ $\equiv$ $(U S)'$

I then reconstruct the original shape of my data, but now the data is normalized,

$(U S)' V^T = A'$

Is this valid?

@hongchengyao
Copy link
Contributor

Hi @leeleavitt , thanks for using harmony! Please correct me if my understanding is wrong: I think you want to use harmony to do batch correction for the original data (count matrix or log normalized data) instead of PCs, i.e., 1) convert the original data (count matrix or log normalized data) to PCs, 2) use harmony to perform batch correction on PCs to get corrected PCs, 3) convert the corrected PCs back to original data format (count matrix or log normalized data).

@leeleavitt
Copy link
Author

Yes exactly

@hongchengyao
Copy link
Contributor

Hi @leeleavitt , first I think what you proposed is doable and valid in terms of the equation, but there are mainly two issues associated with this idea.

  1. Computation feasibility. I'm not sure about the size of your input matrix (number of genes by number of cells), especially the number of genes. Harmony is optimized for input a small number of PCs (usually just 20), so including all PCs (i.e. equal to the number of genes) in the input may make it substantially slower and would consume much more memory than designed.

  2. batch correction performance at the original data format level. Although it's possible to convert the corrected PCs back to the original data format, harmony is never tested for this scenario and we can't promise anything about the performance, especially for downstream analysis like DEG.

Of course, there is nothing to prevent you from using harmony this way and my suggestion would be to reduce the number of PCs as much as possible if you encounter computational problems. Depending on your purpose, it may not be too bad to approximate with the top N PCs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants