Logistic PCA and PPMI-based methods? #36

BradKML · 2022-11-19T06:25:23Z

Currently I am awaiting datasets with a data format of "liked items by user", and that certain items are similar in nature.
Currently there are a few ways of reducing dimensionality:

Logistic PCA, which uses logit curves to render binary information similar to scalar data, data as either +1 or -1 https://github.com/brudfors/logistic-PCA-Tipping/blob/main/pca.py#L6
PPMI-based methods that uses co-occurrence of tags or words within images or sentences https://aclanthology.org/L18-1156.pdf https://github.com/Bollegala/svdmi

What are the trade-off and characteristics of each method? Are there other methods for large number of binary data columns?

erdogant · 2022-11-21T18:17:06Z

It depends on the research question which method to use. But if you start with exploration, an unsupervised approach is always a good starting point. Try the package clusteval. Make sure to use the appropriate metric, such as hamming distance.

Or you can use hypergeometric tests to find significant overlapping features. In this case try HNet library. More details can be found in this blog].

Perhaps SVD analysis is more appropriate than PCA (this is optional in the pca library). Or indeed your suggestion, logistic PCA.

BradKML · 2022-11-22T06:19:38Z

For some clarity, I attempted to run Logistic PCA as the Python implementation but it crashed twice playing with VES performance vs personality study, which has personality Yes/No question. Maybe the native implementation eats up too much memory. "Significant overlapping features" is one of the things I am seeking with PCA-like methods, but that the data is extremely binary.

Q: why is cluster evaluation useful in a binary data dimensionality reduction + feature selection + regression task?

BradKML · 2022-11-22T06:49:03Z

Also, secondary discovery:

some are saying that MCA (correspondence analysis) are good for binary and categorical applications, but if so how can that turn into factor models? https://github.com/MaxHalford/prince
"Correlation Explanation" has been used for bio-informatic data, which is often binary. However they tend to behave like ICAs, albiet similar to PCA in that it does not assume data independence. https://github.com/gregversteeg/CorEx

erdogant added the question A question regarding usage, implementation etc label Jan 8, 2023

BradKML mentioned this issue Jun 3, 2024

PCA for binary data #57

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Logistic PCA and PPMI-based methods? #36

Logistic PCA and PPMI-based methods? #36

BradKML commented Nov 19, 2022

erdogant commented Nov 21, 2022 •

edited

Loading

BradKML commented Nov 22, 2022

BradKML commented Nov 22, 2022

Logistic PCA and PPMI-based methods? #36

Logistic PCA and PPMI-based methods? #36

Comments

BradKML commented Nov 19, 2022

erdogant commented Nov 21, 2022 • edited Loading

BradKML commented Nov 22, 2022

BradKML commented Nov 22, 2022

erdogant commented Nov 21, 2022 •

edited

Loading