Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Logistic PCA and PPMI-based methods? #36

Open
BradKML opened this issue Nov 19, 2022 · 3 comments
Open

Logistic PCA and PPMI-based methods? #36

BradKML opened this issue Nov 19, 2022 · 3 comments
Labels
question A question regarding usage, implementation etc

Comments

@BradKML
Copy link

BradKML commented Nov 19, 2022

Currently I am awaiting datasets with a data format of "liked items by user", and that certain items are similar in nature.
Currently there are a few ways of reducing dimensionality:

What are the trade-off and characteristics of each method? Are there other methods for large number of binary data columns?

@erdogant
Copy link
Owner

erdogant commented Nov 21, 2022

It depends on the research question which method to use. But if you start with exploration, an unsupervised approach is always a good starting point. Try the package clusteval. Make sure to use the appropriate metric, such as hamming distance.

Or you can use hypergeometric tests to find significant overlapping features. In this case try HNet library. More details can be found in this blog].

Perhaps SVD analysis is more appropriate than PCA (this is optional in the pca library). Or indeed your suggestion, logistic PCA.

@BradKML
Copy link
Author

BradKML commented Nov 22, 2022

For some clarity, I attempted to run Logistic PCA as the Python implementation but it crashed twice playing with VES performance vs personality study, which has personality Yes/No question. Maybe the native implementation eats up too much memory. "Significant overlapping features" is one of the things I am seeking with PCA-like methods, but that the data is extremely binary.

Q: why is cluster evaluation useful in a binary data dimensionality reduction + feature selection + regression task?

@BradKML
Copy link
Author

BradKML commented Nov 22, 2022

Also, secondary discovery:

  • some are saying that MCA (correspondence analysis) are good for binary and categorical applications, but if so how can that turn into factor models? https://github.com/MaxHalford/prince
  • "Correlation Explanation" has been used for bio-informatic data, which is often binary. However they tend to behave like ICAs, albiet similar to PCA in that it does not assume data independence. https://github.com/gregversteeg/CorEx

@erdogant erdogant added the question A question regarding usage, implementation etc label Jan 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question A question regarding usage, implementation etc
Projects
None yet
Development

No branches or pull requests

2 participants