Replies: 1 comment 2 replies
-
Could you share your full code for what you are trying to run? It will give me more insights into what is happening under the hood. Running PCA to reduce the dimensionality somewhat (but not too much!) before running UMAP is a good alternative. It will allow for a speed-up whilst minimizing the reduce in accuracy you might have. Also note that you can simply run UMAP on 1 million documents instead as I highly doubt that it would need the full 1.8 million documents to get a good mapping from high to low dimensional space. |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi,
First of all, thanks a lot for this package and for maintaining this so well @MaartenGr .
I am trying to use my Llama 3.1 8B embeddings to perform clustering on my sentence¶graph-level chunks. I have about 1.8 million chunks and am trying to run this on my machine with 220GB RAM with precomputed embeddings. I have tried all of the tips in the FAQ and on the other threads this last week to make it run, except the partial fit - but I don't want to trade-off on performance. It will go up to 1 million documents but not more until memory explodes when fitting using the default UMAP. So for the record, I tried steps 1-3 in the FAQ, min_df, min_cluster, merge_models, etc. Also, CUML is not an alternative.
Since Llama embeddings are very large, I saw on a UMAP thread that it is sometimes recommended for tasks to run PCA followed by UMAP. I wanted to ask if you have tried this for BERTopic and if this yields good results. Alternatively, if you have another way I can reduce the memory.
Thanks a lot in advance.
Beta Was this translation helpful? Give feedback.
All reactions