Reduce memory of Llama embeddings #2273

brovatten · 2025-01-24T08:09:43Z

brovatten
Jan 24, 2025

Hi,

First of all, thanks a lot for this package and for maintaining this so well @MaartenGr .

I am trying to use my Llama 3.1 8B embeddings to perform clustering on my sentence&paragraph-level chunks. I have about 1.8 million chunks and am trying to run this on my machine with 220GB RAM with precomputed embeddings. I have tried all of the tips in the FAQ and on the other threads this last week to make it run, except the partial fit - but I don't want to trade-off on performance. It will go up to 1 million documents but not more until memory explodes when fitting using the default UMAP. So for the record, I tried steps 1-3 in the FAQ, min_df, min_cluster, merge_models, etc. Also, CUML is not an alternative.

Since Llama embeddings are very large, I saw on a UMAP thread that it is sometimes recommended for tasks to run PCA followed by UMAP. I wanted to ask if you have tried this for BERTopic and if this yields good results. Alternatively, if you have another way I can reduce the memory.

Thanks a lot in advance.

MaartenGr · 2025-01-26T08:23:30Z

MaartenGr
Jan 26, 2025
Maintainer

Could you share your full code for what you are trying to run? It will give me more insights into what is happening under the hood.

Running PCA to reduce the dimensionality somewhat (but not too much!) before running UMAP is a good alternative. It will allow for a speed-up whilst minimizing the reduce in accuracy you might have.

Also note that you can simply run UMAP on 1 million documents instead as I highly doubt that it would need the full 1.8 million documents to get a good mapping from high to low dimensional space.

2 replies

brovatten Jan 26, 2025
Author

Thank you for the suggestions, I will try to fit this with 1 million documents. Would that mean I run the whole BERTopic pipeline for this subset or only the UMAP?

I also thought of two more questions. I noticed in the discussion thread about ModernBERT you mentioned foundational models often need to be finetuned with contrastive learning to perform well for clustering. Would you say this is an issue for large embedding models like Llama 3 as well? And for larger embedding models like Llama 3.1 8B would you use the reduction (n_components) as with smaller models?

Below you can see my code:

import logging
import torch
import json
import os
import pandas as pd
import numpy as np
import time

from umap import UMAP
from bertopic import BERTopic
from bertopic.representation import KeyBERTInspired
from transformers import AutoModel, AutoTokenizer
from sklearn.feature_extraction.text import CountVectorizer
import pyLDAvis.gensim_models as gensimvis
from datetime import datetime
import random
from bertopic.vectorizers import ClassTfidfTransformer

# System prompt describes information given to all conversations
system_prompt = """
<s>[INST] <<SYS>>
You are a helpful, respectful and honest assistant for labeling topics.
<</SYS>>
"""

# Example prompt demonstrating the output we are looking for
example_prompt = """
I have a topic that contains the following documents:
- Traditional diets in most cultures were primarily plant-based with a little meat on top, but with the rise of industrial style meat production and factory farming, meat has become a staple food.
- Meat, but especially beef, is the word food in terms of emissions.
- Eating meat doesn't make you a bad person, not eating meat doesn't make you a good one.

The topic is described by the following keywords: 'meat, beef, eat, eating, emissions, steak, food, health, processed, chicken'.

Based on the information about the topic above, please create a short label of this topic. Make sure you to only return the label and nothing more.

[/INST] Environmental impacts of eating meat
"""

# Our main prompt with documents ([DOCUMENTS]) and keywords ([KEYWORDS]) tags
main_prompt = """
[INST]
I have a topic related to gold that contains the following documents:
[DOCUMENTS]

The topic is described by the following keywords: '[KEYWORDS]'.

Based on the information about the topic above, please create a short label of this topic, be as specific as possible. Make sure you to only return the label and nothing more.
[/INST]
"""

prompt = system_prompt + example_prompt + main_prompt

os.environ["TOKENIZERS_PARALLELISM"] = "true"
os.environ["CUDA_VISIBLE_DEVICES"]="0"

embedding_model_name = "meta-llama/Meta-Llama-3.1-8B-Instruct"
chunks = pd.read_json("/path/to/chunksize=300_method=recursive_version=3.json", lines=True)
embeddings = np.load("/path/to/embeddings_300chars_cleaned_v3.npz")["embeddings"].squeeze()

indices = np.random.permutation(embeddings.shape[0])

shuffled_embeddings = embeddings[indices]
shuffled_chunks = chunks.reindex(indices)

half_index = embeddings.shape[0] // 2
embeddings1 = shuffled_embeddings[half_index:]
embeddings2 = shuffled_embeddings[:half_index]
chunks1 = shuffled_chunks[half_index:]
chunks2 = shuffled_chunks[:half_index]

shuffled_chunks.to_json("/path/to/rec1_1chunks_after_shuffling")

assert len(chunks) == embeddings.shape[0]
print("embeddings loaded")

device = 'cuda' if torch.cuda.is_available() else 'cpu'
print("device: ", device)

tokenizer = AutoTokenizer.from_pretrained(
    embedding_model_name,
    torch_dtype=torch.float16,
    use_fast=True,
    use_auth_token="X",
)

embedding_model = AutoModel.from_pretrained(
    embedding_model_name,
    torch_dtype=torch.float16,
    use_auth_token="X",
)

print("generator model and embedding model loaded")
embedding_model.eval()

def llama_tokenizer_to_string(text):
    return tokenizer.tokenize(text)

vectorizer_model = CountVectorizer(
    tokenizer=llama_tokenizer_to_string,
    ngram_range=(1, 3),
    min_df=20,
)

representation_model = KeyBERTInspired(
    top_n_words=10,
    nr_repr_docs=40,
    nr_candidate_words=300
)

from hdbscan import HDBSCAN

hdb_scan = HDBSCAN(min_cluster_size=500, metric='euclidean',
                   cluster_selection_method='leaf', prediction_data=True)

representation_model = {
    "Main": representation_model,
}

umap_model = UMAP(n_neighbors=15, n_components=10, min_dist=0.0, metric='cosine')

topic_model_1 = BERTopic(
    calculate_probabilities=False,
    embedding_model=embedding_model,
    vectorizer_model=vectorizer_model,
    hdbscan_model=hdb_scan,
    umap_model=umap_model,
    min_topic_size=500,
    top_n_words=25,
    verbose=True,
    low_memory=True,
    representation_model=representation_model,
)

topic_model_2 = BERTopic(
    calculate_probabilities=False,
    embedding_model=embedding_model,
    vectorizer_model=vectorizer_model,
    hdbscan_model=hdb_scan,
    umap_model=umap_model,
    min_topic_size=500,
    top_n_words=25,
    verbose=True,
    low_memory=True,
    representation_model=representation_model,
)

topic_model_1.fit(chunks1["chunk"], embeddings1)
topic_model_1.save("/path/to/model1")
del topic_model_1

topic_model_2.fit(chunks2["chunk"], embeddings2)
topic_model_1 = BERTopic.load("/path/to/model2")

merged_model = BERTopic.merge_models([topic_model_1, topic_model_2])

print("begin merging")

merged_model.get_topic_info().to_csv("/path/to/leaf1_merge_recursive=300_mincluster=400_topics.csv")
merged_model.get_document_info(shuffled_chunks["chunk"][:half_index]).to_csv("/path/to/leaf1_merge_recursive=300_mincluster=400_documents.csv")

MaartenGr Feb 3, 2025
Maintainer

Would that mean I run the whole BERTopic pipeline for this subset or only the UMAP?

You can do either but if you really want it to train on all documents, then you would only need to do it for UMAP and pass either that trained model to BERTopic (and adjust the UMAP class) or pass the reduced embeddings.

I also thought of two more questions. I noticed in the discussion thread about ModernBERT you mentioned foundational models often need to be finetuned with contrastive learning to perform well for clustering. Would you say this is an issue for large embedding models like Llama 3 as well? And for larger embedding models like Llama 3.1 8B would you use the reduction (n_components) as with smaller models?

That's correct! Foundational models are generally not embedding models. They, sort-of, create embeddings more of a side-effect rather than being trained to generate embeddings for purposes like classification, similarity search, clustering, etc.

That generally also means that you would fine-tune LLMs for contrastive learning to generate more performant embeddings. I generally would not advise use such large models as smaller ones can be just as performant (see MTEB).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce memory of Llama embeddings #2273

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

Reduce memory of Llama embeddings #2273

brovatten Jan 24, 2025

Replies: 1 comment · 2 replies

MaartenGr Jan 26, 2025 Maintainer

brovatten Jan 26, 2025 Author

MaartenGr Feb 3, 2025 Maintainer

brovatten
Jan 24, 2025

Replies: 1 comment 2 replies

MaartenGr
Jan 26, 2025
Maintainer

brovatten Jan 26, 2025
Author

MaartenGr Feb 3, 2025
Maintainer