Retrieve embeddings for selected documents (and re-cluster) #1825

morrisseyj · 2024-02-20T23:33:56Z

morrisseyj
Feb 20, 2024

I am interested in selecting a subset of topics (or more likely a single topic) and then re-running the clustering process on those embeddings. To this end, I am trying to identify a function/method that will extract the embeddings for each doc. This way I could build an additional model on this subset of embeddings. I can't however seem to find this.

I can achieve this result as follows:

#....
#Grab the docs
docs = [list of docs]

#create the embeddings
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = embedding_model.encode(docs, show_progress_bar=True)

#initialize the other models
umap_model = UMAP.UMAP(n_neighbors = 15, n_components = 5, min_dist = 0.0, metric = "cosine", random_state = 52)
hdbscan_model = HDBSCAN(min_cluster_size=150, metric = "euclidean", cluster_selection_method="eom", prediction_data=True)
vectorizer_model = CountVectorizer(stop_words="english", min_df=2, ngram_range=(1,2))

#Create the topic model
topic_model = BERTopic(
    embedding_model=embedding_model, 
    umap_model=umap_model, 
    hdbscan_model=hdbscan_model, 
    vectorizer_model=vectorizer_model, 
    top_n_words = 10,
    verbose = True
)

#Fit the model
topics, probs = topic_model.fit_transform(docs, embeddings)

#Look through the topics and find the one i want to look at more closely - topic 16
document_topics = topic_model.get_document_info(docs)

#Subset the embeddings by the topic
topic_16_docs = document_topics[document_topics["Topic"] == 16]["Document"]
topic_16_embeddings = embeddings[document_topics["Topic"] == 16]]

topics, probs = topic_model.fit_transform(topic_16_embeddings, topic_16_docs)

This is manageable, but i was wondering if there wasn't a function i was missing that did something like: topic_model.get_document_embeddings(docs) that returned a dataframe of topics and embeddings. That or a method to pull the embedding from each doc.

If not, i can, of course, write a function for this.

MaartenGr · 2024-02-21T11:47:51Z

MaartenGr
Feb 21, 2024
Maintainer

All underlying embedding models have the .embed function as shown here so you can just use that one. Something like topic_model.embedding_model.embed().

4 replies

morrisseyj Feb 21, 2024
Author

Perhaps i am misunderstanding, but this looks like its generating the embedding again, as opposed to retrieving them. This makes it very slow (and impossible on a large dataset).

When i run

topic_model.embedding_model.embed(docs) #Using the example above

It takes about 2 mins to complete (on a relatively small dataset).

Am I missing something here?

MaartenGr Feb 22, 2024
Maintainer

Ah right, my apologies! I thought you wanted to calculate them. It is not possible to retrieve the embeddings per topic directly from BERTopic since those are not saved internally, that would result in a model that also serves as a database which is what you typically want to prevent.

Indeed, it should be straightforward to write a small function to extract topics related to certain embeddings.

morrisseyj Feb 26, 2024
Author

OK, great. Thanks so much. Really appreciate all the work on this set of libraries.

peterforberg Feb 2, 2025

Hello, I was wondering if you ever wrote a function to extract topics related to certain embeddings @morrisseyj. I have stumbled upon this thread attempting to accomplish the same thing.

morrisseyj · 2025-02-03T16:45:22Z

morrisseyj
Feb 3, 2025
Author

In the end, i think i just did the laborious way described above - rather than writing a function as i didn't have to do it very many times. The function should be straightforward, however, based on the code provided. Something like this - note i haven't tested this:

def subset_topic(topic_model, topic: list):
	document_topics = topic_model.get_document_info(docs)
	topic_docs = document_topics[document_topics["Topics"].isin(topic)["Document"]
	topic_embeddings = embeddings[document_topics["Topic"].isin(topic)]]
	return(topic_docs, topic_embeddings)

#Use the function
topic_16_docs, topic_16_embed = subset_topic(topic_model, [16])
topics, probs = topic_model.fit_transform(topic_16_embed, topic_16_docs)

1 reply

peterforberg Feb 3, 2025

Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retrieve embeddings for selected documents (and re-cluster) #1825

{{title}}

Replies: 2 comments 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Retrieve embeddings for selected documents (and re-cluster) #1825

morrisseyj Feb 20, 2024

Replies: 2 comments · 5 replies

MaartenGr Feb 21, 2024 Maintainer

morrisseyj Feb 21, 2024 Author

MaartenGr Feb 22, 2024 Maintainer

morrisseyj Feb 26, 2024 Author

peterforberg Feb 2, 2025

morrisseyj Feb 3, 2025 Author

peterforberg Feb 3, 2025

morrisseyj
Feb 20, 2024

Replies: 2 comments 5 replies

MaartenGr
Feb 21, 2024
Maintainer

morrisseyj Feb 21, 2024
Author

MaartenGr Feb 22, 2024
Maintainer

morrisseyj Feb 26, 2024
Author

morrisseyj
Feb 3, 2025
Author