Skip to content

nareto/transformertopic

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

transformertopic

Topic Modeling using sentence embeddings. This procedure works very well: in practice it almost always produces sensible topics and (from a practical point of view) renders all LDA variants obsolete. See also my blog post about COVID topics for an example of how this can be used.

This is my own implementation of the procedure described here by Maarten Grootendorst, who also has his own implementation available here. Thanks for this brilliant idea!

I wanted to code it myself and have features marked with a ⭐, which as far as I know are not available in Grootendorst's implementation.

Features:

  • Compute topic modeling
  • Compute dynamic topic modeling ("trends" here)
  • ⭐ Assign topics on sentence rather than document level
  • ⭐ Experiment with different dimension reducers
  • ⭐ Experiment with different ways to generate a wordcloud from a topic
  • ⭐ Infer topics of new batches of docs without retraining

How it works

In the following the words "cluster" and "topic" are used interchangeably. Please note that in classic Topic Modeling procedures (e.g. those based on LDA) each document is a probability distribution over topics. In this sense the procedure here presented could be considered as a special case where these distributions are always degenerate and concentrate the probability on one single index.

The procedure is:

  1. split paragraphs into sentences
  2. compute sentence embeddings (using sentence transformers)
  3. compute dimension reduction of these embeddings (with umap, pacmap, tsne or pca)
  4. cluster them with HDBSCAN
  5. for each topic compute a "cluster representator": a dictionary with words as keys and ranks as values (using tfidf, textrank or kmaxoids 1)
  6. use the cluster representators to compute wordclouds for each topic

Installation

pip install -U transformertopic

Usage

View also test.py.

Choose a reducer

from transformertopic.dimensionReducers import PacmapEmbeddings, UmapEmbeddings, TsneEmbeddings
#reducer = PacmapEmbeddings()
#reducer = TsneEmbeddings()
reducer = UmapEmbeddings(umapNNeighbors=13)

Init and run the model

from transformertopic import TransformerTopic
tt = TransformerTopic(dimensionReducer=reducer, hdbscanMinClusterSize=20)
tt.train(documentsDataFrame=pandasDf, dateColumn='date', textColumn='coref_text', copyOtherColumns = True)
print(f"Found {tt.nTopics} topics")
print(tt.df.info())

If you want to use different embeddings, you can pass the SentenceTransformer model name via the stEmbeddings init argument to TransformerTopic.

Show sizes of largest topics

N = 10
topNtopics = tt.showTopicSizes(N)

Choose a cluster representator and show wordclouds for the biggest topics

from transformertopic.clusterRepresentators import TextRank, Tfidf, KMaxoids
representator = Tfidf()
# representator = TextRank()
tt.showWordclouds(topNtopics clusterRepresentator=representator)

Show frequency of topics over times (dynamic topic modeling), or trends:

tt.showTopicTrends()

Show topics in which "car" appears in the top 75 words in their cluster representation:

tt.searchForWordInTopics("car", topNWords=75)

Footnotes

  1. my own implementation, see kmaxoids.py

About

Topic Modeling based on sentence embeddings

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages