Adapted BERTopic pipeline for Topic Modeling the arXiv dataset

This repository constitutes an extensive workflow for Topic Modeling the entire arXiv dataset. It utilizes an adapted BERTopic pipeline and also includes:

preprocessing with nltk
an SQLite database to save results
label generation with Llama
trend analysis with statsmodels
visualization with dash/plotly

The process was designed to be employed locally and ran successfully on a laptop with 16 GB of RAM and an 8GB Nvidia graphics card under Windows 10. Further optimizations are definitely needed to improve the computation speed though.

Setup for the complete workflow

Clone the repository
Install Python 3.10+ for your platform from https://www.python.org/downloads/
For building hdbscan and llama, a C++ compiler is required
Linux: gcc or clang
Windows: Visual Studio Build Tools or MinGW
MacOS: Xcode
To utilize your GPU for the SentenceTransformer, install pytorch+CUDA from https://pytorch.org/get-started/locally/
Install Llama-cpp-python for your hardware from https://pypi.org/project/llama-cpp-python/
Install the rest of the dependencies from the requirements.txt
```
pip install -r requirements.txt
```
Download the nltk punkt model
```
python -m nltk.downloader punkt
```
Choose a suitable Sentence Transformer and LLM, and if necessary update the values for "embeddings_model" and "llm_model" in the config.json accordingly.
If you intend to use the default config, the models from https://huggingface.co/BAAI/bge-base-en-v1.5/tree/main and https://huggingface.co/QuantFactory/Meta-Llama-3-8B-Instruct-GGUF/tree/main need to be downloaded into the appropriate subfolders. For the Sentence Transformer you could also put only the model identifier in the config instead, to have it downloaded automatically from Hugging Face. But, note that this will lead to checking online for an update every time you create embeddings.
Register with kaggle and download the latest arXiv dataset from https://www.kaggle.com/datasets/Cornell-University/arxiv into the "input" subfolder

Setup for the visualization of precomputed results only

Clone the repository
Install Python 3.10+ for your platform from https://www.python.org/downloads/
Install python modules dash and statsmodels
```
pip install dash statsmodels
```
Download and unzip a trimmed down database (without abstracts, embeddings and subsets) from the releases

Usage

The process is split up into several steps to allow intermediate evaluation of the individual results and some variations in the execution.

Step 1 creates the SQLite database to then import the cleaned, filtered and transformed features from the arXiv snapshot
Step 2 converts the abstracts into sentence embeddings
Step 3 trains subsets of the dataset, based on arXiv categories/archives
Step 4 trains the main model
Step 4b creates heatmaps, barcharts and hierarchical representations of the topics
Step 5 (re)generates topic labels
Step 6 assigns additional papers to existing topics with the original UMAP & HDBSCAN models
Step 7 assigns outlier papers to existing topics using a BERTopic approximation
Step 8 create a new model of higher hierarchy by merging the topics of an existing model, based on cluster similarity
Step 9 computes the necessary statistics for all papers over all months of the dataset
Step 10 visualizes the topic trends

Of these only steps 1,2,4 and 9 are strictly necessary, before the visualization is possible. Steps 3,6 and 7 are needed if you have not enough RAM to train the whole dataset at once (Hint: 16 GB are not enough). Step 8 is optional, but recommended to further evaluate the clusters. And step 5 is meant to be used if you're not pleased with some initial labels generated by the LLM.

Configuration

All steps are configured via the config.json file, such that they can be run by simply starting the respective python file.

Notable configurations include:

pre_year_min: the minimum year after which you want to include papers from the snapshot into your model
embeddings_precision: precision used for the SentenceTransformer encoding, according to https://www.sbert.net/docs/package_reference/sentence_transformer/SentenceTransformer.html
Use 'ubinary' for quantized embeddings.
train_models: settings for the selection of papers in step 4
If you wanna train all papers at once, set "model_filter" to "None".
Otherwise you can choose between "archives", "outliers" or "hierarchy" and set "percent_of_papers" to the desired percentage of papers to be selected proportionally.
agg_models: settings for step 8
parent_model: the previously trained model to use
max_cluster_distance: the maximum distance between topics to be merged together in the resulting model
bert_params: various parameters for the different BERTopic components
See the documentations for BERTopic, UMAP and HDBSCAN for understanding their respective use.
Notably, the adapted BERTopic pipeline allows you to define a whole range of min_cluster_sizes and min_sample_sizes to be used by HDBSCAN, and the one with the highest DBCV score will be chosen automatically. Also, a new hyperparameter "hdbscan_min_clusters" was added, which allows setting a minimum number of resulting clusters when using the ranges.

You can add any number of additional entries to train_models, agg_models (and bert_params accordingly) to train models in batches. Set "generate_labels" to false in the bert_params to exclude the LLM and improve performance for hyperparameter tuning.

FAQ

You're using quantized embeddings and encounter an error in method "search_closure" when trying to load a pickled BERTopic model?
Until the pynndescent module gets an official update, you'll have to manually apply the fix from lmcinnes/pynndescent#240. Download https://github.com/lmcinnes/pynndescent/blob/master/pynndescent/pynndescent_.py and replace the one throwing the error.
You want to update your model with a new version of the arXiv dataset?
Simply run steps 1 and 2 again to import new papers and create their embeddings. This will not override your existing data. Then run steps 6 and/or 7 to assign the new papers to the exisiting topics, and finally recompute statistics with step 9.
You want to restart the whole process?
Delete arxiv.db in the "database" folder as well as snapshot_update_date.txt in the "input" folder.

Disclaimer

No part of the code or its documentation was created by or with the help of artificial intelligence.

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
arxiv_topics		arxiv_topics
input		input
LICENSE		LICENSE
README.md		README.md
Step01_Preprocess.py		Step01_Preprocess.py
Step02_Create_Embeddings.py		Step02_Create_Embeddings.py
Step03_Train_Subsets.py		Step03_Train_Subsets.py
Step04_Train_Models.py		Step04_Train_Models.py
Step04b_Evaluate.py		Step04b_Evaluate.py
Step05_Regenerate_Labels.py		Step05_Regenerate_Labels.py
Step06_Transform_Papers.py		Step06_Transform_Papers.py
Step07_Distribute_Papers.py		Step07_Distribute_Papers.py
Step08_Agglomerate.py		Step08_Agglomerate.py
Step09_Compute_Stats.py		Step09_Compute_Stats.py
Step10_Visualize.py		Step10_Visualize.py
config.json		config.json
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Adapted BERTopic pipeline for Topic Modeling the arXiv dataset

Setup for the complete workflow

Setup for the visualization of precomputed results only

Usage

Configuration

FAQ

Disclaimer

About

Releases 1

Languages

License

smartIU/arxiv-topics

Folders and files

Latest commit

History

Repository files navigation

Adapted BERTopic pipeline for Topic Modeling the arXiv dataset

Setup for the complete workflow

Setup for the visualization of precomputed results only

Usage

Configuration

FAQ

Disclaimer

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 1

Languages