This repository constitutes an extensive workflow for Topic Modeling the entire arXiv dataset. It utilizes an adapted BERTopic pipeline and also includes:
- preprocessing with nltk
- an SQLite database to save results
- label generation with Llama
- trend analysis with statsmodels
- visualization with dash/plotly
The process was designed to be employed locally and ran successfully on a laptop with 16 GB of RAM and an 8GB Nvidia graphics card under Windows 10. Further optimizations are definitely needed to improve the computation speed though.
- Clone the repository
- Install Python 3.10+ for your platform from https://www.python.org/downloads/
- For building hdbscan and llama, a C++ compiler is required
Linux: gcc or clang
Windows: Visual Studio Build Tools or MinGW
MacOS: Xcode - To utilize your GPU for the SentenceTransformer, install pytorch+CUDA from https://pytorch.org/get-started/locally/
- Install Llama-cpp-python for your hardware from https://pypi.org/project/llama-cpp-python/
- Install the rest of the dependencies from the requirements.txt
pip install -r requirements.txt
- Download the nltk punkt model
python -m nltk.downloader punkt
- Choose a suitable Sentence Transformer and LLM, and if necessary update the values for "embeddings_model" and "llm_model" in the config.json accordingly.
If you intend to use the default config, the models from https://huggingface.co/BAAI/bge-base-en-v1.5/tree/main and https://huggingface.co/QuantFactory/Meta-Llama-3-8B-Instruct-GGUF/tree/main need to be downloaded into the appropriate subfolders. For the Sentence Transformer you could also put only the model identifier in the config instead, to have it downloaded automatically from Hugging Face. But, note that this will lead to checking online for an update every time you create embeddings. - Register with kaggle and download the latest arXiv dataset from https://www.kaggle.com/datasets/Cornell-University/arxiv into the "input" subfolder
- Clone the repository
- Install Python 3.10+ for your platform from https://www.python.org/downloads/
- Install python modules dash and statsmodels
pip install dash statsmodels
- Download and unzip a trimmed down database (without abstracts, embeddings and subsets) from the releases
The process is split up into several steps to allow intermediate evaluation of the individual results and some variations in the execution.
- Step 1 creates the SQLite database to then import the cleaned, filtered and transformed features from the arXiv snapshot
- Step 2 converts the abstracts into sentence embeddings
- Step 3 trains subsets of the dataset, based on arXiv categories/archives
- Step 4 trains the main model
- Step 4b creates heatmaps, barcharts and hierarchical representations of the topics
- Step 5 (re)generates topic labels
- Step 6 assigns additional papers to existing topics with the original UMAP & HDBSCAN models
- Step 7 assigns outlier papers to existing topics using a BERTopic approximation
- Step 8 create a new model of higher hierarchy by merging the topics of an existing model, based on cluster similarity
- Step 9 computes the necessary statistics for all papers over all months of the dataset
- Step 10 visualizes the topic trends
Of these only steps 1,2,4 and 9 are strictly necessary, before the visualization is possible. Steps 3,6 and 7 are needed if you have not enough RAM to train the whole dataset at once (Hint: 16 GB are not enough). Step 8 is optional, but recommended to further evaluate the clusters. And step 5 is meant to be used if you're not pleased with some initial labels generated by the LLM.
All steps are configured via the config.json file, such that they can be run by simply starting the respective python file.
Notable configurations include:
-
pre_year_min: the minimum year after which you want to include papers from the snapshot into your model
-
embeddings_precision: precision used for the SentenceTransformer encoding, according to https://www.sbert.net/docs/package_reference/sentence_transformer/SentenceTransformer.html
Use 'ubinary' for quantized embeddings. -
train_models: settings for the selection of papers in step 4
If you wanna train all papers at once, set "model_filter" to "None".
Otherwise you can choose between "archives", "outliers" or "hierarchy" and set "percent_of_papers" to the desired percentage of papers to be selected proportionally. -
agg_models: settings for step 8
parent_model: the previously trained model to use
max_cluster_distance: the maximum distance between topics to be merged together in the resulting model -
bert_params: various parameters for the different BERTopic components
See the documentations for BERTopic, UMAP and HDBSCAN for understanding their respective use.
Notably, the adapted BERTopic pipeline allows you to define a whole range of min_cluster_sizes and min_sample_sizes to be used by HDBSCAN, and the one with the highest DBCV score will be chosen automatically. Also, a new hyperparameter "hdbscan_min_clusters" was added, which allows setting a minimum number of resulting clusters when using the ranges.
You can add any number of additional entries to train_models, agg_models (and bert_params accordingly) to train models in batches. Set "generate_labels" to false in the bert_params to exclude the LLM and improve performance for hyperparameter tuning.
-
You're using quantized embeddings and encounter an error in method "search_closure" when trying to load a pickled BERTopic model?
Until the pynndescent module gets an official update, you'll have to manually apply the fix from lmcinnes/pynndescent#240. Download https://github.com/lmcinnes/pynndescent/blob/master/pynndescent/pynndescent_.py and replace the one throwing the error. -
You want to update your model with a new version of the arXiv dataset?
Simply run steps 1 and 2 again to import new papers and create their embeddings. This will not override your existing data. Then run steps 6 and/or 7 to assign the new papers to the exisiting topics, and finally recompute statistics with step 9. -
You want to restart the whole process?
Delete arxiv.db in the "database" folder as well as snapshot_update_date.txt in the "input" folder.
No part of the code or its documentation was created by or with the help of artificial intelligence.