This Repository shows the implementation of Latent Dirichlet Allocation with collapsed gibbs sampling in python by Cole Juracek and Pierre Gardan.
Poetry is the recommended method to install this project. Simply run:
poetry install
To install the project's requirements into a new virtual environment.
Project installation can also be done with pip. Run:
pip install -r requirements.txt
It is recommended to install the project requirements into a project-specific virtual environment.
In addition to the project dependencies, additional steps need to be taken to download the language model used for parsing and a list of stopwords. Run:
make install
To handle these commands
TODO
Examples on 3 datasets can be found within collapsed_lda/examples/
:
- 20 newsgroups (via Scikit)
- Reuters-21578
- NIPS dataset
Data used for these examples can be found in the top-level data/
directory where appropriate
In comparisons, we compare our algorithm against existing implementations using different methods such as sklearn (variational bayesian inference) and gensim (PLDA).
To run the test suite, run the following command while inside the project's virtual environment
python -m pytest tests/
The sampler file containing the actual function for the gibbs sampler. The utility file contains functions to prepare the data into usable tokens and titles while inference is made of only one function to print top words