Author: Sandra Mickwitz
Institution: Università Cattolica del Sacro Cuore
The aim of this project is to train a Latent Dirichlet Allocation (LDA) model to analyze and understand the topics in the ChiLit Corpus from mahlberg-lab/corpora.
A key research question during model training was:
What is the ideal number of topics?
The goal was to find a balance between not too many fine-grained topics and not too few overly broad topics.
The ChiLit Corpus is used in this notebook.
The corpus can be accessed here.
Additionally, a notebook is available for preprocessing the corpus.
Metric | Unprocessed Files | Processed Files | Why the Change? |
---|---|---|---|
Total Tokens | 5,404,761 | 2,191,892 | Stopwords & punctuation removed |
Unique Tokens | 70,728 | 51,673 | Slight reduction, but more meaningful words remain |
Lexical Richness | 0.0131 | 0.0236 | Fewer redundant words, unique word ratio increased |
This project utilizes OCTIS (Optimized Configuration of Topic Models), a Python library designed to facilitate topic model training and evaluation.
- Preprocessing tools to clean and prepare text data.
- Multiple topic modeling algorithms, including:
- LDA (Latent Dirichlet Allocation)
- ProdLDA (Product of Experts LDA)
- ETM (Embedded Topic Model)
- Evaluation metrics such as:
- Topic Diversity
- Topic Coherence
- Hyperparameter optimization for tuning the number of topics.
For more details, visit the official repository.
@inproceedings{terragni2020octis,
title={{OCTIS}: Comparing and Optimizing Topic Models is Simple!},
author={Terragni, Silvia and Fersini, Elisabetta and Galuzzi, Bruno Giovanni and Tropeano, Pietro and Candelieri, Antonio},
year={2021},
booktitle={Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations},
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2021.eacl-demos.31",
pages = "263--270",
}
The GLARE 19th Century Children’s Literature Corpus is part of CLiC!
📌 Blog post: ChiLit: the GLARE 19th Century Children’s Literature Corpus in CLiC
- Developed as part of the MC-Project GLARE (Exploring Gender in Children’s Literature from a Cognitive Corpus Stylistic Perspective).
- Directed by Anna Čermáková and M. Mahlberg.
- 71 books (35 by female, 36 by male authors).
- 38 authors (14 female, 24 male).
- Published between 1826-1911.
- 4,480,386 words in total.
- Source: Gutenberg Project (Public Domain).
Čermáková, A. (2018, 14 February).
ChiLit: the GLARE 19th Century Children’s Literature Corpus in CLiC
Retrieved from University of Birmingham Blog.
📢 Feedback and contributions are welcome!
Feel free to submit issues or pull requests.