Skip to content

This repository features two Jupyter Notebooks: one for text preprocessing and the other showcasing topic modeling using the OCTIS library with Latent Dirichlet Allocation (LDA) applied to the ChiLit Corpus. Dive into the README for detailed instructions, citations, and links to the datasets and libraries used in this project.

License

Notifications You must be signed in to change notification settings

smicksan/TopicModelingChiLitCorpus

Repository files navigation

Topic Modeling with LDA on the ChiLit Corpus

Author: Sandra Mickwitz
Institution: Università Cattolica del Sacro Cuore

Introduction

The aim of this project is to train a Latent Dirichlet Allocation (LDA) model to analyze and understand the topics in the ChiLit Corpus from mahlberg-lab/corpora.

A key research question during model training was:
What is the ideal number of topics?
The goal was to find a balance between not too many fine-grained topics and not too few overly broad topics.

ChiLit Corpus Description

The ChiLit Corpus is used in this notebook.
The corpus can be accessed here.

Additionally, a notebook is available for preprocessing the corpus.

Corpus Statistics

Metric Unprocessed Files Processed Files Why the Change?
Total Tokens 5,404,761 2,191,892 Stopwords & punctuation removed
Unique Tokens 70,728 51,673 Slight reduction, but more meaningful words remain
Lexical Richness 0.0131 0.0236 Fewer redundant words, unique word ratio increased

Topic Modeling with OCTIS

This project utilizes OCTIS (Optimized Configuration of Topic Models), a Python library designed to facilitate topic model training and evaluation.

OCTIS provides:

  • Preprocessing tools to clean and prepare text data.
  • Multiple topic modeling algorithms, including:
    • LDA (Latent Dirichlet Allocation)
    • ProdLDA (Product of Experts LDA)
    • ETM (Embedded Topic Model)
  • Evaluation metrics such as:
    • Topic Diversity
    • Topic Coherence
  • Hyperparameter optimization for tuning the number of topics.

For more details, visit the official repository.

OCTIS Citation

Click to read the paper!

@inproceedings{terragni2020octis,
    title={{OCTIS}: Comparing and Optimizing Topic Models is Simple!},
    author={Terragni, Silvia and Fersini, Elisabetta and Galuzzi, Bruno Giovanni and Tropeano, Pietro and Candelieri, Antonio},
    year={2021},
    booktitle={Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations},
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2021.eacl-demos.31",
    pages = "263--270",
}

Short Information About the Complete ChiLit Corpus

The GLARE 19th Century Children’s Literature Corpus is part of CLiC!
📌 Blog post: ChiLit: the GLARE 19th Century Children’s Literature Corpus in CLiC

Corpus Overview

  • Developed as part of the MC-Project GLARE (Exploring Gender in Children’s Literature from a Cognitive Corpus Stylistic Perspective).
  • Directed by Anna Čermáková and M. Mahlberg.
  • 71 books (35 by female, 36 by male authors).
  • 38 authors (14 female, 24 male).
  • Published between 1826-1911.
  • 4,480,386 words in total.
  • Source: Gutenberg Project (Public Domain).

Reference

Čermáková, A. (2018, 14 February).
ChiLit: the GLARE 19th Century Children’s Literature Corpus in CLiC
Retrieved from University of Birmingham Blog.

📢 Feedback and contributions are welcome!
Feel free to submit issues or pull requests.

About

This repository features two Jupyter Notebooks: one for text preprocessing and the other showcasing topic modeling using the OCTIS library with Latent Dirichlet Allocation (LDA) applied to the ChiLit Corpus. Dive into the README for detailed instructions, citations, and links to the datasets and libraries used in this project.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published