Skip to content

An investigation into how modern machine learning techniques align with Roget's classical thesaurus categorization.

License

Notifications You must be signed in to change notification settings

marsidmali/Roget-s-Thesaurus-in-the-21st-Century

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

📚 Roget's Thesaurus in the 21st Century

Python 3.8+ OpenAI Jupyter Notebook License: MIT

An investigation into how modern machine learning techniques align with Roget's classical thesaurus categorization.

✨ Features

  1. Data Collection & Processing

    • Web scraping of Roget's Thesaurus
    • Hierarchical parsing of classes, divisions, sections, and terms
    • Custom preprocessing preserving linguistic nuances
  2. Word Embeddings

    • OpenAI's text-embedding-3-large model
    • Parallel processing for efficient embedding generation
    • Chunked storage system for large-scale embeddings
  3. Unsupervised Learning

    • Clustering analysis at both class and division/section levels
    • Comparison between discovered clusters and Roget's classification
    • Hungarian algorithm for optimal cluster matching
  4. Supervised Classification

    • Two-level prediction models (class and division/section)
    • Neural networks, Random Forest and XGBoost implementations
    • Performance evaluation with multiple metrics

🛠️ Installation

  1. Clone the repository:
git clone https://github.com/marsidmali/Roget-s-Thesaurus-in-the-21st-Century.git
cd Roget-s-Thesaurus-in-the-21st-Century
  1. Create and activate a virtual environment:
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
  1. Install dependencies:
pip install -r requirements.txt
  1. Set up environment variables:
    • Copy .env.example to .env
    • Add your OpenAI API key

📁 Project Structure

Roget-s-Thesaurus-in-the-21st-Century/
│
├── notebooks/           # Jupyter notebooks
│   └── rogets_thesaurus_analysis.ipynb
│   └── assigment_3.ipynb
├── utils/               # Utility functions
│   └── thesaurus_parser.py
│   └── parallelized_embeddings_fetcher.py 
├── embeddings/          # Embeddings storage  
├── requirements.txt     # Dependencies
└── README.md            # Documentation

🚀 Usage

  1. Launch Jupyter Notebook:
jupyter notebook
  1. Open Roget's Thesaurus in the 21st Century.ipynb
  2. Run all cells to perform the analysis

📝 License

This project is licensed under the MIT License - see the LICENSE file for details.

About

An investigation into how modern machine learning techniques align with Roget's classical thesaurus categorization.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published