Skip to content

Latest commit

 

History

History
84 lines (64 loc) · 2.64 KB

File metadata and controls

84 lines (64 loc) · 2.64 KB

📚 Roget's Thesaurus in the 21st Century

Python 3.8+ OpenAI Jupyter Notebook License: MIT

An investigation into how modern machine learning techniques align with Roget's classical thesaurus categorization.

✨ Features

  1. Data Collection & Processing

    • Web scraping of Roget's Thesaurus
    • Hierarchical parsing of classes, divisions, sections, and terms
    • Custom preprocessing preserving linguistic nuances
  2. Word Embeddings

    • OpenAI's text-embedding-3-large model
    • Parallel processing for efficient embedding generation
    • Chunked storage system for large-scale embeddings
  3. Unsupervised Learning

    • Clustering analysis at both class and division/section levels
    • Comparison between discovered clusters and Roget's classification
    • Hungarian algorithm for optimal cluster matching
  4. Supervised Classification

    • Two-level prediction models (class and division/section)
    • Neural networks, Random Forest and XGBoost implementations
    • Performance evaluation with multiple metrics

🛠️ Installation

  1. Clone the repository:
git clone https://github.com/marsidmali/Roget-s-Thesaurus-in-the-21st-Century.git
cd Roget-s-Thesaurus-in-the-21st-Century
  1. Create and activate a virtual environment:
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
  1. Install dependencies:
pip install -r requirements.txt
  1. Set up environment variables:
    • Copy .env.example to .env
    • Add your OpenAI API key

📁 Project Structure

Roget-s-Thesaurus-in-the-21st-Century/
│
├── notebooks/           # Jupyter notebooks
│   └── rogets_thesaurus_analysis.ipynb
│   └── assigment_3.ipynb
├── utils/               # Utility functions
│   └── thesaurus_parser.py
│   └── parallelized_embeddings_fetcher.py 
├── embeddings/          # Embeddings storage  
├── requirements.txt     # Dependencies
└── README.md            # Documentation

🚀 Usage

  1. Launch Jupyter Notebook:
jupyter notebook
  1. Open Roget's Thesaurus in the 21st Century.ipynb
  2. Run all cells to perform the analysis

📝 License

This project is licensed under the MIT License - see the LICENSE file for details.