📚 Roget's Thesaurus in the 21st Century

An investigation into how modern machine learning techniques align with Roget's classical thesaurus categorization.

✨ Features

Data Collection & Processing
- Web scraping of Roget's Thesaurus
- Hierarchical parsing of classes, divisions, sections, and terms
- Custom preprocessing preserving linguistic nuances
Word Embeddings
- OpenAI's text-embedding-3-large model
- Parallel processing for efficient embedding generation
- Chunked storage system for large-scale embeddings
Unsupervised Learning
- Clustering analysis at both class and division/section levels
- Comparison between discovered clusters and Roget's classification
- Hungarian algorithm for optimal cluster matching
Supervised Classification
- Two-level prediction models (class and division/section)
- Neural networks, Random Forest and XGBoost implementations
- Performance evaluation with multiple metrics

🛠️ Installation

Clone the repository:

git clone https://github.com/marsidmali/Roget-s-Thesaurus-in-the-21st-Century.git
cd Roget-s-Thesaurus-in-the-21st-Century

Create and activate a virtual environment:

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies:

pip install -r requirements.txt

Set up environment variables:
- Copy .env.example to .env
- Add your OpenAI API key

📁 Project Structure

Roget-s-Thesaurus-in-the-21st-Century/
│
├── notebooks/           # Jupyter notebooks
│   └── rogets_thesaurus_analysis.ipynb
│   └── assigment_3.ipynb
├── utils/               # Utility functions
│   └── thesaurus_parser.py
│   └── parallelized_embeddings_fetcher.py 
├── embeddings/          # Embeddings storage  
├── requirements.txt     # Dependencies
└── README.md            # Documentation

🚀 Usage

Launch Jupyter Notebook:

jupyter notebook

Open Roget's Thesaurus in the 21st Century.ipynb
Run all cells to perform the analysis

📝 License

This project is licensed under the MIT License - see the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

📚 Roget's Thesaurus in the 21st Century

✨ Features

🛠️ Installation

📁 Project Structure

🚀 Usage

📝 License

Files

README.md

Latest commit

History

README.md

File metadata and controls

📚 Roget's Thesaurus in the 21st Century

✨ Features

🛠️ Installation

📁 Project Structure

🚀 Usage

📝 License