An investigation into how modern machine learning techniques align with Roget's classical thesaurus categorization.
-
Data Collection & Processing
- Web scraping of Roget's Thesaurus
- Hierarchical parsing of classes, divisions, sections, and terms
- Custom preprocessing preserving linguistic nuances
-
Word Embeddings
- OpenAI's text-embedding-3-large model
- Parallel processing for efficient embedding generation
- Chunked storage system for large-scale embeddings
-
Unsupervised Learning
- Clustering analysis at both class and division/section levels
- Comparison between discovered clusters and Roget's classification
- Hungarian algorithm for optimal cluster matching
-
Supervised Classification
- Two-level prediction models (class and division/section)
- Neural networks, Random Forest and XGBoost implementations
- Performance evaluation with multiple metrics
- Clone the repository:
git clone https://github.com/marsidmali/Roget-s-Thesaurus-in-the-21st-Century.git
cd Roget-s-Thesaurus-in-the-21st-Century
- Create and activate a virtual environment:
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
- Install dependencies:
pip install -r requirements.txt
- Set up environment variables:
- Copy
.env.example
to.env
- Add your OpenAI API key
- Copy
Roget-s-Thesaurus-in-the-21st-Century/
│
├── notebooks/ # Jupyter notebooks
│ └── rogets_thesaurus_analysis.ipynb
│ └── assigment_3.ipynb
├── utils/ # Utility functions
│ └── thesaurus_parser.py
│ └── parallelized_embeddings_fetcher.py
├── embeddings/ # Embeddings storage
├── requirements.txt # Dependencies
└── README.md # Documentation
- Launch Jupyter Notebook:
jupyter notebook
- Open
Roget's Thesaurus in the 21st Century.ipynb
- Run all cells to perform the analysis
This project is licensed under the MIT License - see the LICENSE file for details.