This repository contains Python scripts for generating embeddings from book summaries using the voyageai API and performing nearest neighbors analysis on the generated embeddings. The project aims to semantically analyze book summaries to find similarities and recommend books based on content.
Before you begin, ensure you have met the following requirements:
- Python 3.6+
- pandas
- numpy
- scikit-learn
- nltk
- A valid API key from voyageai
-
Obtain an API key from voyageai. You will need to sign up for an account and subscribe to a plan that suits your needs.
-
Once you have your API key, open
embeddings.py
and locate the following line:client = voyageai.Client(api_key="")
-
Replace the empty string with your API key:
client = voyageai.Client(api_key="YOUR_API_KEY_HERE")
-
embeddings.py
: This script processes a dataset of book summaries to generate embeddings using the voyageai API. It includes data cleaning, token counting, and embedding generation. -
semantic_scores.py
: After generating embeddings, this script loads them and uses the k-Nearest Neighbors algorithm to find and analyze the closest summaries based on their semantic similarity.
-
Place your dataset in the same directory as the scripts or update the file paths in the scripts to where your dataset is located.
-
Run
embeddings.py
first to generate embeddings for your dataset. This will create a new CSV file with the embeddings included.python embeddings.py
-
After generating the embeddings, run
semantic_scores.py
to perform the nearest neighbors analysis.python semantic_scores.py
Please note that the scripts folder does not have test coverage because all these scripts are intended for one-time use. They were specifically designed to process a dataset for a singular analysis purpose, and as such, traditional unit or integration testing paradigms are not directly applicable.