This project's origin is here. In this project, we will be using Weaviate to perform semantic search on podcast transcripts. We will be using the OpenAI text2vec transformer module to vectorize the text. Once the complete data is vectorized and stored, we will be able to perform semantic search on the data.
Vectorization module: sentence-transformers/multi-qa-distilbert-cos-v1
.
Note: if this doesn't work, try sentence-transformers/msmarco-distilroberta-base-v2
(TODO: Add demo video)
Before you can run the project, you need to have Docker, Docker Compose, and Python installed on your machine. Follow the instructions below to install the prerequisites:
- For Windows and Mac:
- Download and install Docker Desktop from Docker's official website.
- For Linux:
- Run the following commands in your terminal:
sudo apt-get update sudo apt-get install docker-ce docker-ce-cli containerd.io
- Run the following commands in your terminal:
- For Windows and Mac:
- Docker Compose is included with Docker Desktop.
- For Linux:
- Run the following command in your terminal:
sudo apt install docker-compose
- Run the following command in your terminal:
- Download and install the latest version of Python from Python's official website.
- Verify the installation by running the following command in your terminal:
python --version
-
Install virtualenv (if not already installed):
pip install virtualenv
-
Create a Virtual Environment: Navigate to the directory where you want to create your virtual environment, then run:
virtualenv <name_of_virtualenv>
-
Activate the Virtual Environment: On Windows, run:
.\<name_of_virtualenv>\Scripts\activate
On macOS and Linux, run:
source <name_of_virtualenv>/bin/activate
-
Install Python requirements:
pip install -r requirements.txt
-
Export OpenAI API Key:
export OPENAI_APIKEY=<your_openai_api_key>
- Start up Weaviate:
docker-compose up -d
. Once completed, Weaviate is running onhttp://localhost:8080
. - Run
python import.py
to import the transcripts into Weaviate. - The data is now stored in the Weaviate instance. You can experiment with it using a python notebook or a python file.
300 Podcast transcripts from Changelog