Skip to content

Commit

Permalink
Vector Search with Kai (#97)
Browse files Browse the repository at this point in the history
* Vector Search with Kai
  • Loading branch information
spandanartw authored May 22, 2024
1 parent 06dce57 commit c12a253
Show file tree
Hide file tree
Showing 2 changed files with 362 additions and 0 deletions.
8 changes: 8 additions & 0 deletions notebooks/vector-search-with-kai/meta.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
[meta]
title = "Vector Search with Kai"
description = """
Run Vector Search using MongoDB clients and power GenAI usecases for your MongoDB applications """
icon = "radar"
tags = ["mongo", "embeddings", "vector", "genai", "kai", "starter"]
destinations = ["spaces"]
minimum_tier="standard"
354 changes: 354 additions & 0 deletions notebooks/vector-search-with-kai/notebook.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,354 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "aa202b34-20c7-4021-a142-d959a81d2cfd",
"metadata": {},
"source": [
"<div id=\"singlestore-header\" style=\"display: flex; background-color: rgba(255, 182, 176, 0.25); padding: 5px;\">\n",
" <div id=\"icon-image\" style=\"width: 90px; height: 90px;\">\n",
" <img width=\"100%\" height=\"100%\" src=\"https://raw.githubusercontent.com/singlestore-labs/spaces-notebooks/master/common/images/header-icons/radar.png\" />\n",
" </div>\n",
" <div id=\"text\" style=\"padding: 5px; margin-left: 10px;\">\n",
" <div id=\"badge\" style=\"display: inline-block; background-color: rgba(0, 0, 0, 0.15); border-radius: 4px; padding: 4px 8px; align-items: center; margin-top: 6px; margin-bottom: -2px; font-size: 80%\">SingleStore Notebooks</div>\n",
" <h1 style=\"font-weight: 500; margin: 8px 0 0 4px;\">Vector Search with Kai</h1>\n",
" </div>\n",
"</div>"
]
},
{
"cell_type": "markdown",
"id": "b1eef9c6-fcc6-4d1c-b280-102eea62a5ec",
"metadata": {},
"source": [
"## Vector Search with Kai"
]
},
{
"cell_type": "markdown",
"id": "078b13da-9901-4be5-b472-25c8cd8e01c4",
"metadata": {},
"source": [
"In this notebook, we load a dataset into a collection, create a vector index and perform vector searches using Kai in a way that is compatible with MongoDB clients and applications"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "2b25990e-8ef4-4759-b828-98d8fbe092f0",
"metadata": {},
"outputs": [],
"source": [
"!pip install datasets --quiet"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "b305676b-f253-45f9-8b97-d9c5074eae6f",
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"import pprint\n",
"import time\n",
"import concurrent.futures\n",
"import datasets\n",
"from pymongo import MongoClient\n",
"from datasets import load_dataset\n",
"from bson import json_util"
]
},
{
"cell_type": "markdown",
"id": "8dd06cce-e988-47b3-88f0-82833b8111a9",
"metadata": {},
"source": [
"### 1. Initializing a pymongo client"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "4a41b5af-0c80-4d4b-bad3-503cb8ee91a5",
"metadata": {},
"outputs": [],
"source": [
"current_database = %sql SELECT DATABASE() as CurrentDatabase\n",
"DB = current_database[0][0]\n",
"COLLECTION = 'wiki_embeddings'"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "4c68f060-4f40-4d3d-939e-b0d524514245",
"metadata": {},
"outputs": [],
"source": [
"# Using the environment variable that holds the kai endpoint\n",
"client = MongoClient(connection_url_kai)\n",
"collection = client[DB][COLLECTION]"
]
},
{
"cell_type": "markdown",
"id": "383015eb-9d56-40f5-9fd3-fec01934fcdf",
"metadata": {},
"source": [
"### 2. Create a collection and load the dataset"
]
},
{
"cell_type": "markdown",
"id": "3d9261a0-28fe-474c-b89e-1c237b2681f9",
"metadata": {},
"source": [
"It is recommended that you create a collection with the embedding field as a top level column for optimized utilization of storage. The name of the column should be the name of the field holding the embedding"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "c88d16fe-0746-40f2-abcc-4d3c3036165e",
"metadata": {},
"outputs": [],
"source": [
"client[DB].create_collection(COLLECTION,\n",
" columns=[{ 'id': \"emb\", 'type': \"VECTOR(768) NOT NULL\" }],\n",
");"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "f76d2205-3e98-4969-be80-db59bd64467d",
"metadata": {},
"outputs": [],
"source": [
"# Using the \"wikipedia-22-12-simple-embeddings\" dataset from Hugging Face\n",
"dataset = load_dataset(\"Cohere/wikipedia-22-12-simple-embeddings\", split=\"train\")"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "ca0f4601-ca6a-4e87-9939-2bef27c60c42",
"metadata": {},
"outputs": [],
"source": [
"DB_SIZE = 50000 #Currently loading 50k documents to the collection, can go to a max of 485,859 for this dataset\n",
"insert_data = []\n",
"insert_count = 0\n",
"# Iterate through the dataset and prepare the documents for insertion\n",
"# The script below ingests 1000 records into the database at a time\n",
"for item in dataset:\n",
" if insert_count >= DB_SIZE:\n",
" break\n",
" # Convert the dataset item to MongoDB document format\n",
" doc_item = json_util.loads(json_util.dumps(item))\n",
" insert_data.append(doc_item)\n",
"\n",
" # Insert in batches of 1000 documents\n",
" if len(insert_data) == 1000:\n",
" collection.insert_many(insert_data)\n",
" insert_count += 1000\n",
" print(f\"{insert_count} of {DB_SIZE} records ingested\")\n",
" insert_data = []\n",
"\n",
"\n",
"# Insert any remaining documents\n",
"if len(insert_data) > 0:\n",
" collection.insert_many(insert_data)\n",
" print(\"Data Ingested\")"
]
},
{
"cell_type": "markdown",
"id": "cefa9acd-3779-4dcc-9117-8b12a27abae2",
"metadata": {},
"source": [
"A sample document from the collection"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "7595312b-88ce-485d-9582-9c7e9ece3e11",
"metadata": {},
"outputs": [],
"source": [
"sample_doc = collection.find_one()\n",
"pprint.pprint(sample_doc, compact=True)"
]
},
{
"cell_type": "markdown",
"id": "9876f080-e500-43df-afa5-04ca31b83524",
"metadata": {},
"source": [
"### 3. Create a vector Index"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "d81aeb45-5bdb-429d-8710-10d4c25f2941",
"metadata": {},
"outputs": [],
"source": [
"client[DB].command({\n",
" 'createIndexes': COLLECTION,\n",
" 'indexes': [{\n",
" 'key': {'emb': 'vector'},\n",
" 'name': 'vector_index',\n",
" 'kaiSearchOptions': {\"index_type\":\"AUTO\", \"metric_type\": \"EUCLIDEAN_DISTANCE\", \"dimensions\": 768}\n",
" }],\n",
"})"
]
},
{
"cell_type": "markdown",
"id": "5d44f05f-9316-4934-b856-213dbb540fa4",
"metadata": {},
"source": [
"Selecting the query embedding from the sample_doc selected above"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "007e376b-e969-479c-97fc-a85ac1c50b56",
"metadata": {},
"outputs": [],
"source": [
"# input vector\n",
"query_vector = sample_doc['emb']"
]
},
{
"cell_type": "markdown",
"id": "e36426ec-8fc9-45d8-83b4-a261c3d9c8bf",
"metadata": {},
"source": [
"### 4. Perform a vector search"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "0f9f9126-12a9-41cc-9f93-5309202fea98",
"metadata": {},
"outputs": [],
"source": [
"def execute_kai_search(query_vector):\n",
" pipeline = [\n",
" {\n",
" '$vectorSearch': {\n",
" \"index\": \"vector_index\",\n",
" \"path\": \"emb\",\n",
" \"queryVector\": query_vector,\n",
" \"numCandidates\": 20,\n",
" \"limit\": 3,\n",
" }\n",
" },\n",
" {\n",
" '$project': {\n",
" '_id':1,\n",
" 'text': 1,\n",
" }\n",
" }\n",
" ]\n",
" results = collection.aggregate(pipeline)\n",
" return list(results)"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "69af232c-5c24-49ce-a28e-b982ab12b696",
"metadata": {},
"outputs": [],
"source": [
"execute_kai_search(query_vector)"
]
},
{
"cell_type": "markdown",
"id": "a1cbe6a2-3bef-4727-8920-e6af2801477e",
"metadata": {},
"source": [
"Running concurrent vector search queries"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "32972b4e-880d-4084-8a58-05b8787005a8",
"metadata": {},
"outputs": [],
"source": [
"num_concurrent_queries = 250\n",
"start_time = time.time()\n",
"\n",
"with concurrent.futures.ThreadPoolExecutor(max_workers=num_concurrent_queries) as executor:\n",
" futures = [executor.submit(execute_kai_search, query_vector) for _ in range(num_concurrent_queries)]\n",
" concurrent.futures.wait(futures)\n",
"\n",
"end_time = time.time()\n",
"print(f\"Executed {num_concurrent_queries} concurrent queries.\")\n",
"print(f\"Total execution time: {end_time - start_time} seconds\")\n",
"\n",
"for f in futures:\n",
" if f.exception() is not None:\n",
" print(f.exception())\n",
"failed_count = sum(1 for f in futures if f.exception() is not None)\n",
"print(f\"Failed queries: {failed_count}\")"
]
},
{
"cell_type": "markdown",
"id": "a26070f4-7bfb-4ee7-8f48-d74e80c9966e",
"metadata": {},
"source": [
"This shows the Kai can create vector indexes instantaneously and perform a large number of concurrent vector search queries surpassing MongoDB Atlas Vector Search capabilities"
]
},
{
"cell_type": "markdown",
"id": "8cfd1834-f238-4845-aa80-bf2bf82dd221",
"metadata": {},
"source": [
"<div id=\"singlestore-footer\" style=\"background-color: rgba(194, 193, 199, 0.25); height:2px; margin-bottom:10px\"></div>\n",
"<div><img src=\"https://raw.githubusercontent.com/singlestore-labs/spaces-notebooks/master/common/images/singlestore-logo-grey.png\" style=\"padding: 0px; margin: 0px; height: 24px\"/></div>"
]
}
],
"metadata": {
"jupyterlab": {
"notebooks": {
"version_major": 6,
"version_minor": 4
}
},
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.6"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

0 comments on commit c12a253

Please sign in to comment.