-
Notifications
You must be signed in to change notification settings - Fork 52
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Showing
2 changed files
with
362 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
[meta] | ||
title = "Vector Search with Kai" | ||
description = """ | ||
Run Vector Search using MongoDB clients and power GenAI usecases for your MongoDB applications """ | ||
icon = "radar" | ||
tags = ["mongo", "embeddings", "vector", "genai", "kai", "starter"] | ||
destinations = ["spaces"] | ||
minimum_tier="standard" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,354 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"id": "aa202b34-20c7-4021-a142-d959a81d2cfd", | ||
"metadata": {}, | ||
"source": [ | ||
"<div id=\"singlestore-header\" style=\"display: flex; background-color: rgba(255, 182, 176, 0.25); padding: 5px;\">\n", | ||
" <div id=\"icon-image\" style=\"width: 90px; height: 90px;\">\n", | ||
" <img width=\"100%\" height=\"100%\" src=\"https://raw.githubusercontent.com/singlestore-labs/spaces-notebooks/master/common/images/header-icons/radar.png\" />\n", | ||
" </div>\n", | ||
" <div id=\"text\" style=\"padding: 5px; margin-left: 10px;\">\n", | ||
" <div id=\"badge\" style=\"display: inline-block; background-color: rgba(0, 0, 0, 0.15); border-radius: 4px; padding: 4px 8px; align-items: center; margin-top: 6px; margin-bottom: -2px; font-size: 80%\">SingleStore Notebooks</div>\n", | ||
" <h1 style=\"font-weight: 500; margin: 8px 0 0 4px;\">Vector Search with Kai</h1>\n", | ||
" </div>\n", | ||
"</div>" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "b1eef9c6-fcc6-4d1c-b280-102eea62a5ec", | ||
"metadata": {}, | ||
"source": [ | ||
"## Vector Search with Kai" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "078b13da-9901-4be5-b472-25c8cd8e01c4", | ||
"metadata": {}, | ||
"source": [ | ||
"In this notebook, we load a dataset into a collection, create a vector index and perform vector searches using Kai in a way that is compatible with MongoDB clients and applications" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 1, | ||
"id": "2b25990e-8ef4-4759-b828-98d8fbe092f0", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"!pip install datasets --quiet" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 2, | ||
"id": "b305676b-f253-45f9-8b97-d9c5074eae6f", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"import os\n", | ||
"import pprint\n", | ||
"import time\n", | ||
"import concurrent.futures\n", | ||
"import datasets\n", | ||
"from pymongo import MongoClient\n", | ||
"from datasets import load_dataset\n", | ||
"from bson import json_util" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "8dd06cce-e988-47b3-88f0-82833b8111a9", | ||
"metadata": {}, | ||
"source": [ | ||
"### 1. Initializing a pymongo client" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 3, | ||
"id": "4a41b5af-0c80-4d4b-bad3-503cb8ee91a5", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"current_database = %sql SELECT DATABASE() as CurrentDatabase\n", | ||
"DB = current_database[0][0]\n", | ||
"COLLECTION = 'wiki_embeddings'" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 4, | ||
"id": "4c68f060-4f40-4d3d-939e-b0d524514245", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# Using the environment variable that holds the kai endpoint\n", | ||
"client = MongoClient(connection_url_kai)\n", | ||
"collection = client[DB][COLLECTION]" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "383015eb-9d56-40f5-9fd3-fec01934fcdf", | ||
"metadata": {}, | ||
"source": [ | ||
"### 2. Create a collection and load the dataset" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "3d9261a0-28fe-474c-b89e-1c237b2681f9", | ||
"metadata": {}, | ||
"source": [ | ||
"It is recommended that you create a collection with the embedding field as a top level column for optimized utilization of storage. The name of the column should be the name of the field holding the embedding" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 5, | ||
"id": "c88d16fe-0746-40f2-abcc-4d3c3036165e", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"client[DB].create_collection(COLLECTION,\n", | ||
" columns=[{ 'id': \"emb\", 'type': \"VECTOR(768) NOT NULL\" }],\n", | ||
");" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 6, | ||
"id": "f76d2205-3e98-4969-be80-db59bd64467d", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# Using the \"wikipedia-22-12-simple-embeddings\" dataset from Hugging Face\n", | ||
"dataset = load_dataset(\"Cohere/wikipedia-22-12-simple-embeddings\", split=\"train\")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 7, | ||
"id": "ca0f4601-ca6a-4e87-9939-2bef27c60c42", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"DB_SIZE = 50000 #Currently loading 50k documents to the collection, can go to a max of 485,859 for this dataset\n", | ||
"insert_data = []\n", | ||
"insert_count = 0\n", | ||
"# Iterate through the dataset and prepare the documents for insertion\n", | ||
"# The script below ingests 1000 records into the database at a time\n", | ||
"for item in dataset:\n", | ||
" if insert_count >= DB_SIZE:\n", | ||
" break\n", | ||
" # Convert the dataset item to MongoDB document format\n", | ||
" doc_item = json_util.loads(json_util.dumps(item))\n", | ||
" insert_data.append(doc_item)\n", | ||
"\n", | ||
" # Insert in batches of 1000 documents\n", | ||
" if len(insert_data) == 1000:\n", | ||
" collection.insert_many(insert_data)\n", | ||
" insert_count += 1000\n", | ||
" print(f\"{insert_count} of {DB_SIZE} records ingested\")\n", | ||
" insert_data = []\n", | ||
"\n", | ||
"\n", | ||
"# Insert any remaining documents\n", | ||
"if len(insert_data) > 0:\n", | ||
" collection.insert_many(insert_data)\n", | ||
" print(\"Data Ingested\")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "cefa9acd-3779-4dcc-9117-8b12a27abae2", | ||
"metadata": {}, | ||
"source": [ | ||
"A sample document from the collection" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 8, | ||
"id": "7595312b-88ce-485d-9582-9c7e9ece3e11", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"sample_doc = collection.find_one()\n", | ||
"pprint.pprint(sample_doc, compact=True)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "9876f080-e500-43df-afa5-04ca31b83524", | ||
"metadata": {}, | ||
"source": [ | ||
"### 3. Create a vector Index" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 9, | ||
"id": "d81aeb45-5bdb-429d-8710-10d4c25f2941", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"client[DB].command({\n", | ||
" 'createIndexes': COLLECTION,\n", | ||
" 'indexes': [{\n", | ||
" 'key': {'emb': 'vector'},\n", | ||
" 'name': 'vector_index',\n", | ||
" 'kaiSearchOptions': {\"index_type\":\"AUTO\", \"metric_type\": \"EUCLIDEAN_DISTANCE\", \"dimensions\": 768}\n", | ||
" }],\n", | ||
"})" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "5d44f05f-9316-4934-b856-213dbb540fa4", | ||
"metadata": {}, | ||
"source": [ | ||
"Selecting the query embedding from the sample_doc selected above" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 10, | ||
"id": "007e376b-e969-479c-97fc-a85ac1c50b56", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# input vector\n", | ||
"query_vector = sample_doc['emb']" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "e36426ec-8fc9-45d8-83b4-a261c3d9c8bf", | ||
"metadata": {}, | ||
"source": [ | ||
"### 4. Perform a vector search" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 11, | ||
"id": "0f9f9126-12a9-41cc-9f93-5309202fea98", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"def execute_kai_search(query_vector):\n", | ||
" pipeline = [\n", | ||
" {\n", | ||
" '$vectorSearch': {\n", | ||
" \"index\": \"vector_index\",\n", | ||
" \"path\": \"emb\",\n", | ||
" \"queryVector\": query_vector,\n", | ||
" \"numCandidates\": 20,\n", | ||
" \"limit\": 3,\n", | ||
" }\n", | ||
" },\n", | ||
" {\n", | ||
" '$project': {\n", | ||
" '_id':1,\n", | ||
" 'text': 1,\n", | ||
" }\n", | ||
" }\n", | ||
" ]\n", | ||
" results = collection.aggregate(pipeline)\n", | ||
" return list(results)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 12, | ||
"id": "69af232c-5c24-49ce-a28e-b982ab12b696", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"execute_kai_search(query_vector)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "a1cbe6a2-3bef-4727-8920-e6af2801477e", | ||
"metadata": {}, | ||
"source": [ | ||
"Running concurrent vector search queries" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 13, | ||
"id": "32972b4e-880d-4084-8a58-05b8787005a8", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"num_concurrent_queries = 250\n", | ||
"start_time = time.time()\n", | ||
"\n", | ||
"with concurrent.futures.ThreadPoolExecutor(max_workers=num_concurrent_queries) as executor:\n", | ||
" futures = [executor.submit(execute_kai_search, query_vector) for _ in range(num_concurrent_queries)]\n", | ||
" concurrent.futures.wait(futures)\n", | ||
"\n", | ||
"end_time = time.time()\n", | ||
"print(f\"Executed {num_concurrent_queries} concurrent queries.\")\n", | ||
"print(f\"Total execution time: {end_time - start_time} seconds\")\n", | ||
"\n", | ||
"for f in futures:\n", | ||
" if f.exception() is not None:\n", | ||
" print(f.exception())\n", | ||
"failed_count = sum(1 for f in futures if f.exception() is not None)\n", | ||
"print(f\"Failed queries: {failed_count}\")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "a26070f4-7bfb-4ee7-8f48-d74e80c9966e", | ||
"metadata": {}, | ||
"source": [ | ||
"This shows the Kai can create vector indexes instantaneously and perform a large number of concurrent vector search queries surpassing MongoDB Atlas Vector Search capabilities" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "8cfd1834-f238-4845-aa80-bf2bf82dd221", | ||
"metadata": {}, | ||
"source": [ | ||
"<div id=\"singlestore-footer\" style=\"background-color: rgba(194, 193, 199, 0.25); height:2px; margin-bottom:10px\"></div>\n", | ||
"<div><img src=\"https://raw.githubusercontent.com/singlestore-labs/spaces-notebooks/master/common/images/singlestore-logo-grey.png\" style=\"padding: 0px; margin: 0px; height: 24px\"/></div>" | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"jupyterlab": { | ||
"notebooks": { | ||
"version_major": 6, | ||
"version_minor": 4 | ||
} | ||
}, | ||
"kernelspec": { | ||
"display_name": "Python 3 (ipykernel)", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.11.6" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 5 | ||
} |