Vector Search with Kai (#97)

* Vector Search with Kai
singlestore-labs · May 22, 2024 · c12a253 · c12a253
1 parent 06dce57
commit c12a253
Show file tree

Hide file tree

Showing 2 changed files with 362 additions and 0 deletions.
diff --git a/notebooks/vector-search-with-kai/meta.toml b/notebooks/vector-search-with-kai/meta.toml
@@ -0,0 +1,8 @@
+[meta]
+title = "Vector Search with Kai"
+description = """
+    Run Vector Search using MongoDB clients and power GenAI usecases for your MongoDB applications """
+icon = "radar"
+tags =  ["mongo", "embeddings", "vector", "genai", "kai", "starter"]
+destinations = ["spaces"]
+minimum_tier="standard"
diff --git a/notebooks/vector-search-with-kai/notebook.ipynb b/notebooks/vector-search-with-kai/notebook.ipynb
@@ -0,0 +1,354 @@
+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "id": "aa202b34-20c7-4021-a142-d959a81d2cfd",
+      "metadata": {},
+      "source": [
+        "<div id=\"singlestore-header\" style=\"display: flex; background-color: rgba(255, 182, 176, 0.25); padding: 5px;\">\n",
+        "    <div id=\"icon-image\" style=\"width: 90px; height: 90px;\">\n",
+        "        <img width=\"100%\" height=\"100%\" src=\"https://raw.githubusercontent.com/singlestore-labs/spaces-notebooks/master/common/images/header-icons/radar.png\" />\n",
+        "    </div>\n",
+        "    <div id=\"text\" style=\"padding: 5px; margin-left: 10px;\">\n",
+        "        <div id=\"badge\" style=\"display: inline-block; background-color: rgba(0, 0, 0, 0.15); border-radius: 4px; padding: 4px 8px; align-items: center; margin-top: 6px; margin-bottom: -2px; font-size: 80%\">SingleStore Notebooks</div>\n",
+        "        <h1 style=\"font-weight: 500; margin: 8px 0 0 4px;\">Vector Search with Kai</h1>\n",
+        "    </div>\n",
+        "</div>"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "b1eef9c6-fcc6-4d1c-b280-102eea62a5ec",
+      "metadata": {},
+      "source": [
+        "## Vector Search with Kai"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "078b13da-9901-4be5-b472-25c8cd8e01c4",
+      "metadata": {},
+      "source": [
+        "In this notebook, we load a dataset into a collection, create a vector index and perform vector searches using Kai in a way that is compatible with MongoDB clients and applications"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 1,
+      "id": "2b25990e-8ef4-4759-b828-98d8fbe092f0",
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "!pip install datasets --quiet"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 2,
+      "id": "b305676b-f253-45f9-8b97-d9c5074eae6f",
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "import os\n",
+        "import pprint\n",
+        "import time\n",
+        "import concurrent.futures\n",
+        "import datasets\n",
+        "from pymongo import MongoClient\n",
+        "from datasets import load_dataset\n",
+        "from bson import json_util"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "8dd06cce-e988-47b3-88f0-82833b8111a9",
+      "metadata": {},
+      "source": [
+        "### 1. Initializing a pymongo client"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 3,
+      "id": "4a41b5af-0c80-4d4b-bad3-503cb8ee91a5",
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "current_database = %sql SELECT DATABASE() as CurrentDatabase\n",
+        "DB = current_database[0][0]\n",
+        "COLLECTION = 'wiki_embeddings'"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 4,
+      "id": "4c68f060-4f40-4d3d-939e-b0d524514245",
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "# Using the environment variable that holds the kai endpoint\n",
+        "client = MongoClient(connection_url_kai)\n",
+        "collection = client[DB][COLLECTION]"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "383015eb-9d56-40f5-9fd3-fec01934fcdf",
+      "metadata": {},
+      "source": [
+        "### 2. Create a collection and load the dataset"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "3d9261a0-28fe-474c-b89e-1c237b2681f9",
+      "metadata": {},
+      "source": [
+        "It is recommended that you create a collection with the embedding field as a top level column for optimized utilization of storage. The name of the column should be the name of the field holding the embedding"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 5,
+      "id": "c88d16fe-0746-40f2-abcc-4d3c3036165e",
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "client[DB].create_collection(COLLECTION,\n",
+        "  columns=[{ 'id': \"emb\", 'type': \"VECTOR(768) NOT NULL\" }],\n",
+        ");"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 6,
+      "id": "f76d2205-3e98-4969-be80-db59bd64467d",
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "# Using the \"wikipedia-22-12-simple-embeddings\" dataset from Hugging Face\n",
+        "dataset = load_dataset(\"Cohere/wikipedia-22-12-simple-embeddings\", split=\"train\")"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 7,
+      "id": "ca0f4601-ca6a-4e87-9939-2bef27c60c42",
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "DB_SIZE = 50000 #Currently loading 50k documents to the collection, can go to a max of 485,859 for this dataset\n",
+        "insert_data = []\n",
+        "insert_count = 0\n",
+        "# Iterate through the dataset and prepare the documents for insertion\n",
+        "# The script below ingests 1000 records into the database at a time\n",
+        "for item in dataset:\n",
+        "    if insert_count >= DB_SIZE:\n",
+        "        break\n",
+        "    # Convert the dataset item to MongoDB document format\n",
+        "    doc_item = json_util.loads(json_util.dumps(item))\n",
+        "    insert_data.append(doc_item)\n",
+        "\n",
+        "    # Insert in batches of 1000 documents\n",
+        "    if len(insert_data) == 1000:\n",
+        "        collection.insert_many(insert_data)\n",
+        "        insert_count += 1000\n",
+        "        print(f\"{insert_count} of {DB_SIZE} records ingested\")\n",
+        "        insert_data = []\n",
+        "\n",
+        "\n",
+        "# Insert any remaining documents\n",
+        "if len(insert_data) > 0:\n",
+        "    collection.insert_many(insert_data)\n",
+        "    print(\"Data Ingested\")"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "cefa9acd-3779-4dcc-9117-8b12a27abae2",
+      "metadata": {},
+      "source": [
+        "A sample document from the collection"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 8,
+      "id": "7595312b-88ce-485d-9582-9c7e9ece3e11",
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "sample_doc = collection.find_one()\n",
+        "pprint.pprint(sample_doc, compact=True)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "9876f080-e500-43df-afa5-04ca31b83524",
+      "metadata": {},
+      "source": [
+        "### 3. Create a vector Index"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 9,
+      "id": "d81aeb45-5bdb-429d-8710-10d4c25f2941",
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "client[DB].command({\n",
+        "    'createIndexes': COLLECTION,\n",
+        "    'indexes': [{\n",
+        "        'key': {'emb': 'vector'},\n",
+        "        'name': 'vector_index',\n",
+        "        'kaiSearchOptions': {\"index_type\":\"AUTO\", \"metric_type\": \"EUCLIDEAN_DISTANCE\", \"dimensions\": 768}\n",
+        "    }],\n",
+        "})"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "5d44f05f-9316-4934-b856-213dbb540fa4",
+      "metadata": {},
+      "source": [
+        "Selecting the query embedding from the sample_doc selected above"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 10,
+      "id": "007e376b-e969-479c-97fc-a85ac1c50b56",
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "# input vector\n",
+        "query_vector = sample_doc['emb']"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "e36426ec-8fc9-45d8-83b4-a261c3d9c8bf",
+      "metadata": {},
+      "source": [
+        "### 4. Perform a vector search"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 11,
+      "id": "0f9f9126-12a9-41cc-9f93-5309202fea98",
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "def execute_kai_search(query_vector):\n",
+        "    pipeline = [\n",
+        "        {\n",
+        "            '$vectorSearch': {\n",
+        "                \"index\": \"vector_index\",\n",
+        "                \"path\": \"emb\",\n",
+        "                \"queryVector\": query_vector,\n",
+        "                \"numCandidates\": 20,\n",
+        "                \"limit\": 3,\n",
+        "            }\n",
+        "        },\n",
+        "        {\n",
+        "            '$project': {\n",
+        "               '_id':1,\n",
+        "               'text': 1,\n",
+        "            }\n",
+        "        }\n",
+        "    ]\n",
+        "    results = collection.aggregate(pipeline)\n",
+        "    return list(results)"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 12,
+      "id": "69af232c-5c24-49ce-a28e-b982ab12b696",
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "execute_kai_search(query_vector)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "a1cbe6a2-3bef-4727-8920-e6af2801477e",
+      "metadata": {},
+      "source": [
+        "Running concurrent vector search queries"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 13,
+      "id": "32972b4e-880d-4084-8a58-05b8787005a8",
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "num_concurrent_queries = 250\n",
+        "start_time = time.time()\n",
+        "\n",
+        "with concurrent.futures.ThreadPoolExecutor(max_workers=num_concurrent_queries) as executor:\n",
+        "    futures = [executor.submit(execute_kai_search, query_vector) for _ in range(num_concurrent_queries)]\n",
+        "    concurrent.futures.wait(futures)\n",
+        "\n",
+        "end_time = time.time()\n",
+        "print(f\"Executed {num_concurrent_queries} concurrent queries.\")\n",
+        "print(f\"Total execution time: {end_time - start_time} seconds\")\n",
+        "\n",
+        "for f in futures:\n",
+        "    if f.exception() is not None:\n",
+        "        print(f.exception())\n",
+        "failed_count = sum(1 for f in futures if f.exception() is not None)\n",
+        "print(f\"Failed queries: {failed_count}\")"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "a26070f4-7bfb-4ee7-8f48-d74e80c9966e",
+      "metadata": {},
+      "source": [
+        "This shows the Kai can create vector indexes instantaneously and perform a large number of concurrent vector search queries surpassing MongoDB Atlas Vector Search capabilities"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "8cfd1834-f238-4845-aa80-bf2bf82dd221",
+      "metadata": {},
+      "source": [
+        "<div id=\"singlestore-footer\" style=\"background-color: rgba(194, 193, 199, 0.25); height:2px; margin-bottom:10px\"></div>\n",
+        "<div><img src=\"https://raw.githubusercontent.com/singlestore-labs/spaces-notebooks/master/common/images/singlestore-logo-grey.png\" style=\"padding: 0px; margin: 0px; height: 24px\"/></div>"
+      ]
+    }
+  ],
+  "metadata": {
+    "jupyterlab": {
+      "notebooks": {
+        "version_major": 6,
+        "version_minor": 4
+      }
+    },
+    "kernelspec": {
+      "display_name": "Python 3 (ipykernel)",
+      "language": "python",
+      "name": "python3"
+    },
+    "language_info": {
+      "codemirror_mode": {
+        "name": "ipython",
+        "version": 3
+      },
+      "file_extension": ".py",
+      "mimetype": "text/x-python",
+      "name": "python",
+      "nbconvert_exporter": "python",
+      "pygments_lexer": "ipython3",
+      "version": "3.11.6"
+    }
+  },
+  "nbformat": 4,
+  "nbformat_minor": 5
+}