From 621d24b9733a601fb7f86aaf7225b4429d1c501d Mon Sep 17 00:00:00 2001
From: Jael Gu <mengjia.gu@zilliz.com>
Date: Thu, 27 Jun 2024 04:05:21 +0800
Subject: [PATCH] Update notebook: MilvusIndexDemo (#14403)

---
 .../vector_stores/MilvusIndexDemo.ipynb       | 236 +++++++++++++-----
 1 file changed, 179 insertions(+), 57 deletions(-)

diff --git a/docs/docs/examples/vector_stores/MilvusIndexDemo.ipynb b/docs/docs/examples/vector_stores/MilvusIndexDemo.ipynb
index f6146bd26ab6b..3dbd63065d919 100644
--- a/docs/docs/examples/vector_stores/MilvusIndexDemo.ipynb
+++ b/docs/docs/examples/vector_stores/MilvusIndexDemo.ipynb
@@ -15,22 +15,31 @@
    "id": "0b692c73",
    "metadata": {},
    "source": [
-    "# Milvus Vector Store"
+    "# Milvus Vector Store\n",
+    "\n",
+    "This guide demonstrates how to build a Retrieval-Augmented Generation (RAG) system using LlamaIndex and Milvus.\n",
+    "\n",
+    "The RAG system combines a retrieval system with a generative model to generate new text based on a given prompt. The system first retrieves relevant documents from a corpus using a vector similarity search engine like Milvus, and then uses a generative model to generate new text based on the retrieved documents.\n",
+    "\n",
+    "[Milvus](https://milvus.io/) is the world's most advanced open-source vector database, built to power embedding similarity search and AI applications.\n",
+    "\n",
+    "In this notebook we are going to show a quick demo of using the MilvusVectorStore."
    ]
   },
   {
    "attachments": {},
    "cell_type": "markdown",
-   "id": "1e7787c2",
+   "id": "f81e2c81",
    "metadata": {},
    "source": [
-    "In this notebook we are going to show a quick demo of using the MilvusVectorStore. "
+    "## Before you begin\n",
+    "\n",
+    "### Install dependencies"
    ]
   },
   {
-   "attachments": {},
    "cell_type": "markdown",
-   "id": "f81e2c81",
+   "id": "0d0e46d8",
    "metadata": {},
    "source": [
     "If you're opening this Notebook on colab, you will probably need to install LlamaIndex 🦙."
@@ -56,23 +65,30 @@
     "%pip install llama-index"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "eab0d1a3",
+   "metadata": {},
+   "source": [
+    "This notebook will use [Milvus Lite](https://milvus.io/docs/milvus_lite.md) requiring a higher version of pymilvus:"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "47264e32",
+   "id": "7661b098",
    "metadata": {},
    "outputs": [],
    "source": [
-    "import logging\n",
-    "import sys\n",
-    "\n",
-    "# Uncomment to see debug logs\n",
-    "# logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)\n",
-    "# logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))\n",
-    "\n",
-    "from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Document\n",
-    "from llama_index.vector_stores.milvus import MilvusVectorStore\n",
-    "import textwrap"
+    "%pip install pymilvus>=2.4.2"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "70cc8c56",
+   "metadata": {},
+   "source": [
+    "> If you are using Google Colab, to enable dependencies just installed, you may need to **restart the runtime** (click on the \"Runtime\" menu at the top of the screen, and select \"Restart session\" from the dropdown menu)."
    ]
   },
   {
@@ -82,7 +98,8 @@
    "metadata": {},
    "source": [
     "### Setup OpenAI\n",
-    "Lets first begin by adding the openai api key. This will allow us to access openai for embeddings and to use chatgpt."
+    "\n",
+    "Lets first begin by adding the openai api key. This will allow us to access chatgpt."
    ]
   },
   {
@@ -103,7 +120,9 @@
    "id": "a3d4e638",
    "metadata": {},
    "source": [
-    "Download Data"
+    "### Prepare data\n",
+    "\n",
+    "You can download sample data with the following commands:"
    ]
   },
   {
@@ -113,8 +132,9 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "! mkdir -p 'data/paul_graham/'\n",
-    "! wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'"
+    "! mkdir -p 'data/'\n",
+    "! wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham_essay.txt'\n",
+    "! wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/10k/uber_2021.pdf' -O 'data/uber_2021.pdf'"
    ]
   },
   {
@@ -123,8 +143,10 @@
    "id": "59ff935d",
    "metadata": {},
    "source": [
+    "## Getting Started\n",
+    "\n",
     "### Generate our data\n",
-    "With our LLM set, lets start using the Milvus Index. As a first example, lets generate a document from the file found in the `data/paul_graham/` folder. In this folder there is a single essay from Paul Graham titled `What I Worked On`. To generate the documents we will use the SimpleDirectoryReader."
+    "As a first example, lets generate a document from the file `paul_graham_essay.txt`. It is a single essay from Paul Graham titled `What I Worked On`. To generate the documents we will use the SimpleDirectoryReader."
    ]
   },
   {
@@ -137,13 +159,17 @@
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "Document ID: 11c3a6fe-799e-4e40-8122-2339936c2722\n"
+      "Document ID: 95f25e4d-f270-4650-87ce-006d69d82033\n"
      ]
     }
    ],
    "source": [
+    "from llama_index.core import SimpleDirectoryReader\n",
+    "\n",
     "# load documents\n",
-    "documents = SimpleDirectoryReader(\"./data/paul_graham/\").load_data()\n",
+    "documents = SimpleDirectoryReader(\n",
+    "    input_files=[\"./data/paul_graham_essay.txt\"]\n",
+    ").load_data()\n",
     "\n",
     "print(\"Document ID:\", documents[0].doc_id)"
    ]
@@ -155,20 +181,8 @@
    "metadata": {},
    "source": [
     "### Create an index across the data\n",
-    "Now that we have a document, we can can create an index and insert the document. For the index we will use a GPTMilvusIndex. GPTMilvusIndex takes in a few arguments:\n",
     "\n",
-    "- `uri (str, optional)`: The URI to connect to, comes in the form of \"https://address:port\" if using Milvus or Zilliz Cloud service, or \"path/to/local/milvus.db\" if using a lite local Milvus. Defaults to \"./milvus_llamaindex.db\".\n",
-    "- `token (str, optional)`: The token for log in. Empty if not using rbac, if using rbac it will most likely be \"username:password\". Defaults to \"\".\n",
-    "- `collection_name (str, optional)`: The name of the collection where data will be stored. Defaults to \"llamalection\".\n",
-    "- `dim (int, optional)`: The dimension of the embeddings. If it is not provided, collection creation will be done on first insert. Defaults to None.\n",
-    "- `embedding_field (str, optional)`: The name of the embedding field for the collection, defaults to DEFAULT_EMBEDDING_KEY.\n",
-    "- `doc_id_field (str, optional)`: The name of the doc_id field for the collection, defaults to DEFAULT_DOC_ID_KEY.\n",
-    "- `similarity_metric (str, optional)`: The similarity metric to use, currently supports IP and L2. Defaults to \"IP\".\n",
-    "- `consistency_level (str, optional)`: Which consistency level to use for a newly created collection. Defaults to \"Strong\".\n",
-    "- `overwrite (bool, optional)`: Whether to overwrite existing collection with same name. Defaults to False.\n",
-    "- `text_key (str, optional)`: What key text is stored in in the passed collection. Used when bringing your own collection. Defaults to None.\n",
-    "- `index_config (dict, optional)`: The configuration used for building the Milvus index. Defaults to None.\n",
-    "- `search_config (dict, optional)`: The configuration used for searching the Milvus index. Note that this must be compatible with the index type specified by index_config. Defaults to None.\n",
+    "Now that we have a document, we can can create an index and insert the document.\n",
     "\n",
     "> Please note that **Milvus Lite** requires `pymilvus>=2.4.2`."
    ]
@@ -180,8 +194,9 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# Create an index over the documnts\n",
-    "from llama_index.core import StorageContext\n",
+    "# Create an index over the documents\n",
+    "from llama_index.core import VectorStoreIndex, StorageContext\n",
+    "from llama_index.vector_stores.milvus import MilvusVectorStore\n",
     "\n",
     "\n",
     "vector_store = MilvusVectorStore(\n",
@@ -193,6 +208,17 @@
     ")"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "a75a5773",
+   "metadata": {},
+   "source": [
+    "> For the parameters of `MilvusVectorStore`:\n",
+    "> - Setting the `uri` as a local file, e.g.`./milvus.db`, is the most convenient method, as it automatically utilizes [Milvus Lite](https://milvus.io/docs/milvus_lite.md) to store all data in this file.\n",
+    "> - If you have large scale of data, you can set up a more performant Milvus server on [docker or kubernetes](https://milvus.io/docs/quickstart.md). In this setup, please use the server uri, e.g.`http://localhost:19530`, as your `uri`.\n",
+    "> - If you want to use [Zilliz Cloud](https://zilliz.com/cloud), the fully managed cloud service for Milvus, adjust the `uri` and `token`, which correspond to the [Public Endpoint and Api key](https://docs.zilliz.com/docs/on-zilliz-cloud-console#free-cluster-details) in Zilliz Cloud."
+   ]
+  },
   {
    "attachments": {},
    "cell_type": "markdown",
@@ -213,19 +239,14 @@
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "The author learned about programming on early computers like the IBM 1401 using Fortran, the\n",
-      "limitations of early computing technology, the transition to microcomputers, and the excitement of\n",
-      "having a personal computer like the TRS-80. Additionally, the author explored different academic\n",
-      "paths, initially planning to study philosophy but eventually switching to AI due to a lack of\n",
-      "interest in philosophy courses. Later on, the author pursued art education, attending RISD and the\n",
-      "Accademia di Belli Arti in Florence, where they encountered a different approach to teaching art.\n"
+      "The author learned that philosophy courses in college were boring to him, leading him to switch his focus to studying AI.\n"
      ]
     }
    ],
    "source": [
     "query_engine = index.as_query_engine()\n",
-    "response = query_engine.query(\"What did the author learn?\")\n",
-    "print(textwrap.fill(str(response), 100))"
+    "res = query_engine.query(\"What did the author learn?\")\n",
+    "print(res)"
    ]
   },
   {
@@ -238,14 +259,15 @@
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "Dealing with the stress and challenges related to managing Hacker News was a difficult moment for\n",
-      "the author.\n"
+      "The disease posed challenges for the author as it affected his mother's health, leading to a stroke caused by colon cancer. This resulted in her losing her balance and needing to be placed in a nursing home. The author and his sister were determined to help their mother get out of the nursing home and back to her house.\n"
      ]
     }
    ],
    "source": [
-    "response = query_engine.query(\"What was a hard moment for the author?\")\n",
-    "print(textwrap.fill(str(response), 100))"
+    "res = query_engine.query(\n",
+    "    \"What challenges did the disease pose for the author?\"\n",
+    ")\n",
+    "print(res)"
    ]
   },
   {
@@ -267,11 +289,14 @@
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "Res: The author is the individual who created the content or work in question.\n"
+      "The author is the individual who created the context information.\n"
      ]
     }
    ],
    "source": [
+    "from llama_index.core import Document\n",
+    "\n",
+    "\n",
     "vector_store = MilvusVectorStore(\n",
     "    uri=\"./milvus_demo.db\", dim=1536, overwrite=True\n",
     ")\n",
@@ -282,7 +307,7 @@
     ")\n",
     "query_engine = index.as_query_engine()\n",
     "res = query_engine.query(\"Who is the author?\")\n",
-    "print(\"Res:\", res)"
+    "print(res)"
    ]
   },
   {
@@ -304,7 +329,7 @@
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "Res: The number is ten.\n"
+      "The number is ten.\n"
      ]
     }
    ],
@@ -318,26 +343,123 @@
     ")\n",
     "query_engine = index.as_query_engine()\n",
     "res = query_engine.query(\"What is the number?\")\n",
-    "print(\"Res:\", res)"
+    "print(res)"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "e5287c2d",
+   "id": "56ac3375-371b-4e5f-bac9-8124b6871429",
    "metadata": {},
    "outputs": [
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "Res: Paul Graham\n"
+      "Paul Graham\n"
      ]
     }
    ],
    "source": [
     "res = query_engine.query(\"Who is the author?\")\n",
-    "print(\"Res:\", res)"
+    "print(res)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "154ec1b3",
+   "metadata": {},
+   "source": [
+    "## Metadata filtering\n",
+    "\n",
+    "We can generate results by filtering specific sources. The following example illustrates loading all documents from the directory and subsequently filtering them based on metadata."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "2a845c5d-f10b-4fba-9cd2-e62871f836f3",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from llama_index.core.vector_stores import ExactMatchFilter, MetadataFilters\n",
+    "\n",
+    "# Load all the two documents loaded before\n",
+    "documents_all = SimpleDirectoryReader(\"./data/\").load_data()\n",
+    "\n",
+    "vector_store = MilvusVectorStore(\n",
+    "    uri=\"./milvus_demo.db\", dim=1536, overwrite=True\n",
+    ")\n",
+    "storage_context = StorageContext.from_defaults(vector_store=vector_store)\n",
+    "index = VectorStoreIndex.from_documents(documents_all, storage_context)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8de85343",
+   "metadata": {},
+   "source": [
+    "We want to only retrieve documents from the file `uber_2021.pdf`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "d9f9bcb5-43de-4983-b754-a822ac7b5278",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "The disease posed challenges related to the adverse impact on the business and operations, including reduced demand for Mobility offerings globally, affecting travel behavior and demand. Additionally, the pandemic led to driver supply constraints, impacted by concerns regarding COVID-19, with uncertainties about when supply levels would return to normal. The rise of the Omicron variant further affected travel, resulting in advisories and restrictions that could adversely impact both driver supply and consumer demand for Mobility offerings.\n"
+     ]
+    }
+   ],
+   "source": [
+    "filters = MetadataFilters(\n",
+    "    filters=[ExactMatchFilter(key=\"file_name\", value=\"uber_2021.pdf\")]\n",
+    ")\n",
+    "query_engine = index.as_query_engine(filters=filters)\n",
+    "res = query_engine.query(\n",
+    "    \"What challenges did the disease pose for the author?\"\n",
+    ")\n",
+    "\n",
+    "print(res)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "92e95db9",
+   "metadata": {},
+   "source": [
+    "We get a different result this time when retrieve from the file `paul_graham_essay.txt`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "0f134a35-dbd3-49d8-b7d8-48bdd2349701",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "The disease posed challenges for the author as it affected his mother's health, leading to a stroke caused by colon cancer. This resulted in his mother losing her balance and needing to be placed in a nursing home. The author and his sister were determined to help their mother get out of the nursing home and back to her house.\n"
+     ]
+    }
+   ],
+   "source": [
+    "filters = MetadataFilters(\n",
+    "    filters=[ExactMatchFilter(key=\"file_name\", value=\"paul_graham_essay.txt\")]\n",
+    ")\n",
+    "query_engine = index.as_query_engine(filters=filters)\n",
+    "res = query_engine.query(\n",
+    "    \"What challenges did the disease pose for the author?\"\n",
+    ")\n",
+    "\n",
+    "print(res)"
    ]
   }
  ],