From fa20d940beb6aa33b821dfaeaef5226712c7ddea Mon Sep 17 00:00:00 2001 From: ChengZi Date: Thu, 17 Oct 2024 14:43:27 +0800 Subject: [PATCH] siliconflow Signed-off-by: ChengZi --- ...uild_RAG_with_milvus_and_siliconflow.ipynb | 529 ++++++++++++++++++ 1 file changed, 529 insertions(+) create mode 100644 bootcamp/tutorials/integration/build_RAG_with_milvus_and_siliconflow.ipynb diff --git a/bootcamp/tutorials/integration/build_RAG_with_milvus_and_siliconflow.ipynb b/bootcamp/tutorials/integration/build_RAG_with_milvus_and_siliconflow.ipynb new file mode 100644 index 000000000..157854ca5 --- /dev/null +++ b/bootcamp/tutorials/integration/build_RAG_with_milvus_and_siliconflow.ipynb @@ -0,0 +1,529 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "collapsed": false, + "jupyter": { + "outputs_hidden": false + } + }, + "source": [ + "\"Open \n", + " \"GitHub" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Build RAG with Milvus and SiliconFlow\n", + "\n", + "[SiliconFlow](https://siliconflow.cn/) is committed to building a scalable, standardized, and high-performance AI Infra platform.\n", + "SiliconCloud is one of the flagship offerings from SiliconFlow, described as a Model as a Service (MaaS) platform. It provides a comprehensive environment for deploying various AI models, including large language models (LLMs) and embedding models. SiliconCloud aggregates numerous open-source models, enabling users to easily access and utilize these resources without the need for extensive infrastructure setup.\n", + "\n", + "In this tutorial, we will show you how to build a RAG(Retrieval-Augmented Generation) pipeline with Milvus and SiliconFlow. \n", + "\n", + "\n", + "## Preparation\n", + "### Dependencies and Environment" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "vscode": { + "languageId": "shellscript" + } + }, + "outputs": [], + "source": [ + "! pip install --upgrade pymilvus openai requests tqdm" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "collapsed": false + }, + "source": [ + "> If you are using Google Colab, to enable dependencies just installed, you may need to **restart the runtime** (click on the \"Runtime\" menu at the top of the screen, and select \"Restart session\" from the dropdown menu)." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "SiliconFlow enables the OpenAI-style API. You can login to its official website and prepare the [api key](https://docs.siliconflow.cn/quickstart) `SILICON_FLOW_API_KEY` as an environment variable." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": { + "collapsed": false, + "jupyter": { + "outputs_hidden": false + }, + "pycharm": { + "name": "#%%\n" + } + }, + "outputs": [], + "source": [ + "import os\n", + "\n", + "os.environ[\"SILICON_FLOW_API_KEY\"] = \"***********\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Prepare the data\n", + "\n", + "We use the FAQ pages from the [Milvus Documentation 2.4.x](https://github.com/milvus-io/milvus-docs/releases/download/v2.4.6-preview/milvus_docs_2.4.x_en.zip) as the private knowledge in our RAG, which is a good data source for a simple RAG pipeline.\n", + "\n", + "Download the zip file and extract documents to the folder `milvus_docs`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "vscode": { + "languageId": "shellscript" + } + }, + "outputs": [], + "source": [ + "! wget https://github.com/milvus-io/milvus-docs/releases/download/v2.4.6-preview/milvus_docs_2.4.x_en.zip\n", + "! unzip -q milvus_docs_2.4.x_en.zip -d milvus_docs" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We load all markdown files from the folder `milvus_docs/en/faq`. For each document, we just simply use \"# \" to separate the content in the file, which can roughly separate the content of each main part of the markdown file." + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [], + "source": [ + "from glob import glob\n", + "\n", + "text_lines = []\n", + "\n", + "for file_path in glob(\"milvus_docs/en/faq/*.md\", recursive=True):\n", + " with open(file_path, \"r\") as file:\n", + " file_text = file.read()\n", + "\n", + " text_lines += file_text.split(\"# \")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Prepare the Embedding Model\n", + "\n", + "We initialize a client to prepare the embedding model. SiliconFlow enables the OpenAI-style API, and you can use the same API with minor adjustments to call the embedding model and the LLM." + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [], + "source": [ + "from openai import OpenAI\n", + "\n", + "siliconflow_client = OpenAI(\n", + " api_key=os.environ[\"SILICON_FLOW_API_KEY\"], base_url=\"https://api.siliconflow.cn/v1\"\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Define a function to generate text embeddings using the client. We use the `BAAI/bge-large-en-v1.5` model as an example." + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [], + "source": [ + "def emb_text(text):\n", + " return (\n", + " siliconflow_client.embeddings.create(input=text, model=\"BAAI/bge-large-en-v1.5\")\n", + " .data[0]\n", + " .embedding\n", + " )" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Generate a test embedding and print its dimension and first few elements." + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "1024\n", + "[0.011475468054413795, 0.02982141077518463, 0.0038535362109541893, 0.035921916365623474, -0.0159175843000412, -0.014918108470737934, -0.018094222992658615, -0.002937349723652005, 0.030917132273316383, 0.03390815854072571]\n" + ] + } + ], + "source": [ + "test_embedding = emb_text(\"This is a test\")\n", + "embedding_dim = len(test_embedding)\n", + "print(embedding_dim)\n", + "print(test_embedding[:10])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Load data into Milvus" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Create the Collection" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [], + "source": [ + "from pymilvus import MilvusClient\n", + "\n", + "milvus_client = MilvusClient(uri=\"./milvus_demo.db\")\n", + "\n", + "collection_name = \"my_rag_collection\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "collapsed": false + }, + "source": [ + "> As for the argument of `MilvusClient`:\n", + "> - Setting the `uri` as a local file, e.g.`./milvus.db`, is the most convenient method, as it automatically utilizes [Milvus Lite](https://milvus.io/docs/milvus_lite.md) to store all data in this file.\n", + "> - If you have large scale of data, you can set up a more performant Milvus server on [docker or kubernetes](https://milvus.io/docs/quickstart.md). In this setup, please use the server uri, e.g.`http://localhost:19530`, as your `uri`.\n", + "> - If you want to use [Zilliz Cloud](https://zilliz.com/cloud), the fully managed cloud service for Milvus, adjust the `uri` and `token`, which correspond to the [Public Endpoint and Api key](https://docs.zilliz.com/docs/on-zilliz-cloud-console#free-cluster-details) in Zilliz Cloud." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Check if the collection already exists and drop it if it does." + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [], + "source": [ + "if milvus_client.has_collection(collection_name):\n", + " milvus_client.drop_collection(collection_name)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Create a new collection with specified parameters. \n", + "\n", + "If we don't specify any field information, Milvus will automatically create a default `id` field for primary key, and a `vector` field to store the vector data. A reserved JSON field is used to store non-schema-defined fields and their values." + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [], + "source": [ + "milvus_client.create_collection(\n", + " collection_name=collection_name,\n", + " dimension=embedding_dim,\n", + " metric_type=\"IP\", # Inner product distance\n", + " consistency_level=\"Strong\", # Strong consistency level\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Insert data\n", + "Iterate through the text lines, create embeddings, and then insert the data into Milvus.\n", + "\n", + "Here is a new field `text`, which is a non-defined field in the collection schema. It will be automatically added to the reserved JSON dynamic field, which can be treated as a normal field at a high level." + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Creating embeddings: 100%|██████████| 72/72 [00:04<00:00, 16.97it/s]\n" + ] + }, + { + "data": { + "text/plain": [ + "{'insert_count': 72, 'ids': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71], 'cost': 0}" + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from tqdm import tqdm\n", + "\n", + "data = []\n", + "\n", + "for i, line in enumerate(tqdm(text_lines, desc=\"Creating embeddings\")):\n", + " data.append({\"id\": i, \"vector\": emb_text(line), \"text\": line})\n", + "\n", + "milvus_client.insert(collection_name=collection_name, data=data)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Build RAG\n", + "\n", + "### Retrieve data for a query\n", + "\n", + "Let's specify a frequent question about Milvus." + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [], + "source": [ + "question = \"How is data stored in milvus?\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Search for the question in the collection and retrieve the semantic top-3 matches." + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [], + "source": [ + "search_res = milvus_client.search(\n", + " collection_name=collection_name,\n", + " data=[\n", + " emb_text(question)\n", + " ], # Use the `emb_text` function to convert the question to an embedding vector\n", + " limit=3, # Return top 3 results\n", + " search_params={\"metric_type\": \"IP\", \"params\": {}}, # Inner product distance\n", + " output_fields=[\"text\"], # Return the text field\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's take a look at the search results of the query\n" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[\n", + " [\n", + " \" Where does Milvus store data?\\n\\nMilvus deals with two types of data, inserted data and metadata. \\n\\nInserted data, including vector data, scalar data, and collection-specific schema, are stored in persistent storage as incremental log. Milvus supports multiple object storage backends, including [MinIO](https://min.io/), [AWS S3](https://aws.amazon.com/s3/?nc1=h_ls), [Google Cloud Storage](https://cloud.google.com/storage?hl=en#object-storage-for-companies-of-all-sizes) (GCS), [Azure Blob Storage](https://azure.microsoft.com/en-us/products/storage/blobs), [Alibaba Cloud OSS](https://www.alibabacloud.com/product/object-storage-service), and [Tencent Cloud Object Storage](https://www.tencentcloud.com/products/cos) (COS).\\n\\nMetadata are generated within Milvus. Each Milvus module has its own metadata that are stored in etcd.\\n\\n###\",\n", + " 0.833885133266449\n", + " ],\n", + " [\n", + " \"How does Milvus flush data?\\n\\nMilvus returns success when inserted data are loaded to the message queue. However, the data are not yet flushed to the disk. Then Milvus' data node writes the data in the message queue to persistent storage as incremental logs. If `flush()` is called, the data node is forced to write all data in the message queue to persistent storage immediately.\\n\\n###\",\n", + " 0.812842607498169\n", + " ],\n", + " [\n", + " \"Does the query perform in memory? What are incremental data and historical data?\\n\\nYes. When a query request comes, Milvus searches both incremental data and historical data by loading them into memory. Incremental data are in the growing segments, which are buffered in memory before they reach the threshold to be persisted in storage engine, while historical data are from the sealed segments that are stored in the object storage. Incremental data and historical data together constitute the whole dataset to search.\\n\\n###\",\n", + " 0.7714196443557739\n", + " ]\n", + "]\n" + ] + } + ], + "source": [ + "import json\n", + "\n", + "retrieved_lines_with_distances = [\n", + " (res[\"entity\"][\"text\"], res[\"distance\"]) for res in search_res[0]\n", + "]\n", + "print(json.dumps(retrieved_lines_with_distances, indent=4))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Use LLM to get a RAG response\n", + "\n", + "Convert the retrieved documents into a string format." + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [], + "source": [ + "context = \"\\n\".join(\n", + " [line_with_distance[0] for line_with_distance in retrieved_lines_with_distances]\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Define system and user prompts for the Lanage Model. This prompt is assembled with the retrieved documents from Milvus." + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [], + "source": [ + "SYSTEM_PROMPT = \"\"\"\n", + "Human: You are an AI assistant. You are able to find answers to the questions from the contextual passage snippets provided.\n", + "\"\"\"\n", + "USER_PROMPT = f\"\"\"\n", + "Use the following pieces of information enclosed in tags to provide an answer to the question enclosed in tags.\n", + "\n", + "{context}\n", + "\n", + "\n", + "{question}\n", + "\n", + "\"\"\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Use the `deepseek-ai/DeepSeek-V2.5` model provided by SiliconCloud to generate a response based on the prompts." + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": { + "pycharm": { + "name": "#%%\n" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "In Milvus, data is stored in two main categories: inserted data and metadata.\n", + "\n", + "- **Inserted Data**: This includes vector data, scalar data, and collection-specific schema, which are stored in persistent storage as incremental logs. Milvus supports various object storage backends such as MinIO, AWS S3, Google Cloud Storage (GCS), Azure Blob Storage, Alibaba Cloud OSS, and Tencent Cloud Object Storage (COS).\n", + "\n", + "- **Metadata**: This is generated within Milvus, with each module having its own metadata stored in etcd, a distributed key-value store.\n" + ] + } + ], + "source": [ + "response = siliconflow_client.chat.completions.create(\n", + " model=\"deepseek-ai/DeepSeek-V2.5\",\n", + " messages=[\n", + " {\"role\": \"system\", \"content\": SYSTEM_PROMPT},\n", + " {\"role\": \"user\", \"content\": USER_PROMPT},\n", + " ],\n", + ")\n", + "print(response.choices[0].message.content)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Great! We have successfully built a RAG pipeline with Milvus and SiliconFlow." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.13" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} \ No newline at end of file