diff --git a/authors.yaml b/authors.yaml index a3b7744228..5940a72366 100644 --- a/authors.yaml +++ b/authors.yaml @@ -402,6 +402,11 @@ daisyshe-oai: website: "https://www.linkedin.com/in/daisysheng/" avatar: "https://avatars.githubusercontent.com/u/212609991?v=4" +vyavdoshenko: + name: "Volodymyr Yavdoshenko" + website: "https://www.linkedin.com/in/volodymyr-yavdoshenko/" + avatar: "https://avatars.githubusercontent.com/u/41993419?v=4" + dkundel-openai: name: "Dominik Kundel" website: "https://www.linkedin.com/in/dominik-kundel/" diff --git a/examples/vector_databases/README.md b/examples/vector_databases/README.md index ebbb8fee0e..5bce945480 100644 --- a/examples/vector_databases/README.md +++ b/examples/vector_databases/README.md @@ -12,6 +12,7 @@ Each provider has their own named directory, with a standard notebook to introdu - [Azure AI Search](https://learn.microsoft.com/azure/search/search-get-started-vector) - [Azure SQL Database](https://learn.microsoft.com/azure/azure-sql/database/ai-artificial-intelligence-intelligent-applications?view=azuresql) - [Chroma](https://docs.trychroma.com/getting-started) +- [Dragonfly](https://www.dragonflydb.io/docs/) - [Elasticsearch](https://www.elastic.co/guide/en/elasticsearch/reference/current/knn-search.html) - [Hologres](https://www.alibabacloud.com/help/en/hologres/latest/procedure-to-use-hologres) - [Kusto](https://learn.microsoft.com/en-us/azure/data-explorer/web-query-data) diff --git a/examples/vector_databases/dragonfly/dragonfly-hybrid-query-examples.ipynb b/examples/vector_databases/dragonfly/dragonfly-hybrid-query-examples.ipynb new file mode 100644 index 0000000000..63023114c8 --- /dev/null +++ b/examples/vector_databases/dragonfly/dragonfly-hybrid-query-examples.ipynb @@ -0,0 +1,867 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "cb1537e6", + "metadata": {}, + "source": [ + "# Running Hybrid VSS Queries with Dragonfly and OpenAI\n", + "\n", + "This notebook provides an introduction to using Dragonfly as a vector database with OpenAI embeddings and running hybrid queries that combine VSS and lexical search. Dragonfly is a scalable, real-time database that can be used as a vector database. The Dragonfly Query and Search capability allows you to index and search for vectors in Dragonfly. This notebook will show you how to use the Dragonfly Query and Search to index and search for vectors created by using the OpenAI API and stored in Dragonfly.\n", + "\n", + "Hybrid queries combine vector similarity with traditional Dragonfly Query and Search filtering capabilities on GEO, NUMERIC, TAG or TEXT data simplifying application code. A common example of a hybrid query in an e-commerce use case is to find items visually similar to a given query image limited to items available in a GEO location and within a price range." + ] + }, + { + "cell_type": "markdown", + "id": "f1a618c5", + "metadata": {}, + "source": [ + "## Prerequisites\n", + "\n", + "Before we start this project, we need to set up the following:\n", + "\n", + "* Start Dragonfly\n", + "* Install libraries\n", + "* Get your [OpenAI API key](https://platform.openai.com/api-keys)" + ] + }, + { + "cell_type": "markdown", + "id": "d4860798", + "metadata": {}, + "source": [ + "## Start Dragonfly\n", + "\n", + "```bash\n", + "$ docker run -d -p 6379:6379 --name df docker.dragonflydb.io/dragonflydb/dragonfly\n", + "```" + ] + }, + { + "cell_type": "markdown", + "id": "b9babafe", + "metadata": {}, + "source": [ + "## Install Requirements" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2b04113f", + "metadata": {}, + "outputs": [], + "source": [ + "! pip install -r requirements.txt" + ] + }, + { + "cell_type": "markdown", + "id": "36fe86f4", + "metadata": {}, + "source": [ + "## Prepare your OpenAI API key\n", + "\n", + "The `OpenAI API key` is used for vectorization of query data.\n", + "\n", + "If you don't have an OpenAI API key, you can get one from [https://platform.openai.com/api-keys](https://platform.openai.com/api-keys).\n", + "\n", + "Once you get your key, please add it to your environment variables as `OPENAI_API_KEY` by using following approach:" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "88be138c", + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "import openai\n", + "\n", + "# Set your OpenAI API key here\n", + "# Option 1: Set environment variable\n", + "# os.environ[\"OPENAI_API_KEY\"] = \"your-api-key-here\"\n", + "\n", + "# Option 2: Set directly in openai\n", + "# openai.api_key = \"your-api-key-here\"\n", + "\n", + "# Option 3: Use getpass for interactive input\n", + "import getpass\n", + "if not os.getenv(\"OPENAI_API_KEY\"):\n", + " openai.api_key = getpass.getpass(\"OpenAI API Key:\")\n", + "else:\n", + " openai.api_key = os.getenv(\"OPENAI_API_KEY\")\n", + " print(\"OPENAI_API_KEY loaded from environment\")" + ] + }, + { + "cell_type": "markdown", + "id": "97fefe4c", + "metadata": {}, + "source": [ + "## Load data\n", + "\n", + "In this section we'll load and clean an ecommerce dataset. We'll generate embeddings using OpenAI and use this data to create an index in Dragonfly and then search for similar vectors." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "9fbebe0d", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "Index: 1978 entries, 0 to 1998\n", + "Data columns (total 10 columns):\n", + " # Column Non-Null Count Dtype \n", + "--- ------ -------------- ----- \n", + " 0 id 1978 non-null int64 \n", + " 1 gender 1978 non-null object\n", + " 2 masterCategory 1978 non-null object\n", + " 3 subCategory 1978 non-null object\n", + " 4 articleType 1978 non-null object\n", + " 5 baseColour 1978 non-null object\n", + " 6 season 1978 non-null object\n", + " 7 year 1978 non-null int64 \n", + " 8 usage 1978 non-null object\n", + " 9 productDisplayName 1978 non-null object\n", + "dtypes: int64(2), object(8)\n", + "memory usage: 170.0+ KB\n" + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
idgendermasterCategorysubCategoryarticleTypebaseColourseasonyearusageproductDisplayName
015970MenApparelTopwearShirtsNavy BlueFall2011CasualTurtle Check Men Navy Blue Shirt
139386MenApparelBottomwearJeansBlueSummer2012CasualPeter England Men Party Blue Jeans
259263WomenAccessoriesWatchesWatchesSilverWinter2016CasualTitan Women Silver Watch
321379MenApparelBottomwearTrack PantsBlackFall2011CasualManchester United Men Solid Black Track Pants
453759MenApparelTopwearTshirtsGreySummer2012CasualPuma Men Grey T-shirt
\n", + "
" + ], + "text/plain": [ + " id gender masterCategory subCategory articleType baseColour season \\\n", + "0 15970 Men Apparel Topwear Shirts Navy Blue Fall \n", + "1 39386 Men Apparel Bottomwear Jeans Blue Summer \n", + "2 59263 Women Accessories Watches Watches Silver Winter \n", + "3 21379 Men Apparel Bottomwear Track Pants Black Fall \n", + "4 53759 Men Apparel Topwear Tshirts Grey Summer \n", + "\n", + " year usage productDisplayName \n", + "0 2011 Casual Turtle Check Men Navy Blue Shirt \n", + "1 2012 Casual Peter England Men Party Blue Jeans \n", + "2 2016 Casual Titan Women Silver Watch \n", + "3 2011 Casual Manchester United Men Solid Black Track Pants \n", + "4 2012 Casual Puma Men Grey T-shirt " + ] + }, + "execution_count": 2, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "import pandas as pd\n", + "import numpy as np\n", + "from typing import List\n", + "import openai\n", + "\n", + "EMBEDDING_MODEL = \"text-embedding-3-small\"\n", + "\n", + "# Simple embedding function compatible with openai==0.28.1\n", + "def get_embeddings(list_of_text: List[str], model=\"text-embedding-3-small\") -> List[List[float]]:\n", + " assert len(list_of_text) <= 2048, \"The batch size should not be larger than 2048.\"\n", + " # replace newlines, which can negatively affect performance.\n", + " list_of_text = [text.replace(\"\\n\", \" \") for text in list_of_text]\n", + " data = openai.Embedding.create(input=list_of_text, model=model)[\"data\"]\n", + " return [d[\"embedding\"] for d in data]\n", + "\n", + "# load in data and clean data types and drop null rows\n", + "df = pd.read_csv(\"../../data/styles_2k.csv\", on_bad_lines='skip')\n", + "df.dropna(inplace=True)\n", + "df[\"year\"] = df[\"year\"].astype(int)\n", + "df.info()\n", + "\n", + "# print dataframe\n", + "n_examples = 5\n", + "df.head(n_examples)" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "3ce1ec50", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "Index: 1978 entries, 0 to 1998\n", + "Data columns (total 11 columns):\n", + " # Column Non-Null Count Dtype \n", + "--- ------ -------------- ----- \n", + " 0 product_id 1978 non-null int64 \n", + " 1 gender 1978 non-null object\n", + " 2 masterCategory 1978 non-null object\n", + " 3 subCategory 1978 non-null object\n", + " 4 articleType 1978 non-null object\n", + " 5 baseColour 1978 non-null object\n", + " 6 season 1978 non-null object\n", + " 7 year 1978 non-null int64 \n", + " 8 usage 1978 non-null object\n", + " 9 productDisplayName 1978 non-null object\n", + " 10 product_text 1978 non-null object\n", + "dtypes: int64(2), object(9)\n", + "memory usage: 185.4+ KB\n" + ] + } + ], + "source": [ + "df[\"product_text\"] = df.apply(lambda row: f\"name {row['productDisplayName']} category {row['masterCategory']} subcategory {row['subCategory']} color {row['baseColour']} gender {row['gender']}\".lower(), axis=1)\n", + "df.rename({\"id\":\"product_id\"}, inplace=True, axis=1)\n", + "\n", + "df.info()" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "13859ab5", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'name turtle check men navy blue shirt category apparel subcategory topwear color navy blue gender men'" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# check out one of the texts we will use to create semantic embeddings\n", + "df[\"product_text\"][0]" + ] + }, + { + "cell_type": "markdown", + "id": "91df4d5b", + "metadata": {}, + "source": [ + "## Connect to Dragonfly\n", + "\n", + "Now that we have our Dragonfly process running. We will use the default host and port for the Dragonfly database which is `localhost:6379`.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "cc662c1b", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "True" + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "import redis\n", + "from redis.commands.search.indexDefinition import (\n", + " IndexDefinition,\n", + " IndexType\n", + ")\n", + "from redis.commands.search.query import Query\n", + "from redis.commands.search.field import (\n", + " TagField,\n", + " NumericField,\n", + " TextField,\n", + " VectorField\n", + ")\n", + "\n", + "# Connect to Dragonfly\n", + "df_client = redis.Redis(\n", + " host=\"localhost\",\n", + " port=6379,\n", + " password=\"\"\n", + ")\n", + "\n", + "df_client.ping()" + ] + }, + { + "cell_type": "markdown", + "id": "7d3dac3c", + "metadata": {}, + "source": [ + "## Creating a Search Index in Dragonfly\n", + "\n", + "The below cells will show how to specify and create a search index in Dragonfly. We will:\n", + "\n", + "1. Set some constants for defining our index like the distance metric and the index name\n", + "2. Define the index schema with fields\n", + "3. Create the index" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "f894b911", + "metadata": { + "scrolled": true + }, + "outputs": [], + "source": [ + "# Constants\n", + "INDEX_NAME = \"product_embeddings\" # name of the search index\n", + "PREFIX = \"doc\" # prefix for the document keys\n", + "DISTANCE_METRIC = \"L2\" # distance metric for the vectors (ex. COSINE, IP, L2)\n", + "NUMBER_OF_VECTORS = len(df)" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "15db8380", + "metadata": {}, + "outputs": [], + "source": [ + "# Define fields for each of the columns in the dataset\n", + "name = TextField(name=\"productDisplayName\")\n", + "category = TagField(name=\"masterCategory\")\n", + "articleType = TagField(name=\"articleType\")\n", + "gender = TagField(name=\"gender\")\n", + "season = TagField(name=\"season\")\n", + "year = NumericField(name=\"year\")\n", + "text_embedding = VectorField(\"product_vector\",\n", + " \"FLAT\", {\n", + " \"TYPE\": \"FLOAT32\",\n", + " \"DIM\": 1536,\n", + " \"DISTANCE_METRIC\": DISTANCE_METRIC,\n", + " \"INITIAL_CAP\": NUMBER_OF_VECTORS,\n", + " }\n", + ")\n", + "fields = [name, category, articleType, gender, season, year, text_embedding]" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "3658693c", + "metadata": {}, + "outputs": [], + "source": [ + "# Check if index exists\n", + "try:\n", + " df_client.ft(INDEX_NAME).info()\n", + " print(\"Index already exists\")\n", + "except:\n", + " # Create the index\n", + " df_client.ft(INDEX_NAME).create_index(\n", + " fields = fields,\n", + " definition = IndexDefinition(prefix=[PREFIX], index_type=IndexType.HASH)\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "775c15b4", + "metadata": {}, + "source": [ + "## Generate OpenAI Embeddings and Load Documents into the Index\n", + "\n", + "Now that we have a search index, we can load documents into it. We will use the dataframe containing the styles dataset loaded previously. In Dragonfly, either the HASH or JSON data types can be used to store documents. We will use the HASH data type in this example. The cells below will show how to get OpenAI embeddings for the different products and load documents into the index." + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "852cff45", + "metadata": {}, + "outputs": [], + "source": [ + "# Use OpenAI get_embeddings batch requests to speed up embedding creation\n", + "def embeddings_batch_request(documents: pd.DataFrame):\n", + " records = documents.to_dict(\"records\")\n", + " print(\"Records to process: \", len(records))\n", + " product_vectors = []\n", + " docs = []\n", + " batchsize = 1000\n", + "\n", + " for idx,doc in enumerate(records,start=1):\n", + " # create byte vectors\n", + " docs.append(doc[\"product_text\"])\n", + " if idx % batchsize == 0:\n", + " product_vectors += get_embeddings(docs, EMBEDDING_MODEL)\n", + " docs.clear()\n", + " print(\"Vectors processed \", len(product_vectors), end='\\r')\n", + " product_vectors += get_embeddings(docs, EMBEDDING_MODEL)\n", + " print(\"Vectors processed \", len(product_vectors), end='\\r')\n", + " return product_vectors" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "id": "0d791186", + "metadata": {}, + "outputs": [], + "source": [ + "def index_documents(client: redis.Redis, prefix: str, documents: pd.DataFrame):\n", + " product_vectors = embeddings_batch_request(documents)\n", + " records = documents.to_dict(\"records\")\n", + " batchsize = 500\n", + "\n", + " pipe = client.pipeline()\n", + " for idx,doc in enumerate(records,start=1):\n", + " key = f\"{prefix}:{str(doc['product_id'])}\"\n", + "\n", + " # create byte vectors\n", + " text_embedding = np.array((product_vectors[idx-1]), dtype=np.float32).tobytes()\n", + "\n", + " # replace list of floats with byte vectors\n", + " doc[\"product_vector\"] = text_embedding\n", + "\n", + " pipe.hset(key, mapping = doc)\n", + " if idx % batchsize == 0:\n", + " pipe.execute()\n", + " pipe.execute()" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "5bfaeafa", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Records to process: 1978\n", + "Loaded 1978 documents in Dragonfly search index with name: product_embeddings\n", + "CPU times: user 546 ms, sys: 143 ms, total: 689 ms\n", + "Wall time: 5.29 s\n" + ] + } + ], + "source": [ + "%%time\n", + "index_documents(df_client, PREFIX, df)\n", + "print(f\"Loaded {df_client.info()['db0']['keys']} documents in Dragonfly search index with name: {INDEX_NAME}\")" + ] + }, + { + "cell_type": "markdown", + "id": "46050ca9", + "metadata": {}, + "source": [ + "## Simple Vector Search Queries with OpenAI Query Embeddings\n", + "\n", + "Now that we have a search index and documents loaded into it, we can run search queries. Below we will provide a function that will run a search query and return the results. Using this function we run a few queries that will show how you can utilize Dragonfly as a vector database." + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "id": "b044aa93", + "metadata": {}, + "outputs": [], + "source": [ + "def search_df(\n", + " df_client: redis.Redis,\n", + " user_query: str,\n", + " index_name: str = \"product_embeddings\",\n", + " vector_field: str = \"product_vector\",\n", + " return_fields: list = [\"productDisplayName\", \"masterCategory\", \"gender\", \"season\", \"year\", \"vector_score\"],\n", + " hybrid_fields = \"*\",\n", + " k: int = 20,\n", + " print_results: bool = True,\n", + ") -> List[dict]:\n", + "\n", + " # Use OpenAI to create embedding vector from user query\n", + " embedded_query = openai.Embedding.create(input=user_query,\n", + " model=\"text-embedding-3-small\",\n", + " )[\"data\"][0]['embedding']\n", + "\n", + " # Prepare the Query\n", + " base_query = f'{hybrid_fields}=>[KNN {k} @{vector_field} $vector AS vector_score]'\n", + " query = (\n", + " Query(base_query)\n", + " .return_fields(*return_fields)\n", + " .sort_by(\"vector_score\")\n", + " .paging(0, k)\n", + " .dialect(2)\n", + " )\n", + " params_dict = {\"vector\": np.array(embedded_query).astype(dtype=np.float32).tobytes()}\n", + "\n", + " # perform vector search\n", + " results = df_client.ft(index_name).search(query, params_dict)\n", + " if print_results:\n", + " for i, product in enumerate(results.docs):\n", + " score = 1 - float(product.vector_score)\n", + " print(f\"{i}. {product.productDisplayName} (Score: {round(score ,3) })\")\n", + " return results.docs" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "id": "7e2025f6", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "0. Locomotive Men Washed Blue Jeans (Score: 0.205)\n", + "1. Locomotive Men Washed Blue Jeans (Score: 0.205)\n", + "2. French Connection Men Blue Jeans (Score: 0.181)\n", + "3. John Players Men Blue Jeans (Score: 0.178)\n", + "4. Denizen Women Blue Jeans (Score: 0.17)\n", + "5. Lee Men Blue Chicago Fit Jeans (Score: 0.159)\n", + "6. Lee Men Blue Chicago Fit Jeans (Score: 0.159)\n", + "7. Peter England Men Party Blue Jeans (Score: 0.156)\n", + "8. Levis Kids Blue Solid Jean (Score: 0.145)\n", + "9. Palm Tree Kids Boy Washed Blue Jeans (Score: 0.144)\n" + ] + } + ], + "source": [ + "# Execute a simple vector search in Dragonfly\n", + "results = search_df(df_client, 'man blue jeans', k=10)" + ] + }, + { + "cell_type": "markdown", + "id": "2007be48", + "metadata": {}, + "source": [ + "## Hybrid Queries with Dragonfly\n", + "\n", + "The previous examples showed how run vector search queries. In this section, we will show how to combine vector search with other fields for hybrid search. In the example below, we will combine vector search with full text search." + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "id": "2c81fbb7", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "0. Locomotive Men Washed Blue Jeans (Score: 0.205)\n", + "1. Locomotive Men Washed Blue Jeans (Score: 0.205)\n", + "2. French Connection Men Blue Jeans (Score: 0.181)\n", + "3. John Players Men Blue Jeans (Score: 0.178)\n", + "4. Denizen Women Blue Jeans (Score: 0.17)\n", + "5. Lee Men Blue Chicago Fit Jeans (Score: 0.159)\n", + "6. Lee Men Blue Chicago Fit Jeans (Score: 0.159)\n", + "7. Peter England Men Party Blue Jeans (Score: 0.156)\n", + "8. Palm Tree Kids Boy Washed Blue Jeans (Score: 0.144)\n", + "9. Lee Men Tino Blue Jeans (Score: 0.136)\n" + ] + } + ], + "source": [ + "# improve search quality by adding hybrid query for \"man blue jeans\" in the product vector combined with a phrase search for \"blue jeans\"\n", + "results = search_df(df_client,\n", + " \"man blue jeans\",\n", + " vector_field=\"product_vector\",\n", + " k=10,\n", + " hybrid_fields='@productDisplayName:blue jeans'\n", + " )" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "id": "8a56633b", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "0. Basics Men White Slim Fit Striped Shirt (Score: -0.106)\n", + "1. ADIDAS Men's Slim Fit White T-shirt (Score: -0.126)\n", + "2. Basics Men Red Slim Fit Checked Shirt (Score: -0.135)\n", + "3. Basics Men Navy Slim Fit Checked Shirt (Score: -0.142)\n", + "4. Basics Men Blue Slim Fit Checked Shirt (Score: -0.143)\n", + "5. Basics Men Blue Slim Fit Checked Shirt (Score: -0.143)\n", + "6. Tokyo Talkies Women Navy Slim Fit Jeans (Score: -0.174)\n", + "7. Lee Rinse Navy Blue Slim Fit Jeans (Score: -0.177)\n" + ] + } + ], + "source": [ + "# hybrid query for shirt in the product vector and only include results with the phrase \"slim fit\" in the title\n", + "results = search_df(df_client,\n", + " \"shirt\",\n", + " vector_field=\"product_vector\",\n", + " k=10,\n", + " hybrid_fields='@productDisplayName:slim fit'\n", + " )" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "id": "6c25ee8d", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "0. Q&Q Women Blue Watch (Score: -0.121)\n", + "1. Q&Q Men Silver-Toned Dial Analogue Watch Q252J404Y (Score: -0.123)\n", + "2. Q&Q Women White Dial Watch (Score: -0.126)\n", + "3. Citizen Men Silver Dial Watch (Score: -0.13)\n", + "4. Q&Q Unisex Blue Dial Watch (Score: -0.131)\n", + "5. Q&Q Men Black Dial Watch (Score: -0.133)\n", + "6. Q&Q Men Black Dial Watch (Score: -0.133)\n", + "7. Q&Q Men Black Dial Watch (Score: -0.133)\n", + "8. Q&Q Men Black Dial Watch (Score: -0.133)\n", + "9. Q&Q Men Black Dial Watch (Score: -0.133)\n" + ] + } + ], + "source": [ + "# hybrid query for watch in the product vector and only include results with the tag \"Accessories\" in the masterCategory field\n", + "results = search_df(df_client,\n", + " \"watch\",\n", + " vector_field=\"product_vector\",\n", + " k=10,\n", + " hybrid_fields='@masterCategory:{Accessories}'\n", + " )" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "id": "2c0d11d8", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "0. Red Tape Men Black Sandals (Score: 0.061)\n", + "1. Coolers Men Black Sandals (Score: 0.056)\n", + "2. Coolers Men Black Sandals (Score: 0.056)\n", + "3. Gliders Men Yellow Sandals (Score: 0.043)\n", + "4. Coolers Men Black Sports Sandals (Score: 0.041)\n", + "5. Rocia Women Casual Black Sandal (Score: 0.031)\n", + "6. Ganuchi Men Casual Black Sandals (Score: 0.031)\n", + "7. Rocia Women Maroon Sandals (Score: 0.029)\n", + "8. Rocia Women Maroon Sandals (Score: 0.029)\n", + "9. Rocia Women Black & Brown Sandals (Score: 0.027)\n" + ] + } + ], + "source": [ + "# hybrid query for sandals in the product vector and only include results within the 2011-2012 year range\n", + "results = search_df(df_client,\n", + " \"sandals\",\n", + " vector_field=\"product_vector\",\n", + " k=10,\n", + " hybrid_fields='@year:[2011 2012]'\n", + " )" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "id": "7caad384", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "0. Red Tape Men Black Sandals (Score: 0.086)\n", + "1. Coolers Men Black Sandals (Score: 0.06)\n", + "2. Coolers Men Black Sandals (Score: 0.06)\n", + "3. Enroute Teens Orange Sandals (Score: 0.058)\n", + "4. Enroute Teens Brown Sandals (Score: 0.052)\n", + "5. Rocia Women Brown Sandals (Score: 0.051)\n", + "6. Puma Women Purple Techno Sandals (Score: 0.05)\n", + "7. Coolers Men Black Sports Sandals (Score: 0.046)\n", + "8. Enroute Kids Girls Pink Sandals (Score: 0.044)\n", + "9. Ganuchi Men Casual Olive Sandals (Score: 0.043)\n" + ] + } + ], + "source": [ + "# hybrid query for sandals in the product vector and only include results within the 2011-2012 year range from the summer season\n", + "results = search_df(df_client,\n", + " \"blue sandals\",\n", + " vector_field=\"product_vector\",\n", + " k=10,\n", + " hybrid_fields='(@year:[2011 2012] @season:{Summer})'\n", + " )" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "id": "f1232d3c", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "0. Wrangler Women Black Belt (Score: 0.03)\n", + "1. Wrangler Men Leather Brown Belt (Score: -0.002)\n", + "2. Wrangler Men Green Striped Shirt (Score: -0.194)\n", + "3. Wrangler Men Griffith White Shirt (Score: -0.209)\n", + "4. Wrangler Men Purple Striped Shirt (Score: -0.214)\n", + "5. Wrangler Women Stella Green Shirt (Score: -0.245)\n" + ] + } + ], + "source": [ + "# hybrid query for a brown belt filtering results by a year (NUMERIC) with a specific article types (TAG) and with a brand name (TEXT)\n", + "results = search_df(df_client,\n", + " \"brown belt\",\n", + " vector_field=\"product_vector\",\n", + " k=10,\n", + " hybrid_fields='(@year:[2012 2012] @articleType:{Shirts | Belts} @productDisplayName:\"Wrangler\")'\n", + " )" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.18" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/examples/vector_databases/dragonfly/requirements.txt b/examples/vector_databases/dragonfly/requirements.txt new file mode 100644 index 0000000000..9e960f7d2d --- /dev/null +++ b/examples/vector_databases/dragonfly/requirements.txt @@ -0,0 +1,15 @@ +redis==4.6.0 +openai==0.28.1 +pandas>=2.0.1 +numpy>=1.20.3 +wget>=3.2 + +python-dotenv>=1.0.0 + +jupyter>=1.0.0 +jupyterlab>=3.0.0 + +scipy>=1.9.0 +scikit-learn>=1.1.0 +matplotlib>=3.5.0 +plotly>=5.0.0 diff --git a/registry.yaml b/registry.yaml index c7db22961b..41696195ee 100644 --- a/registry.yaml +++ b/registry.yaml @@ -2376,3 +2376,11 @@ - katiagg tags: - images + +- title: Running hybrid VSS queries with Dragonfly and OpenAI + path: examples/vector_databases/dragonfly/dragonfly-hybrid-query-examples.ipynb + date: 2025-07-28 + authors: + - vyavdoshenko + tags: + - embeddings