diff --git a/supporting-blog-content/lexical-and-semantic-search-with-elasticsearch/ecommerce_dense_sparse_project.ipynb b/supporting-blog-content/lexical-and-semantic-search-with-elasticsearch/ecommerce_dense_sparse_project.ipynb index 42386dae..b8d8a7b5 100644 --- a/supporting-blog-content/lexical-and-semantic-search-with-elasticsearch/ecommerce_dense_sparse_project.ipynb +++ b/supporting-blog-content/lexical-and-semantic-search-with-elasticsearch/ecommerce_dense_sparse_project.ipynb @@ -9,22 +9,42 @@ "source": [ "# **Lexical and Semantic Search with Elasticsearch**\n", "\n", - "In this example, you will explore various approaches to retrieving information using Elasticsearch, focusing specifically on text, lexical and semantic search.\n", + "In the following examples, we will explore various approaches to retrieving information using Elasticsearch - focusing specifically on full text search, semantic search, and a hybrid combination of both.\n", "\n", - "To accomplish this, this example demonstrate various search scenarios on a dataset generated to simulate e-commerce product information.\n", + "To accomplish this, this example demonstrates various search scenarios on a dataset generated to simulate e-commerce product information.\n", "\n", - "This dataset contains over 2,500 products, each with a description. These products are categorized into 76 distinct product categories, with each category containing a varying number of products.\n", + "This dataset contains over 2,500 products, each with a description. These products are categorized into 76 distinct product categories, with each category containing a varying number of products. \n", "\n", + "Here is a sample of an object from the dataset:\n", + "\n", + "```json\n", + " {\n", + " \"product\": \"Samsung 49-inch Curved Gaming Monitor\",\n", + " \"description\": \"is a curved gaming monitor with a high refresh rate and AMD FreeSync technology.\",\n", + " \"category\": \"Monitors\"\n", + "}\n", + "\n", + "```\n", + "\n", + "We will consume the dataset from a JSON file into Elasticsearch using modern consumption patterns. We will then perform a series of search operations to demonstrate the different search strategies.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "id": "6370f2e4", + "metadata": {}, + "source": [ "## **🧰 Requirements**\n", "\n", "For this example, you will need:\n", "\n", - "- Python 3.6 or later\n", + "- Python 3.11 or later\n", "- The Elastic Python client\n", - "- Elastic 8.8 deployment or later, with 8GB memory machine learning node\n", - "- The Elastic Learned Sparse EncodeR model that comes pre-loaded into Elastic installed and started on your deployment\n", + "- Elastic 9.0 deployment or later on either a local, cloud, or serverless environment\n", + "\n", "\n", - "We'll be using [Elastic Cloud](https://www.elastic.co/guide/en/cloud/current/ec-getting-started.html), a [free trial](https://cloud.elastic.co/registration?onboarding_token=vectorsearch&utm_source=github&utm_content=elasticsearch-labs-notebook) is available." + "We'll be using [Elastic Cloud](https://www.elastic.co/guide/en/cloud/current/ec-getting-started.html). You can use a [free trial here](https://cloud.elastic.co/registration?onboarding_token=vectorsearch&utm_source=github&utm_content=elasticsearch-labs-notebook) to get started." ] }, { @@ -38,7 +58,7 @@ "\n", "To get started, we'll need to connect to our Elastic deployment using the Python client.\n", "\n", - "Because we're using an Elastic Cloud deployment, we'll use the **Cloud ID** to identify our deployment.\n" + "Because we're using an Elastic Cloud deployment, we'll use the **Cloud Endpoint** and **Cloud API Key** to identify our deployment. These may be found within Kibana by following the instructions [here](https://www.elastic.co/docs/deploy-manage/api-keys/elastic-cloud-api-keys).\n" ] }, { @@ -50,19 +70,22 @@ }, "outputs": [], "source": [ - "!pip install elasticsearch==8.8 #Elasticsearch" + "%pip install elasticsearch pandas IPython -q" ] }, { - "cell_type": "code", - "execution_count": null, - "id": "8c36e9b5-8f2b-4734-9213-1350caa7f837", - "metadata": { - "id": "8c36e9b5-8f2b-4734-9213-1350caa7f837" - }, - "outputs": [], + "cell_type": "markdown", + "id": "38b734aa", + "metadata": {}, "source": [ - "pip -q install eland elasticsearch sentence_transformers transformers torch==1.11 # Eland Python Client" + "### Import the required packages\n", + "We will import the following packages:\n", + "- `Elasticsearch`: a client library for Elasticsearch actions\n", + "- `bulk`: a function to perform Elasticsearch actions in bulk\n", + "- `getpass`: a module for receiving Elasticsearch credentials via text prompt\n", + "- `json`: a module for reading and writing JSON data\n", + "- `pandas`, `display`, `Markdown`: for data visualization and markdown formatting\n", + "\n" ] }, { @@ -74,19 +97,19 @@ }, "outputs": [], "source": [ - "from elasticsearch import (\n", - " Elasticsearch,\n", - " helpers,\n", - ") # Import the Elasticsearch client and helpers module\n", - "from urllib.request import urlopen # library for opening URLs\n", + "# import the Elasticsearch client and bulk function\n", + "from elasticsearch import Elasticsearch\n", + "from elasticsearch.helpers import bulk\n", + "\n", + "# import getpass module to handle Auth input\n", + "import getpass\n", + "\n", + "# import json module to read JSON file of products\n", "import json # module for handling JSON data\n", - "from pathlib import Path # module for working with file paths\n", "\n", - "# Python client and toolkit for machine learning in Elasticsearch\n", - "from eland.ml.pytorch import PyTorchModel\n", - "from eland.ml.pytorch.transformers import TransformerModel\n", - "from elasticsearch.client import MlClient # Elastic module for ml\n", - "import getpass # handling password input" + "# display search results in a table\n", + "import pandas as pd\n", + "from IPython.display import display, Markdown" ] }, { @@ -96,13 +119,12 @@ "id": "ea1VkDBXJIQR" }, "source": [ - "Now we can instantiate the Python Elasticsearch client.\n", - "\n", - "First we prompt the user for their password and Cloud ID.\n", - "\n", - "🔐 NOTE: `getpass` enables us to securely prompt the user for credentials without echoing them to the terminal, or storing it in memory.\n", + "### 📚 Instantiating the Elasticsearch Client\n", "\n", - "Then we create a `client` object that instantiates an instance of the `Elasticsearch` class." + "First we prompt the user for their Elastic Endpoint URL and Elastic API Key.\n", + "Then we create a `client` object that instantiates an instance of the `Elasticsearch` class.\n", + "Lastly, we verify that our client is connected to our Elasticsearch instance by calling `client.ping()`.\n", + "> 🔐 *NOTE: `getpass` enables us to securely prompt the user for credentials without echoing them to the terminal, or storing it in memory.*" ] }, { @@ -114,16 +136,19 @@ }, "outputs": [], "source": [ - "# Found in the 'Manage Deployment' page\n", - "CLOUD_ID = getpass.getpass(\"Enter Elastic Cloud ID: \")\n", + "# endpoint for Elasticsearch instance\n", + "ELASTIC_ENDPOINT = getpass.getpass(\"Enter Elastic Endpoint: \")\n", "\n", - "# Password for the 'elastic' user generated by Elasticsearch\n", - "ELASTIC_PASSWORD = getpass.getpass(\"Enter Elastic password: \")\n", + "# Elastic API key for Elasticsearch\n", + "ELASTIC_API_KEY = getpass.getpass(\"Enter Elastic API Key: \")\n", "\n", - "# Create the client instance\n", + "# create the Elasticsearch client instance\n", "client = Elasticsearch(\n", - " cloud_id=CLOUD_ID, basic_auth=(\"elastic\", ELASTIC_PASSWORD), request_timeout=3600\n", - ")" + " hosts=[ELASTIC_ENDPOINT], api_key=ELASTIC_API_KEY, request_timeout=3600\n", + ")\n", + "\n", + "resp = client.ping()\n", + "print(f\"Connected to Elastic instance: {resp}\")" ] }, { @@ -133,9 +158,11 @@ "id": "BH-N6epTJarM" }, "source": [ - "## Setup emebdding model\n", + "## Prepare our embedding model workflow\n", "\n", - "Next we upload the all-mpnet-base-v2 embedding model into Elasticsearch and create an ingest pipeline with inference processors for text embedding and text expansion, using the description field for both. This field contains the description of each product." + "Next we ensure our embedding models are available in Elasticsearch. We will use Elastic's provided `e5_multilingual_small` and `elser_V2` models to provide dense and sparse vectoring, respectively. Using these models out of the box will ensure they are up-to-date and ready for integration with Elasticsearch.\n", + "\n", + "Other models may be uploaded and deployed using [Eland](https://www.elastic.co/docs/reference/elasticsearch/clients/eland) or integrated using the [inference endpoint API](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-inference-put-azureopenai) to connect to third-party models." ] }, { @@ -147,28 +174,51 @@ }, "outputs": [], "source": [ - "# Set the model name from Hugging Face and task type\n", - "# sentence-transformers model\n", - "hf_model_id = \"sentence-transformers/all-mpnet-base-v2\"\n", - "tm = TransformerModel(hf_model_id, \"text_embedding\")\n", + "# Declare models and endpoint names predeployed by Elastic\n", + "elser_model = \".elser_model_2_linux-x86_64\"\n", + "elser_endpoint = \".elser-2-elasticsearch\"\n", "\n", - "# set the modelID as it is named in Elasticsearch\n", - "es_model_id = tm.elasticsearch_model_id()\n", + "e5_model = \".multilingual-e5-small_linux-x86_64\"\n", + "e5_endpoint = \".multilingual-e5-small-elasticsearch\"\n", "\n", - "# Download the model from Hugging Face\n", - "tmp_path = \"models\"\n", - "Path(tmp_path).mkdir(parents=True, exist_ok=True)\n", - "model_path, config, vocab_path = tm.save(tmp_path)\n", + "# Define (model, endpoint) tuples to check\n", + "model_endpoint_pairs = [(elser_model, elser_endpoint), (e5_model, e5_endpoint)]\n", "\n", - "# Load the model into Elasticsearch\n", - "ptm = PyTorchModel(client, es_model_id)\n", - "ptm.import_model(\n", - " model_path=model_path, config_path=None, vocab_path=vocab_path, config=config\n", - ")\n", + "# Fetch all loaded models and endpoints once\n", + "models = client.ml.get_trained_models()\n", + "model_ids = {model[\"model_id\"]: model for model in models[\"trained_model_configs\"]}\n", + "endpoints = client.inference.get()\n", + "endpoint_ids = {\n", + " endpoint[\"inference_id\"]: endpoint for endpoint in endpoints[\"endpoints\"]\n", + "}\n", "\n", - "# Start the model\n", - "s = MlClient.start_trained_model_deployment(client, model_id=es_model_id)\n", - "s.body" + "# Check each (model, endpoint) pair\n", + "for model_id, endpoint_id in model_endpoint_pairs:\n", + " print(f\"Checking Model: {model_id}\")\n", + " model = model_ids.get(model_id)\n", + " if model:\n", + " print(f\" Model ID: {model['model_id']}\")\n", + " print(f\" Description: {model.get('description', 'No description')}\")\n", + " print(f\" Version: {model.get('version', 'N/A')}\")\n", + " else:\n", + " print(\" Model not found or not loaded.\")\n", + " print(f\"Checking Endpoint: {endpoint_id}\")\n", + " endpoint = endpoint_ids.get(endpoint_id)\n", + " if endpoint:\n", + " print(f\" Inference Endpoint ID: {endpoint['inference_id']}\")\n", + " print(f\" Task Type: {endpoint['task_type']}\")\n", + " else:\n", + " print(\" Endpoint not found or not ready.\")\n", + " print(\"------\")" + ] + }, + { + "cell_type": "markdown", + "id": "80506477", + "metadata": {}, + "source": [ + "### Create an inference pipeline\n", + "This function will create an ingest pipeline with inference processors to use `ELSER` (sparse_vector) and `e5_multilingual_small` (dense_vector) to infer against data that will be ingested in the pipeline. This allows us to automatically generate embeddings for the product descriptions when they are indexed into Elasticsearch." ] }, { @@ -180,34 +230,37 @@ }, "outputs": [], "source": [ - "# Creating an ingest pipeline with inference processors to use ELSER (sparse) and all-mpnet-base-v2 (dense) to infer against data that will be ingested in the pipeline.\n", - "\n", - "client.ingest.put_pipeline(\n", - " id=\"ecommerce-pipeline\",\n", + "index_pipeline = \"ecommerce-pipeline\"\n", + "resp = client.ingest.put_pipeline(\n", + " id=index_pipeline,\n", " processors=[\n", " {\n", " \"inference\": {\n", - " \"model_id\": \"elser_model\",\n", - " \"target_field\": \"ml\",\n", - " \"field_map\": {\"description\": \"text_field\"},\n", - " \"inference_config\": {\n", - " \"text_expansion\": { # text_expansion inference type (ELSER)\n", - " \"results_field\": \"tokens\"\n", + " \"model_id\": elser_endpoint, # inference endpoint ID\n", + " \"input_output\": [\n", + " {\n", + " \"input_field\": \"description\", # source field\n", + " \"output_field\": \"elser_description_vector\", # destination vector field\n", " }\n", - " },\n", + " ],\n", " }\n", " },\n", " {\n", " \"inference\": {\n", - " \"model_id\": \"sentence-transformers__all-mpnet-base-v2\",\n", - " \"target_field\": \"description_vector\", # Target field for the inference results\n", - " \"field_map\": {\n", - " \"description\": \"text_field\" # Field matching our configured trained model input. Typically for NLP models, the field name is text_field.\n", - " },\n", + " \"model_id\": e5_endpoint, # inference endpoint ID\n", + " \"input_output\": [\n", + " {\n", + " \"input_field\": \"description\", # source field\n", + " \"output_field\": \"e5_description_vector\", # destination vector field\n", + " }\n", + " ],\n", + " \"inference_config\": {\"text_embedding\": {}},\n", " }\n", " },\n", " ],\n", - ")" + ")\n", + "\n", + "print(f\"ecommerce-pipeline created: {resp['acknowledged']}\")" ] }, { @@ -218,88 +271,84 @@ }, "source": [ "## Index documents\n", + "The `ecommerce-search` index we are creating will include fields to support dense and sparse vector storage and search. \n", + "\n", + "We define the `e5_description_vector` and the `elser_description_vector` fields to store the inference pipeline results. \n", "\n", - "Then, we create a source index to load `products-ecommerce.json`, this will be the `ecommerce` index and a destination index to extract the documents from the source and index these documents into the destination `ecommerce-search`.\n", + "The field type in `e5_description_vector` is a `dense_vector`. The `.e5_multilingual_small` model has an embedding size of 384, so the dimension of the vector (dims) is set to 384. \n", "\n", - "For the `ecommerce-search` index we add a field to support dense vector storage and search `description_vector.predicted_value`, this is the target field for inference results. The field type in this case is `dense_vector`, the `all-mpnet-base-v2` model has embedding_size of 768, so dims is set to 768. We also add a `rank_features` field type to support the text expansion output." + "We also add an `elser_description_vector` field type to support the `sparse_vector` output from our `.elser_model_2_linux-x86_64` model. No further configuration is needed for this field for our use case." ] }, { "cell_type": "code", "execution_count": null, - "id": "6e115bd0-e758-44db-b5b9-96217af472c1", + "id": "9b53b39e-d74e-4fa8-a364-e2c3caf37418", "metadata": { - "id": "6e115bd0-e758-44db-b5b9-96217af472c1" + "id": "9b53b39e-d74e-4fa8-a364-e2c3caf37418" }, "outputs": [], "source": [ - "# Index to load products-ecommerce.json docs\n", + "# define the index name and mapping\n", + "commerce_index = \"ecommerce-search\"\n", + "mappings = {\n", + " \"properties\": {\n", + " \"product\": {\n", + " \"type\": \"text\",\n", + " },\n", + " \"description\": {\n", + " \"type\": \"text\",\n", + " },\n", + " \"category\": {\n", + " \"type\": \"text\",\n", + " },\n", + " \"elser_description_vector\": {\"type\": \"sparse_vector\"},\n", + " \"e5_description_vector\": {\n", + " \"type\": \"dense_vector\",\n", + " \"dims\": 384,\n", + " \"index\": \"true\",\n", + " \"similarity\": \"cosine\",\n", + " },\n", + " \"e5_semantic_description_vector\": {\n", + " \"type\": \"semantic_text\",\n", + " \"inference_id\": e5_endpoint,\n", + " },\n", + " \"elser_semantic_description_vector\": {\"type\": \"semantic_text\"},\n", + " }\n", + "}\n", "\n", - "client.indices.create(\n", - " index=\"ecommerce\",\n", - " mappings={\n", - " \"properties\": {\n", - " \"product\": {\n", - " \"type\": \"text\",\n", - " \"fields\": {\"keyword\": {\"type\": \"keyword\", \"ignore_above\": 256}},\n", - " },\n", - " \"description\": {\n", - " \"type\": \"text\",\n", - " \"fields\": {\"keyword\": {\"type\": \"keyword\", \"ignore_above\": 256}},\n", - " },\n", - " \"category\": {\n", - " \"type\": \"text\",\n", - " \"fields\": {\"keyword\": {\"type\": \"keyword\", \"ignore_above\": 256}},\n", - " },\n", - " }\n", - " },\n", - ")" + "\n", + "if client.indices.exists(index=commerce_index):\n", + " client.indices.delete(index=commerce_index)\n", + "resp = client.indices.create(\n", + " index=commerce_index,\n", + " mappings=mappings,\n", + ")\n", + "\n", + "print(f\"Index {commerce_index} created: {resp['acknowledged']}\")" + ] + }, + { + "cell_type": "markdown", + "id": "88db9926", + "metadata": {}, + "source": [ + "### Attach Pipeline to Index\n", + "Lets connect our pipeline to the index. This updates the settings of our index to use the pipeline we previously defined as the default.\n" ] }, { "cell_type": "code", "execution_count": null, - "id": "9b53b39e-d74e-4fa8-a364-e2c3caf37418", - "metadata": { - "id": "9b53b39e-d74e-4fa8-a364-e2c3caf37418" - }, + "id": "c4830b74", + "metadata": {}, "outputs": [], "source": [ - "# Reindex dest index\n", - "\n", - "INDEX = \"ecommerce-search\"\n", - "client.indices.create(\n", - " index=INDEX,\n", - " settings={\"index\": {\"number_of_shards\": 1, \"number_of_replicas\": 1}},\n", - " mappings={\n", - " # Saving disk space by excluding the ELSER tokens and the dense_vector field from document source.\n", - " # Note: That should only be applied if you are certain that reindexing will not be required in the future.\n", - " \"_source\": {\"excludes\": [\"ml.tokens\", \"description_vector.predicted_value\"]},\n", - " \"properties\": {\n", - " \"product\": {\n", - " \"type\": \"text\",\n", - " \"fields\": {\"keyword\": {\"type\": \"keyword\", \"ignore_above\": 256}},\n", - " },\n", - " \"description\": {\n", - " \"type\": \"text\",\n", - " \"fields\": {\"keyword\": {\"type\": \"keyword\", \"ignore_above\": 256}},\n", - " },\n", - " \"category\": {\n", - " \"type\": \"text\",\n", - " \"fields\": {\"keyword\": {\"type\": \"keyword\", \"ignore_above\": 256}},\n", - " },\n", - " \"ml.tokens\": { # The name of the field to contain the generated tokens.\n", - " \"type\": \"rank_features\" # ELSER output must be ingested into a field with the rank_features field type.\n", - " },\n", - " \"description_vector.predicted_value\": { # Inference results field, target_field.predicted_value\n", - " \"type\": \"dense_vector\",\n", - " \"dims\": 768, # The all-mpnet-base-v2 model has embedding_size of 768, so dims is set to 768.\n", - " \"index\": \"true\",\n", - " \"similarity\": \"dot_product\", # When indexing vectors for approximate kNN search, you need to specify the similarity function for comparing the vectors.\n", - " },\n", - " },\n", - " },\n", - ")" + "resp = client.indices.put_settings(\n", + " index=commerce_index,\n", + " body={\"default_pipeline\": index_pipeline},\n", + ")\n", + "print(f\"Pipeline set for {commerce_index}: {resp['acknowledged']}\")" ] }, { @@ -309,9 +358,9 @@ "id": "Vo-LKu8TOT5j" }, "source": [ - "## Load documents\n", + "### Load documents\n", "\n", - "Then we load `products-ecommerce.json` into the `ecommerce` index." + "We load the contents of`products-ecommerce.json` into the `ecommerce-search` index. We will use the `bulk` helper function to efficiently index our documents en masse. " ] }, { @@ -323,91 +372,102 @@ }, "outputs": [], "source": [ - "# dataset\n", - "\n", - "url = \"https://raw.githubusercontent.com/elastic/elasticsearch-labs/02c01b3450e8ddc72ccec85d559eee5280c185ac/supporting-blog-content/lexical-and-semantic-search-with-elasticsearch/products-ecommerce.json\" # json raw file - update the link here\n", - "\n", - "response = urlopen(url)\n", - "\n", - "# Load the response data into a JSON object\n", - "data_json = json.loads(response.read())\n", + "# Load the dataset\n", + "with open(\"products-ecommerce.json\", \"r\") as f:\n", + " data_json = json.load(f)\n", "\n", "\n", + "# helper function to create bulk indexing body\n", "def create_index_body(doc):\n", - " \"\"\"Generate the body for an Elasticsearch document.\"\"\"\n", + " doc[\"elser_semantic_description_vector\"] = doc[\"description\"]\n", + " doc[\"e5_semantic_description_vector\"] = doc[\"description\"]\n", + "\n", " return {\n", - " \"_index\": \"ecommerce\",\n", + " \"_index\": \"ecommerce-search\",\n", " \"_source\": doc,\n", " }\n", "\n", "\n", - "# Prepare the documents to be indexed\n", + "# prepare the documents array payload\n", "documents = [create_index_body(doc) for doc in data_json]\n", "\n", - "# Use helpers.bulk to index\n", - "helpers.bulk(client, documents)\n", - "\n", - "print(\"Done indexing documents into `ecommerce` index\")" + "# use bulk function to index\n", + "try:\n", + " print(\"Indexing documents...\")\n", + " resp = bulk(client, documents)\n", + " print(f\"Documents indexed successfully: {resp[0]}\")\n", + "except Exception as e:\n", + " print(f\"Error indexing documents: {e}\")" ] }, { "cell_type": "markdown", - "id": "3dShN9W4Opl8", + "id": "-qUXNuOvPDsI", "metadata": { - "id": "3dShN9W4Opl8" + "id": "-qUXNuOvPDsI" }, "source": [ - "## Reindex\n", + "## Text Analysis\n", + "The classic way documents are ranked for relevance by Elasticsearch based on a text query uses the Lucene implementation of the [BM25](https://en.wikipedia.org/wiki/Okapi_BM25) model, a **sparse model for lexical search**. This method follows the traditional approach for text search, looking for exact term matches.\n", "\n", - "Now we can reindex data from the `source` index `ecommerce` to the `dest` index `ecommerce-search` with the ingest pipeline `ecommerce-pipeline` we created.\n", + "To make this search possible, Elasticsearch converts **text field** data into a searchable format by performing text analysis.\n", "\n", - "After this step our `dest` index will have the fields we need to perform Semantic Search." + "**Text analysis** is performed by an [analyzer](https://www.elastic.co/guide/en/elasticsearch/reference/current/analyzer-anatomy.html), a set of rules to govern the process of extracting relevant tokens for searching. An analyzer must have exactly one [tokenizer](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-tokenizers.html). The tokenizer receives a stream of characters and breaks it up into individual tokens (usually individual words.) \n", + "\n" + ] + }, + { + "cell_type": "markdown", + "id": "5f51e460", + "metadata": {}, + "source": [ + "### Standard Analyzer\n", + "In the example below we are using the default analyzer, the standard analyzer, which works well for most use cases as it provides English grammar based tokenization. Tokenization enables matching on individual terms, but each token is still matched literally." ] }, { "cell_type": "code", "execution_count": null, - "id": "4297cb0b-ae2e-44f9-811d-27a41c43a858", + "id": "55b602d1-f1e4-4b70-9273-5fc701ac9039", "metadata": { - "id": "4297cb0b-ae2e-44f9-811d-27a41c43a858" + "id": "55b602d1-f1e4-4b70-9273-5fc701ac9039" }, "outputs": [], "source": [ - "# Reindex data from one index 'source' to another 'dest' with the 'ecommerce-pipeline' pipeline.\n", + "# Define the text to be analyzed\n", + "text = \"Comfortable furniture for a large balcony\"\n", "\n", - "client.reindex(\n", - " wait_for_completion=True,\n", - " source={\"index\": \"ecommerce\"},\n", - " dest={\"index\": \"ecommerce-search\", \"pipeline\": \"ecommerce-pipeline\"},\n", - ")" + "# Define the analyze request\n", + "request_body = {\"analyzer\": \"standard\", \"text\": text} # Stop Analyzer\n", + "\n", + "# Perform the analyze request\n", + "resp = client.indices.analyze(\n", + " analyzer=request_body[\"analyzer\"], text=request_body[\"text\"]\n", + ")\n", + "\n", + "# Extract and display the analyzed tokens\n", + "standard_tokens = [token[\"token\"] for token in resp[\"tokens\"]]\n", + "print(\"Standard-analyzed Tokens:\", standard_tokens)" ] }, { "cell_type": "markdown", - "id": "-qUXNuOvPDsI", - "metadata": { - "id": "-qUXNuOvPDsI" - }, + "id": "fb75f526", + "metadata": {}, "source": [ - "## Text Analysis with Standard Analyzer" + "### Stop Analyzer\n", + "If you want to personalize your search experience you can choose a different built-in analyzer. For example, by updating the code to use the stop analyzer it will break the text into tokens at any non-letter character with support for removing stop words." ] }, { "cell_type": "code", "execution_count": null, - "id": "829ae6e8-807d-4f0d-ada6-fee86748b91a", - "metadata": { - "id": "829ae6e8-807d-4f0d-ada6-fee86748b91a" - }, + "id": "3e3fdcff", + "metadata": {}, "outputs": [], "source": [ - "# Performs text analysis on a string and returns the resulting tokens.\n", - "\n", - "# Define the text to be analyzed\n", - "text = \"Comfortable furniture for a large balcony\"\n", - "\n", "# Define the analyze request\n", - "request_body = {\"analyzer\": \"standard\", \"text\": text} # Standard Analyzer\n", + "request_body = {\"analyzer\": \"stop\", \"text\": text}\n", "\n", "# Perform the analyze request\n", "response = client.indices.analyze(\n", @@ -415,45 +475,120 @@ ")\n", "\n", "# Extract and display the analyzed tokens\n", - "tokens = [token[\"token\"] for token in response[\"tokens\"]]\n", - "print(\"Analyzed Tokens:\", tokens)" + "stop_tokens = [token[\"token\"] for token in response[\"tokens\"]]\n", + "print(\"Stop-analyzed Tokens:\", stop_tokens)" ] }, { "cell_type": "markdown", - "id": "12u70NLmPyNV", - "metadata": { - "id": "12u70NLmPyNV" - }, + "id": "aba7fad6", + "metadata": {}, "source": [ - "## Text Analysis with Stop Analyzer" + "### Custom Analyzer\n", + "When the built-in analyzers do not fulfill your needs, you can create a [custom analyzer](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-custom-analyzer.html)\n", + "], which uses the appropriate combination of zero or more [character filters](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-charfilters.html), a [tokenizer](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-tokenizers.html) and zero or more [token filters](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-tokenfilters.html).\n", + "\n", + "In the below example that combines a tokenizer and token filters, the text will be lowercased by the [lowercase filter](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-lowercase-tokenfilter.html) before being processed by the [synonyms token filter](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-synonym-tokenfilter.html).\n", + "\n", + "> Note: you cannot pass a custom analyzer definition inline to analyze. Define the analyzer in your index settings, then reference it by name in the analyze call. For this reason we will create a temporary index to store the analyzer." ] }, { "cell_type": "code", "execution_count": null, - "id": "55b602d1-f1e4-4b70-9273-5fc701ac9039", - "metadata": { - "id": "55b602d1-f1e4-4b70-9273-5fc701ac9039" - }, + "id": "d44f3e2b", + "metadata": {}, "outputs": [], "source": [ - "# Performs text analysis on a string and returns the resulting tokens.\n", + "index_settings = {\n", + " \"settings\": {\n", + " \"analysis\": {\n", + " \"analyzer\": {\n", + " \"my_custom_analyzer\": {\n", + " \"type\": \"custom\",\n", + " \"tokenizer\": \"standard\",\n", + " \"char_filter\": [\"html_strip\"],\n", + " \"filter\": [\"lowercase\", \"asciifolding\"],\n", + " }\n", + " }\n", + " }\n", + " }\n", + "}\n", "\n", - "# Define the text to be analyzed\n", - "text = \"Comfortable furniture for a large balcony\"\n", + "custom_text = \"Čōmføřțǎble Fůrñíturę Fòr â ľarğe Bałcony\"\n", "\n", - "# Define the analyze request\n", - "request_body = {\"analyzer\": \"stop\", \"text\": text} # Stop Analyzer\n", + "# Create a temporary index with the custom analyzer\n", + "client.indices.create(index=\"temporary_index\", body=index_settings)\n", "\n", "# Perform the analyze request\n", - "response = client.indices.analyze(\n", - " analyzer=request_body[\"analyzer\"], text=request_body[\"text\"]\n", + "resp = client.indices.analyze(\n", + " index=\"temporary_index\", analyzer=\"my_custom_analyzer\", text=custom_text\n", ")\n", "\n", "# Extract and display the analyzed tokens\n", - "tokens = [token[\"token\"] for token in response[\"tokens\"]]\n", - "print(\"Analyzed Tokens:\", tokens)" + "custom_tokens = [token[\"token\"] for token in resp[\"tokens\"]]\n", + "print(\"Custom Tokens:\", custom_tokens)\n", + "\n", + "# Delete the temporary index\n", + "client.indices.delete(index=\"temporary_index\")" + ] + }, + { + "cell_type": "markdown", + "id": "432620b6", + "metadata": {}, + "source": [ + "### Text Analysis Results\n", + "In the table below, we can observe that analyzers both included with Elasticsearch and custom made may be included with your search requests to improve the quality of your search results by reducing or refining the content being searched. Attention should be paid to your particular use case and the needs of your users." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9c5d11cb", + "metadata": {}, + "outputs": [], + "source": [ + "print(\"Standard Token Analyzer\")\n", + "print(f\"Before: \\n{text}\")\n", + "print(f\"After: \\n{standard_tokens}\")\n", + "print(\"===================\")\n", + "print(\"Stop Token Analyzer\")\n", + "print(f\"Before: \\n{text}\")\n", + "print(f\"After: \\n{stop_tokens}\")\n", + "print(\"===================\")\n", + "print(\"Custom Token Analyzer\")\n", + "print(f\"Before: \\n{custom_text}\")\n", + "print(f\"After: \\n{custom_tokens}\")" + ] + }, + { + "cell_type": "markdown", + "id": "db4f86e3", + "metadata": {}, + "source": [ + "## Search \n", + "The remainder of this notebook will cover the following search types:\n", + "\n", + "\n", + "- Lexical Search\n", + "- Semantic Search \n", + " - ELSER Semantic Search (Sparse Vector)\n", + " - E5 Semantic Search (Dense Vector)\n", + " - ELSER Semantic Search with `semantic_text`\n", + " - E5 Semantic Search with `semantic_text`\n", + "- Hybrid Search\n", + " - E5 + Lexical (linear combination)\n", + " - E5 + Lexical (RRF)\n", + " - ELSER + Lexical (linear combination)\n", + " - ELSER + Lexical (RRF)\n", + "- ES|QL Search\n", + " - Semantic Search ES|QL\n", + " - ELSER ES|QL\n", + " - E5 ES|QL\n", + " - ELSER ES|QL with `semantic_text`\n", + " - E5 ES|QL with `semantic_text`\n", + " " ] }, { @@ -463,7 +598,8 @@ "id": "8G8MKcUvP0zs" }, "source": [ - "## Lexical Search" + "### Lexical Search\n", + "Our first search will be a straightforward BM25 text search within the description field. We are storing all of our results in a results_list for a final comparison at the end of the notebook. A convenience function to display the results is also defined." ] }, { @@ -475,9 +611,25 @@ }, "outputs": [], "source": [ - "# BM25\n", + "results_list = []\n", "\n", - "response = client.search(\n", + "\n", + "def print_search_results(search_results):\n", + " if not search_results:\n", + " print(\"No matches found\")\n", + " else:\n", + " for hit in search_results:\n", + " score = hit[\"_score\"]\n", + " product = hit[\"_source\"][\"product\"]\n", + " category = hit[\"_source\"][\"category\"]\n", + " description = hit[\"_source\"][\"description\"]\n", + " print(\n", + " f\"\\nScore: {score}\\nProduct: {product}\\nCategory: {category}\\nDescription: {description}\\n\"\n", + " )\n", + "\n", + "\n", + "# Regular BM25 (Lexical) Search\n", + "resp = client.search(\n", " size=2,\n", " index=\"ecommerce-search\",\n", " query={\n", @@ -488,20 +640,12 @@ " }\n", " }\n", " },\n", + " source_excludes=[\"*_description_vector\"], # Exclude vector fields from response\n", ")\n", - "hits = response[\"hits\"][\"hits\"]\n", "\n", - "if not hits:\n", - " print(\"No matches found\")\n", - "else:\n", - " for hit in hits:\n", - " score = hit[\"_score\"]\n", - " product = hit[\"_source\"][\"product\"]\n", - " category = hit[\"_source\"][\"category\"]\n", - " description = hit[\"_source\"][\"description\"]\n", - " print(\n", - " f\"\\nScore: {score}\\nProduct: {product}\\nCategory: {category}\\nDescription: {description}\\n\"\n", - " )" + "lexical_search_results = resp[\"hits\"][\"hits\"]\n", + "results_list.append({\"lexical_search\": lexical_search_results})\n", + "print_search_results(lexical_search_results)" ] }, { @@ -511,7 +655,7 @@ "id": "xiywcf_-P39a" }, "source": [ - "## Semantic Search with Dense Vector" + "### Semantic Search with Dense Vector" ] }, { @@ -523,33 +667,26 @@ }, "outputs": [], "source": [ - "# KNN\n", - "\n", "response = client.search(\n", " index=\"ecommerce-search\",\n", " size=2,\n", " knn={\n", - " \"field\": \"description_vector.predicted_value\",\n", + " \"field\": \"e5_description_vector\",\n", " \"k\": 50, # Number of nearest neighbors to return as top hits.\n", " \"num_candidates\": 500, # Number of nearest neighbor candidates to consider per shard. Increasing num_candidates tends to improve the accuracy of the final k results.\n", " \"query_vector_builder\": { # Object indicating how to build a query_vector. kNN search enables you to perform semantic search by using a previously deployed text embedding model.\n", " \"text_embedding\": {\n", - " \"model_id\": \"sentence-transformers__all-mpnet-base-v2\", # Text embedding model id\n", + " \"model_id\": \".multilingual-e5-small-elasticsearch\", # Text embedding model id\n", " \"model_text\": \"Comfortable furniture for a large balcony\", # Query\n", " }\n", " },\n", " },\n", + " source_excludes=[\"*_description_vector\"], # Exclude vector fields from response\n", ")\n", "\n", - "for hit in response[\"hits\"][\"hits\"]:\n", - "\n", - " score = hit[\"_score\"]\n", - " product = hit[\"_source\"][\"product\"]\n", - " category = hit[\"_source\"][\"category\"]\n", - " description = hit[\"_source\"][\"description\"]\n", - " print(\n", - " f\"\\nScore: {score}\\nProduct: {product}\\nCategory: {category}\\nDescription: {description}\\n\"\n", - " )" + "dense_semantic_search_results = response[\"hits\"][\"hits\"]\n", + "results_list.append({\"dense_semantic_search\": dense_semantic_search_results})\n", + "print_search_results(dense_semantic_search_results)" ] }, { @@ -559,7 +696,77 @@ "id": "QlWFdngRQFbv" }, "source": [ - "## Semantic Search with Sparse Vector" + "### Semantic Search with Sparse Vector" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c5475e21", + "metadata": {}, + "outputs": [], + "source": [ + "# Elastic Learned Sparse Encoder - ELSER\n", + "\n", + "resp = client.search(\n", + " index=\"ecommerce-search\",\n", + " size=2,\n", + " query={\n", + " \"sparse_vector\": {\n", + " \"field\": \"elser_description_vector\",\n", + " \"inference_id\": \".elser-2-elasticsearch\",\n", + " \"query\": \"Comfortable furniture for a large balcony\",\n", + " }\n", + " },\n", + " source_excludes=[\"*_description_vector\"], # Exclude vector fields from response\n", + ")\n", + "\n", + "\n", + "sparse_semantic_search_results = resp[\"hits\"][\"hits\"]\n", + "results_list.append({\"sparse_semantic_search\": sparse_semantic_search_results})\n", + "print_search_results(sparse_semantic_search_results)" + ] + }, + { + "cell_type": "markdown", + "id": "3a2a5267", + "metadata": {}, + "source": [ + "### Semantic Search with `semantic_text` Type (ELSER)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4d2fb926", + "metadata": {}, + "outputs": [], + "source": [ + "# Elastic Learned Sparse Encoder - ELSER\n", + "\n", + "resp = client.search(\n", + " index=\"ecommerce-search\",\n", + " size=2,\n", + " query={\n", + " \"semantic\": {\n", + " \"field\": \"elser_semantic_description_vector\",\n", + " \"query\": \"Comfortable furniture for a large balcony\",\n", + " }\n", + " },\n", + " source_excludes=[\"*_description_vector\"], # Exclude vector fields from response\n", + ")\n", + "\n", + "elser_semantic_text_search_results = resp[\"hits\"][\"hits\"]\n", + "results_list.append({\"elser_semantic_text_search\": sparse_semantic_search_results})\n", + "print_search_results(elser_semantic_text_search_results)" + ] + }, + { + "cell_type": "markdown", + "id": "1df079f3", + "metadata": {}, + "source": [ + "### Semantic Search with `semantic_text` Type (e5)" ] }, { @@ -573,28 +780,75 @@ "source": [ "# Elastic Learned Sparse Encoder - ELSER\n", "\n", - "response = client.search(\n", + "resp = client.search(\n", " index=\"ecommerce-search\",\n", " size=2,\n", " query={\n", - " \"text_expansion\": {\n", - " \"ml.tokens\": {\n", - " \"model_id\": \"elser_model\",\n", - " \"model_text\": \"Comfortable furniture for a large balcony\",\n", - " }\n", + " \"semantic\": {\n", + " \"field\": \"e5_semantic_description_vector\",\n", + " \"query\": \"Comfortable furniture for a large balcony\",\n", " }\n", " },\n", + " source_excludes=[\"*_description_vector\"], # Exclude vector fields from response\n", ")\n", "\n", - "for hit in response[\"hits\"][\"hits\"]:\n", + "e5_semantic_text_search_results = resp[\"hits\"][\"hits\"]\n", + "results_list.append({\"e5_semantic_text_search\": e5_semantic_text_search_results})\n", + "print_search_results(e5_semantic_text_search_results)" + ] + }, + { + "cell_type": "markdown", + "id": "6b5016f3", + "metadata": {}, + "source": [ + "### Hybrid Search - BM25 + `semantic_text` Type" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c174fc71", + "metadata": {}, + "outputs": [], + "source": [ + "# BM25 + semantic_text (RRF)\n", + "top_k = 2\n", + "resp = client.search(\n", + " index=\"ecommerce-search\",\n", + " retriever={\n", + " \"rrf\": {\n", + " \"retrievers\": [\n", + " {\n", + " \"standard\": {\n", + " \"query\": {\n", + " \"match\": {\n", + " \"description\": \"A dining table and comfortable chairs for a large balcony\"\n", + " }\n", + " }\n", + " }\n", + " },\n", + " {\n", + " \"standard\": {\n", + " \"query\": {\n", + " \"semantic\": {\n", + " \"field\": \"e5_semantic_description_vector\",\n", + " \"query\": \"Comfortable furniture for a large balcony\",\n", + " }\n", + " }\n", + " }\n", + " },\n", + " ],\n", + " \"rank_window_size\": 2,\n", + " \"rank_constant\": 20,\n", + " }\n", + " },\n", + " source_excludes=[\"*_description_vector\"], # Exclude vector fields from response\n", + ")\n", "\n", - " score = hit[\"_score\"]\n", - " product = hit[\"_source\"][\"product\"]\n", - " category = hit[\"_source\"][\"category\"]\n", - " description = hit[\"_source\"][\"description\"]\n", - " print(\n", - " f\"\\nScore: {score}\\nProduct: {product}\\nCategory: {category}\\nDescription: {description}\\n\"\n", - " )" + "dense_rrf_search_results = resp[\"hits\"][\"hits\"]\n", + "results_list.append({\"dense_rrf_search\": dense_rrf_search_results})\n", + "print_search_results(dense_rrf_search_results)" ] }, { @@ -604,7 +858,7 @@ "id": "kz9deDBYQJxr" }, "source": [ - "## Hybrid Search - BM25+KNN linear combination" + "### Hybrid Search - BM25 + Dense Vector linear combination" ] }, { @@ -617,8 +871,8 @@ "outputs": [], "source": [ "# BM25 + KNN (Linear Combination)\n", - "\n", - "response = client.search(\n", + "query = \"A dining table and comfortable chairs for a large balcony\"\n", + "resp = client.search(\n", " index=\"ecommerce-search\",\n", " size=2,\n", " query={\n", @@ -627,8 +881,8 @@ " {\n", " \"match\": {\n", " \"description\": {\n", - " \"query\": \"A dining table and comfortable chairs for a large balcony\",\n", - " \"boost\": 1, # You can adjust the boost value\n", + " \"query\": query,\n", + " \"boost\": 1,\n", " }\n", " }\n", " }\n", @@ -636,28 +890,23 @@ " }\n", " },\n", " knn={\n", - " \"field\": \"description_vector.predicted_value\",\n", - " \"k\": 50,\n", - " \"num_candidates\": 500,\n", - " \"boost\": 1, # You can adjust the boost value\n", + " \"field\": \"e5_description_vector\",\n", + " \"k\": 2,\n", + " \"num_candidates\": 20,\n", + " \"boost\": 1,\n", " \"query_vector_builder\": {\n", " \"text_embedding\": {\n", - " \"model_id\": \"sentence-transformers__all-mpnet-base-v2\",\n", - " \"model_text\": \"A dining table and comfortable chairs for a large balcony\",\n", + " \"model_id\": \".multilingual-e5-small-elasticsearch\",\n", + " \"model_text\": query,\n", " }\n", " },\n", " },\n", + " source_excludes=[\"*_description_vector\"], # Exclude vector fields from response\n", ")\n", "\n", - "for hit in response[\"hits\"][\"hits\"]:\n", - "\n", - " score = hit[\"_score\"]\n", - " product = hit[\"_source\"][\"product\"]\n", - " category = hit[\"_source\"][\"category\"]\n", - " description = hit[\"_source\"][\"description\"]\n", - " print(\n", - " f\"\\nScore: {score}\\nProduct: {product}\\nCategory: {category}\\nDescription: {description}\\n\"\n", - " )" + "dense_linear_search_results = resp[\"hits\"][\"hits\"]\n", + "results_list.append({\"dense_linear_search\": dense_linear_search_results})\n", + "print_search_results(dense_linear_search_results)" ] }, { @@ -667,7 +916,9 @@ "id": "cybkWjmpQV8g" }, "source": [ - "## Hybrid Search - BM25+KNN RRF" + "### Hybrid Search - BM25 + Dense Vector Reverse Reciprocal Fusion (RRF)\n", + "\n", + "[Reciprocal rank fusion](https://www.elastic.co/docs/reference/elasticsearch/rest-apis/reciprocal-rank-fusion) (RRF) is a method for combining multiple result sets with different relevance indicators into a single result set. RRF requires no tuning, and the different relevance indicators do not have to be related to each other to achieve high-quality results." ] }, { @@ -680,52 +931,45 @@ "outputs": [], "source": [ "# BM25 + KNN (RRF)\n", - "# RRF functionality is in technical preview and may be changed or removed in a future release. The syntax will likely change before GA.\n", - "\n", - "response = client.search(\n", + "top_k = 2\n", + "resp = client.search(\n", " index=\"ecommerce-search\",\n", - " size=2,\n", - " query={\n", - " \"bool\": {\n", - " \"should\": [\n", + " retriever={\n", + " \"rrf\": {\n", + " \"retrievers\": [\n", " {\n", - " \"match\": {\n", - " \"description\": {\n", - " \"query\": \"A dining table and comfortable chairs for a large balcony\"\n", + " \"standard\": {\n", + " \"query\": {\n", + " \"match\": {\n", + " \"description\": \"A dining table and comfortable chairs for a large balcony\"\n", + " }\n", " }\n", " }\n", - " }\n", - " ]\n", - " }\n", - " },\n", - " knn={\n", - " \"field\": \"description_vector.predicted_value\",\n", - " \"k\": 50,\n", - " \"num_candidates\": 500,\n", - " \"query_vector_builder\": {\n", - " \"text_embedding\": {\n", - " \"model_id\": \"sentence-transformers__all-mpnet-base-v2\",\n", - " \"model_text\": \"A dining table and comfortable chairs for a large balcony\",\n", - " }\n", - " },\n", - " },\n", - " rank={\n", - " \"rrf\": { # Reciprocal rank fusion\n", - " \"window_size\": 50, # This value determines the size of the individual result sets per query.\n", - " \"rank_constant\": 20, # This value determines how much influence documents in individual result sets per query have over the final ranked result set.\n", + " },\n", + " {\n", + " \"knn\": {\n", + " \"field\": \"e5_description_vector\",\n", + " \"query_vector_builder\": {\n", + " \"text_embedding\": {\n", + " \"model_id\": e5_endpoint,\n", + " \"model_text\": \"A dining table and comfortable chairs for a large balcony\",\n", + " }\n", + " },\n", + " \"k\": 2,\n", + " \"num_candidates\": 20,\n", + " }\n", + " },\n", + " ],\n", + " \"rank_window_size\": 2,\n", + " \"rank_constant\": 20,\n", " }\n", " },\n", + " source_excludes=[\"*_description_vector\"], # Exclude vector fields from response\n", ")\n", "\n", - "for hit in response[\"hits\"][\"hits\"]:\n", - "\n", - " rank = hit[\"_rank\"]\n", - " category = hit[\"_source\"][\"category\"]\n", - " product = hit[\"_source\"][\"product\"]\n", - " description = hit[\"_source\"][\"description\"]\n", - " print(\n", - " f\"\\nRank: {rank}\\nProduct: {product}\\nCategory: {category}\\nDescription: {description}\\n\"\n", - " )" + "dense_rrf_search_results = resp[\"hits\"][\"hits\"]\n", + "results_list.append({\"dense_rrf_search\": dense_rrf_search_results})\n", + "print_search_results(dense_rrf_search_results)" ] }, { @@ -735,7 +979,7 @@ "id": "LyKI2Z-XQbI6" }, "source": [ - "## Hybrid Search - BM25+ELSER linear combination" + "### Hybrid Search - BM25 + Sparse Vector linear combination" ] }, { @@ -749,7 +993,7 @@ "source": [ "# BM25 + Elastic Learned Sparse Encoder (Linear Combination)\n", "\n", - "response = client.search(\n", + "resp = client.search(\n", " index=\"ecommerce-search\",\n", " size=2,\n", " query={\n", @@ -764,28 +1008,244 @@ " }\n", " },\n", " {\n", - " \"text_expansion\": {\n", - " \"ml.tokens\": {\n", - " \"model_id\": \"elser_model\",\n", - " \"model_text\": \"A dining table and comfortable chairs for a large balcony\",\n", - " \"boost\": 1, # You can adjust the boost value\n", - " }\n", + " \"sparse_vector\": {\n", + " \"field\": \"elser_description_vector\",\n", + " \"inference_id\": elser_endpoint,\n", + " \"query\": \"A dining table and comfortable chairs for a large balcony\",\n", " }\n", " },\n", " ]\n", " }\n", " },\n", + " source_excludes=[\"*_description_vector\"], # Exclude vector fields from response\n", + ")\n", + "\n", + "sparse_linear_search_results = resp[\"hits\"][\"hits\"]\n", + "results_list.append({\"sparse_linear_search\": sparse_linear_search_results})\n", + "print_search_results(sparse_linear_search_results)" + ] + }, + { + "cell_type": "markdown", + "id": "e3d5e4e9", + "metadata": {}, + "source": [ + "### Hybrid Search - BM25 + Sparse Vector Reciprocal Rank Fusion (RRF)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "199c5c60", + "metadata": {}, + "outputs": [], + "source": [ + "# BM25 + ELSER (RRF)\n", + "top_k = 2\n", + "resp = client.search(\n", + " index=\"ecommerce-search\",\n", + " retriever={\n", + " \"rrf\": {\n", + " \"retrievers\": [\n", + " {\n", + " \"standard\": {\n", + " \"query\": {\n", + " \"match\": {\n", + " \"description\": \"A dining table and comfortable chairs for a large balcony\"\n", + " }\n", + " }\n", + " }\n", + " },\n", + " {\n", + " \"standard\": {\n", + " \"query\": {\n", + " \"sparse_vector\": {\n", + " \"field\": \"elser_description_vector\",\n", + " \"inference_id\": elser_endpoint,\n", + " \"query\": \"A dining table and comfortable chairs for a large balcony\",\n", + " }\n", + " }\n", + " }\n", + " },\n", + " ],\n", + " \"rank_window_size\": 2,\n", + " \"rank_constant\": 20,\n", + " }\n", + " },\n", + " source_excludes=[\"*_description_vector\"], # Exclude vector fields from response\n", ")\n", "\n", - "for hit in response[\"hits\"][\"hits\"]:\n", + "print(resp[\"hits\"][\"hits\"])\n", + "sparse_rrf_search_results = resp[\"hits\"][\"hits\"]\n", + "results_list.append({\"sparse_rrf_search\": sparse_rrf_search_results})\n", + "print_search_results(sparse_rrf_search_results)" + ] + }, + { + "cell_type": "markdown", + "id": "f11de3ac", + "metadata": {}, + "source": [ + "### ES|QL Search\n", + "Elastic offers its own query language called ES|QL. ES|QL is a SQL-like query language that allows you to search and analyze data in Elasticsearch. Further information can be found in the [official documentation](https://www.elastic.co/guide/en/elasticsearch/reference/current/esql.html)." + ] + }, + { + "cell_type": "markdown", + "id": "9d1343a4", + "metadata": {}, + "source": [ + "#### Lexical Search with ES|QL\n", + "This demonstrates the lexical search capabilities of ES|QL using the match function. The function `MATCH` specifically searches for matches in a query string within a specified field. In the example below, we search for documents containing the phrase \"Comfortable furniture for a large balcony\" in the description field.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "91c3d193", + "metadata": {}, + "outputs": [], + "source": [ + "\"\"\" Convert search_results from es|ql to a dict with _source\n", + " and subproperties of score, description, category, and product \"\"\"\n", + "\n", + "\n", + "def normalize_results(search_results):\n", + " normalized_results = []\n", + " results = search_results.body[\"values\"]\n", + " for result in results:\n", + " new_result = {\"_source\": {}}\n", + " new_result[\"_score\"] = result[-1]\n", + " new_result[\"_source\"][\"product\"] = result[-2]\n", + " new_result[\"_source\"][\"category\"] = result[0]\n", + " new_result[\"_source\"][\"description\"] = result[1]\n", + " normalized_results.append(new_result)\n", + "\n", + " return normalized_results\n", + "\n", + "\n", + "esql_query = \"\"\"\n", + "FROM ecommerce-search METADATA _score\n", + "| WHERE match(description, \"Comfortable furniture for a large balcony\")\n", + "| SORT _score DESC\n", + "| LIMIT 2\n", + "\"\"\"\n", + "\n", + "resp = client.esql.query(query=esql_query)\n", + "esql_lexical_search_results = normalize_results(resp)\n", + "results_list.append({\"esql_lexical_search\": esql_lexical_search_results})\n", + "print_search_results(esql_lexical_search_results)" + ] + }, + { + "cell_type": "markdown", + "id": "dbbdb5f3", + "metadata": {}, + "source": [ + "#### Semantic Search with ES|QL\n", + "To perform a semantic search using ES|QL, use the `semantic_text` type for your query. This will run a similarity search based on the semantic meaning of the text, rather than the lexical (word-level) matching of the `text` type. Similar to the ease of performing a search with semantic search using the `semantic_text` type with the Python client, the ES|QL query is simple to write and understand.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1ba37cd7", + "metadata": {}, + "outputs": [], + "source": [ + "esql_query = \"\"\"\n", + "FROM ecommerce-search METADATA _score\n", + "| WHERE elser_semantic_description_vector:\"Comfortable furniture for a large balcony\"\n", + "| SORT _score DESC\n", + "| LIMIT 2\n", + "\"\"\"\n", + "\n", + "resp = client.esql.query(query=esql_query)\n", + "esql_semantic_search_results = normalize_results(resp)\n", + "results_list.append({\"esql_semantic_search\": esql_semantic_search_results})\n", + "print_search_results(esql_semantic_search_results)" + ] + }, + { + "cell_type": "markdown", + "id": "7b95f9b8", + "metadata": {}, + "source": [ + "## Compiled Results\n", + "Here are the results of the previous searches. We can see that all of the results return approximately the same the products." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1162a857", + "metadata": {}, + "outputs": [], + "source": [ + "# Flatten results for each search type, preserving insertion order\n", + "rows = []\n", + "for result in results_list:\n", + " search_type = list(result.keys())[0]\n", + " for doc in result[search_type]:\n", + " row = {\n", + " \"search_type\": search_type,\n", + " \"product\": doc[\"_source\"].get(\"product\"),\n", + " \"category\": doc[\"_source\"].get(\"category\"),\n", + " \"description\": doc[\"_source\"].get(\"description\"),\n", + " \"score\": doc.get(\"_score\"),\n", + " }\n", + " rows.append(row)\n", + "\n", + "# Create DataFrame without altering row order\n", + "df = pd.DataFrame(rows)\n", + "\n", + "# Get the unique search_types in order of appearance\n", + "ordered_search_types = []\n", + "for row in rows:\n", + " st = row[\"search_type\"]\n", + " if st not in ordered_search_types:\n", + " ordered_search_types.append(st)\n", + "\n", + "for search_type in ordered_search_types:\n", + " group = df[df[\"search_type\"] == search_type]\n", + " display(Markdown(f\"### {search_type.replace('_', ' ').title()}\"))\n", + " styled = (\n", + " group.drop(columns=\"search_type\")\n", + " .reset_index(drop=True)\n", + " .style.set_properties(\n", + " subset=[\"description\"],\n", + " **{\"white-space\": \"pre-wrap\", \"word-break\": \"break-word\"},\n", + " )\n", + " .hide(axis=\"index\") # For pandas >=1.4.0\n", + " )\n", + " display(styled)" + ] + }, + { + "cell_type": "markdown", + "id": "b08c83b6", + "metadata": {}, + "source": [ + "As can be seen in the results, the semantic search query provides more relevant results than the lexical search query. This is due to the semantic search query using the `semantic_text` field, which is based on the dense vector representation of the text, while the lexical search query uses the description field, which is based on the lexical representation of the text. Nuances and context are better captured by the semantic search query, making it more effective for finding relevant results." + ] + }, + { + "cell_type": "markdown", + "id": "2b83cbe6", + "metadata": {}, + "source": [ + "# Conclusion\n", + "\n", + "It should be noted that while the semantic search query provides more relevant results, it is also more computationally expensive than the lexical search query. This is because the semantic search query requires the calculation of vector representations, which can be computationally intensive. \n", + "\n", + "Ultimately, it is recommended to use the semantic_text type when implementing semantic search for a few key reasons:\n", + "- Query structure is simple and easy to understand.\n", + "- Implementing the semantic_text type requires minimal changes to the index mapping and query.\n", + "- Setting up an ingest pipeline and inference endpoint is unnecessary.\n", + "\n", + "Using `spare_vector` and `dense_vector` types are more complex and requires additional setup, but can be useful in certain scenarios where semantic search needs to be customized beyond standard semantic text search. This could be a change in the similarity algorithm, use of different vectorization models, or any necessary preprocessing steps. \n", "\n", - " score = hit[\"_score\"]\n", - " product = hit[\"_source\"][\"product\"]\n", - " category = hit[\"_source\"][\"category\"]\n", - " description = hit[\"_source\"][\"description\"]\n", - " print(\n", - " f\"\\nScore: {score}\\nProduct: {product}\\nCategory: {category}\\nDescription: {description}\\n\"\n", - " )" + "Hybrid search retains the power of both lexical and semantic search, allowing for a more flexible and effective search experience. With hybrid search, you can balance the trade-off between relevance and performance, making it a more practical choice for production environments. This should be considered the default approach for search." ] } ], @@ -794,7 +1254,7 @@ "provenance": [] }, "kernelspec": { - "display_name": "Python 3 (ipykernel)", + "display_name": "Python 3", "language": "python", "name": "python3" }, @@ -808,7 +1268,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.10.10" + "version": "3.11.6" } }, "nbformat": 4,