diff --git a/README.md b/README.md index 1c3b5bac..e0b0f5f1 100644 --- a/README.md +++ b/README.md @@ -78,6 +78,7 @@ print(docs) > [!TIP] > All synchronous functions have corresponding asynchronous functions +> PGVectorStore also supports Hybrid Search which combines multiple search strategies to improve search results. ## ChatMessageHistory diff --git a/examples/pg_vectorstore_how_to.ipynb b/examples/pg_vectorstore_how_to.ipynb index bbdd7237..8b429fc8 100644 --- a/examples/pg_vectorstore_how_to.ipynb +++ b/examples/pg_vectorstore_how_to.ipynb @@ -686,6 +686,108 @@ "1. For new records, added via `VectorStore` embeddings are automatically generated." ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Hybrid Search Vector Store\n", + "\n", + "A Hybrid Search Vector Store combines multiple lookup strategies to provide more comprehensive and relevant search results. Specifically, it leverages both dense embedding vector search (for semantic similarity) and TSV (Text Search Vector) based keyword search (for lexical matching). This approach is particularly powerful for applications requiring efficient searching through customized text and metadata, especially when a specialized embedding model isn't feasible or necessary.\n", + "\n", + "By integrating both semantic and lexical capabilities, hybrid search helps overcome the limitations of each individual method:\n", + "\n", + "* **Semantic Search**: Excellent for understanding the meaning of a query, even if the exact keywords aren't present. However, it can sometimes miss highly relevant documents that contain the precise keywords but have a slightly different semantic context.\n", + "\n", + "* **Keyword Search**: Highly effective for finding documents with exact keyword matches and is generally fast. Its weakness lies in its inability to understand synonyms, misspellings, or conceptual relationships.\n", + "\n", + "With a `HybridSearchConfig` provided, the `PGVectorStore` class can efficiently manage a hybrid search vector store using PostgreSQL as the backend, automatically handling the creation and population of the necessary TSV columns when possible.\n", + "\n", + "\n", + "Assuming a pre-existing table same as above in PG DB: `products`, which stores product details for an eComm venture.\n", + "\n", + "Here is how this table mapped to `PGVectorStore`:\n", + "\n", + "- **`id_column=\"product_id\"`**: ID column uniquely identifies each row in the products table.\n", + "\n", + "- **`content_column=\"description\"`**: The `description` column contains text descriptions of each product. This text is used by the `embedding_service` to create vectors that go in embedding_column and represent the semantic meaning of each description.\n", + "\n", + "- **`embedding_column=\"embed\"`**: The `embed` column stores the vectors created from the product descriptions. These vectors are used to find products with similar descriptions.\n", + "\n", + "- **`metadata_columns=[\"name\", \"category\", \"price_usd\", \"quantity\", \"sku\", \"image_url\"]`**: These columns are treated as metadata for each product. Metadata provides additional information about a product, such as its name, category, price, quantity available, SKU (Stock Keeping Unit), and an image URL. This information is useful for displaying product details in search results or for filtering and categorization.\n", + "\n", + "- **`metadata_json_column=\"metadata\"`**: The `metadata` column can store any additional information about the products in a flexible JSON format. This allows for storing varied and complex data that doesn't fit into the standard columns.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from langchain_postgres.v2 import PGVectorStore\n", + "from langchain_postgres.v2.hybrid_search_config import (\n", + " HybridSearchConfig,\n", + " reciprocal_rank_fusion,\n", + ")\n", + "\n", + "TABLE_NAME = \"hybrid_search_products\"\n", + "\n", + "hybrid_search_config = HybridSearchConfig(\n", + " tsv_column=\"hybrid_description\",\n", + " tsv_lang=\"pg_catalog.english\",\n", + " fusion_function=reciprocal_rank_fusion,\n", + " fusion_function_parameters={\n", + " \"rrf_k\": 60,\n", + " \"fetch_top_k\": 10,\n", + " },\n", + ")\n", + "\n", + "# If a hybrid search config is provided during vector store table creation,\n", + "# the specified TSV column will be automatically created.\n", + "await pg_engine.ainit_vectorstore_table(\n", + " table_name=TABLE_NAME,\n", + " # schema_name=SCHEMA_NAME,\n", + " vector_size=VECTOR_SIZE,\n", + " id_column=\"product_id\",\n", + " content_column=\"description\",\n", + " embedding_column=\"embed\",\n", + " metadata_columns=[\"name\", \"category\", \"price_usd\", \"quantity\", \"sku\", \"image_url\"],\n", + " metadata_json_column=\"metadata\",\n", + " hybrid_search_config=hybrid_search_config,\n", + " store_metadata=True,\n", + ")\n", + "\n", + "\n", + "# If a hybrid search config is NOT provided during init_vectorstore_table (above),\n", + "# but only provided during PGVectorStore creation, the specified TSV column\n", + "# is not present and TSV vectors are created dynamically on-the-go for hybrid search.\n", + "vs_hybrid = await PGVectorStore.create(\n", + " pg_engine,\n", + " table_name=TABLE_NAME,\n", + " # schema_name=SCHEMA_NAME,\n", + " embedding_service=embedding,\n", + " # Connect to existing VectorStore by customizing below column names\n", + " id_column=\"product_id\",\n", + " content_column=\"description\",\n", + " embedding_column=\"embed\",\n", + " metadata_columns=[\"name\", \"category\", \"price_usd\", \"quantity\", \"sku\", \"image_url\"],\n", + " metadata_json_column=\"metadata\",\n", + " hybrid_search_config=hybrid_search_config,\n", + ")\n", + "\n", + "# Optionally, create an index on hybrid search column name\n", + "await vs_hybrid.aapply_hybrid_search_index()\n", + "\n", + "# Fetch documents from the previopusly created store to fetch product documents\n", + "docs = await custom_store.asimilarity_search(\"products\", k=5)\n", + "# Add data normally to the vector store, which will also add the tsv values in tsv_column\n", + "await vs_hybrid.aadd_documents(docs)\n", + "\n", + "# Use hybrid search\n", + "hybrid_docs = await vs_hybrid.asimilarity_search(\"products\", k=5)\n", + "print(hybrid_docs)" + ] + }, { "cell_type": "markdown", "metadata": {},