-
Notifications
You must be signed in to change notification settings - Fork 15.9k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
docs: Create custom embeddings (#20398)
Guidelines on how to create custom embeddings --------- Co-authored-by: Chester Curme <[email protected]>
- Loading branch information
Showing
4 changed files
with
233 additions
and
32 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,222 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"attachments": {}, | ||
"cell_type": "markdown", | ||
"id": "c160026f-aadb-4e9f-8642-b4a9e8479d77", | ||
"metadata": {}, | ||
"source": [ | ||
"# Custom Embeddings\n", | ||
"\n", | ||
"LangChain is integrated with many [3rd party embedding models](/docs/integrations/text_embedding/). In this guide we'll show you how to create a custom Embedding class, in case a built-in one does not already exist. Embeddings are critical in natural language processing applications as they convert text into a numerical form that algorithms can understand, thereby enabling a wide range of applications such as similarity search, text classification, and clustering.\n", | ||
"\n", | ||
"Implementing embeddings using the standard [Embeddings](https://python.langchain.com/api_reference/core/embeddings/langchain_core.embeddings.embeddings.Embeddings.html) interface will allow your embeddings to be utilized in existing `LangChain` abstractions (e.g., as the embeddings powering a [VectorStore](https://python.langchain.com/api_reference/core/vectorstores/langchain_core.vectorstores.base.VectorStore.html) or cached using [CacheBackedEmbeddings](/docs/how_to/caching_embeddings/)).\n", | ||
"\n", | ||
"## Interface\n", | ||
"\n", | ||
"The current `Embeddings` abstraction in LangChain is designed to operate on text data. In this implementation, the inputs are either single strings or lists of strings, and the outputs are lists of numerical arrays (vectors), where each vector represents\n", | ||
"an embedding of the input text into some n-dimensional space.\n", | ||
"\n", | ||
"Your custom embedding must implement the following methods:\n", | ||
"\n", | ||
"| Method/Property | Description | Required/Optional |\n", | ||
"|---------------------------------|----------------------------------------------------------------------------|-------------------|\n", | ||
"| `embed_documents(texts)` | Generates embeddings for a list of strings. | Required |\n", | ||
"| `embed_query(text)` | Generates an embedding for a single text query. | Required |\n", | ||
"| `aembed_documents(texts)` | Asynchronously generates embeddings for a list of strings. | Optional |\n", | ||
"| `aembed_query(text)` | Asynchronously generates an embedding for a single text query. | Optional |\n", | ||
"\n", | ||
"These methods ensure that your embedding model can be integrated seamlessly into the LangChain framework, providing both synchronous and asynchronous capabilities for scalability and performance optimization.\n", | ||
"\n", | ||
"\n", | ||
":::note\n", | ||
"`Embeddings` do not currently implement the [Runnable](/docs/concepts/runnables/) interface and are also **not** instances of pydantic `BaseModel`.\n", | ||
":::\n", | ||
"\n", | ||
"### Embedding queries vs documents\n", | ||
"\n", | ||
"The `embed_query` and `embed_documents` methods are required. These methods both operate\n", | ||
"on string inputs. The accessing of `Document.page_content` attributes is handled\n", | ||
"by the vector store using the embedding model for legacy reasons.\n", | ||
"\n", | ||
"`embed_query` takes in a single string and returns a single embedding as a list of floats.\n", | ||
"If your model has different modes for embedding queries vs the underlying documents, you can\n", | ||
"implement this method to handle that. \n", | ||
"\n", | ||
"`embed_documents` takes in a list of strings and returns a list of embeddings as a list of lists of floats.\n", | ||
"\n", | ||
":::note\n", | ||
"`embed_documents` takes in a list of plain text, not a list of LangChain `Document` objects. The name of this method\n", | ||
"may change in future versions of LangChain.\n", | ||
":::" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "2162547f-4577-47e8-b12f-e9aa3c243797", | ||
"metadata": {}, | ||
"source": [ | ||
"## Implementation\n", | ||
"\n", | ||
"As an example, we'll implement a simple embeddings model that returns a constant vector. This model is for illustrative purposes only." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 1, | ||
"id": "6b838062-552c-43f8-94f8-d17e4ae4c221", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"from typing import List\n", | ||
"\n", | ||
"from langchain_core.embeddings import Embeddings\n", | ||
"\n", | ||
"\n", | ||
"class ParrotLinkEmbeddings(Embeddings):\n", | ||
" \"\"\"ParrotLink embedding model integration.\n", | ||
"\n", | ||
" # TODO: Populate with relevant params.\n", | ||
" Key init args — completion params:\n", | ||
" model: str\n", | ||
" Name of ParrotLink model to use.\n", | ||
"\n", | ||
" See full list of supported init args and their descriptions in the params section.\n", | ||
"\n", | ||
" # TODO: Replace with relevant init params.\n", | ||
" Instantiate:\n", | ||
" .. code-block:: python\n", | ||
"\n", | ||
" from langchain_parrot_link import ParrotLinkEmbeddings\n", | ||
"\n", | ||
" embed = ParrotLinkEmbeddings(\n", | ||
" model=\"...\",\n", | ||
" # api_key=\"...\",\n", | ||
" # other params...\n", | ||
" )\n", | ||
"\n", | ||
" Embed single text:\n", | ||
" .. code-block:: python\n", | ||
"\n", | ||
" input_text = \"The meaning of life is 42\"\n", | ||
" embed.embed_query(input_text)\n", | ||
"\n", | ||
" .. code-block:: python\n", | ||
"\n", | ||
" # TODO: Example output.\n", | ||
"\n", | ||
" # TODO: Delete if token-level streaming isn't supported.\n", | ||
" Embed multiple text:\n", | ||
" .. code-block:: python\n", | ||
"\n", | ||
" input_texts = [\"Document 1...\", \"Document 2...\"]\n", | ||
" embed.embed_documents(input_texts)\n", | ||
"\n", | ||
" .. code-block:: python\n", | ||
"\n", | ||
" # TODO: Example output.\n", | ||
"\n", | ||
" # TODO: Delete if native async isn't supported.\n", | ||
" Async:\n", | ||
" .. code-block:: python\n", | ||
"\n", | ||
" await embed.aembed_query(input_text)\n", | ||
"\n", | ||
" # multiple:\n", | ||
" # await embed.aembed_documents(input_texts)\n", | ||
"\n", | ||
" .. code-block:: python\n", | ||
"\n", | ||
" # TODO: Example output.\n", | ||
"\n", | ||
" \"\"\"\n", | ||
"\n", | ||
" def __init__(self, model: str):\n", | ||
" self.model = model\n", | ||
"\n", | ||
" def embed_documents(self, texts: List[str]) -> List[List[float]]:\n", | ||
" \"\"\"Embed search docs.\"\"\"\n", | ||
" return [[0.5, 0.6, 0.7] for _ in texts]\n", | ||
"\n", | ||
" def embed_query(self, text: str) -> List[float]:\n", | ||
" \"\"\"Embed query text.\"\"\"\n", | ||
" return self.embed_documents([text])[0]\n", | ||
"\n", | ||
" # optional: add custom async implementations here\n", | ||
" # you can also delete these, and the base class will\n", | ||
" # use the default implementation, which calls the sync\n", | ||
" # version in an async executor:\n", | ||
"\n", | ||
" # async def aembed_documents(self, texts: List[str]) -> List[List[float]]:\n", | ||
" # \"\"\"Asynchronous Embed search docs.\"\"\"\n", | ||
" # ...\n", | ||
"\n", | ||
" # async def aembed_query(self, text: str) -> List[float]:\n", | ||
" # \"\"\"Asynchronous Embed query text.\"\"\"\n", | ||
" # ..." | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "47a19044-5c3f-40da-889a-1a1cfffc137c", | ||
"metadata": {}, | ||
"source": [ | ||
"### Let's test it 🧪" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 2, | ||
"id": "21c218fe-8f91-437f-b523-c2b6e5cf749e", | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"[[0.5, 0.6, 0.7], [0.5, 0.6, 0.7]]\n", | ||
"[0.5, 0.6, 0.7]\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"embeddings = ParrotLinkEmbeddings(\"test-model\")\n", | ||
"print(embeddings.embed_documents([\"Hello\", \"world\"]))\n", | ||
"print(embeddings.embed_query(\"Hello\"))" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "de50f690-178e-4561-af98-14967b3c8501", | ||
"metadata": {}, | ||
"source": [ | ||
"## Contributing\n", | ||
"\n", | ||
"We welcome contributions of Embedding models to the LangChain code base.\n", | ||
"\n", | ||
"If you aim to contribute an embedding model for a new provider (e.g., with a new set of dependencies or SDK), we encourage you to publish your implementation in a separate `langchain-*` integration package. This will enable you to appropriately manage dependencies and version your package. Please refer to our [contributing guide](/docs/contributing/how_to/integrations/) for a walkthrough of this process." | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python 3 (ipykernel)", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.10.4" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 5 | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters