diff --git a/.github/workflows/json-bundle.yml b/.github/workflows/json-bundle.yml index be95f1f5d..1b3812cb9 100644 --- a/.github/workflows/json-bundle.yml +++ b/.github/workflows/json-bundle.yml @@ -6,6 +6,7 @@ on: jobs: bundle: + if: ${{ github.repository == 'superlinked/VectorHub' }} runs-on: ubuntu-latest permissions: contents: 'read' diff --git a/README.md b/README.md index e330e45d7..df2ccc602 100644 --- a/README.md +++ b/README.md @@ -2,7 +2,7 @@ # VectorHub -VectorHub is a free and open-sourced learning hub for people interested in adding vector retrieval to their ML stack. On VectorHub you will find practical resources to help you - +[VectorHub](https://hub.superlinked.com) is a free and open-sourced learning hub for people interested in adding vector retrieval to their ML stack. On VectorHub you will find practical resources to help you - * Create MVPs with easy-to-follow learning materials * Solve use case specific challenges in vector retrieval @@ -11,14 +11,18 @@ VectorHub is a free and open-sourced learning hub for people interested in addin Read more about our philosophy in our [Manifesto](manifesto.md). -## Built With +## Tools by VectorHub +[Vector DB Comparison](https://vdbs.superlinked.com) is a free and open source tool from VectorHub to compare vector databases. It is created to outline the feature sets of different VDB solutions. Each of the features outlined has been verified to varying degrees. -* [Archbee](https://www.archbee.com/) - Frontend for markdown files ## Contributing Please read [CONTRIBUTING.md](https://hub.superlinked.com/contributing) for details on our code of conduct, and the process for submitting pull requests to us. +### Built With + +* [Archbee](https://www.archbee.com/) - Frontend for markdown files + ## License This work is licensed under a [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License][cc-by-nc-sa]. diff --git a/docs/home.md b/docs/home.md index 7751881d9..77e9a3ec1 100644 --- a/docs/home.md +++ b/docs/home.md @@ -43,6 +43,7 @@ Here are some examples from the community, more coming soon! Subscribe to be updated when new ones come out & check the blog section. +- [02/01 - Scaling RAG for Production](https://hub.superlinked.com/scaling-rag-for-production): How to go from working model to a production system with step-by-step instructions. - [01/25 - Improving RAG performance with Knowledge Graphs](use_cases/knowledge_graphs.md): Adding knowledge graph embeddings as contextual data to improve the performance of RAG. - [01/18 Representation Learning on Graph Structured Data](https://hub.superlinked.com/representation-learning-on-graph-structured-data): Understanding how combining KGEs and semantic embeddings can improve understanding of your solution. - [01/11 - VDB Feature Matrix](https://vdbs.superlinked.com/): Find the right Vector Database (VDB) for your use case. diff --git a/docs/summary.md b/docs/summary.md index 4094f1b75..825d6118b 100644 --- a/docs/summary.md +++ b/docs/summary.md @@ -21,6 +21,7 @@ - [Personalized Search](use_cases/personalized_search.md) - [Recommender Systems](use_cases/recommender_systems.md) - [Retrieval Augmented Generation](use_cases/retrieval_augmented_generation.md) + - [Scaling RAG for Production](use_cases/scaling_rag_for_production.md) - [Enhancing RAG with multiple agents](use_cases/multi_agent_rag.md) - [Embeddings on browser](use_cases/embeddings_on_browser.md) - [Answering Questions with Knowledge Graph Embeddings](use_cases/knowledge_graph_embedding.md) diff --git a/docs/tools/vdb_table/data/azureai.json b/docs/tools/vdb_table/data/azureai.json index 0e9caf3cf..9c81c16a5 100644 --- a/docs/tools/vdb_table/data/azureai.json +++ b/docs/tools/vdb_table/data/azureai.json @@ -20,9 +20,7 @@ }, "dev_languages": { "value": [ - "C#", - "C++", - "Java" + "c++" ], "source_url": "", "comment": "" @@ -151,4 +149,4 @@ "source_url": "https://learn.microsoft.com/azure/search/vector-search-how-to-generate-embeddings", "comment": "" } -} +} \ No newline at end of file diff --git a/docs/tools/vdb_table/data/meilisearch.json b/docs/tools/vdb_table/data/meilisearch.json index bb58c9f03..ecc8d2d9c 100644 --- a/docs/tools/vdb_table/data/meilisearch.json +++ b/docs/tools/vdb_table/data/meilisearch.json @@ -51,7 +51,7 @@ "comment": "" }, "sparse_vectors": { - "support": "full", + "support": "none", "source_url": "", "comment": "" }, diff --git a/docs/tools/vdb_table/data/rockset.json b/docs/tools/vdb_table/data/rockset.json index 5343a91fc..c4ad7b9b8 100644 --- a/docs/tools/vdb_table/data/rockset.json +++ b/docs/tools/vdb_table/data/rockset.json @@ -14,139 +14,139 @@ "comment": "" }, "license": { - "value": "", - "source_url": "", + "value": "Proprietary", + "source_url": "https://rockset.com/legal/terms-of-service/", "comment": "" }, "dev_languages": { "value": [ - "" + "c++" ], "source_url": "", "comment": "" }, "github_stars": 0, - "vector_launch_year": 0, + "vector_launch_year": 2023, "metadata_filter": { - "support": "", + "support": "full", "source_url": "", "comment": "" }, "hybrid_search": { - "support": "", - "source_url": "", + "support": "full", + "source_url": "https://github.com/sofia099/search_autocomplete", "comment": "" }, "facets": { - "support": "", - "source_url": "", + "support": "full", + "source_url": "https://docs.rockset.com/documentation/reference/aggregate-functions", "comment": "" }, "geo_search": { - "support": "", - "source_url": "", + "support": "full", + "source_url": "https://docs.rockset.com/documentation/reference/geographic-functions", "comment": "" }, "multi_vec": { - "support": "", + "support": "none", "source_url": "", "comment": "" }, "sparse_vectors": { - "support": "", + "support": "none", "source_url": "", "comment": "" }, "bm25": { - "support": "", + "support": "none", "source_url": "", "comment": "" }, "full_text": { - "support": "", - "source_url": "", + "support": "full", + "source_url": "https://docs.rockset.com/documentation/reference/text-search-functions", "comment": "" }, "embeddings_text": { - "support": "", - "source_url": "", + "support": "partial", + "source_url": "https://docs.rockset.com/documentation/reference/user-defined-functions", "comment": "" }, "embeddings_image": { - "support": "", - "source_url": "", + "support": "partial", + "source_url": "https://docs.rockset.com/documentation/reference/user-defined-functions", "comment": "" }, "embeddings_structured": { - "support": "", + "support": "none", "source_url": "", "comment": "" }, "rag": { - "support": "", + "support": "none", "source_url": "", "comment": "" }, "recsys": { - "support": "", + "support": "none", "source_url": "", "comment": "" }, "langchain": { - "support": "", - "source_url": "", + "support": "full", + "source_url": "https://rockset.com/docs/langchain/", "comment": "" }, "llamaindex": { - "support": "", - "source_url": "", + "support": "full", + "source_url": "https://rockset.com/docs/llama_index/", "comment": "" }, "managed_cloud": { "support": "full", - "source_url": "", + "source_url": "https://docs.rockset.com/documentation/docs/what-is-rockset", "comment": "" }, "pricing": { "value": "", - "source_url": "", + "source_url": "https://rockset.com/pricing/", "comment": "" }, "in_process": { - "support": "", + "support": "none", "source_url": "", "comment": "" }, "multi_tenancy": { - "support": "", - "source_url": "", + "support": "full", + "source_url": "https://docs.rockset.com/documentation/docs/multitenancy", "comment": "" }, "disk_index": { - "support": "", - "source_url": "", + "support": "full", + "source_url": "https://rockset.com/blog/separate-compute-storage-rocksdb/", "comment": "" }, "ephemeral": { - "support": "", + "support": "none", "source_url": "", "comment": "" }, "sharding": { - "support": "", - "source_url": "", - "comment": "" + "support": "full", + "source_url": "https://rockset.com/whitepapers/rockset-concepts-designs-and-architecture", + "comment": "tech talk that includes sharding details: https://www.youtube.com/watch?v=trXiMHjP6a8" }, "doc_size": { - "bytes": 0, + "bytes": 41943040, "unlimited": false, - "source_url": "", + "source_url": "https://docs.rockset.com/documentation/docs/debugging#document-size-is-too-large", "comment": "" }, "vector_dims": { "value": 0, - "unlimited": false, + "unlimited": true, "source_url": "", "comment": "" } -} \ No newline at end of file +} diff --git a/docs/tools/vdb_table/data/usearch.json b/docs/tools/vdb_table/data/usearch.json index e4706d2bd..92e57133b 100644 --- a/docs/tools/vdb_table/data/usearch.json +++ b/docs/tools/vdb_table/data/usearch.json @@ -15,7 +15,7 @@ }, "license": { "value": "Apache-2.0", - "source_url": "", + "source_url": "https://github.com/unum-cloud/usearch/blob/main/LICENSE", "comment": "" }, "dev_languages": { @@ -25,26 +25,26 @@ "source_url": "", "comment": "" }, - "github_stars": 1061, + "github_stars": 1250, "vector_launch_year": 2023, "metadata_filter": { - "support": "", + "support": "partial", "source_url": "", - "comment": "" + "comment": "hybrid filtering techniques are available in some SDKs, like C++, but not in Python" }, "hybrid_search": { "support": "", "source_url": "", - "comment": "" + "comment": "hybrid filtering techniques are available in some SDKs, like C++, but not in Python" }, "facets": { - "support": "", + "support": "none", "source_url": "", "comment": "" }, "geo_search": { "support": "full", - "source_url": "", + "source_url": "https://ashvardanian.com/posts/abusing-vector-search/", "comment": "" }, "multi_vec": { @@ -58,24 +58,24 @@ "comment": "" }, "bm25": { - "support": "", + "support": "none", "source_url": "", "comment": "" }, "full_text": { - "support": "", + "support": "partial", "source_url": "", - "comment": "" + "comment": "with custom user-defined metrics on similar-length string until v3" }, "embeddings_text": { - "support": "", - "source_url": "", - "comment": "" + "support": "partial", + "source_url": "https://github.com/unum-cloud/uform", + "comment": "UForm embeddings as well as third-party solutions can be used with USearch" }, "embeddings_image": { - "support": "", - "source_url": "", - "comment": "" + "support": "partial", + "source_url": "https://github.com/unum-cloud/uform", + "comment": "UForm embeddings as well as third-party solutions can be used with USearch" }, "embeddings_structured": { "support": "", @@ -108,23 +108,23 @@ "comment": "" }, "pricing": { - "value": "", + "value": "none", "source_url": "", "comment": "" }, "in_process": { - "support": "none", - "source_url": "", - "comment": "" + "support": "full", + "source_url": "https://ashvardanian.com/posts/porting-cpp-library-to-ten-languages/", + "comment": "The whole library is compiled for the target language runtime to be embedded natively into apps in different languages" }, "multi_tenancy": { - "support": "", + "support": "none", "source_url": "", "comment": "" }, "disk_index": { - "support": "", - "source_url": "", + "support": "partial", + "source_url": "https://github.com/unum-cloud/usearch/#serialization--serving-index-from-disk", "comment": "" }, "ephemeral": { @@ -133,7 +133,7 @@ "comment": "" }, "sharding": { - "support": "", + "support": "none", "source_url": "", "comment": "" }, diff --git a/docs/tools/vdb_table/data/vectara.json b/docs/tools/vdb_table/data/vectara.json index ceade6eed..5a9d67aff 100644 --- a/docs/tools/vdb_table/data/vectara.json +++ b/docs/tools/vdb_table/data/vectara.json @@ -20,7 +20,7 @@ }, "dev_languages": { "value": [ - "not known" + "" ], "source_url": "", "comment": "" diff --git a/docs/tools/vdb_table/vendor.schema.json b/docs/tools/vdb_table/vendor.schema.json index fa286978f..4b754171e 100644 --- a/docs/tools/vdb_table/vendor.schema.json +++ b/docs/tools/vdb_table/vendor.schema.json @@ -8,7 +8,7 @@ "links": {"allOf": [{"$ref": "#/$defs/links"}], "$comment": "About | Links | " }, "oss": {"allOf": [{"$ref": "#/$defs/featureWithSource"}], "$comment": "About | OSS | The code-base is open source and users can self-host it for free." }, "license": {"allOf": [{"$ref": "#/$defs/stringWithSource"}], "$comment": "About | License | The license the source code is released under." }, - "dev_languages": {"allOf": [{"$ref": "#/$defs/stringListWithSource"}], "$comment": "About | Dev Lang | The language the database is developed in." }, + "dev_languages": {"allOf": [{"$ref": "#/$defs/devLanguageListWithSource"}], "$comment": "About | Dev Lang | The language the database is developed in." }, "github_stars": {"type": "integer", "$comment": "About | GitHub ⭐ | The number of stars for the core product repository." }, "vector_launch_year": {"type": "integer", "$comment": "About | VSS Launch | The year of the first release for the vector search functionality." }, "metadata_filter": {"allOf": [{"$ref": "#/$defs/featureWithSource"}], "$comment": "Search | Filters | Metadata filtering support within vector search - allowing users to refine results based on additional contextual informatio and enhancing precision in search queries. Not to be confused with filters/faceting in Lucene based keyword search." }, @@ -69,14 +69,26 @@ "comment": {"type": "string"} } }, - "stringListWithSource": { - "$id": "stringListWithSource", + "devLanguageListWithSource": { + "$id": "devLanguageListWithSource", "type": "object", "properties": { "value": { "type": "array", "items": { - "type": "string" + "type": "string", + "enum": [ + "", + "python", + "c++", + "c", + "c#", + "go", + "java", + "rust", + "typescript", + "not known" + ] } }, "source_url": {"type": "string"}, diff --git a/docs/use_cases/scaling_rag_for_production.md b/docs/use_cases/scaling_rag_for_production.md index 01279a250..b9aabdf25 100644 --- a/docs/use_cases/scaling_rag_for_production.md +++ b/docs/use_cases/scaling_rag_for_production.md @@ -2,25 +2,23 @@ # Scaling RAG for Production -![](assets/use_cases/recommender_systems/cover.jpg) - Retrieval-augmented Generation (RAG) combines Large Language Models (LLMs) with external data to reduce the probability of machine hallucinations - AI-generated information that misrepresents underlying data or reality. When developing RAG systems, scalability is often an afterthought. This creates problems when moving from initial development to production. Having to manually adjust code while your application grows can get very costly and is prone to errors. -Our tutorial provides one example of **how you can develop a RAG pipeline with production workloads in mind from the start**, using the right tools - ones that are designed to scale. +Our tutorial provides an example of **how you can develop a RAG pipeline with production workloads in mind from the start**, using the right tools - ones that are designed to scale your application. ## Development vs. production -The goals and requirements of development and production are usually very different. This is particularly true for new technologies like Large Language Models (LLMs) and Retrieval-augmented Generation (RAG), where organizations prioritize rapid experimentation to test the waters before committing more resources. Once important stakeholders are convinced, the focus shifts from demonstrating that something _can create value_ to _actually creating value via production_. Until a system is productionized, its ROI is typically zero. +The goals and requirements of development and production are usually very different. This is particularly true for new technologies like Large Language Models (LLMs) and Retrieval-augmented Generation (RAG), where organizations prioritize rapid experimentation to test the waters before committing more resources. Once important stakeholders are convinced, the focus shifts from demonstrating an application's _potential for_ creating value to _actually_ creating value, via production. Until a system is productionized, its ROI is typically zero. **Productionizing**, in the context of [RAG systems](https://hub.superlinked.com/retrieval-augmented-generation), involves transitioning from a prototype or test environment to a **stable, operational state**, in which the system is readily accessible and reliable for remote end users, such as via URL - i.e., independent of the end user machine state. Productionizing also involves **scaling** the system to handle varying levels of user demand and traffic, ensuring consistent performance and availability. -Even though there is no ROI without productionizing, organizations often underesimate the hurdles involved. Productionizing is always a trade-off between performance and costs, and this is no different for Retrieval-augmented Generation (RAG) systems. The goal is to achieve a stable, operational, and scalable end product while keeping costs low. +Even though there is no ROI without productionizing, organizations often underesimate the hurdles involved in getting to an end product. Productionizing is always a trade-off between performance and costs, and this is no different for Retrieval-augmented Generation (RAG) systems. The goal is to achieve a stable, operational, and scalable end product while keeping costs low. Let's look more closely at the basic requirements of an [RAG system](https://hub.superlinked.com/retrieval-augmented-generation), before going in to the specifics of what you'll need to productionize it in a cost-effective but scalable way. ## The basics of RAG -Let’s review the most basic RAG workflow: +The most basic RAG workflow looks like this: 1. Submit a text query to an embedding model, which converts it into a semantically meaningful vector embedding. 2. Send the resulting query vector embedding to your document embeddings storage location - typically a [vector database](https://hub.superlinked.com/32-key-access-patterns#Ea74G). @@ -28,28 +26,26 @@ Let’s review the most basic RAG workflow: 4. Add the retrieved document chunks as context to the query vector embedding and send it to the LLM. 5. The LLM generates a response utilizing the retrieved context. -While RAG workflows can become significantly more complex, incorporating methods like metadata filtering and retrieval reranking, _all_ RAG systems must contain the components involved the basic workflow: an embedding model, a store for document and vector embeddings, a retriever, and a LLM. +While RAG workflows can become significantly more complex, incorporating methods like metadata filtering and retrieval reranking, _all_ RAG systems must contain the components involved in the basic workflow: an embedding model, a store for document and vector embeddings, a retriever, and a LLM. -But smart development, with productionization in mind, requires not just setting up our components in a functional way. We must also develop with cost-effective scalability in mind... For this we'll need not just these basic components, but the right tools for configuring a scalable RAG system.. +But smart development, with productionization in mind, requires more than just setting up your components in a functional way. You must also develop with cost-effective scalability in mind. For this you'll need not just these basic components, but more specifically the tools appropriate to configuring a scalable RAG system. ## Developing for scalability: the right tools -store for doucment and vector embeddings: -LLM library: Langchain's LCEL -productionizing framework for scaling: Ray - ### LLM library: LangChain -As of this writing, LangChain, while it has also been the subject of much criticism, is also arguably the most prominent LLM library. A lot of developers turn to Langchain to build Proof-of-Concepts (PoCs) and Minimum Viable Products (MVPs), or simply to experiment with new ideas. Whether one chooses LangChain or one of the other major LLM and RAG libraries - for example, LlamaIndex or Haystack, to name my other personal favorites - they can _all_ be used to productionize an RAG system. That is, all three have integrations for third-party libraries and providers that will handle the production requirements. Which one you choose to interface with your other components depends on the details of your existing tech stack and use case. +As of this writing, LangChain, while it has also been the subject of much criticism, is arguably the most prominent LLM library. A lot of developers turn to Langchain to build Proof-of-Concepts (PoCs) and Minimum Viable Products (MVPs), or simply to experiment with new ideas. Whether one chooses LangChain or one of the other major LLM and RAG libraries - for example, LlamaIndex or Haystack, to name our alternate personal favorites - they can _all_ be used to productionize an RAG system. That is, all three have integrations for third-party libraries and providers that will handle production requirements. Which one you choose to interface with your other components depends on the details of your existing tech stack and use case. For the purpose of this tutorial, we'll use part of the Langchain documentation, along with Ray. ### Scaling with Ray -Because our goal is to build a 1) simple, 2) scalable, _and_ 3) economically feasible option, not reliant on proprietary solutions, we have chosen to use [Ray](https://github.com/ray-project/ray), a Python framework for productionizing and scaling machine learning (ML) workloads. Ray is designed with a range of auto-scaling features, to seamlessly scale ML systems. It's also adaptable to both local environments and Kubernetes, efficiently managing all workload requirements. +Because our goal is to build a 1) simple, 2) scalable, _and_ 3) economically feasible option, not reliant on proprietary solutions, we have chosen to use [Ray](https://github.com/ray-project/ray), a Python framework for productionizing and scaling machine learning (ML) workloads. Ray is designed with a range of auto-scaling features that seamlessly scale ML systems. It's also adaptable to both local environments and Kubernetes, efficiently managing all workload requirements. **Ray permits us to keep our tutorial system simple, non-proprietary, and on our own network, rather than the cloud**. While LangChain, LlamaIndex, and Haystack libraries support cloud deployment for AWS, Azure, and GCP, the details of cloud deployment heavily depend on - and are therefore very particular to - the specific cloud provider you choose. These libraries also all contain Ray integrations to enable scaling. But **using Ray directly will provide us with more universally applicable insights**, given that the Ray integrations within LangChain, LlamaIndex, and Haystack are built upon the same underlying framework. +Now that we have our LLM library sorted, let's turn to data gathering and processing. + ## Data gathering and processing ### Gathering the data @@ -196,7 +192,7 @@ def extract_main_content(record): ``` -We can now use Ray's map() function to run this extraction process. Ray let’s us run multiple processes in parallel. +We can now use Ray's map() function to run this extraction process. Ray lets us run multiple processes in parallel. ```python # Extract content @@ -205,7 +201,7 @@ content_ds.count() ``` -Awesome! The results of the above extraction are our dataset. Because Ray datasets are optimized for performance at scale, and therefore productionization, they don't require us to make costly and error-prone adjustments to our code when our application grows. +Awesome! The results of the above extraction are our dataset. Because Ray datasets are optimized for scaled performance in production, they don't require us to make costly and error-prone adjustments to our code when our application grows. ### Processing the data @@ -213,7 +209,7 @@ To process our dataset, our next three steps are **chunking, embedding, and inde **Chunking the data** -Chunking - splitting your documents into multiple smaller parts - is necessary to make your data meet the LLM’s context length limits, and helps keep contexts specific enough to remain relevant. Chunks need to be the right size. If your chunks are too small, the information retrieved may become too narrow to provide adequate query responses. The optimal chunk size will depend on your data, the models you use, and your use case. We will use a common chunking value here, one that has been used in a lot of applications. +Chunking - splitting your documents into multiple smaller parts - is necessary to make your data meet the LLM’s context length limits, and helps keep contexts specific enough to remain relevant. Chunks also need to not be too small. When chunks are too small, the information retrieved may become too narrow to provide adequate query responses. The optimal chunk size will depend on your data, the models you use, and your use case. We will use a common chunking value here, one that has been used in a lot of applications. Let’s define our text splitting logic first, using a standard text splitter from LangChain: @@ -249,7 +245,7 @@ Now that we've gathered and chunked our data scalably, we need to embed and inde **Embedding the data** -We use a pretrained model to create vector embeddings for both our data chunks and the query itself. By measuring the distance between the chunk embeddings and the query embedding, we can identify the most relevant, or "top-k," chunks. Of the various pretrained models, we'll use the popular 'bge-base-en-v1.5' model because, at the time of writing this tutorial, it ranks as the highest-performing model of its size on the [MTEB Leaderboard](https://huggingface.co/spaces/mteb/leaderboard). For convenience, we continue using LangChain: +We use a pretrained model to create vector embeddings for both our data chunks and the query itself. By measuring the distance between the chunk embeddings and the query embedding, we can identify the most relevant, or "top-k," chunks. Of the various pretrained models, we'll use the popular 'bge-base-en-v1.5' model, which, at the time of writing this tutorial, ranks as the highest-performing model of its size on the [MTEB Leaderboard](https://huggingface.co/spaces/mteb/leaderboard). For convenience, we continue using LangChain: ```python from langchain.embeddings import OpenAIEmbeddings @@ -290,7 +286,7 @@ embedded_chunks = chunks_ds.map_batches( **Indexing the data** -Now that our chunks are embedded, we need to **store** them somewhere. For the sake of this tutorial, we'll utilize Qdrant’s new in-memory feature, which lets us experiment with our code rapidly without needing to set up a fully-fledged instance. However, for deployment in a production environment, you should rely on more robust and scalable solutions — hosted either within your own network or by a third-party provider. Detailed guidance on setting up such solutions is beyond the scope of this tutorial. +Now that our chunks are embedded, we need to **store** them somewhere. For the sake of this tutorial, we'll utilize Qdrant’s new in-memory feature, which lets us experiment with our code rapidly without needing to set up a fully-fledged instance. However, for deployment in a production environment, you should rely on more robust and scalable solutions — hosted either within your own network or by a third-party provider. For example, to fully productionize, we would need to point to our Qdrant (or your preferred hosted vendor) instance instead of using it in-memory. Detailed guidance on self-hosted solutions, such as setting up a Kubernetes cluster, are beyond the scope of this tutorial. ```python from qdrant_client import QdrantClient @@ -305,7 +301,7 @@ client.recreate_collection( ) ``` -To perform the next processing step - storage - using Ray would require more than 2 CPU scores, making this tutorial incompatible with the free tier of Google Colab. Instead, then, we'll use pandas. Fortunately, Ray allows us to convert our dataset into a pandas DataFrame with a single line of code: +To perform the next processing step, storage, using Ray would require more than 2 CPU scores, making this tutorial incompatible with the free tier of Google Colab. Instead, then, we'll use pandas. Fortunately, Ray allows us to convert our dataset into a pandas DataFrame with a single line of code: ```python emb_chunks_df = embedded_chunks.to_pandas() @@ -388,7 +384,7 @@ def semantic_search(query, embedding_model, k): We're now very close to being able to field queries and retrieve answers! We've set up everything we need to query our LLM _at scale_. But before querying the model for a response, we want to first inform the query with our data, by **retrieving relevant context from our vector database and then adding it to the query**. -To do this, we use a simplified version of the generate.py script provided in Ray's [LLM repository](https://github.com/ray-project/llm-applications/blob/main/rag/generate.py). This simplified version is adapted to our code and leaves out a bunch of advanced retrieval techniques, such as reranking and hybrid search. We use gpt-3.5-turbo as our LLM and query it via the OpenAI API. +To do this, we use a simplified version of the generate.py script provided in Ray's [LLM repository](https://github.com/ray-project/llm-applications/blob/main/rag/generate.py). This version is adapted to our code and - to simplify and keep our focus on how to scale a basic RAG system - leaves out a bunch of advanced retrieval techniques, such as reranking and hybrid search. For our LLM, we use gpt-3.5-turbo, and query it via the OpenAI API. ```python from openai import OpenAI @@ -467,7 +463,7 @@ for content in response: print(content, end='', flush=True) ``` -However, to make using our application even more convenient, we simply and adapt Ray's official documentation to implement our workflow within a single QueryAgent class, which will will take care of all the steps we implemented above for us, including a few additional utility functions. +To **make using our application even more convenient**, we can simply adapt Ray's official documentation to **implement our workflow within a single QueryAgent class**, which bundles together and takes care of all of the steps we implemented above - retrieving embeddings, embedding the search query, performing vector search, processing the results, and querying the LLM to generate a response. Using this single class approach, we no longer need to sequentially call all of these functions, and can also include utility functions. (Specifically, `Get_num_tokens` encodes our text and gets the number of tokens, to calculate the length of the input. To maintain our standard 50:50 ratio to allocate space to each of input and generation, we use `(text, max_context_length)` to trim input text if it's too long.) ```python import tiktoken @@ -549,7 +545,7 @@ class QueryAgent: return result ``` -And this is how we can use the QueryAgent: +To embed our query and retrieve relevant vectors, and then generate a response, we run our QueryAgent as follows: ```python import json @@ -567,7 +563,7 @@ print(json.dumps(result, indent=2)) ## Serving our application -Our application is now running! Our final step is to serve it. Ray's [Ray Serve](https://docs.ray.io/en/latest/serve/index.html) module makes this very straightforward. We use Ray Serve in combination with FastAPI and pydantic. The @serve.deployment decorator lets us define how many replicas and compute resources we want to use, and Ray’s autoscaling will handle the rest. Two Ray Serve decorators are all we need to modify our FastAPI application for production. +Our application is now running! Our last productionizing step is to serve it. Ray's [Ray Serve](https://docs.ray.io/en/latest/serve/index.html) module makes this step very straightforward. We combine Ray Serve with FastAPI and pydantic. The @serve.deployment decorator lets us define how many replicas and compute resources we want to use, and Ray’s autoscaling will handle the rest. Two Ray Serve decorators are all we need to modify our FastAPI application for production. ```python import pickle @@ -614,7 +610,7 @@ class RayAssistantDeployment: return Response.parse_obj(result) ``` -And now we **deploy** our application: +Now, we're ready to **deploy** our application: ```python # Deploying our application with Ray Serve @@ -626,7 +622,7 @@ deployment = RayAssistantDeployment.bind( serve.run(deployment, route_prefix="/") ``` -Our FastAPI endpoint can now be queried like any other API, while Ray handles the workload automatically: +Our FastAPI endpoint is capable of being queried like any other API, while Ray take care of the workload automatically: ```python # Performing inference @@ -641,23 +637,23 @@ except: print(response.text) ``` -Wow! We've been on quite a journey. We gathered our data using Ray and some LangChain documentation, processed it by chunking, embedding, and indexing it, set up our retrieval and generation, and, finally, served our application using Ray Serve... +Wow! We've been on quite a journey. We gathered our data using Ray and some LangChain documentation, processed it by chunking, embedding, and indexing it, set up our retrieval and generation, and, finally, served our application using Ray Serve. Our tutorial has so far covered an example of how to develop scalably and economically - how to productionize from the very start of development. -But to fully productionize your application, you also need to maintain it. +Still, there is one last crucial step. ## Production is only the start: maintenance -Often, reaching production is viewed as the primary goal, while maintenance is overlooked. However, the reality is that maintaining an application is a continuous and important task. +To fully productionize any application, you also need to maintain it. And maintaining your application is a continuous task. -Regular assessment and improvement of your application are essential. This might include routinely updating your data to guarantee that your application has the latest information, or keeping an eye on performance to prevent any degradation. For smoother operations, integrating your workflows with CI/CD pipelines is recommended. +Maintenance involves regular assessment and improvement of your application. You may need to routinely update your dataset if your application relies on being current with real-world changes. And, of course, you should monitor application performance to prevent degradation. For smoother operations, we recommend integrating your workflows with CI/CD pipelines. -### Limitations +### Limitations and future discussion -There are are other critical aspects to consider that were outside of the scope of this article, but will be explored elsewhere: +Other critical aspects of scalably productionizing fall outside of the scope of this article, but will be explored in future articles, including: - **Advanced Development** Pre-training, finetuning, prompt engineering and other in-depth development techniques -- **Evaluation** LLM Evaluation can get very tricky due to randomness and qualitative metrics, RAG also consits of multiple complex parts -- **Compliance** Adhering to data privacy laws and regulations, especially when handling personal or sensitive information. +- **Evaluation** Randomness and qualitative metrics, and complex multi-part structure of RAG can make LLM evaluation difficult +- **Compliance** Adhering to data privacy laws and regulations, especially when handling personal or sensitive information --- ## Contributors