From 31134c229830e47c3b0d4499f22481cc705d866c Mon Sep 17 00:00:00 2001 From: svonava Date: Sat, 3 Feb 2024 08:41:52 -0800 Subject: [PATCH] Revert "feat: add mdformat github action (#198)" (#200) This reverts commit 485b11e79652206e8599a144bf0b4fa4f6b04e95. --- .github/workflows/mdformat.yml | 24 - docs/building_blocks/readme.md | 17 +- docs/contributing/markdown_formatting.md | 10 +- docs/contributing/readme.md | 45 +- docs/contributing/style_guide.md | 137 +- docs/tools/readme.md | 4 +- docs/tools/testpage.md | 2 + .../clustering_&_anomaly_detection.md | 3 +- docs/use_cases/cybersecurity.md | 3 +- docs/use_cases/embeddings_on_browser.md | 204 +-- docs/use_cases/fraud_&_safety.md | 3 +- docs/use_cases/knowledge_graph_embedding.md | 177 +- docs/use_cases/knowledge_graphs.md | 263 +-- docs/use_cases/multi_agent_rag.md | 231 +-- .../use_cases/node_representation_learning.md | 256 +-- docs/use_cases/personalized_search.md | 107 +- docs/use_cases/readme.md | 51 +- docs/use_cases/recommender_systems.md | 101 +- .../retrieval_augmented_generation.md | 142 +- docs/use_cases/scaling_rag_for_production.md | 1418 ++++++++--------- 20 files changed, 1229 insertions(+), 1969 deletions(-) delete mode 100644 .github/workflows/mdformat.yml diff --git a/.github/workflows/mdformat.yml b/.github/workflows/mdformat.yml deleted file mode 100644 index 8adaa86dc..000000000 --- a/.github/workflows/mdformat.yml +++ /dev/null @@ -1,24 +0,0 @@ -name: Run mdformat on markdown files -on: - workflow_run: - workflows: - - Validate vendor JSON files - types: - - completed - -jobs: - format: - runs-on: ubuntu-latest - steps: - - name: Checkout - uses: actions/checkout@v4 - - uses: actions/setup-python@v5 - with: - python-version: '3.10' - - run: pip install mdformat - - run: mdformat docs/**/*.md --wrap 120 - - name: Commit and push changes - uses: devops-infra/action-commit-push@master - with: - github_token: ${{ secrets.GITHUB_TOKEN }} - commit_message: Formatted Markdown files. \ No newline at end of file diff --git a/docs/building_blocks/readme.md b/docs/building_blocks/readme.md index 85565c917..19a6a9cdd 100644 --- a/docs/building_blocks/readme.md +++ b/docs/building_blocks/readme.md @@ -1,21 +1,12 @@ # Building Blocks -Building blocks are the atomic units of creating a vector retrieval stack. If you want to create a vector retrieval -stack that's ready for production, you'll need to have a few key components in place. These include: +Building blocks are the atomic units of creating a vector retrieval stack. If you want to create a vector retrieval stack that's ready for production, you'll need to have a few key components in place. These include: -- Data sources: You can get your data from a variety of sources, including relational databases like PSQL and MySQL, - data pipeline tools like Kafka and GCP pub-sub, data warehouses like Snowflake and Databricks, and customer data - platforms like Segment. The goal here is to extract and connect your data so that it can be used in your vector stack. -- Vector computation: This involves turning your data into vectors using models from Huggingface or your own custom - models. You'll also need to know where to run these models and how to bring all of your computing infrastructure - together using tools like custom spark pipelines or products like Superlinked. The ultimate goal is to have - production-ready pipelines and models that are ready to go. -- Vector search & management: This is all about querying and retrieving vectors from Vector DBs like Weaviate and - Pinecone, or hybrid DBs like Redis and Postgres (with pgvector). You'll also need to use search tools like Elastic and - Vespa to rank your vectors. The goal is to make the vectors indexable and search for relevant vectors when needed. +- Data sources: You can get your data from a variety of sources, including relational databases like PSQL and MySQL, data pipeline tools like Kafka and GCP pub-sub, data warehouses like Snowflake and Databricks, and customer data platforms like Segment. The goal here is to extract and connect your data so that it can be used in your vector stack. +- Vector computation: This involves turning your data into vectors using models from Huggingface or your own custom models. You'll also need to know where to run these models and how to bring all of your computing infrastructure together using tools like custom spark pipelines or products like Superlinked. The ultimate goal is to have production-ready pipelines and models that are ready to go. +- Vector search & management: This is all about querying and retrieving vectors from Vector DBs like Weaviate and Pinecone, or hybrid DBs like Redis and Postgres (with pgvector). You'll also need to use search tools like Elastic and Vespa to rank your vectors. The goal is to make the vectors indexable and search for relevant vectors when needed. ## Contents - - [Data Sources](https://hub.superlinked.com/data-sources) - [Vector Compute](https://hub.superlinked.com/vector-compute) - [Vector Search & Management](https://hub.superlinked.com/vector-search) diff --git a/docs/contributing/markdown_formatting.md b/docs/contributing/markdown_formatting.md index e2cb3bfdb..14a14f2c9 100644 --- a/docs/contributing/markdown_formatting.md +++ b/docs/contributing/markdown_formatting.md @@ -2,8 +2,8 @@ ## Adding comments -If you want to add comments to your document that you don't want rendered to the VectorHub frontend, use the following -format in your markdown files. Make sure to create blank lines before and after your comment for the best results. +If you want to add comments to your document that you don't want rendered to the VectorHub frontend, use the following format in your markdown files. Make sure to create blank lines before and after your comment for the best results. + ```markdown [//]: # (your comment here) @@ -19,13 +19,11 @@ You can use [mermaid](http://mermaid.js.org/intro/) to create diagrams for your ## Adding Special blocks in archbee -Archbee supports special code, tabs, link blocks, callouts, and changelog blocks which can be found in -[their documentation](https://docs.archbee.com/editor-markdown-shortcuts). +Archbee supports special code, tabs, link blocks, callouts, and changelog blocks which can be found in [their documentation](https://docs.archbee.com/editor-markdown-shortcuts). ## Adding alt text and title to images -We encourage you to create alt text (for accessibility & SEO purposes) and a title (for explanability and readability) -for all images you add to a document. +We encourage you to create alt text (for accessibility & SEO purposes) and a title (for explanability and readability) for all images you add to a document. ```markdown ![Alt text](/path/to/img.jpg "Optional title") diff --git a/docs/contributing/readme.md b/docs/contributing/readme.md index a25799d90..6451da820 100644 --- a/docs/contributing/readme.md +++ b/docs/contributing/readme.md @@ -1,52 +1,47 @@ # Contributing -VectorHub is a learning hub that lives on its contributors. We are always looking for people to help others, especially -as we grow. You can contribute in many ways, either by creating new content or by letting us know if content needs -updating. +VectorHub is a learning hub that lives on its contributors. We are always looking for people to help others, especially as we grow. You can contribute in many ways, either by creating new content or by letting us know if content needs updating. ## How is VectorHub organised VectorHub's content is organized into three major areas: -1. Building Blocks: These cover the broad field of vector creation and retrieval. We take a step by step approach to - creating a vector stack: Data Sources -> Vector Compute -> Vector Search & Management. +1. Building Blocks: These cover the broad field of vector creation and retrieval. We take a step by step approach to creating a vector stack: Data Sources -> Vector Compute -> Vector Search & Management. -1. Blog: This is where contributors can share examples of things they have been working on, research and solutions to - problems they have encountered while working on Information Retrieval problems +2. Blog: This is where contributors can share examples of things they have been working on, research and solutions to problems they have encountered while working on Information Retrieval problems -1. Toolkit (coming soon): These are interesting apps, links, videos, tips, & tricks that aid in vector creation and - retrieval. +3. Toolkit (coming soon): These are interesting apps, links, videos, tips, & tricks that aid in vector creation and retrieval. ## How to contribute -[This loom](https://www.loom.com/share/aae75e4746f24453af0f3ae276f9ac56?sid=28db5254-f95f-48ae-8bf9-e13ed201bbce) -explains how to set up your contributing workflow. +[This loom](https://www.loom.com/share/aae75e4746f24453af0f3ae276f9ac56?sid=28db5254-f95f-48ae-8bf9-e13ed201bbce) explains how to set up your contributing workflow. To summarise: - 1. Fork the VectorHub repo -1. Push all commits to your fork in the appropriate section for your content -1. Open a PR to merge content from their fork to the remote repo (superlinked/vectorhub) +2. Push all commits to your fork in the appropriate section for your content +3. Open a PR to merge content from their fork to the remote repo (superlinked/vectorhub) When contributing an article please include the following at the start: - -1. One sentence to explain their topic / use case -1. One-two sentences on why your use case is valuable to the reader -1. A brief outline of what each section will discuss (can be bulletpointed) +1) One sentence to explain their topic / use case +2) One-two sentences on why your use case is valuable to the reader +3) A brief outline of what each section will discuss (can be bulletpointed) ## Get involved -We constantly release bounties looking for content contributions. Keep an eye out for items with bounty labels on our -GitHub. +We constantly release bounties looking for content contributions. Keep an eye out for items with bounty labels on our GitHub. ### Other ways you can get involved -::::link-array :::link-array-item{headerImage headerColor} -[Report an error/bug/typo](https://github.com/superlinked/VectorHub/issues) ::: +::::link-array +:::link-array-item{headerImage headerColor} +[Report an error/bug/typo](https://github.com/superlinked/VectorHub/issues) +::: :::link-array-item{headerImage headerColor} -[Create new or update existing content](https://github.com/superlinked/VectorHub) ::: :::: +[Create new or update existing content](https://github.com/superlinked/VectorHub) +::: +:::: -:::hint{type="info"} Thank you for your suggestions! If you think there is anything to improve on VectorHub, feel free -to contact us on arunesh@superlinked.com, or check our [GitHub repository](https://github.com/superlinked/VectorHub). +:::hint{type="info"} +Thank you for your suggestions! If you think there is anything to improve on VectorHub, feel free to contact us on arunesh\@superlinked.com, or check our [GitHub repository](https://github.com/superlinked/VectorHub). ::: diff --git a/docs/contributing/style_guide.md b/docs/contributing/style_guide.md index 1d74e0413..ab8863030 100644 --- a/docs/contributing/style_guide.md +++ b/docs/contributing/style_guide.md @@ -1,129 +1,139 @@ # Style Guide -VectorHub is a community-driven learning hub. Our style guide aims to help you share your thinking and work. We care -about grammar, but our priority is meaning. We want to generate productive conversation between community members. To -this end, we've written ten "commandments" outlining some dos and don'ts. Please read them before you start writing your -article. +VectorHub is a community-driven learning hub. Our style guide aims to help you share your thinking and work. We care about grammar, but our priority is meaning. We want to generate productive conversation between community members. To this end, we've written ten "commandments" outlining some dos and don'ts. Please read them before you start writing your article. ## VectorHub's Ten Commandments ### 1. Give value to your readers -:::hint Who are you writing for, and what problem are you solving? ::: +:::hint +Who are you writing for, and what problem are you solving? +::: -Ask yourself why your article is **valuable** to your readers. Set the context: say who your article is relevant to and -why. What **problem** of theirs are you helping to **solve**? e.g. Instead of this: "This latest software update is -packed with helpful enhancements." Write this: "Our November 15, 2023 update lets web developers optimize projects by -developing faster with simplified coding." +Ask yourself why your article is **valuable** to your readers. +Set the context: say who your article is relevant to and why. What **problem** of theirs are you helping to **solve**? +e.g. Instead of this: "This latest software update is packed with helpful enhancements." +Write this: "Our November 15, 2023 update lets web developers optimize projects by developing faster with simplified coding." Front load your first paragraph with **keywords**. ### 2. Be hierarchical -:::hint Organize your content like a building, and show us the blueprint first. ::: +:::hint +Organize your content like a building, and show us the blueprint first. +::: -**Give vectorhub editors a one-sentence-per-section outline showing us what you want to communicate. This is a huge -timesaver, and helps us make sure your article does what you want it to.** +**Give vectorhub editors a one-sentence-per-section outline showing us what you want to communicate. This is a huge timesaver, and helps us make sure your article does what you want it to.** -Give an overview of where you're going in your Introduction. In your article itself, add clear **headings** and -**subheadings** that give information and an overview to your readers. This will enable readers to navigate to relevant -sections and search engines to scan it. +Give an overview of where you're going in your Introduction. In your article itself, add clear **headings** and **subheadings** that give information and an overview to your readers. This will enable readers to navigate to relevant sections and search engines to scan it. Link (scroll-to-anchor) between concepts in your article (or to other articles on the platform) where it makes sense. ### 3. Be clear, substantive, and brief -:::hint Use simple language, short sentences, write only what's essential. 800-2500 words. ::: +:::hint +Use simple language, short sentences, write only what's essential. 800-2500 words. +::: **Length**: Articles must be **at least 800 but less than max 2500 words**. -Write clearly and concisely. Aim for -[crisp minimalism](https://learn.microsoft.com/en-us/style-guide/top-10-tips-style-voice). Use simple language wherever -possible; write like you speak. Be minimal. Less is more. Use **adjectives** only when they add value. Avoid -superlatives. +Write clearly and concisely. Aim for [crisp minimalism](https://learn.microsoft.com/en-us/style-guide/top-10-tips-style-voice). +Use simple language wherever possible; write like you speak. +Be minimal. Less is more. Use **adjectives** only when they add value. Avoid superlatives. -Get the **substance** down **first**. Fine-tuned, stylized prose can come later. First, complete your article in **point -form**. Make sure you've included everything that your article needs to make sense and convey value to your readers. +Get the **substance** down **first**. Fine-tuned, stylized prose can come later. First, complete your article in **point form**. Make sure you've included everything that your article needs to make sense and convey value to your readers. -Use common **abbreviations**. e.g. "apps" instead of "applications" Introduce unfamiliar abbreviations. e.g. "RAG -(Retrieval Augmented Generation)" +Use common **abbreviations**. +e.g. "apps" instead of "applications" +Introduce unfamiliar abbreviations. +e.g. "RAG (Retrieval Augmented Generation)" -Skip **periods** on headings and subheadings. Only use them in paragraphs and body text. Use last (Oxford) **commas** -for clarity. e.g. "We are programmers, data analysts, and web designers." +Skip **periods** on headings and subheadings. Only use them in paragraphs and body text. +Use last (Oxford) **commas** for clarity. +e.g. "We are programmers, data analysts, and web designers." Don't add extra **spaces** anywhere. One space between sentences and words. ### 4. Be visual -:::hint Save a thousand words. Use a picture. ::: +:::hint +Save a thousand words. Use a picture. +::: -Text only goes so far. Complement your words with **diagrams**, **graphs**, **charts**, **code snippets**, **images** -and any other visual tools that explain your work more efficiently. +Text only goes so far. Complement your words with **diagrams**, **graphs**, **charts**, **code snippets**, **images** and any other visual tools that explain your work more efficiently. ### 5. Be conversational, friendly, and use action verbs -:::hint Write how you speak. Be personal. In an active voice. ::: +:::hint +Write how you speak. Be personal. In an active voice. +::: Write with a **friendly** tone. Call your audience "you" and yourselves "we." -Use **contractions**. e.g. Instead of "We are," "it is," and "they are," use "We're," "it's", and "they're." +Use **contractions**. +e.g. Instead of "We are," "it is," and "they are," use "We're," "it's", and "they're." -As much as possible, use an **active voice** to explain events and relationships. e.g. Instead of: "Deep neural networks -are used by GPT to learn contextual embeddings." Write this: "GPT uses deep neural networks to learn contextual -embeddings." +As much as possible, use an **active voice** to explain events and relationships. +e.g. Instead of: "Deep neural networks are used by GPT to learn contextual embeddings." +Write this: "GPT uses deep neural networks to learn contextual embeddings." -Choose specific action verbs over generic ones. e.g. Instead of: "We made changes to the code to improve performance." +Choose specific action verbs over generic ones. +e.g. Instead of: "We made changes to the code to improve performance." Write this: "We optimized the code to boost performance." Use consistent verb tenses. ### 6. Cite (hyperlink) all sources -:::hint Give credit where credit is due. Always. ::: +:::hint +Give credit where credit is due. Always. +::: **Hyperlink to sources**, rather than including the source url. Don't use footnote or endnotes. -Make sure you have **permission** to reuse any images you include in your VectorHub article. Cite your own modified -versions of images owned by someone else as a "Modified version of" the original source. +Make sure you have **permission** to reuse any images you include in your VectorHub article. +Cite your own modified versions of images owned by someone else as a "Modified version of" the original source. -Cite sources for all graphics, images, direct quotations, and others' unique or patented ideas. If you're not sure -whether to cite it, you probably should. +Cite sources for all graphics, images, direct quotations, and others' unique or patented ideas. If you're not sure whether to cite it, you probably should. -Cite visual elements (i.e., figures) as follows: \[Title\], \[Author/Photographer/Artist\], \[Year\], \[Source\]. -Provide a hyperlink for the whole citation, pointing to the visual element source. e.g. -[Feed Recommendation Illustration, Arunesh Singh, 2023, superlinked.com.](https://superlinked.com) +Cite visual elements (i.e., figures) as follows: [Title], [Author/Photographer/Artist], [Year], [Source]. +Provide a hyperlink for the whole citation, pointing to the visual element source. +e.g. [Feed Recommendation Illustration, Arunesh Singh, 2023, superlinked.com.](https://superlinked.com) -Source citations go underneath visual elements. e.g. +Source citations go underneath visual elements. +e.g. ![Figure 1. Conceptual illustration of our approach, from Graph embeddings for movie visualization and recommendation. M. Vlachos, 2012, ResearchGate. https://www.researchgate.net/publication/290580162_Graph_embeddings_for_movie_visualization_and_recommendation/download?_tp=eyJjb250ZXh0Ijp7ImZpcnN0UGFnZSI6Il9kaXJlY3QiLCJwYWdlIjoiX2RpcmVjdCJ9fQ](assets/misc/Figure1-Conceptual_illustration_of_our_approach.png) -[Graph embeddings for movie visualization and recommendation. Figure 1. Conceptual illustration. M. Vlachos, 2012, ResearchGate.](https://www.researchgate.net/publication/290580162_Graph_embeddings_for_movie_visualization_and_recommendation/download?_tp=eyJjb250ZXh0Ijp7ImZpcnN0UGFnZSI6Il9kaXJlY3QiLCJwYWdlIjoiX2RpcmVjdCJ9fQ) +[Graph embeddings for movie visualization and recommendation. +Figure 1. Conceptual illustration. M. Vlachos, 2012, ResearchGate.](https://www.researchgate.net/publication/290580162_Graph_embeddings_for_movie_visualization_and_recommendation/download?_tp=eyJjb250ZXh0Ijp7ImZpcnN0UGFnZSI6Il9kaXJlY3QiLCJwYWdlIjoiX2RpcmVjdCJ9fQ) -Punctuate _outside_ of links. e.g. "We explain our approach in more depth -[here](https://learn.microsoft.com/en-us/style-guide/top-10-tips-style-voice)." +Punctuate _outside_ of links. +e.g. "We explain our approach in more depth [here](https://learn.microsoft.com/en-us/style-guide/top-10-tips-style-voice)." ### 7. Edit and proofread -:::hint Nail down your logic. Step back, outline, and revise. ::: +:::hint +Nail down your logic. Step back, outline, and revise. +::: -As early as possible (when your article is still in point form), go through it, summarize each paragraph in one -(concise) sentence. Now, put all your summary sentences together and see if they tell a story. Is there a logical flow? -If not, rearrange, remove, or add content until the article makes sense. This exercise saves a lot of time and energy, -and ensures your headings and subheadings are accurate. +As early as possible (when your article is still in point form), go through it, summarize each paragraph in one (concise) sentence. Now, put all your summary sentences together and see if they tell a story. Is there a logical flow? If not, rearrange, remove, or add content until the article makes sense. This exercise saves a lot of time and energy, and ensures your headings and subheadings are accurate. Use a spell and grammar checker. ### 8. Technical terminology -:::hint It's an article about tech. Use relevant, familiar technical terms. ::: +:::hint +It's an article about tech. Use relevant, familiar technical terms. +::: -Use technology-specific terminology. If you think the thing you're describing is unfamiliar to most readers, it probably -is. Link to external resources that provide more in-depth explanations of terms you don't have space to explain in your -article. +Use technology-specific terminology. If you think the thing you're describing is unfamiliar to most readers, it probably is. Link to external resources that provide more in-depth explanations of terms you don't have space to explain in your article. ### 9. Accessibility -:::hint Hierarchize and tag. ::: +:::hint +Hierarchize and tag. +::: Improve accessibility by creating headings and subheadings that are clear and accurately descriptive. @@ -131,17 +141,18 @@ Include alt text for all images and graphics. ### 10. Inclusive language -:::hint The VectorHub community is diverse. Be non-biased and gender neutral. ::: +:::hint +The VectorHub community is diverse. Be non-biased and gender neutral. +::: -The VectorHub community is international and heterogeneous. Avoid words and phrases with negative connotations. Err on -the side of **caution** - if you think a term might be offensive, don't use it. +The VectorHub community is international and heterogeneous. Avoid words and phrases with negative connotations. Err on the side of **caution** - if you think a term might be offensive, don't use it. Avoid **stereotypes** and **biases**. -Use **gender neutral pronouns**. e.g. Instead of he/his/him or she/hers/her, use they/them/their. - -______________________________________________________________________ +Use **gender neutral pronouns**. +e.g. Instead of he/his/him or she/hers/her, use they/them/their. +--- ### Contributors - [Robert Turner](https://www.robertturner.co/copyedit) diff --git a/docs/tools/readme.md b/docs/tools/readme.md index fffc1bbd1..ac975cacf 100644 --- a/docs/tools/readme.md +++ b/docs/tools/readme.md @@ -1,5 +1,3 @@ # Toolbox -Toolbox is a collection of benchmarks, code snippets, summaries, and tricks that help you decide what's best for your -use case. These are the tools that we and our community use frequently. We curate them based on your input. Feel free to -share some tools that you use often or have created recently that will help the community. +Toolbox is a collection of benchmarks, code snippets, summaries, and tricks that help you decide what's best for your use case. These are the tools that we and our community use frequently. We curate them based on your input. Feel free to share some tools that you use often or have created recently that will help the community. diff --git a/docs/tools/testpage.md b/docs/tools/testpage.md index 99be9580e..d17251246 100644 --- a/docs/tools/testpage.md +++ b/docs/tools/testpage.md @@ -1,5 +1,7 @@ + # Test Page + Link here Description here diff --git a/docs/use_cases/clustering_&_anomaly_detection.md b/docs/use_cases/clustering_&_anomaly_detection.md index 765c9023c..35b4720d2 100644 --- a/docs/use_cases/clustering_&_anomaly_detection.md +++ b/docs/use_cases/clustering_&_anomaly_detection.md @@ -8,8 +8,7 @@ your content here -______________________________________________________________________ - +--- ## Contributors - [Your Name](you_social_handle.com) diff --git a/docs/use_cases/cybersecurity.md b/docs/use_cases/cybersecurity.md index c16c568f9..85e8651d9 100644 --- a/docs/use_cases/cybersecurity.md +++ b/docs/use_cases/cybersecurity.md @@ -8,8 +8,7 @@ your content here -______________________________________________________________________ - +--- ## Contributors - [Your Name](you_social_handle.com) diff --git a/docs/use_cases/embeddings_on_browser.md b/docs/use_cases/embeddings_on_browser.md index fa33770d9..62ce5d50b 100644 --- a/docs/use_cases/embeddings_on_browser.md +++ b/docs/use_cases/embeddings_on_browser.md @@ -8,113 +8,71 @@ ![Visual Summary of our Tutorial](../assets/use_cases/embeddings_on_browser/embeddings-browser-animation.gif) -______________________________________________________________________ - +--- ## Vector Embeddings, just for specialists? -Let's say you want to build an app that assesses the similarity of content using vector embeddings. You know a little -about what you'll need: first, obviously, a way of creating vector embeddings, maybe also some retrieval augmented -generation. But how do you operationalize your idea into a real-world application? Don't you require a substantial -hardware setup or expensive cloud APIs? Even if you had the requisite backend resources, who's going to develop and -configure them? Don't you also need highly specialized machine learning engineers or data scientists even to get -started? Don't you have to at least know Python? +Let's say you want to build an app that assesses the similarity of content using vector embeddings. You know a little about what you'll need: first, obviously, a way of creating vector embeddings, maybe also some retrieval augmented generation. But how do you operationalize your idea into a real-world application? Don't you require a substantial hardware setup or expensive cloud APIs? Even if you had the requisite backend resources, who's going to develop and configure them? Don't you also need highly specialized machine learning engineers or data scientists even to get started? Don't you have to at least know Python? Happily, the answer to all of these concerns is No. -**You can start building AI apps without having to learn a new programming language or adopt an entirely new set of -skills**. +**You can start building AI apps without having to learn a new programming language or adopt an entirely new set of skills**. -You don't require high-end equipment, or powerful GPUs. You _don't_ need ML and data science experts. Thanks to -pre-trained machine learning models, **you can create an intuitive component that generates and compares vector -embeddings right within your browser, on a local machine, tailored to your data**. You also don't require library -installations or complex configurations for end-users. You don't have to know Python; you can do it directly in -TypeScript. And you can start immediately. +You don't require high-end equipment, or powerful GPUs. You _don't_ need ML and data science experts. Thanks to pre-trained machine learning models, **you can create an intuitive component that generates and compares vector embeddings right within your browser, on a local machine, tailored to your data**. You also don't require library installations or complex configurations for end-users. You don't have to know Python; you can do it directly in TypeScript. And you can start immediately. -The following tutorial in creating a small-scale AI application demonstrates just how straightforward and efficient the -process can be. Though our component is a very specific use case, you can apply its basic approach to operationalizing -vector embeddings for all kinds of practial applications. +The following tutorial in creating a small-scale AI application demonstrates just how straightforward and efficient the process can be. Though our component is a very specific use case, you can apply its basic approach to operationalizing vector embeddings for all kinds of practial applications. Intrigued? Ready to start building? ## An app that generates, compares, and visualizes vector embeddings in your browser! -Our component takes input content, produces vector embeddings from it, assesses its parts - in our case, sentences - and -provides a user-friendly visual display of the results. And you can build it right within your web browser. +Our component takes input content, produces vector embeddings from it, assesses its parts - in our case, sentences - and provides a user-friendly visual display of the results. And you can build it right within your web browser. -In our tutorial, we will take some user input text, split it into sentences, and derive vector embeddings for each -sentence using TensorFlow.js. To assess the quality of our embeddings, we will generate a similarity matrix mapping the -distance between vectors as a colorful heatmap. Our component enables this by managing all the necessary state and UI -logic. +In our tutorial, we will take some user input text, split it into sentences, and derive vector embeddings for each sentence using TensorFlow.js. To assess the quality of our embeddings, we will generate a similarity matrix mapping the distance between vectors as a colorful heatmap. Our component enables this by managing all the necessary state and UI logic. Let's take a closer look at the our component's parts. ## Specific parts of our application 1. We import all necessary dependencies: React, Material-UI components, TensorFlow.js, and D3 (for color interpolation). -1. Our code defines a React functional component that generates sentence embeddings and visualizes their similarity - matrix in a user interface. -1. We declare various state variables using the **`useState`** hook, in order to manage user input, loading states, and - results. -1. The **`handleSimilarityMatrix`** function toggles the display of the similarity matrix, and calculates it when - necessary. -1. The **`handleGenerateEmbedding`** function is responsible for starting the sentence embedding generation process. It - splits the input sentences into individual sentences and triggers the **`embeddingGenerator`** function. -1. The **`calculateSimilarityMatrix`** is marked as a *memoized* function using the **`useCallback`** hook. It - calculates the similarity matrix based on sentence embeddings. -1. The **`embeddingGenerator`** is an asynchronous function that loads the Universal Sentence Encoder model and - generates sentence embeddings. -1. We use the **`useEffect`** hook to render the similarity matrix as a colorful canvas when **`similarityMatrix`** - changes. -1. The component's return statement defines the user interface, including input fields, buttons, and result displays. -1. The user input section includes a text area where the user can input sentences. -1. The embeddings output section displays the generated embeddings. -1. We provide two buttons. One generates the embeddings, and the other shows/hides the similarity matrix. -1. The code handles loading and model-loaded states, displaying loading indicators or model-loaded messages. -1. The similarity matrix section displays the colorful similarity matrix as a canvas when the user chooses to show it. +2. Our code defines a React functional component that generates sentence embeddings and visualizes their similarity matrix in a user interface. +3. We declare various state variables using the **`useState`** hook, in order to manage user input, loading states, and results. +4. The **`handleSimilarityMatrix`** function toggles the display of the similarity matrix, and calculates it when necessary. +5. The **`handleGenerateEmbedding`** function is responsible for starting the sentence embedding generation process. It splits the input sentences into individual sentences and triggers the **`embeddingGenerator`** function. +6. The **`calculateSimilarityMatrix`** is marked as a *memoized* function using the **`useCallback`** hook. It calculates the similarity matrix based on sentence embeddings. +7. The **`embeddingGenerator`** is an asynchronous function that loads the Universal Sentence Encoder model and generates sentence embeddings. +8. We use the **`useEffect`** hook to render the similarity matrix as a colorful canvas when **`similarityMatrix`** changes. +9. The component's return statement defines the user interface, including input fields, buttons, and result displays. +10. The user input section includes a text area where the user can input sentences. +11. The embeddings output section displays the generated embeddings. +12. We provide two buttons. One generates the embeddings, and the other shows/hides the similarity matrix. +13. The code handles loading and model-loaded states, displaying loading indicators or model-loaded messages. +14. The similarity matrix section displays the colorful similarity matrix as a canvas when the user chooses to show it. + ## Our encoder -The [Universal Sentence Encoder](https://arxiv.org/pdf/1803.11175.pdf) is a pre-trained machine learning model built on -the transformer architecture. It creates context-aware representations for each word in a sentence, using the attention -mechanism - i.e., carefully considering the order and identity of all other words. The Encoder employs element-wise -summation to combine these word representations into a fixed-length sentence vector. To normalize these vectors, the -Encoder then divides them by the square root of the sentence length - to prevent shorter sentences from dominating -solely due to their brevity. +The [Universal Sentence Encoder](https://arxiv.org/pdf/1803.11175.pdf) is a pre-trained machine learning model built on the transformer architecture. It creates context-aware representations for each word in a sentence, using the attention mechanism - i.e., carefully considering the order and identity of all other words. The Encoder employs element-wise summation to combine these word representations into a fixed-length sentence vector. To normalize these vectors, the Encoder then divides them by the square root of the sentence length - to prevent shorter sentences from dominating solely due to their brevity. -The Encoder takes sentences or paragraphs of text as input, and outputs vectors that effectively capture the meaning of -the text. This lets us assess vector similarity (i.e., distance) - a result you can use in a wide variety of natural -language processing (NLP) tasks, including ours. +The Encoder takes sentences or paragraphs of text as input, and outputs vectors that effectively capture the meaning of the text. This lets us assess vector similarity (i.e., distance) - a result you can use in a wide variety of natural language processing (NLP) tasks, including ours. ### Encoder, Lite -For our application, we'll utilize a scaled-down and faster 'Lite' variant of the full model. The Lite model maintains -strong performance while demanding less computational power, making it ideal for deployment in client-side code, mobile -devices, or even directly within web browsers. And because the Lite variant doesn't require any kind of complex -installation or a dedicated GPU, it's more accessible to a broader range of users. +For our application, we'll utilize a scaled-down and faster 'Lite' variant of the full model. The Lite model maintains strong performance while demanding less computational power, making it ideal for deployment in client-side code, mobile devices, or even directly within web browsers. And because the Lite variant doesn't require any kind of complex installation or a dedicated GPU, it's more accessible to a broader range of users. ### Why a pre-trained model -The rationale behind pre-trained models is straightforward. Most NLP projects in research and industry contexts only -have access to relatively small training datasets. It's not feasible, then, to use data-hungry deep learning models. And -annotating more supervised training data is often prohibitively expensive. Here, **pre-trained models can fill the data -gap**. +The rationale behind pre-trained models is straightforward. Most NLP projects in research and industry contexts only have access to relatively small training datasets. It's not feasible, then, to use data-hungry deep learning models. And annotating more supervised training data is often prohibitively expensive. Here, **pre-trained models can fill the data gap**. -Many NLP projects employ pre-trained word embeddings like word2vec or GloVe, which transform individual words into -vectors. However, recent developments have shown that, on many tasks, **pre-trained sentence-level embeddings excel at -capturing higher level semantics** than word embeddings can. The Universal Sentence Encoder's fixed-length vector -embeddings are extremely effective for computing semantic similarity between sentences, with high scores in various -semantic textual similarity benchmarks. +Many NLP projects employ pre-trained word embeddings like word2vec or GloVe, which transform individual words into vectors. However, recent developments have shown that, on many tasks, **pre-trained sentence-level embeddings excel at capturing higher level semantics** than word embeddings can. The Universal Sentence Encoder's fixed-length vector embeddings are extremely effective for computing semantic similarity between sentences, with high scores in various semantic textual similarity benchmarks. + +Though our Encoder's sentence embeddings are pre-trained, they can also be fine-tuned for specific tasks, even when there isn't much task-specific training data. (If we needed, we could even make the encoder more versatile, supporting _multiple_ downstream tasks, by training it with multi-task learning.) -Though our Encoder's sentence embeddings are pre-trained, they can also be fine-tuned for specific tasks, even when -there isn't much task-specific training data. (If we needed, we could even make the encoder more versatile, supporting -_multiple_ downstream tasks, by training it with multi-task learning.) Okay, let's get started, using TypeScript. ## Our step-by-step tutorial ### Import modules - ```tsx import React, { FC, useState, useEffect, useCallback } from 'react'; import { @@ -137,8 +95,7 @@ import { interpolateGreens } from 'd3-scale-chromatic'; ### State variables to manage user input, loading state, and results -We use the **`useState`** hook in a React functional component to manage user input, track loading states of machine -learning models, and update the user interface with the results, including the similarity matrix visualization. +We use the **`useState`** hook in a React functional component to manage user input, track loading states of machine learning models, and update the user interface with the results, including the similarity matrix visualization. ```tsx // State variables to manage user input, loading state, and results @@ -173,10 +130,7 @@ learning models, and update the user interface with the results, including the s ### Function to toggle the display of the similarity matrix -The **`handleSimilarityMatrix`** function is called in response to user input, toggling the display of a UI similarity -matrix - by changing the **`showSimilarityMatrix`** state variable. If the matrix was previously shown, the -**`handleSimilarityMatrix`** hides it by setting it to **`null`**. If the matrix wasn't shown, the -**`handleSimilarityMatrix`** calculates the matrix and sets it to display in the UI. +The **`handleSimilarityMatrix`** function is called in response to user input, toggling the display of a UI similarity matrix - by changing the **`showSimilarityMatrix`** state variable. If the matrix was previously shown, the **`handleSimilarityMatrix`** hides it by setting it to **`null`**. If the matrix wasn't shown, the **`handleSimilarityMatrix`** calculates the matrix and sets it to display in the UI. ```tsx // Toggles display of similarity matrix @@ -198,11 +152,7 @@ matrix - by changing the **`showSimilarityMatrix`** state variable. If the matri ### Function to generate sentence embeddings and populate state -The **`handleGenerateEmbedding`** function, called when a user clicks the "Generate Embedding" button, initiates the -process of generating sentence embeddings. It sets the **`modelComputing`** state variable to **`true`** to indicate -that the model is working, splits the user's input into individual sentences, updates the **`sentencesList`** state -variable with these sentences, and then calls the **`embeddingGenerator`** function to start generating embeddings based -on the individual sentences. +The **`handleGenerateEmbedding`** function, called when a user clicks the "Generate Embedding" button, initiates the process of generating sentence embeddings. It sets the **`modelComputing`** state variable to **`true`** to indicate that the model is working, splits the user's input into individual sentences, updates the **`sentencesList`** state variable with these sentences, and then calls the **`embeddingGenerator`** function to start generating embeddings based on the individual sentences. ```tsx // Generate embeddings for input sentences @@ -223,12 +173,9 @@ on the individual sentences. ### Function to calculate the similarity matrix for sentence embeddings -The **`calculateSimilarityMatrix`** function computes a similarity matrix for a set of sentences by comparing the -embeddings of each sentence with all other sentence embeddings. The matrix contains similarity scores for all possible -sentence pairs. You can use it to perform further visualization and analysis. +The **`calculateSimilarityMatrix`** function computes a similarity matrix for a set of sentences by comparing the embeddings of each sentence with all other sentence embeddings. The matrix contains similarity scores for all possible sentence pairs. You can use it to perform further visualization and analysis. -This function is memoized using the **`useCallback`** hook, which ensures that its behavior will remain consistent -across renders unless its dependencies change. +This function is memoized using the **`useCallback`** hook, which ensures that its behavior will remain consistent across renders unless its dependencies change. ```tsx // Calculates similarity matrix for sentence embeddings @@ -276,9 +223,7 @@ across renders unless its dependencies change. ### Function to generate sentence embeddings using the Universal Sentence Encoder -The **`embeddingGenerator`** function is called when the user clicks a "Generate Embedding" button. It loads the -Universal Sentence Encoder model, generates sentence embeddings for a list of sentences, and updates the component's -state with the results. It also handles potential errors. +The **`embeddingGenerator`** function is called when the user clicks a "Generate Embedding" button. It loads the Universal Sentence Encoder model, generates sentence embeddings for a list of sentences, and updates the component's state with the results. It also handles potential errors. ```tsx // Generate embeddings using Universal Sentence Encoder (Cer., et al., 2018) @@ -332,9 +277,7 @@ state with the results. It also handles potential errors. ### useEffect hook to render the similarity matrix as a colorful canvas -**`useEffect`** is triggered when the **`similarityMatrix`** or **`canvasSize`** changes. **`useEffect`** draws a -similarity matrix on an HTML canvas element. The matrix is represented as a grid of colored cells, with each color (hue) -determined by the similarity value among sentences. The resulting visualization is a dynamic part of the user interface. +**`useEffect`** is triggered when the **`similarityMatrix`** or **`canvasSize`** changes. **`useEffect`** draws a similarity matrix on an HTML canvas element. The matrix is represented as a grid of colored cells, with each color (hue) determined by the similarity value among sentences. The resulting visualization is a dynamic part of the user interface. ```tsx // Render similarity matrix as colored canvas @@ -380,9 +323,7 @@ determined by the similarity value among sentences. The resulting visualization ### User Input Section -This code represents UI fields where users can input multiple sentences. It includes a label, a multiline text input -field, and React state management to control and update the input, storing user-entered sentences in the **`sentences`** -state variable for further processing in the component. +This code represents UI fields where users can input multiple sentences. It includes a label, a multiline text input field, and React state management to control and update the input, storing user-entered sentences in the **`sentences`** state variable for further processing in the component. ```tsx {/* User Input Section */} @@ -406,8 +347,8 @@ state variable for further processing in the component. ### Embeddings Output Section -The UI embeddings output section displays the embeddings stored in the **`embeddings`** state variable, including a -label, and a multiline text output field. React state management lets you control and update the displayed content. +The UI embeddings output section displays the embeddings stored in the **`embeddings`** state variable, including a label, and a multiline text output field. React state management lets you control and update the displayed content. + ```tsx {/* Embeddings Output Section */} @@ -431,9 +372,7 @@ label, and a multiline text output field. React state management lets you contro ### Generate Embedding Button -The following code represents a raised, solid button in the UI that triggers the **`handleGenerateEmbedding`** function -to initiate the embedding generation process. The generate embedding button is initially disabled if there are no input -sentences (**`!sentences`**) or if the model is currently loading (**`modelLoading`**). +The following code represents a raised, solid button in the UI that triggers the **`handleGenerateEmbedding`** function to initiate the embedding generation process. The generate embedding button is initially disabled if there are no input sentences (**`!sentences`**) or if the model is currently loading (**`modelLoading`**). ```tsx {/* Generate Embedding Button */} @@ -451,11 +390,7 @@ sentences (**`!sentences`**) or if the model is currently loading (**`modelLoadi ### Model Indicator -This code deploys the values of the **`modelComputing`** and **`modelLoading`** state variables to control what's -displayed in the user interface. If **`modelComputing`** and **`modelLoading`** are **`true`**, a loading indicator is -displayed. If **`modelLoading`** is **`false`**, then the model is already loaded and we display a message indicating -this. This conditional rendering shows the user either a loading indicator or a model loaded message based on the status -of model loading and computing. +This code deploys the values of the **`modelComputing`** and **`modelLoading`** state variables to control what's displayed in the user interface. If **`modelComputing`** and **`modelLoading`** are **`true`**, a loading indicator is displayed. If **`modelLoading`** is **`false`**, then the model is already loaded and we display a message indicating this. This conditional rendering shows the user either a loading indicator or a model loaded message based on the status of model loading and computing. ```tsx {/* Display model loading or loaded message */} @@ -486,9 +421,7 @@ of model loading and computing. ### Similarity Matrix -The following code displays the similarity matrix in the user interface if the **`showSimilarityMatrix`** state variable -is **`true`**. This section of the UI includes a title, "Similarity Matrix," and a canvas element for rendering the -matrix. If **`false`**, the similarity matrix is hidden. +The following code displays the similarity matrix in the user interface if the **`showSimilarityMatrix`** state variable is **`true`**. This section of the UI includes a title, "Similarity Matrix," and a canvas element for rendering the matrix. If **`false`**, the similarity matrix is hidden. ```tsx {/* Similarity Matrix Section */} @@ -519,68 +452,49 @@ matrix. If **`false`**, the similarity matrix is hidden. }; ``` + ## The test drive: functionality & embedding quality -Before we launch our intuitive semantic search application into production, we should test it. Let's check its -functionality, and the quality of our model's vector embeddings. +Before we launch our intuitive semantic search application into production, we should test it. Let's check its functionality, and the quality of our model's vector embeddings. -Functionality is easy. We just run and test it. Checking embedding quality is a bit more complex. We are dealing with -arrays of 512 elements. How do we gauge their effectiveness? +Functionality is easy. We just run and test it. Checking embedding quality is a bit more complex. We are dealing with arrays of 512 elements. How do we gauge their effectiveness? -Here is where our **similarity matrix** comes into play. We employ the dot product between vectors for each pair of -sentences to discern their proximity or dissimilarity. To illustrate this, let's take two random pages from Wikipedia, -each containing different paragraphs. These two pages will provide us with a total of seven sentences for comparison. +Here is where our **similarity matrix** comes into play. We employ the dot product between vectors for each pair of sentences to discern their proximity or dissimilarity. To illustrate this, let's take two random pages from Wikipedia, each containing different paragraphs. These two pages will provide us with a total of seven sentences for comparison. -1. [The quick brown fox jumps over the lazy dog](https://en.wikipedia.org/wiki/The_quick_brown_fox_jumps_over_the_lazy_dog) +1) [The quick brown fox jumps over the lazy dog](https://en.wikipedia.org/wiki/The_quick_brown_fox_jumps_over_the_lazy_dog) -1. [Los Angeles Herald](https://en.wikipedia.org/wiki/Los_Angeles_Herald) +2) [Los Angeles Herald](https://en.wikipedia.org/wiki/Los_Angeles_Herald) ### Paragraph 1 input -> "The quick brown fox jumps over the lazy dog" is an English-language pangram – a sentence that contains all the -> letters of the alphabet at least once. The phrase is commonly used for touch-typing practice, testing typewriters and -> computer keyboards, displaying examples of fonts, and other applications involving text where the use of all letters -> in the alphabet is desired. +> "The quick brown fox jumps over the lazy dog" is an English-language pangram – a sentence that contains all the letters of the alphabet at least once. The phrase is commonly used for touch-typing practice, testing typewriters and computer keyboards, displaying examples of fonts, and other applications involving text where the use of all letters in the alphabet is desired. +> ### Paragraph 2 input -> The Los Angeles Herald or the Evening Herald was a newspaper published in Los Angeles in the late 19th and early 20th -> centuries. Founded in 1873 by Charles A. Storke, the newspaper was acquired by William Randolph Hearst in 1931. It -> merged with the Los Angeles Express and became an evening newspaper known as the Los Angeles Herald-Express. A 1962 -> combination with Hearst's morning Los Angeles Examiner resulted in its final incarnation as the evening Los Angeles -> Herald-Examiner. +> The Los Angeles Herald or the Evening Herald was a newspaper published in Los Angeles in the late 19th and early 20th centuries. Founded in 1873 by Charles A. Storke, the newspaper was acquired by William Randolph Hearst in 1931. It merged with the Los Angeles Express and became an evening newspaper known as the Los Angeles Herald-Express. A 1962 combination with Hearst's morning Los Angeles Examiner resulted in its final incarnation as the evening Los Angeles Herald-Examiner. +> -When we input these sentences to our model and generate the similarity matrix, we can observe some remarkable patterns. +When we input these sentences to our model and generate the similarity matrix, we can observe some remarkable patterns. ![Similarity Matrix for seven sentences from two documents](../assets/use_cases/embeddings_on_browser/embeddings-browser-similarity-matrix.png) -(Note: the 7x7 matrix represents seven sentences; Paragraph 2's second sentence breaks at the "A." of "Charles A. -Storke." The third sentence begins with "Storke.") +(Note: the 7x7 matrix represents seven sentences; Paragraph 2's second sentence breaks at the "A." of "Charles A. Storke." The third sentence begins with "Storke.") -Our similarity matrix uses color hue to illustrate that same-paragraph sentence pairs are more similar (darker green) -than different-paragraph sentence pairs (lighter green). The darker the hue of green, the more similar the vectors -representing the sentences are - i.e., the closer they are in semantic meaning. For example, pairing Paragraph 1's first -sentence ("The quick brown fox...") and second sentence ("The phrase is commonly...") displays as medium green squares - -\[1,2\] and \[2,1\]. So does pairing Paragraph 2's first ("The Los Angeles Herald...") and second ("Founded in 1873...") -\- \[3,4\] and \[4,3\]. The darkest green squares represent the dot product values of identical pairs - \[1,1\], \[2,2\] -\[3,3\], and so on. +Our similarity matrix uses color hue to illustrate that same-paragraph sentence pairs are more similar (darker green) than different-paragraph sentence pairs (lighter green). The darker the hue of green, the more similar the vectors representing the sentences are - i.e., the closer they are in semantic meaning. For example, pairing Paragraph 1's first sentence ("The quick brown fox...") and second sentence ("The phrase is commonly...") displays as medium green squares - [1,2] and [2,1]. So does pairing Paragraph 2's first ("The Los Angeles Herald...") and second ("Founded in 1873...") - [3,4] and [4,3]. The darkest green squares represent the dot product values of identical pairs - [1,1], [2,2] [3,3], and so on. ![Numbered sentence pairs in similarity matrix](../assets/use_cases/embeddings_on_browser/embeddings-browser-numbered-similarity-matrix.png) -As a result, each paragraph's same-paragraph sentence pairs form their own notably darker regions within the larger -matrix above. Conversely, different-paragraph sentence pairs are less similar, and therefore display as lighter green -squares. For example, pairings of Paragraph 1's first sentence \[1\] and Paragraph 2's first sentence \[3\] are -distinctively lighter green (i.e., more distant in meaning) - \[1,3\] and \[3,1\], and lie outside our two -same-paragraph sentence pair regions. +As a result, each paragraph's same-paragraph sentence pairs form their own notably darker regions within the larger matrix above. Conversely, different-paragraph sentence pairs are less similar, and therefore display as lighter green squares. For example, pairings of Paragraph 1's first sentence [1] and Paragraph 2's first sentence [3] are distinctively lighter green (i.e., more distant in meaning) - [1,3] and [3,1], and lie outside our two same-paragraph sentence pair regions. + ## A real-world-ready semantic search application And that's it! -You can now build our low cost, intuitive, ready-to-deploy, in-browser vector embedding generator and visualizer in your -own browser, and use it for your own real-world applications. +You can now build our low cost, intuitive, ready-to-deploy, in-browser vector embedding generator and visualizer in your own browser, and use it for your own real-world applications. + +This is just one example of the kind of AI apps any developer can build, using pre-trained models configured with TypeScript, and without any cloud models, expensive hardware, or specialized engineering knowledge. -This is just one example of the kind of AI apps any developer can build, using pre-trained models configured with -TypeScript, and without any cloud models, expensive hardware, or specialized engineering knowledge. ## Contributors diff --git a/docs/use_cases/fraud_&_safety.md b/docs/use_cases/fraud_&_safety.md index 93343ea34..4f12c7eeb 100644 --- a/docs/use_cases/fraud_&_safety.md +++ b/docs/use_cases/fraud_&_safety.md @@ -8,8 +8,7 @@ your content here -______________________________________________________________________ - +--- ## Contributors - [Your Name](you_social_handle.com) diff --git a/docs/use_cases/knowledge_graph_embedding.md b/docs/use_cases/knowledge_graph_embedding.md index 9a90a75c6..bc11e8b19 100644 --- a/docs/use_cases/knowledge_graph_embedding.md +++ b/docs/use_cases/knowledge_graph_embedding.md @@ -2,89 +2,56 @@ # Answering Questions with Knowledge Graph Embeddings -Large Language Models (LLMs) are everywhere, achieving impressive results in all sorts of language-related tasks - -language understanding, sentiment analysis, text completion, and so on. But in some domains, including those involving -relational data (often stored in Knowledge Graphs), LLMs don't always perform as well. For use cases that require you to -capture semantic relationships - like relation extraction and link prediction, specialized approaches that _embed_ -relational data can perform much better than LLMs. +Large Language Models (LLMs) are everywhere, achieving impressive results in all sorts of language-related tasks - language understanding, sentiment analysis, text completion, and so on. But in some domains, including those involving relational data (often stored in Knowledge Graphs), LLMs don't always perform as well. For use cases that require you to capture semantic relationships - like relation extraction and link prediction, specialized approaches that _embed_ relational data can perform much better than LLMs. -We look at how Knowledge Graph Embedding (KGE) algorithms can improve performance on some tasks that LLMs have -difficulty with, explore some example code for training and evaluating a KGE model, and use the KGE model to perform Q&A -tasks. We also compare KGE and LLM performance on a Q&A task. +We look at how Knowledge Graph Embedding (KGE) algorithms can improve performance on some tasks that LLMs have difficulty with, explore some example code for training and evaluating a KGE model, and use the KGE model to perform Q&A tasks. We also compare KGE and LLM performance on a Q&A task. Let's get started. ## Knowledge Graphs and missing edges -We use Knowledge Graphs (KGs) to describe how different entities, like people, places, or more generally "things," -relate to each other. For example, a KG can show us how a writer is linked to their books, or how a book is connected to -its awards: +We use Knowledge Graphs (KGs) to describe how different entities, like people, places, or more generally "things," relate to each other. For example, a KG can show us how a writer is linked to their books, or how a book is connected to its awards: ![Knowledge Graph example](../assets/use_cases/knowledge_graph_embedding/small_kg.png) -In domains where understanding these specific connections is crucial - like recommendation systems, search engines, or -information retrieval - KGs specialize in helping computers understand the detailed relationships between things. +In domains where understanding these specific connections is crucial - like recommendation systems, search engines, or information retrieval - KGs specialize in helping computers understand the detailed relationships between things. -The problem with KGs is that they are usually incomplete. Edges that should be present are missing. These missing links -can result from inaccuracies in the data collection process, or simply reflect that our data source is imperfect. In -large open-source knowledge bases, -[we can observe a _significant_ amount of incompleteness](https://towardsdatascience.com/neural-graph-databases-cc35c9e1d04f): +The problem with KGs is that they are usually incomplete. Edges that should be present are missing. These missing links can result from inaccuracies in the data collection process, or simply reflect that our data source is imperfect. In large open-source knowledge bases, [we can observe a _significant_ amount of incompleteness](https://towardsdatascience.com/neural-graph-databases-cc35c9e1d04f): -> … in Freebase, 93.8% of people have no place of birth, and -> [78.5% have no nationality](https://aclanthology.org/P09-1113.pdf), -> [about 68% of people do not have any profession](https://dl.acm.org/doi/abs/10.1145/2566486.2568032), while, in -> Wikidata, [about 50% of artists have no date of birth](https://arxiv.org/abs/2207.00143), and only -> [0.4% of known buildings have information about height](https://dl.acm.org/doi/abs/10.1145/3485447.3511932). +> … in Freebase, 93.8% of people have no place of birth, and [78.5% have no nationality](https://aclanthology.org/P09-1113.pdf), [about 68% of people do not have any profession](https://dl.acm.org/doi/abs/10.1145/2566486.2568032), while, in Wikidata, [about 50% of artists have no date of birth](https://arxiv.org/abs/2207.00143), and only [0.4% of known buildings have information about height](https://dl.acm.org/doi/abs/10.1145/3485447.3511932). -The **imperfections of KGs** can lead to negative outcomes. For example, in recommendations systems, KG incompleteness -can result in **limited or biased recommendations**; on Q&A tasks, KG incompleteness can yield **substantively and -contextually incomplete or inaccurate answers to queries**. +The **imperfections of KGs** can lead to negative outcomes. For example, in recommendations systems, KG incompleteness can result in **limited or biased recommendations**; on Q&A tasks, KG incompleteness can yield **substantively and contextually incomplete or inaccurate answers to queries**. Fortunately, KGEs can help solve problems that plague KGs. ## Knowledge Graph Embeddings and how they work -Trained KGE algorithms can generalize and predict missing edges by calculating the likelihood of connections between -entities. +Trained KGE algorithms can generalize and predict missing edges by calculating the likelihood of connections between entities. -KGE algorithms do this by taking tangled complex webs of connections between entities and turning them into something AI -systems can understand: **vectors**. Embedding entities in a vector space allows KGE algorithms to define a **loss -function** that measures the discrepancy between embedding similarity and node similarity in the graph. _If the loss is -minimal, similar nodes in the graph have similar embeddings_. +KGE algorithms do this by taking tangled complex webs of connections between entities and turning them into something AI systems can understand: **vectors**. Embedding entities in a vector space allows KGE algorithms to define a **loss function** that measures the discrepancy between embedding similarity and node similarity in the graph. _If the loss is minimal, similar nodes in the graph have similar embeddings_. -The KGE model is **trained** by trying to make the similarities between embedding vectors align with the similarities of -corresponding nodes in the graph. The model adjusts its parameters during training to ensure that entities that are -similar in the KG have similar embeddings. This ensures that vector representations capture the structural and -relational aspects of entities in the graph. +The KGE model is **trained** by trying to make the similarities between embedding vectors align with the similarities of corresponding nodes in the graph. The model adjusts its parameters during training to ensure that entities that are similar in the KG have similar embeddings. This ensures that vector representations capture the structural and relational aspects of entities in the graph. -KGE algorithms vary in the similarity functions they employ, and how they define node similarity within a graph. A -**simple approach** is to consider nodes that are connected by an edge as similar. Using this definition, learning node -embeddings can be framed as a classification task. In this task, the goal is to determine how likely it is that any -given pair of nodes have a specific type of relationship (i.e., share a specific edge), given their embeddings. +KGE algorithms vary in the similarity functions they employ, and how they define node similarity within a graph. A **simple approach** is to consider nodes that are connected by an edge as similar. Using this definition, learning node embeddings can be framed as a classification task. In this task, the goal is to determine how likely it is that any given pair of nodes have a specific type of relationship (i.e., share a specific edge), given their embeddings. ## Demo using DistMult KGE -For our KGE model demo, we opted for the DistMult KGE algorithm. It works by representing the likelihood of -relationships between entities (i.e., similarity) as a _bilinear_ function. Essentially, DisMult KGE assumes that the -score of a given triple (comprised of a head entity _h_, a relationship _r_, and a tail entity _t_) can be computed as: -_h_^T (diag)_r_ _t_. +For our KGE model demo, we opted for the DistMult KGE algorithm. It works by representing the likelihood of relationships between entities (i.e., similarity) as a _bilinear_ function. Essentially, DisMult KGE assumes that the score of a given triple (comprised of a head entity _h_, a relationship _r_, and a tail entity _t_) can be computed as: _h_^T \(diag)_r_ _t_. ![DistMult similarity function](../assets/use_cases/knowledge_graph_embedding/distmult.png) diagram source: [dglke](https://dglke.dgl.ai/doc/kg.html) -The model parameters are learned (internalizing the intricate relationships within the KG) by _minimizing cross entropy -between real and corrupted triplets_. +The model parameters are learned (internalizing the intricate relationships within the KG) by _minimizing cross entropy between real and corrupted triplets_. In the following two sections we'll walk you through: -**1. Building and training a DistMult model** **2. Using the model to answer questions** +**1. Building and training a DistMult model** +**2. Using the model to answer questions** ### Building and Training a KGE model -We use a subgraph of the Freebase Knowledge Graph, a database of general facts (transferred to Wikidata after Freebase -Knowledge Graph's 2014 shutdown). This subgraph contains 14541 different entities, 237 different relation types, and -310116 edges in total. +We use a subgraph of the Freebase Knowledge Graph, a database of general facts (transferred to Wikidata after Freebase Knowledge Graph's 2014 shutdown). This subgraph contains 14541 different entities, 237 different relation types, and 310116 edges in total. You can load the subgraph as follows: @@ -93,11 +60,9 @@ from torch_geometric.datasets import FB15k_237 train_data = FB15k_237("./data", split='train')[0] ``` -We'll use PyTorch Geometric, a library built on top of PyTorch, to construct and train the model. PyTorch Geometric is -specifically designed for building machine learning models on graph-structured data. +We'll use PyTorch Geometric, a library built on top of PyTorch, to construct and train the model. PyTorch Geometric is specifically designed for building machine learning models on graph-structured data. -The implementation of the DistMult algorithm lies under the `torch_geometric.nn` package. To create the model, we need -to specify the following three parameters: +The implementation of the DistMult algorithm lies under the `torch_geometric.nn` package. To create the model, we need to specify the following three parameters: - `num_nodes`: The number of distinct entities in the graph (in our case, 14541) - `num_relations`: The number of distinct relations in the graph (in our case, 237) @@ -112,14 +77,12 @@ model = DistMult( ) ``` -For additional configuration of the model, please refer to the -[PyTorch Geometric documentation](https://pytorch-geometric.readthedocs.io/en/latest/generated/torch_geometric.nn.kge.DistMult.html). +For additional configuration of the model, please refer to the [PyTorch Geometric documentation](https://pytorch-geometric.readthedocs.io/en/latest/generated/torch_geometric.nn.kge.DistMult.html). + The process of **model training in PyTorch** follows a standard set of steps: -The first step is **initialization of an optimizer**. The optimizer is a fundamental part of machine learning model -training; it adjusts the parameters of the model to reduce loss. In our demo, we use the -[Adam](https://pytorch.org/docs/stable/generated/torch.optim.Adam.html) optimizer. +The first step is **initialization of an optimizer**. The optimizer is a fundamental part of machine learning model training; it adjusts the parameters of the model to reduce loss. In our demo, we use the [Adam](https://pytorch.org/docs/stable/generated/torch.optim.Adam.html) optimizer. ```python import torch.optim as optim @@ -127,10 +90,7 @@ import torch.optim as optim opt = optim.Adam(model.parameters(), lr=0.001, weight_decay=1e-6) ``` -Second, **creation of a data loader**. The purpose of this loader is to return a batch iterator over the entire dataset. -The batch size can be adjusted according to the specific requirements of the model and the capacity of the hardware. The -loader not only provides an efficient way to load the data but also shuffles it, to ensure that the model is not biased -by the order of the training samples. +Second, **creation of a data loader**. The purpose of this loader is to return a batch iterator over the entire dataset. The batch size can be adjusted according to the specific requirements of the model and the capacity of the hardware. The loader not only provides an efficient way to load the data but also shuffles it, to ensure that the model is not biased by the order of the training samples. ```python # 2. create data loader on the training set @@ -143,11 +103,7 @@ loader = model.loader( ) ``` -Finally, **execution of the training loop**. This is where the actual learning takes place. The model processes each -batch of data, then we compare the actual output to the expected output (labels). The model parameters are then adjusted -to bring the outputs closer to the labels. This process continues until the model's performance on a validation set -reaches an acceptable level, or a predefined number of iterations has been completed (we opt for the latter in our -example). +Finally, **execution of the training loop**. This is where the actual learning takes place. The model processes each batch of data, then we compare the actual output to the expected output (labels). The model parameters are then adjusted to bring the outputs closer to the labels. This process continues until the model's performance on a validation set reaches an acceptable level, or a predefined number of iterations has been completed (we opt for the latter in our example). ```python # 3. usual torch training loop @@ -164,13 +120,11 @@ for e in range(EPOCHS): print(f"Epoch {e} loss {sum(l) / len(l):.4f}") ``` -Now that we have a trained model, we can do **some experiments** to see how well the learned embeddings capture semantic -meaning. To do so, we will construct 3 fact triplets and then we'll use the model to score these triplets. The triplets -(each consisting of a head entity, a relationship, and a tail entity) are: +Now that we have a trained model, we can do **some experiments** to see how well the learned embeddings capture semantic meaning. To do so, we will construct 3 fact triplets and then we'll use the model to score these triplets. The triplets (each consisting of a head entity, a relationship, and a tail entity) are: 1. France contains Burgundy (which is true) -1. France contains Rio de Janeiro (which is not true) -1. France contains Bonnie and Clyde (which makes no sense) +2. France contains Rio de Janeiro (which is not true) +3. France contains Bonnie and Clyde (which makes no sense) ```python # Get node and relation IDs @@ -194,10 +148,10 @@ print(scores.tolist()) # Bonnie and Clyde gets the lowest (negative) score ``` + ### Answering questions with our model -Next, we'll demo how to apply the trained model to answer questions. To answer the question, "What is Guy Ritchie's -profession?" we start by finding the embedding vectors of the node "Guy Ritchie" and the relation "profession." +Next, we'll demo how to apply the trained model to answer questions. To answer the question, "What is Guy Ritchie's profession?" we start by finding the embedding vectors of the node "Guy Ritchie" and the relation "profession." ```python # Accessing node and relation embeddings @@ -209,10 +163,7 @@ guy_ritchie = node_embeddings[nodes["Guy Ritchie"]] profession = relation_embeddings[edges["/people/person/profession"]] ``` -Remember, the DistMult algorithm models connections as a bilinear function of a (head, relation, tail) triplet, so we -can express our question as: \. The model will answer with whichever node maximizes this -expression. That is, it will find the tail entity (the profession of Guy Ritchie) that results in the highest score when -plugged into the bilinear function. +Remember, the DistMult algorithm models connections as a bilinear function of a (head, relation, tail) triplet, so we can express our question as: . The model will answer with whichever node maximizes this expression. That is, it will find the tail entity (the profession of Guy Ritchie) that results in the highest score when plugged into the bilinear function. ```python # Creating embedding for the query based on the chosen relation and entity @@ -233,83 +184,57 @@ top_5_scores = scores[sorted_indices] ('artist', 2.522)] ``` -Impressively, the model **correctly interprets and infers information that isn't explicitly included in the graph**, and -provides the right answer to our question. Our model aptly demonstrates KGE's ability to make up for graph -incompleteness. +Impressively, the model **correctly interprets and infers information that isn't explicitly included in the graph**, and provides the right answer to our question. Our model aptly demonstrates KGE's ability to make up for graph incompleteness. -Furthermore, the fact that the top five relevant entities identified by the model are all professions suggests that the -model has successfully learned and understood the concept of a "profession" - that is, the model has **discerned the -broader context and implications** of "profession," rather than just recognizing the term itself. +Furthermore, the fact that the top five relevant entities identified by the model are all professions suggests that the model has successfully learned and understood the concept of a "profession" - that is, the model has **discerned the broader context and implications** of "profession," rather than just recognizing the term itself. -Moreover, these five professions are all closely related to the film industry, suggesting that the model has _not only_ -understood the concept of a profession but _also_ narrowed this category to film industry professions specifically; that -is, KGE has managed to **grasp the semantic meaning** of the combination of the two query terms: the head entity (Guy -Ritchie) and the relation entity (profession), and therefore was able to link the general concept of a profession to the -specific context of the film industry, a testament to its ability to capture and interpret semantic meaning. +Moreover, these five professions are all closely related to the film industry, suggesting that the model has _not only_ understood the concept of a profession but _also_ narrowed this category to film industry professions specifically; that is, KGE has managed to **grasp the semantic meaning** of the combination of the two query terms: the head entity (Guy Ritchie) and the relation entity (profession), and therefore was able to link the general concept of a profession to the specific context of the film industry, a testament to its ability to capture and interpret semantic meaning. -In sum, the model's performance in this scenario demonstrates its potential for **understanding concepts**, -**interpreting context**, and **extracting semantic meaning**. +In sum, the model's performance in this scenario demonstrates its potential for **understanding concepts**, **interpreting context**, and **extracting semantic meaning**. + +Here is the [complete code for this demo](https://github.com/superlinked/VectorHub/blob/main/docs/assets/use_cases/knowledge_graph_embedding/kge_demo.ipynb). -Here is the -[complete code for this demo](https://github.com/superlinked/VectorHub/blob/main/docs/assets/use_cases/knowledge_graph_embedding/kge_demo.ipynb). ## Comparing KGE with LLM performance on a large Knowledge Graph -Next, let's compare the performance of KGE and LLMs on the ogbl-wikikg2 dataset, drawn from Wikidata. This dataset -includes 2.5 million unique entities, 535 types of relations, and 17.1 million fact triplets. We'll evaluate their -performance using hit rates (ratio of correct answers), following the guidelines provided in -[Stanford's Open Graph Benchmark](https://ogb.stanford.edu/docs/linkprop/#ogbl-wikikg2). +Next, let's compare the performance of KGE and LLMs on the ogbl-wikikg2 dataset, drawn from Wikidata. This dataset includes 2.5 million unique entities, 535 types of relations, and 17.1 million fact triplets. We'll evaluate their performance using hit rates (ratio of correct answers), following the guidelines provided in [Stanford's Open Graph Benchmark](https://ogb.stanford.edu/docs/linkprop/#ogbl-wikikg2). -First, we create textual representations for each node within the graph by crafting sentences that describe their -connections, like this: "\[node\] \[relation1\] \[neighbor1\], \[neighbor2\]. \[node\] \[relation2\] \[neighbor3\], -\[neighbor4\]. ..." +First, we create textual representations for each node within the graph by crafting sentences that describe their connections, like this: "[node] [relation1] [neighbor1], [neighbor2]. [node] [relation2] [neighbor3], [neighbor4]. ..." -We then feed these textual representations into a LLM – specifically, the **BAAI/bge-base-en-v1.5** model available on -[HuggingFace](https://huggingface.co/BAAI/bge-base-en-v1.5). The embeddings that result from this process serve as our -node embeddings. +We then feed these textual representations into a LLM – specifically, the **BAAI/bge-base-en-v1.5** model available on [HuggingFace](https://huggingface.co/BAAI/bge-base-en-v1.5). The embeddings that result from this process serve as our node embeddings. -For queries, we take a similar textual representation approach, creating descriptions of the query but omitting the -specific entity in question. With these representations in hand, we utilize dot product similarity to find and rank -relevant answers. +For queries, we take a similar textual representation approach, creating descriptions of the query but omitting the specific entity in question. With these representations in hand, we utilize dot product similarity to find and rank relevant answers. For the KGE algorithm, we employ DistMult with a 250-dimensional embedding space. + ### Results You can see the results on the Open Graph Benchmark query set in the table below: -| metric/model | Random | LLM | DistMult| | --- | --- | --- | --- | | HitRate@1 | 0.001 | 0.0055 | **0.065** | | -HitRate@3 | 0.003 | 0.0154 | **0.150** | | HitRate@10 | 0.010 | 0.0436 | **0.307** | +| metric/model | Random | LLM | DistMult| +| --- | --- | --- | --- | +| HitRate@1 | 0.001 | 0.0055 | **0.065** | +| HitRate@3 | 0.003 | 0.0154 | **0.150** | +| HitRate@10 | 0.010 | 0.0436 | **0.307** | -While the LLM performs three times better than when the nodes are randomly ordered, it's KGE that really stands out as -the superior option, with **hit rates almost ten times higher than the LLM**. In addition, DistMult finds the **correct -answer on its first try more frequently** than LLM does in ten attempts. DisMult's performance is all the more -remarkable when considering that it outperforms LLM even though we used lower-dimensional (250) embeddings with DisMult -than the LLM, which outputs 768-dimensional embeddings. +While the LLM performs three times better than when the nodes are randomly ordered, it's KGE that really stands out as the superior option, with **hit rates almost ten times higher than the LLM**. In addition, DistMult finds the **correct answer on its first try more frequently** than LLM does in ten attempts. DisMult's performance is all the more remarkable when considering that it outperforms LLM even though we used lower-dimensional (250) embeddings with DisMult than the LLM, which outputs 768-dimensional embeddings. -Our results unequivocally demonstrate **KGE's clear advantage over LLMs for tasks where relational information is -important**. +Our results unequivocally demonstrate **KGE's clear advantage over LLMs for tasks where relational information is important**. ### DisMult limitations While DistMult stands out as a simple but powerful tool for embedding KGs, it does have limitations. It struggles with: +1. cold starts: When the graph evolves or changes over time, DistMult can't represent new nodes introduced later on, or can't model the effect of new connections introduced to the graph. +2. complex questions: While it excels in straightforward question-answering scenarios, the DistMult model falls short when faced with complex questions that demand a deeper comprehension, extending beyond immediate connections. Other KGE algorithms better suit such tasks. -1. cold starts: When the graph evolves or changes over time, DistMult can't represent new nodes introduced later on, or - can't model the effect of new connections introduced to the graph. -1. complex questions: While it excels in straightforward question-answering scenarios, the DistMult model falls short - when faced with complex questions that demand a deeper comprehension, extending beyond immediate connections. Other - KGE algorithms better suit such tasks. ## KGEs for relational data -Because LLMs have trouble encoding intricate relation structures, their performance suffers when dealing with relational -information. Creating a string representation of a node's connections tends to overload the LLM's input. Instead, their -strength lies in processing more focused and specific textual information; LLMs are typically not trained to handle -broad and diverse information within a single context. KGE algorithms, on the other hand, are specifically designed to -handle relational data, and can be further customized to fit the specific needs of a wide variety of use cases. +Because LLMs have trouble encoding intricate relation structures, their performance suffers when dealing with relational information. Creating a string representation of a node's connections tends to overload the LLM's input. Instead, their strength lies in processing more focused and specific textual information; LLMs are typically not trained to handle broad and diverse information within a single context. KGE algorithms, on the other hand, are specifically designed to handle relational data, and can be further customized to fit the specific needs of a wide variety of use cases. -______________________________________________________________________ +--- ## Contributors - [Richárd Kiss, author](https://www.linkedin.com/in/richard-kiss-3209a1186/) diff --git a/docs/use_cases/knowledge_graphs.md b/docs/use_cases/knowledge_graphs.md index c2cd31be0..ece299746 100644 --- a/docs/use_cases/knowledge_graphs.md +++ b/docs/use_cases/knowledge_graphs.md @@ -1,220 +1,147 @@ # Improving RAG performance with Knowledge Graphs -We look at the limitations of not just LLMs but also standard RAG solutions to LLM's knowledge and reasoning gaps, and -examine the ways Knowledge Graphs combined with vector embeddings can fill these gaps - through graph embedding -constraints, judicious choice of reasoning techniques, careful retrieval design, collaborative filtering, and flywheel -learning. +We look at the limitations of not just LLMs but also standard RAG solutions to LLM's knowledge and reasoning gaps, and examine the ways Knowledge Graphs combined with vector embeddings can fill these gaps - through graph embedding constraints, judicious choice of reasoning techniques, careful retrieval design, collaborative filtering, and flywheel learning. ## Introduction -Large Language Models (LLMs) mark a watershed moment in natural language processing, creating new abilities in -conversational AI, creative writing, and a broad range of other applications. But they have limitations. While LLMs can -generate remarkably fluent and coherent text from nothing more than a short prompt, LLM knowledge is not real-world -data, but rather restricted to patterns learned from training data. In addition, LLMs can't do logical inference or -synthesize facts from multiple sources; as queries become more complex and open-ended, LLM responses become -contradictory or nonsense. +Large Language Models (LLMs) mark a watershed moment in natural language processing, creating new abilities in conversational AI, creative writing, and a broad range of other applications. But they have limitations. While LLMs can generate remarkably fluent and coherent text from nothing more than a short prompt, LLM knowledge is not real-world data, but rather restricted to patterns learned from training data. In addition, LLMs can't do logical inference or synthesize facts from multiple sources; as queries become more complex and open-ended, LLM responses become contradictory or nonsense. -Retrieval Augmented Generation (RAG) systems have filled some of the LLM gaps by surfacing external source data using -semantic similarity search on vector embeddings. Still, because RAG systems don't have access to network structure data -(the interconnections between contextual facts), they struggle to achieve true relevance, aggregate facts, and perform -chains of reasoning. +Retrieval Augmented Generation (RAG) systems have filled some of the LLM gaps by surfacing external source data using semantic similarity search on vector embeddings. Still, because RAG systems don't have access to network structure data (the interconnections between contextual facts), they struggle to achieve true relevance, aggregate facts, and perform chains of reasoning. -Knowledge Graphs (KGs), by encoding real-world entities and their connections, overcome the above deficiencies of pure -vector search. KGs enable complex, multi-hop reasoning, across diverse data sources, thereby representing a more -comprehensive understanding of the knowledge space. +Knowledge Graphs (KGs), by encoding real-world entities and their connections, overcome the above deficiencies of pure vector search. KGs enable complex, multi-hop reasoning, across diverse data sources, thereby representing a more comprehensive understanding of the knowledge space. -Let's take a closer look at how we can combine vector embeddings and KGs, fusing surface-level semantics, structured -knowledge, and logic to unlock new levels of reasoning, accuracy, and explanatory ability in LLMs. +Let's take a closer look at how we can combine vector embeddings and KGs, fusing surface-level semantics, structured knowledge, and logic to unlock new levels of reasoning, accuracy, and explanatory ability in LLMs. -We start by exploring the inherent weaknesses of relying on vector search in isolation, and then show how to combine -KGs and embeddings complementarily, to overcome the limitations of each. +We start by exploring the inherent weaknesses of relying on vector search in isolation, and then show how to combine KGs and embeddings complementarily, to overcome the limitations of each. ## RAG Vector Search: process and limits -Most RAG systems employ vector search on a document collection to surface relevant context for the LLM. This process has -**several key steps**: +Most RAG systems employ vector search on a document collection to surface relevant context for the LLM. This process has **several key steps**: -1. **Text Encoding**: Using embedding models, like BERT, the RAG system encodes and condenses passages of text from the - corpus as dense vector representations, capturing semantic meaning. -1. **Indexing**: To enable rapid similarity search, these passage vectors are indexed within a high-dimensional vector - space. Popular methods include ANNOY, Faiss, and Pinecone. -1. **Query Encoding**: An incoming user query is encoded as a vector representation, using the same embedding model. -1. **Similarity Retrieval**: Using distance metrics like cosine similarity, the system runs a search over the indexed - passages to find closest neighbors to the query vector. -1. **Passage Return**: The system returns the most similar passage vectors, and extracts the corresponding original text - to provide context for the LLM. +1. **Text Encoding**: Using embedding models, like BERT, the RAG system encodes and condenses passages of text from the corpus as dense vector representations, capturing semantic meaning. +2. **Indexing**: To enable rapid similarity search, these passage vectors are indexed within a high-dimensional vector space. Popular methods include ANNOY, Faiss, and Pinecone. +3. **Query Encoding**: An incoming user query is encoded as a vector representation, using the same embedding model. +4. **Similarity Retrieval**: Using distance metrics like cosine similarity, the system runs a search over the indexed passages to find closest neighbors to the query vector. +5. **Passage Return**: The system returns the most similar passage vectors, and extracts the corresponding original text to provide context for the LLM. ![RAG](../assets/use_cases/knowledge_graphs/RAG.png) - + This RAG Vector Search pipeline has **several key limitations**: -- Passage vectors can't represent inferential connections (i.e., context), and therefore usually fail to encode the - query's full semantic intent. -- Key relevant details embedded in passages (across sentences) are lost in the process of condensing entire passages - into single vectors. +- Passage vectors can't represent inferential connections (i.e., context), and therefore usually fail to encode the query's full semantic intent. +- Key relevant details embedded in passages (across sentences) are lost in the process of condensing entire passages into single vectors. - Each passage is matched independently, so facts can't be connected or aggregated. -- The ranking and matching process for determining relevancy remains opaque; we can't see why the system prefers certain - passages to others. +- The ranking and matching process for determining relevancy remains opaque; we can't see why the system prefers certain passages to others. - There's no encoding of relationships, structure, rules, or any other connections between content. -RAG, because it focuses only on semantic similarity, is unable to reason across content, so it fails to really -understand not only queries but also the data RAG retrieves. The more complex the query, the poorer RAG's results -become. +RAG, because it focuses only on semantic similarity, is unable to reason across content, so it fails to really understand not only queries but also the data RAG retrieves. The more complex the query, the poorer RAG's results become. ## Incorporating Knowledge Graphs -Knowledge Graphs, on the other hand, represent information in an interconnected network of entities and relationships, -enabling more complex reasoning across content. +Knowledge Graphs, on the other hand, represent information in an interconnected network of entities and relationships, enabling more complex reasoning across content. How do KGs augment retrieval? -1. **Explicit Facts** — KGs preserve key details by capturing facts directly as nodes and edges, instead of condensed - into opaque vectors. -1. **Contextual Details** — KG entities possess rich attributes like descriptions, aliases, and metadata that provide - crucial context. -1. **Network Structure** — KGs capture real-world relationships - rules, hierarchies, timelines, and other connections - - between entities. -1. **Multi-Hop Reasoning** — Queries can traverse relationships, and infer across multiple steps, to connect and derive - facts from diverse sources. -1. **Joint Reasoning** — Entity Resolution can identify and link references that pertain to the same real-world object, - enabling collective analysis. -1. **Explainable Relevance** — Graph topology lets us transparently analyze the connections and relationships that - determine why certain facts are retrieved as relevant. -1. **Personalization** — KGs capture and tailor query results according to user attributes, context, and historical - interactions. +1. **Explicit Facts** — KGs preserve key details by capturing facts directly as nodes and edges, instead of condensed into opaque vectors. +2. **Contextual Details** — KG entities possess rich attributes like descriptions, aliases, and metadata that provide crucial context. +3. **Network Structure** — KGs capture real-world relationships - rules, hierarchies, timelines, and other connections - between entities. +4. **Multi-Hop Reasoning** — Queries can traverse relationships, and infer across multiple steps, to connect and derive facts from diverse sources. +5. **Joint Reasoning** — Entity Resolution can identify and link references that pertain to the same real-world object, enabling collective analysis. +6. **Explainable Relevance** — Graph topology lets us transparently analyze the connections and relationships that determine why certain facts are retrieved as relevant. +7. **Personalization** — KGs capture and tailor query results according to user attributes, context, and historical interactions. ![RAG + Knowledge Graph](../assets/use_cases/knowledge_graphs/rag_kg.png) -In sum, whereas RAG performs matching on disconnected nodes, KGs enable graph traversal search and retrieval of -interconnected contextual, search for query-relevant facts, make the ranking process transparent, and encode structured -facts, relationships, and context to enable complex, precise, multi-step reasoning. As a result, compared to pure vector -search, KGs can improve relevance and explanatory power. +In sum, whereas RAG performs matching on disconnected nodes, KGs enable graph traversal search and retrieval of interconnected contextual, search for query-relevant facts, make the ranking process transparent, and encode structured facts, relationships, and context to enable complex, precise, multi-step reasoning. As a result, compared to pure vector search, KGs can improve relevance and explanatory power. But KG retrieval can be optimized further by applying certain constraints. ## Using Constraints to Optimize Embeddings from Knowledge Graphs -Knowledge Graphs represent entities and relationships that can be vector embedded to enable mathematical operations. -These representations and retrieval results can be improved further by adding some **simple but universal constraints**: - -- **Non-Negativity Constraints** — Restricting entity embeddings to values between 0 and 1 ensures focus on entities' - positive properties only, and thereby improves interpretability. -- **Entailment Constraints** — Encoding expected logic rules like symmetry, inversion, and composition directly as - constraints on relation embeddings ensures incorporation of those patterns into the representations. -- **Confidence Modeling** — Soft constraints using slack variables can encode different confidence levels of logic rules - depending on evidence. -- **Regularization** — Introduces constraints that impose useful inductive biases to help pattern learning, without - making optimization significantly more complex; only a projection step is added. +Knowledge Graphs represent entities and relationships that can be vector embedded to enable mathematical operations. These representations and retrieval results can be improved further by adding some **simple but universal constraints**: -In addition to **improving interpretability**, **ensuring expected logic rules**, **permitting evidence-based rule -confidence levels**, and **improving pattern learning**, constraints can _also_: +- **Non-Negativity Constraints** — Restricting entity embeddings to values between 0 and 1 ensures focus on entities' positive properties only, and thereby improves interpretability. +- **Entailment Constraints** — Encoding expected logic rules like symmetry, inversion, and composition directly as constraints on relation embeddings ensures incorporation of those patterns into the representations. +- **Confidence Modeling** — Soft constraints using slack variables can encode different confidence levels of logic rules depending on evidence. +- **Regularization** — Introduces constraints that impose useful inductive biases to help pattern learning, without making optimization significantly more complex; only a projection step is added. -- **improve explainability** of the reasoning process; structured constraints make visible the patterns learned by the - model; and -- **improve accuracy** of unseen queries; constraints improve generalization by restricting the hypothesis space to - compliant representations. +In addition to **improving interpretability**, **ensuring expected logic rules**, **permitting evidence-based rule confidence levels**, and **improving pattern learning**, constraints can _also_: +- **improve explainability** of the reasoning process; structured constraints make visible the patterns learned by the model; and +- **improve accuracy** of unseen queries; constraints improve generalization by restricting the hypothesis space to compliant representations. -In short, applying some simple constraints can augment Knowledge Graph embeddings to produce more optimized, -explainable, and logically compliant representations, with inductive biases that mimic real-world structures and rules, -resulting in more accurate and interpretable reasoning, without much additional complexity. +In short, applying some simple constraints can augment Knowledge Graph embeddings to produce more optimized, explainable, and logically compliant representations, with inductive biases that mimic real-world structures and rules, resulting in more accurate and interpretable reasoning, without much additional complexity. ## Choosing a reasoning framework that matches your use case -Knowledge Graphs require reasoning to derive new facts, answer queries, and make predictions. But there are a diverse -range of reasoning techniques, whose respective strengths can be combined to fit the requirements of specific use cases. +Knowledge Graphs require reasoning to derive new facts, answer queries, and make predictions. But there are a diverse range of reasoning techniques, whose respective strengths can be combined to fit the requirements of specific use cases. -| Reasoning framework | Method | Pros | Cons | | ---- | ---- | ---- | ---- | | **Logical Rules** | Express knowledge as -logical axioms and ontologies | Sound and complete reasoning through theorem proving | Limited uncertainty handling | | -**Graph Embeddings** | Embed KG structure for vector space operations | Handle uncertainty | Lack expressivity | | -**Neural Provers** | Differentiable theorem proving modules combined with vector lookups | Adaptive | Opaque reasoning | -| **Rule Learners** | Induce rules by statistical analysis of graph structure and data | Automate rule creation | -Uncertain quality | | **Hybrid Pipeline** | Logical rules encode unambiguous constraints | Embeddings provide vector -space operations. Neural provers fuse benefits through joint training. | | | **Explainable Modeling** | Use case-based, -fuzzy, or probabilistic logic to add transparency | Can express degrees uncertainty and confidence in rules | | | -**Iterative Enrichment** | Expand knowledge by materializing inferred facts and learned rules back into the graph -| Provides a feedback loop | | +| Reasoning framework | Method | Pros | Cons | +| ---- | ---- | ---- | ---- | +| **Logical Rules** | Express knowledge as logical axioms and ontologies | Sound and complete reasoning through theorem proving | Limited uncertainty handling | +| **Graph Embeddings** | Embed KG structure for vector space operations | Handle uncertainty | Lack expressivity | +| **Neural Provers** | Differentiable theorem proving modules combined with vector lookups | Adaptive | Opaque reasoning | +| **Rule Learners** | Induce rules by statistical analysis of graph structure and data | Automate rule creation | Uncertain quality | +| **Hybrid Pipeline** | Logical rules encode unambiguous constraints | Embeddings provide vector space operations. Neural provers fuse benefits through joint training. | | +| **Explainable Modeling** | Use case-based, fuzzy, or probabilistic logic to add transparency | Can express degrees uncertainty and confidence in rules | | +| **Iterative Enrichment** | Expand knowledge by materializing inferred facts and learned rules back into the graph | Provides a feedback loop | | -The key to creating a suitable pipeline is identifying the types of reasoning required and mapping them to the right -combination of appropriate techniques. +The key to creating a suitable pipeline is identifying the types of reasoning required and mapping them to the right combination of appropriate techniques. ## Preserving Quality Information Flow to the LLM -Retrieving knowledge Graph facts for the LLM introduces information bottlenecks. Careful design can mitigate these -bottlenecks by ensuring relevance. Here are some methods for doing that: +Retrieving knowledge Graph facts for the LLM introduces information bottlenecks. Careful design can mitigate these bottlenecks by ensuring relevance. Here are some methods for doing that: -- **Chunking** — Splitting content into small chunks improves isolation. But it loses surrounding context, hindering - reasoning across chunks. -- **Summarization** — Generating summaries of chunks condenses key details, highlighting their significance. This makes - context more concise. +- **Chunking** — Splitting content into small chunks improves isolation. But it loses surrounding context, hindering reasoning across chunks. +- **Summarization** — Generating summaries of chunks condenses key details, highlighting their significance. This makes context more concise. - **Metadata** — Attaching summaries, titles, tags, etc. preserves the source content's context. -- **Query Rewriting** — Rewriting a more detailed version of the original query better tailors retrieval to the LLM’s - needs. +- **Query Rewriting** — Rewriting a more detailed version of the original query better tailors retrieval to the LLM’s needs. - **Relationship Modeling** — KG traversals preserve connections between facts, maintaining context. - **Information Ordering** — Ordering facts chronologically or by relevance optimizes information structure. - **Explicit Statements** — Converting implicit knowledge into explicit facts facilitates reasoning. -To preserve quality information flow to the LLM to maximize its reasoning ability, you need to strike a balance between -granularity and cohesiveness. KG relationships help contextualize isolated facts. Techniques that optimize the -relevance, structure, explicitness, and context of retrieved knowledge help maximize the LLM's reasoning ability. +To preserve quality information flow to the LLM to maximize its reasoning ability, you need to strike a balance between granularity and cohesiveness. KG relationships help contextualize isolated facts. Techniques that optimize the relevance, structure, explicitness, and context of retrieved knowledge help maximize the LLM's reasoning ability. ## Unlocking Reasoning Capabilities by Combining KGs and Embeddings -**Knowledge Graphs** provide structured representations of entities and relationships. KGs empower complex reasoning -through graph traversals, and handle multi-hop inferences. **Embeddings** encode information in the vector space for -similarity-based operations. Embeddings enable efficient approximate search at scale, and surface latent patterns. - -**Combining KGs and embeddings** permits their respective strengths to overcome each other’s weaknesses, and improve -reasoning capabilities, in the following ways: - -- **Joint Encoding** — Embeddings are generated for both KG entities and KG relationships. This distills statistical - patterns in the embeddings. -- **Neural Networks** — Graph neural networks (GNNs) operate on the graph structure and embedded elements through - differentiable message passing. This fuses the benefits of both KGs and embeddings. -- **Reasoning Flow** — KG traversals gather structured knowledge. Then, embeddings focus the search and retrieve related - content at scale. -- **Explainability** — Explicit KG relationships help make the reasoning process transparent. Embeddings lend - interpretability. +**Knowledge Graphs** provide structured representations of entities and relationships. KGs empower complex reasoning through graph traversals, and handle multi-hop inferences. **Embeddings** encode information in the vector space for similarity-based operations. Embeddings enable efficient approximate search at scale, and surface latent patterns. + +**Combining KGs and embeddings** permits their respective strengths to overcome each other’s weaknesses, and improve reasoning capabilities, in the following ways: + +- **Joint Encoding** — Embeddings are generated for both KG entities and KG relationships. This distills statistical patterns in the embeddings. +- **Neural Networks** — Graph neural networks (GNNs) operate on the graph structure and embedded elements through differentiable message passing. This fuses the benefits of both KGs and embeddings. +- **Reasoning Flow** — KG traversals gather structured knowledge. Then, embeddings focus the search and retrieve related content at scale. +- **Explainability** — Explicit KG relationships help make the reasoning process transparent. Embeddings lend interpretability. - **Iterative Improvement** — Inferred knowledge can expand the KG. GNNs provide continuous representation learning. -While KGs enable structured knowledge representation and reasoning, embeddings provide the pattern recognition -capability and scalability of neural networks, augmenting reasoning capabilities in the kinds of language AI that -require both statistical learning and symbolic logic. +While KGs enable structured knowledge representation and reasoning, embeddings provide the pattern recognition capability and scalability of neural networks, augmenting reasoning capabilities in the kinds of language AI that require both statistical learning and symbolic logic. ## Improving Search with Collaborative Filtering -You can use collaborative filtering's ability to leverage connections between entities to enhance search, by taking the -following steps: +You can use collaborative filtering's ability to leverage connections between entities to enhance search, by taking the following steps: 1. **Knowledge Graph** — Construct a KG with nodes representing entities and edges representing relationships. -1. **Node Embedding** — Generate an embedding vector for certain key node properties like title, description, and so on. -1. **Vector Index** — Build a vector similarity index on the node embeddings. -1. **Similarity Search** — For a given search query, find the nodes with the most similar embeddings. -1. **Collaborative Adjustment** — Propagate and adjust similarity scores based on node connections, using algorithms - like PageRank. -1. **Edge Weighting** — Weight adjustments on the basis of edge types, strengths, confidence levels, etc. -1. **Score Normalization** — Normalize adjusted scores to preserve relative rankings. -1. **Result Reranking** — Reorder initial search results on the basis of adjusted collaborative scores. -1. **User Context** — Further adapt search results based on user profile, history, and preferences. +2. **Node Embedding** — Generate an embedding vector for certain key node properties like title, description, and so on. +3. **Vector Index** — Build a vector similarity index on the node embeddings. +4. **Similarity Search** — For a given search query, find the nodes with the most similar embeddings. +5. **Collaborative Adjustment** — Propagate and adjust similarity scores based on node connections, using algorithms like PageRank. +6. **Edge Weighting** — Weight adjustments on the basis of edge types, strengths, confidence levels, etc. +7. **Score Normalization** — Normalize adjusted scores to preserve relative rankings. +8. **Result Reranking** — Reorder initial search results on the basis of adjusted collaborative scores. +9. **User Context** — Further adapt search results based on user profile, history, and preferences. ## Fueling Knowledge Graphs with Flywheel Learning -Knowledge Graphs unlock new reasoning capabilities for language models by providing structured real-world knowledge. But -KGs aren't perfect. They contain knowledge gaps, and have to update to remain current. Flywheel Learning can help -remediate these problems, improving KG quality by continuously analyzing system interactions and ingesting new data. +Knowledge Graphs unlock new reasoning capabilities for language models by providing structured real-world knowledge. But KGs aren't perfect. They contain knowledge gaps, and have to update to remain current. Flywheel Learning can help remediate these problems, improving KG quality by continuously analyzing system interactions and ingesting new data. ### Building the Knowledge Graph Flywheel Building an effective KG flywheel requires: -1. **Instrumentation** — logging all system queries, responses, scores, user actions, and so on, to provide visibility - into how the KG is being used. -1. **Analysis** — aggregating, clustering, and analyzing usage data to surface poor responses and issues, and identify - patterns indicating knowledge gaps. -1. **Curation** — manually reviewing problematic responses and tracing issues back to missing or incorrect facts in the - graph. -1. **Remediation** — directly modifying the graph to add missing facts, improve structure, increase clarity, etc., and - fixing the underlying data issues. -1. **Iteration** — continuously looping through the above steps. +1. **Instrumentation** — logging all system queries, responses, scores, user actions, and so on, to provide visibility into how the KG is being used. +2. **Analysis** — aggregating, clustering, and analyzing usage data to surface poor responses and issues, and identify patterns indicating knowledge gaps. +3. **Curation** — manually reviewing problematic responses and tracing issues back to missing or incorrect facts in the graph. +4. **Remediation** — directly modifying the graph to add missing facts, improve structure, increase clarity, etc., and fixing the underlying data issues. +5. **Iteration** — continuously looping through the above steps. Each iteration through the loop further enhances the Knowledge Graph. @@ -227,51 +154,27 @@ Flywheels can also handle high-volume ingestion of streamed live data. ### Active Learning -Streaming data pipelines, while continuously updating the KG, will not necessarily fill all knowledge gaps. To handle -these, flywheel learning also: - +Streaming data pipelines, while continuously updating the KG, will not necessarily fill all knowledge gaps. To handle these, flywheel learning also: - generates queries to identify and fill critical knowledge gaps; and - discovers holes in the graph, formulates questions, retrieves missing facts, and adds them. ### The Flywheel Effect -Each loop of the flywheel analyzes current usage patterns and remediates more data issues, incrementally improving the -quality of the Knowledge Graph. The flywheel process thus enables the KG and language model to co-evolve and improve in -accordance with feedback from real-world system operation. Flywheel learning provides a scaffolding for continuous, -automated improvement of the Knowledge Graph, tailoring it to fit the language model's needs. This powers the accuracy, -relevance, and adaptability of the language model. +Each loop of the flywheel analyzes current usage patterns and remediates more data issues, incrementally improving the quality of the Knowledge Graph. The flywheel process thus enables the KG and language model to co-evolve and improve in accordance with feedback from real-world system operation. Flywheel learning provides a scaffolding for continuous, automated improvement of the Knowledge Graph, tailoring it to fit the language model's needs. This powers the accuracy, relevance, and adaptability of the language model. ## Conclusion -In sum, to achieve human-level performance, language AI must be augmented by retrieving external knowledge and -reasoning. Where LLMs and RAG struggle with representing the context and relationships between real-world entities, -Knowledge Graphs excel. The Knowledge Graph's structured representations permit complex, multi-hop, logical reasoning -over interconnected facts. - -Still, while KGs provide previously missing information to language models, KGs can't surface latent patterns the way -that language models working on vector embeddings can. Together, KGs and embeddings provide a highly productive blend of -knowledge representation, logical reasoning, and statistical learning. And embedding of KGs can be optimized by applying -some simple constraints. +In sum, to achieve human-level performance, language AI must be augmented by retrieving external knowledge and reasoning. Where LLMs and RAG struggle with representing the context and relationships between real-world entities, Knowledge Graphs excel. The Knowledge Graph's structured representations permit complex, multi-hop, logical reasoning over interconnected facts. -Finally, KG's aren't perfect; they have knowledge gaps and need updating. Flywheel Learning can make up for KG knowledge -gaps through live system analysis, and handle continuous, large volume data updates to keep the KG current. Flywheel -learning thus enables the co-evolution of KGs and LLMs to achieve better reasoning, accuracy, and relevance in language -AI applications. +Still, while KGs provide previously missing information to language models, KGs can't surface latent patterns the way that language models working on vector embeddings can. Together, KGs and embeddings provide a highly productive blend of knowledge representation, logical reasoning, and statistical learning. And embedding of KGs can be optimized by applying some simple constraints. -The partnership of KGs and embeddings provides the building blocks moving language AI to true comprehension — -conversation agents that understand context and history, recommendation engines that discern subtle preferences, and -search systems that synthesize accurate answers by connecting facts. As we continue to improve our solutions to the -challenges of constructing high-quality Knowledge Graphs, benchmarking, noise handling, and more, a key role will no -doubt be played by hybrid techniques combining symbolic and neural approaches. +Finally, KG's aren't perfect; they have knowledge gaps and need updating. Flywheel Learning can make up for KG knowledge gaps through live system analysis, and handle continuous, large volume data updates to keep the KG current. Flywheel learning thus enables the co-evolution of KGs and LLMs to achieve better reasoning, accuracy, and relevance in language AI applications. -______________________________________________________________________ +The partnership of KGs and embeddings provides the building blocks moving language AI to true comprehension — conversation agents that understand context and history, recommendation engines that discern subtle preferences, and search systems that synthesize accurate answers by connecting facts. As we continue to improve our solutions to the challenges of constructing high-quality Knowledge Graphs, benchmarking, noise handling, and more, a key role will no doubt be played by hybrid techniques combining symbolic and neural approaches. +--- ## Contributors - -The author and editor have adapted this article with extensive content and format revisions from the author's previous -article -[Embeddings + Knowledge Graphs](https://towardsdatascience.com/embeddings-knowledge-graphs-the-ultimate-tools-for-rag-systems-cbbcca29f0fd), -published in Towards Data Science, Nov 14, 2023. +The author and editor have adapted this article with extensive content and format revisions from the author's previous article [Embeddings + Knowledge Graphs](https://towardsdatascience.com/embeddings-knowledge-graphs-the-ultimate-tools-for-rag-systems-cbbcca29f0fd), published in Towards Data Science, Nov 14, 2023. - [Anthony Alcaraz](https://www.linkedin.com/in/anthony-alcaraz-b80763155/) - [Robert Turner, editor](https://robertturner.co/copyedit) diff --git a/docs/use_cases/multi_agent_rag.md b/docs/use_cases/multi_agent_rag.md index eb2b44288..b75b88f27 100644 --- a/docs/use_cases/multi_agent_rag.md +++ b/docs/use_cases/multi_agent_rag.md @@ -7,92 +7,54 @@ ## Multi-Agent RAG -Retrieval-augmented generation (RAG) has shown great promise for powering conversational AI. However, in most RAG -systems today, a single model handles the full workflow of query analysis, passage retrieval, contextual ranking, -summarization, and prompt augmentation. This results in suboptimal relevance, latency, and coherence. A multi-agent -architecture that factors responsibilities across specialized retrieval, ranking, reading, and orchestration agents, -operating asynchronously, allows each agent to focus on its specialized capability using custom models and data. -Multi-agent RAG is thus able to improve relevance, latency, and coherence overall. +Retrieval-augmented generation (RAG) has shown great promise for powering conversational AI. However, in most RAG systems today, a single model handles the full workflow of query analysis, passage retrieval, contextual ranking, summarization, and prompt augmentation. This results in suboptimal relevance, latency, and coherence. A multi-agent architecture that factors responsibilities across specialized retrieval, ranking, reading, and orchestration agents, operating asynchronously, allows each agent to focus on its specialized capability using custom models and data. Multi-agent RAG is thus able to improve relevance, latency, and coherence overall. -While multi-agent RAG is not a panacea – for simpler conversational tasks a single RAG agent may suffice – multi-agent -RAG outperforms single agent RAG when your use case requires reasoning over diverse information sources. This article -explores a multi-agent RAG architecture and quantifies its benefits. +While multi-agent RAG is not a panacea – for simpler conversational tasks a single RAG agent may suffice – multi-agent RAG outperforms single agent RAG when your use case requires reasoning over diverse information sources. This article explores a multi-agent RAG architecture and quantifies its benefits. ## RAG Challenges and Opportunities Retrieval augmented generation faces several key challenges that limit its performance in real-world applications. -First, existing retrieval mechanisms struggle to identify the most relevant passages from corpora containing millions of -documents. Simple similarity functions often return superfluous or tangential results. When retrieval fails to return -the most relevant information, it leads to suboptimal prompting. +First, existing retrieval mechanisms struggle to identify the most relevant passages from corpora containing millions of documents. Simple similarity functions often return superfluous or tangential results. When retrieval fails to return the most relevant information, it leads to suboptimal prompting. -Second, retrieving supplementary information introduces latency; if the database is large, this latency can be -prohibitive. Searching terabytes of text with complex ranking creates wait times that are too long for consumer -applications. +Second, retrieving supplementary information introduces latency; if the database is large, this latency can be prohibitive. Searching terabytes of text with complex ranking creates wait times that are too long for consumer applications. -In addition, current RAG systems fail to appropriately weight the original prompt and retrieved passages. Without -dynamic contextual weighting, the model can become over-reliant on retrievals (resulting in reduced control or -adaptablity in generating meaningful responses). +In addition, current RAG systems fail to appropriately weight the original prompt and retrieved passages. Without dynamic contextual weighting, the model can become over-reliant on retrievals (resulting in reduced control or adaptablity in generating meaningful responses). ## Multi-agent RAGs address real-world challenges -Specialized agents with divided responsibilities can help address the challenges that plague single-agent architectures, -and unlock RAG's full potential. By factoring RAG into separable subtasks executed concurrently by collaborative and -specialized query understanding, retriever, ranker, reader, and orchestrator agents, multi-agent RAG can mitigate -single-agent RAG's relevance, scalability, and latency limitations. This allows RAG to scale efficiently to enterprise -workloads. +Specialized agents with divided responsibilities can help address the challenges that plague single-agent architectures, and unlock RAG's full potential. By factoring RAG into separable subtasks executed concurrently by collaborative and specialized query understanding, retriever, ranker, reader, and orchestrator agents, multi-agent RAG can mitigate single-agent RAG's relevance, scalability, and latency limitations. This allows RAG to scale efficiently to enterprise workloads. Let's break multi-agent RAG into its parts: -First, a query understanding / parsing agent comprehends the query, breaking it down and describing it in different -sub-queries. +First, a query understanding / parsing agent comprehends the query, breaking it down and describing it in different sub-queries. -Then x number of retriever agents, each utilizing optimized vector indices, focus solely on efficient passage retrieval -from the document corpus, based on the sub-queries. These retriever agents employ vector similarity search or knowledge -graph retrieval-based searches to quickly find potentially relevant passages, minimizing latency even when document -corpora are large. +Then x number of retriever agents, each utilizing optimized vector indices, focus solely on efficient passage retrieval from the document corpus, based on the sub-queries. These retriever agents employ vector similarity search or knowledge graph retrieval-based searches to quickly find potentially relevant passages, minimizing latency even when document corpora are large. -The ranker agent evaluates the relevance of the retrieved passages using additional ranking signals like source -credibility, passage specificity, and lexical overlap. This provides a relevance-based filtering step. This agent might -be using ontology, for example, as a way to rerank retrieved information. +The ranker agent evaluates the relevance of the retrieved passages using additional ranking signals like source credibility, passage specificity, and lexical overlap. This provides a relevance-based filtering step. This agent might be using ontology, for example, as a way to rerank retrieved information. -The reader agent summarizes lengthy retrieved passages into succinct snippets containing only the most salient -information. This distills the context down to key facts. +The reader agent summarizes lengthy retrieved passages into succinct snippets containing only the most salient information. This distills the context down to key facts. -Finally, the orchestrator agent dynamically adjusts the relevance weighting and integration of the prompt and filtered, -ranked context passages (i.e., prompt hybridization) to maximize coherence in the final augmented prompt. +Finally, the orchestrator agent dynamically adjusts the relevance weighting and integration of the prompt and filtered, ranked context passages (i.e., prompt hybridization) to maximize coherence in the final augmented prompt. ## Benefits of multi-agent RAG architecture -- Agent-specific, focused specialization _improves relevance and quality_. Retriever agents leverage tailored similarity - metrics, rankers weigh signals like source credibility, and readers summarize context. +- Agent-specific, focused specialization _improves relevance and quality_. Retriever agents leverage tailored similarity metrics, rankers weigh signals like source credibility, and readers summarize context. - Asynchronous operation _reduces latency_ by parallelizing retrieval. Slow operations don't block faster ones. -- Salient extraction and abstraction techniques achieve _better summarization_. Reader agents condense complex - information from retrieved passages into concise, coherent, highly informative summaries. -- Prompt hybridization achieves _optimized prompting_. Orchestrator agents balance prompt and ranked context data for - more coherent outcome prompts. -- Flexible, modular architecture _enables easy horizontal scaling_ and optional _incorporation of new data sources_. You - can enhance iteratively over time by adding more agents (e.g., a visualizer agent to inspect system behavior), or - substituting alternative implementations of any agent. +- Salient extraction and abstraction techniques achieve _better summarization_. Reader agents condense complex information from retrieved passages into concise, coherent, highly informative summaries. +- Prompt hybridization achieves _optimized prompting_. Orchestrator agents balance prompt and ranked context data for more coherent outcome prompts. +- Flexible, modular architecture _enables easy horizontal scaling_ and optional _incorporation of new data sources_. You can enhance iteratively over time by adding more agents (e.g., a visualizer agent to inspect system behavior), or substituting alternative implementations of any agent. -Let's look at an implementation of multi-agent RAG, and then look under the hood of the agents that make up multi-agent -RAG, examining their logic, sequence, and possible optimizations. + +Let's look at an implementation of multi-agent RAG, and then look under the hood of the agents that make up multi-agent RAG, examining their logic, sequence, and possible optimizations. ## Example with AutoGen library -Before going into the code snippet below from the [Microsoft AutoGen library](https://github.com/microsoft/autogen), -some explanation of terms: +Before going into the code snippet below from the [Microsoft AutoGen library](https://github.com/microsoft/autogen), some explanation of terms: -1. AssistantAgent: The AssistantAgent is given a name, a system message, and a configuration object (llm_config). The - system message is a string that describes the role of the agent. The llm_config object is a dictionary that contains - functions for the agent to perform its role. +1. AssistantAgent: The AssistantAgent is given a name, a system message, and a configuration object (llm_config). The system message is a string that describes the role of the agent. The llm_config object is a dictionary that contains functions for the agent to perform its role. -1. user_proxy is an instance of UserProxyAgent. It is given a name and several configuration options. The - is_termination_msg option is a function that determines when the user wants to terminate the conversation. The - human_input_mode option is set to "NEVER", which means the agent will never ask for input from a human. The - max_consecutive_auto_reply option is set to 10, which means the agent will automatically reply to up to 10 - consecutive messages without input from a human. The code_execution_config option is a dictionary that contains - configuration options for executing code. +2. user_proxy is an instance of UserProxyAgent. It is given a name and several configuration options. The is_termination_msg option is a function that determines when the user wants to terminate the conversation. The human_input_mode option is set to "NEVER", which means the agent will never ask for input from a human. The max_consecutive_auto_reply option is set to 10, which means the agent will automatically reply to up to 10 consecutive messages without input from a human. The code_execution_config option is a dictionary that contains configuration options for executing code. ```python @@ -118,13 +80,10 @@ boss = autogen.UserProxyAgent( ### The QueryUnderstandingAgent 1. A query is received by the QueryUnderstandingAgent. -1. The agent checks if it is a long question using logic like word count, presence of multiple question marks, etc. -1. If the query is a long question, the GuidanceQuestionGenerator breaks it into shorter sub-questions. For example, - “What is the capital of France and what is the population?” is broken into “What is the capital of France?” and “What - is the population of France?” -1. These sub-questions are then passed to the QueryRouter one by one. -1. The QueryRouter checks each sub-question against a set of predefined routing rules and cases to determine which query - engine it should go to. +2. The agent checks if it is a long question using logic like word count, presence of multiple question marks, etc. +3. If the query is a long question, the GuidanceQuestionGenerator breaks it into shorter sub-questions. For example, “What is the capital of France and what is the population?” is broken into “What is the capital of France?” and “What is the population of France?” +4. These sub-questions are then passed to the QueryRouter one by one. +5. The QueryRouter checks each sub-question against a set of predefined routing rules and cases to determine which query engine it should go to. ```python # QueryUnderstandingAgent @@ -135,46 +94,32 @@ query_understanding_agent = autogen.AssistantAgent( ) ``` -The goal of the QueryUnderstandingAgent is to check each subquery and determine which retriever agent is best suited to -handle it based on the database schema matching. For example, some subqueries may be better served by a vector database, -and others by a knowledge graph database. +The goal of the QueryUnderstandingAgent is to check each subquery and determine which retriever agent is best suited to handle it based on the database schema matching. For example, some subqueries may be better served by a vector database, and others by a knowledge graph database. -To implement the QueryUnderstandingAgent, we can create a SubqueryRouter component, which takes in two retriever -agents — a VectorRetrieverAgent and a KnowledgeGraphRetrieverAgent. +To implement the QueryUnderstandingAgent, we can create a SubqueryRouter component, which takes in two retriever agents — a VectorRetrieverAgent and a KnowledgeGraphRetrieverAgent. -When a subquery needs to be routed, the SubqueryRouter will check to see if the subquery matches the schema of the -vector database using some keyword or metadata matching logic. If there is a match, it will return the -VectorRetrieverAgent to handle the subquery. If there is no match for the vector database, the SubqueryRouter will next -check if the subquery matches the schema of the knowledge graph database. If so, it will return the -KnowledgeGraphRetrieverAgent instead. +When a subquery needs to be routed, the SubqueryRouter will check to see if the subquery matches the schema of the vector database using some keyword or metadata matching logic. If there is a match, it will return the VectorRetrieverAgent to handle the subquery. If there is no match for the vector database, the SubqueryRouter will next check if the subquery matches the schema of the knowledge graph database. If so, it will return the KnowledgeGraphRetrieverAgent instead. -The SubqueryRouter acts like a dispatcher, distributing subquery work to the optimal retriever agent. This way, each -retriever agent can focus on efficiently retrieving results from its respective databases without worrying about -handling all subquery types. +The SubqueryRouter acts like a dispatcher, distributing subquery work to the optimal retriever agent. This way, each retriever agent can focus on efficiently retrieving results from its respective databases without worrying about handling all subquery types. -This multi-agent modularity makes it easy to add more specialized retriever agents as needed for different databases or -data sources. +This multi-agent modularity makes it easy to add more specialized retriever agents as needed for different databases or data sources. ### General Flow 1. The query starts at the “Long Question?” decision point, per above. -1. If ‘Yes’, the query is broken into sub-questions and then sent to various query engines. -1. If ‘No’, the query moves to the main routing logic, which routes the query based on specific cases, or defaults to a - fallback strategy. -1. Once an engine returns a satisfactory answer, the process ends; otherwise, fallbacks are tried. +2. If ‘Yes’, the query is broken into sub-questions and then sent to various query engines. +3. If ‘No’, the query moves to the main routing logic, which routes the query based on specific cases, or defaults to a fallback strategy. +4. Once an engine returns a satisfactory answer, the process ends; otherwise, fallbacks are tried. ### The Retriever Agents -We can create multiple retriever agents, each focused on efficient retrieval from a specific data source or using a -particular technique. For example: +We can create multiple retriever agents, each focused on efficient retrieval from a specific data source or using a particular technique. For example: -- VectorDBRetrieverAgent: Retrieves passages using vector similarity search on an indexed document corpus. -- WikipediaRetrieverAgent: Retrieves relevant Wikipedia passages. -- KnowledgeGraphRetriever: Uses knowledge graph retrieval. +* VectorDBRetrieverAgent: Retrieves passages using vector similarity search on an indexed document corpus. +* WikipediaRetrieverAgent: Retrieves relevant Wikipedia passages. +* KnowledgeGraphRetriever: Uses knowledge graph retrieval. -When subqueries are generated, we assign each one to the optimal retriever agent based on its content and the agent -capabilities. For example, a fact-based subquery may go to the KnowledgeGraphRetriever, while a broader subquery could -use the VectorDBRetrieverAgent. +When subqueries are generated, we assign each one to the optimal retriever agent based on its content and the agent capabilities. For example, a fact-based subquery may go to the KnowledgeGraphRetriever, while a broader subquery could use the VectorDBRetrieverAgent. ```python retriever_agent_vector = autogen.AssistantAgent( @@ -196,8 +141,7 @@ retriever_agent_sql = autogen.AssistantAgent( ) ``` -To enable _asynchronous retrieval_, we use Python’s asyncio framework. When subqueries are available, we create asyncio -tasks to run the assigned retriever agent for each subquery concurrently. +To enable _asynchronous retrieval_, we use Python’s asyncio framework. When subqueries are available, we create asyncio tasks to run the assigned retriever agent for each subquery concurrently. For example: @@ -210,8 +154,7 @@ for subquery in subqueries: await asyncio.gather(*retrieval_tasks) ``` -This allows all retriever agents to work in parallel instead of waiting for each one to finish before moving on to the -next. Asynchronous retrieval returns passages far more quickly than single-agent retrieval. +This allows all retriever agents to work in parallel instead of waiting for each one to finish before moving on to the next. Asynchronous retrieval returns passages far more quickly than single-agent retrieval. The results from each agent can then be merged and ranked for the next stages. @@ -219,21 +162,15 @@ The results from each agent can then be merged and ranked for the next stages. The ranker agents in a multi-agent retrieval system can be specialized using different ranking tools and techniques: -- Fine-tune on domain-specific data using datasets like MS MARCO or self-supervised data from the target corpus. This - allows learning representations tailored to ranking documents for the specific domain. -- Use cross-encoder models like SBERT trained extensively on passage ranking tasks as a base. Cross-encoder models - capture nuanced relevance between queries and documents. -- Employ dense encoding models like DPR to leverage dual-encoder search through the vector space when ranking a large - set of candidates. -- For efficiency, use approximate nearest neighbor algorithms like HNSW when finding top candidates from a large corpus. -- Apply re-ranking with cross-encoders after initial fast dense retrieval – for greater accuracy in ranking the top - results. -- Exploit metadata like document freshness, author credibility, keywords, etc. to customize ranking based on query - context. -- Use learned models like LambdaRank to optimize the ideal combination of ranking signals. -- Specialize different ranker agents, respectively, on particular types of queries where they perform best, selected - dynamically. -- Implement ensemble ranking approaches to combine multiple underlying rankers/signals efficiently. +* Fine-tune on domain-specific data using datasets like MS MARCO or self-supervised data from the target corpus. This allows learning representations tailored to ranking documents for the specific domain. +* Use cross-encoder models like SBERT trained extensively on passage ranking tasks as a base. Cross-encoder models capture nuanced relevance between queries and documents. +* Employ dense encoding models like DPR to leverage dual-encoder search through the vector space when ranking a large set of candidates. +* For efficiency, use approximate nearest neighbor algorithms like HNSW when finding top candidates from a large corpus. +* Apply re-ranking with cross-encoders after initial fast dense retrieval – for greater accuracy in ranking the top results. +* Exploit metadata like document freshness, author credibility, keywords, etc. to customize ranking based on query context. +* Use learned models like LambdaRank to optimize the ideal combination of ranking signals. +* Specialize different ranker agents, respectively, on particular types of queries where they perform best, selected dynamically. +* Implement ensemble ranking approaches to combine multiple underlying rankers/signals efficiently. ```python # RankerAgent @@ -244,25 +181,20 @@ ranker_agent = autogen.AssistantAgent( ) ``` -To optimize accuracy, speed, and customization in your ranker agents, you need to identify which specialized techniques -enhance ranking performance in which scenarios, then use them to configure your ranker agents accordingly. +To optimize accuracy, speed, and customization in your ranker agents, you need to identify which specialized techniques enhance ranking performance in which scenarios, then use them to configure your ranker agents accordingly. ### The Reader Agent To optimize your reader agent, we recommend that you: -- Use Claude 2 as the base model for the ReaderAgent; leverage its long context abilities for summarization. Fine-tune - Claude 2 further on domain-specific summarization data. -- Implement a ToolComponent that wraps access to a knowledge graph containing summarization methodologies — things like - identifying key entities, events, detecting redundancy, etc. +* Use Claude 2 as the base model for the ReaderAgent; leverage its long context abilities for summarization. Fine-tune Claude 2 further on domain-specific summarization data. +* Implement a ToolComponent that wraps access to a knowledge graph containing summarization methodologies — things like identifying key entities, events, detecting redundancy, etc. By taking the above steps, you can ensure that: -- When the ReaderAgent’s run method takes the lengthy passage as input, the ReaderAgent generates a prompt for Claude 2, - combining the passage and a methodology retrieval call to the KG tool, to arrive at the optimal approach for - summarizing the content. -- Claude 2 processes this augmented prompt to produce a concise summary, extracting the key information. -- As a final step, the summary is returned. +* When the ReaderAgent’s run method takes the lengthy passage as input, the ReaderAgent generates a prompt for Claude 2, combining the passage and a methodology retrieval call to the KG tool, to arrive at the optimal approach for summarizing the content. +* Claude 2 processes this augmented prompt to produce a concise summary, extracting the key information. +* As a final step, the summary is returned. ```python # ReaderAgent @@ -284,25 +216,17 @@ orchestrator_agent = autogen.AssistantAgent( ) ``` -The OrchestratorAgent can leverage structured knowledge and symbolic methods to complement LLM reasoning where -appropriate and produce answers that are highly accurate, contextual, and explainable. We recommend that you: - -1. Maintain a knowledge graph containing key entities, relations, and facts extracted from the documents. Use this to - verify the factual accuracy of answers. -1. Implement logic to check the final answer against known facts and rules in the knowledge graph. Flag inconsistencies - for the LLMs to re-reason. -1. Enable the OrchestratorAgent to ask clarifying questions to the ReaderAgent (i.e., solicit additional context) in the - event that answers contradict the knowledge graph. -1. Use the knowledge graph to identify and add additional context (entities and events) related to concepts in the user - query and final answer. -1. Generate concise explanations to justify the reasoning alongside (in parallel to) the final answer, using knowledge - graph relations and LLM semantics. -1. Analyze past answer reasoning patterns to identify common anomalies, biases, and fallacies, to continuously fine-tune - the LLM reasoning, and improve final answer quality. -1. Codify appropriate levels of answer certainty and entailment for different query types based on knowledge graph data - analysis. -1. Maintain provenance of answer generations to incrementally improve reasoning over time via knowledge graph and LLM - feedback. +The OrchestratorAgent can leverage structured knowledge and symbolic methods to complement LLM reasoning where appropriate and produce answers that are highly accurate, contextual, and explainable. We recommend that you: + +1. Maintain a knowledge graph containing key entities, relations, and facts extracted from the documents. Use this to verify the factual accuracy of answers. +2. Implement logic to check the final answer against known facts and rules in the knowledge graph. Flag inconsistencies for the LLMs to re-reason. +3. Enable the OrchestratorAgent to ask clarifying questions to the ReaderAgent (i.e., solicit additional context) in the event that answers contradict the knowledge graph. +4. Use the knowledge graph to identify and add additional context (entities and events) related to concepts in the user query and final answer. +5. Generate concise explanations to justify the reasoning alongside (in parallel to) the final answer, using knowledge graph relations and LLM semantics. +7. Analyze past answer reasoning patterns to identify common anomalies, biases, and fallacies, to continuously fine-tune the LLM reasoning, and improve final answer quality. +8. Codify appropriate levels of answer certainty and entailment for different query types based on knowledge graph data analysis. +9. Maintain provenance of answer generations to incrementally improve reasoning over time via knowledge graph and LLM feedback. + Finally, to facilitate communication and interactions among the participating agents, you need to create a group chat: @@ -317,27 +241,20 @@ manager = GroupChatManager(chat) manager.run() ``` -## Benefits of Specialized Agents -The proposed multi-agent RAG architecture delivers significant benefits in conversational AI, compared to single-agent -RAG systems: +## Benefits of Specialized Agents -- By dedicating an agent solely to passage retrieval, you can employ more advanced and efficient search algorithms, - prefetching passages in parallel across the corpus, improving overall latency. -- Using a ranker agent specialized in evaluating relevance improves retrieval precision. By filtering out lower quality - hits, model prompting stays focused on pertinent information. -- Summarization by the reader agent distills long text into concise snippets containing only the most salient facts. - This prevents prompt dilution and improves coherence. -- Dynamic context weighting by the orchestrator agent minimizes chances of the model ignoring the original prompt or - becoming overly reliant on retrieved information. -- Specialized agents make your RAG system more scalable and flexible. Agents can be upgraded independently, and - additional agents added to extend capabilities. +The proposed multi-agent RAG architecture delivers significant benefits in conversational AI, compared to single-agent RAG systems: -Overall, the multi-agent factored RAG system demonstrates substantial improvements in appropriateness, coherence, -reasoning, and correctness over single-agent RAG baselines. +* By dedicating an agent solely to passage retrieval, you can employ more advanced and efficient search algorithms, prefetching passages in parallel across the corpus, improving overall latency. +* Using a ranker agent specialized in evaluating relevance improves retrieval precision. By filtering out lower quality hits, model prompting stays focused on pertinent information. +* Summarization by the reader agent distills long text into concise snippets containing only the most salient facts. This prevents prompt dilution and improves coherence. +* Dynamic context weighting by the orchestrator agent minimizes chances of the model ignoring the original prompt or becoming overly reliant on retrieved information. +* Specialized agents make your RAG system more scalable and flexible. Agents can be upgraded independently, and additional agents added to extend capabilities. -______________________________________________________________________ +Overall, the multi-agent factored RAG system demonstrates substantial improvements in appropriateness, coherence, reasoning, and correctness over single-agent RAG baselines. +--- ## Contributors - [Anthony Alcaraz](https://www.linkedin.com/in/anthony-alcaraz-b80763155/) diff --git a/docs/use_cases/node_representation_learning.md b/docs/use_cases/node_representation_learning.md index a04a71be0..f26dad715 100644 --- a/docs/use_cases/node_representation_learning.md +++ b/docs/use_cases/node_representation_learning.md @@ -4,24 +4,15 @@ ## Introduction: representing things and relationships between them -Of the various types of information - words, pictures, and connections between things - **relationships** are especially -interesting. Relationships show how things interact and create networks. But not all ways of representing relationships -are the same. In machine learning, **how we do vector representation of things and their relationships affects -performance** on a wide range of tasks. +Of the various types of information - words, pictures, and connections between things - **relationships** are especially interesting. Relationships show how things interact and create networks. But not all ways of representing relationships are the same. In machine learning, **how we do vector representation of things and their relationships affects performance** on a wide range of tasks. -Below, we evaluate several approaches to vector representation on a real-life use case: how well each approach -classifies academic articles in a subset of the Cora citation network. +Below, we evaluate several approaches to vector representation on a real-life use case: how well each approach classifies academic articles in a subset of the Cora citation network. -We look first at Bag-of-Words (BoW), a standard approach to vectorizing text data in ML. Because BoW can't represent the -network structure well, we turn to solutions that can help BoW's performance: Node2Vec and GraphSAGE. We also look for a -solution to BoW's other shortcoming - its inability to capture semantic meaning. We evaluate LLM embeddings, first on -their own, then combined with Node2Vec, and, finally, GraphSAGE trained on LLM features. +We look first at Bag-of-Words (BoW), a standard approach to vectorizing text data in ML. Because BoW can't represent the network structure well, we turn to solutions that can help BoW's performance: Node2Vec and GraphSAGE. We also look for a solution to BoW's other shortcoming - its inability to capture semantic meaning. We evaluate LLM embeddings, first on their own, then combined with Node2Vec, and, finally, GraphSAGE trained on LLM features. ## Loading our dataset, and evaluating BoW - -Our use case is a subset of the **Cora citation network**. This subset comprises 2708 scientific papers (nodes) and -connections that indicate citations between them. Each paper has a BoW descriptor containing 1433 words. The papers in -the dataset are also divided into 7 different topics (classes). Each paper belongs to exactly one of them. + +Our use case is a subset of the **Cora citation network**. This subset comprises 2708 scientific papers (nodes) and connections that indicate citations between them. Each paper has a BoW descriptor containing 1433 words. The papers in the dataset are also divided into 7 different topics (classes). Each paper belongs to exactly one of them. We **load the dataset** as follows: @@ -32,9 +23,7 @@ ds = Planetoid("./data", "Cora")[0] ### Evaluating BoW on a classification task -We can evaluate how well the BoW descriptors represent the articles by measuring classification performance (Accuracy -and macro F1). We use a KNN (K-Nearest Neighbors) classifier with 15 neighbors, and cosine similarity as the similarity -metric: +We can evaluate how well the BoW descriptors represent the articles by measuring classification performance (Accuracy and macro F1). We use a KNN (K-Nearest Neighbors) classifier with 15 neighbors, and cosine similarity as the similarity metric: ```python from sklearn.neighbors import KNeighborsClassifier @@ -59,15 +48,11 @@ evaluate(ds.x, ds.y) >>> F1 macro 0.701 ``` -BoW's accuracy and F1 macro scores are pretty good, but leave significant room for improvement. BoW falls short of -correctly classifying papers more than 25% of the time. And on average across classes BoW is inaccurate nearly 30% of -the time. +BoW's accuracy and F1 macro scores are pretty good, but leave significant room for improvement. BoW falls short of correctly classifying papers more than 25% of the time. And on average across classes BoW is inaccurate nearly 30% of the time. ## Taking advantage of citation graph data -Can we improve on this? Our citation dataset contains not only text data but also relationship data - a citation graph. -Any given article will tend to cite other articles that belong to the same topic that it belongs to. Therefore, -representations that embed not just textual data but also citation data will probably classify articles more accurately. +Can we improve on this? Our citation dataset contains not only text data but also relationship data - a citation graph. Any given article will tend to cite other articles that belong to the same topic that it belongs to. Therefore, representations that embed not just textual data but also citation data will probably classify articles more accurately. BoW features represent text data. But how well does BoW capture the relationships between articles? @@ -75,34 +60,25 @@ BoW features represent text data. But how well does BoW capture the relationship ### Comparing citation pair similarity in BoW -To examine how well citation pairs show up in BoW features, we can make a plot comparing connected and not connected -pairs of papers based on how similar their respective BoW features are. +To examine how well citation pairs show up in BoW features, we can make a plot comparing connected and not connected pairs of papers based on how similar their respective BoW features are. ![BoW cosine similarity edge counts](../assets/use_cases/node_representation_learning/bins_bow.png) -In this plot, we define groups (shown on the y-axis) so that each group has about the same number of pairs as the other -groups. The only exception is the 0.00-0.05 group, where lots of pairs have _no_ similar words - they can't be split -into smaller groups. +In this plot, we define groups (shown on the y-axis) so that each group has about the same number of pairs as the other groups. The only exception is the 0.00-0.05 group, where lots of pairs have _no_ similar words - they can't be split into smaller groups. -The plot demonstrates how connected nodes usually have higher cosine similarities. Papers that cite each other often use -similar words. But if we ignore paper pairs with zero similarities (the 0.00-0.00 group), papers that have _not_ cited -each other also seem to have a wide range of common words. +The plot demonstrates how connected nodes usually have higher cosine similarities. Papers that cite each other often use similar words. But if we ignore paper pairs with zero similarities (the 0.00-0.00 group), papers that have _not_ cited each other also seem to have a wide range of common words. -Though BoW representations embody _some_ information about article connectivity, BoW features don't contain enough -citation pair information to accurately reconstruct the actual citation graph. BoW looks exclusively at word -co-occurrence between article pairs, and therefore misses word context data contained in the network structure. +Though BoW representations embody _some_ information about article connectivity, BoW features don't contain enough citation pair information to accurately reconstruct the actual citation graph. BoW looks exclusively at word co-occurrence between article pairs, and therefore misses word context data contained in the network structure. -**Can we make up for BoW's inability to represent the citation network's structure?** Are there methods that capture -node connectivity data better? +**Can we make up for BoW's inability to represent the citation network's structure?** +Are there methods that capture node connectivity data better? -Node2Vec is built to do precisely this, for static networks. So is GraphSAGE, for dynamic ones. Let's look at Node2Vec -first. +Node2Vec is built to do precisely this, for static networks. So is GraphSAGE, for dynamic ones. +Let's look at Node2Vec first. ## Embedding network structure with Node2Vec -As opposed to BoW vectors, node embeddings are vector representations that capture the structural role and properties of -nodes in a network. Node2Vec is an algorithm that learns node representations using the Skip-Gram method; it models the -conditional probability of encountering a context node given a source node in node sequences (random walks): +As opposed to BoW vectors, node embeddings are vector representations that capture the structural role and properties of nodes in a network. Node2Vec is an algorithm that learns node representations using the Skip-Gram method; it models the conditional probability of encountering a context node given a source node in node sequences (random walks): -Imagine a world where your online searches return results that truly understand your needs, a world where you don't have -to know the exact words to find what you're looking for. This isn't a vision of some distant possible future; it's -happening now. Companies like Pinterest, Spotify, eBay, Airbnb, and Doordash have already taken advantage of the -treasure trove of insights inherent in data – data that is -[growing exponentially, and projected to surpass 175 zettabytes by 2025](https://www.forbes.com/sites/tomcoughlin/2018/11/27/175-zettabytes-by-2025) -– to significantly improve user experience and engagement, conversion rates, and customer satisfaction. Spotify, for -example, has been able to enhance its music recommendation system, leading to a more than 10% performance improvement in -session and track recommendation tasks, and a -[sizeable boost in user engagement and satisfaction](https://doi.org/10.1145/3383313.3412248). +Imagine a world where your online searches return results that truly understand your needs, a world where you don't have to know the exact words to find what you're looking for. This isn't a vision of some distant possible future; it's happening now. Companies like Pinterest, Spotify, eBay, Airbnb, and Doordash have already taken advantage of the treasure trove of insights inherent in data – data that is [growing exponentially, and projected to surpass 175 zettabytes by 2025](https://www.forbes.com/sites/tomcoughlin/2018/11/27/175-zettabytes-by-2025) – to significantly improve user experience and engagement, conversion rates, and customer satisfaction. Spotify, for example, has been able to enhance its music recommendation system, leading to a more than 10% performance improvement in session and track recommendation tasks, and a [sizeable boost in user engagement and satisfaction](https://doi.org/10.1145/3383313.3412248). And _how_ have they done this? What's enabled these companies to harvest the inherent power of data to their benefit? -The answer is vector embeddings. +The answer is vector embeddings. -Vector embeddings let you return more _relevant_ results to your search queries by: 1) querying the _meaning_ of the -search terms, as opposed to just looking for search keyword _matches_; and 2) informing your search query with the -meaning of personal preference data, through the addition of a personal preference vector. +Vector embeddings let you return more _relevant_ results to your search queries by: 1) querying the _meaning_ of the search terms, as opposed to just looking for search keyword _matches_; and 2) informing your search query with the meaning of personal preference data, through the addition of a personal preference vector. -Let's look first at how vector embeddings improve the relevance of search query results generally, and then at how -vector embeddings permit us to use the meaning of personal preferences to create truly personalized searches. +Let's look first at how vector embeddings improve the relevance of search query results generally, and then at how vector embeddings permit us to use the meaning of personal preferences to create truly personalized searches. Illustration of vector embeddings ## Vector search vs. keyword-based search -Vector embeddings are revolutionizing the way we search and retrieve information. They work by converting data into -numerical representations, known as vectors. Conversion into vectors allows the search system to consider the -_semantics_ – the underlying meaning – of the data when performing a search. +Vector embeddings are revolutionizing the way we search and retrieve information. They work by converting data into numerical representations, known as vectors. Conversion into vectors allows the search system to consider the _semantics_ – the underlying meaning – of the data when performing a search. -Imagine you're searching for a book in an online store. With traditional keyword-based search, you would need to know -the exact title or author's name. But using vector embeddings, you can simply describe the book's theme or plot, and the -search system retrieves relevant results. This is because vector embeddings understand the _meaning_ of the query, -rather than just matching on the keywords in your query. +Imagine you're searching for a book in an online store. With traditional keyword-based search, you would need to know the exact title or author's name. But using vector embeddings, you can simply describe the book's theme or plot, and the search system retrieves relevant results. This is because vector embeddings understand the _meaning_ of the query, rather than just matching on the keywords in your query. ### How do vector embeddings return relevant results? -The power of vector embeddings lies in their ability to quantify similarity between vectors. This is done using a -distance metric. One of the most commonly used distance metrics is cosine similarity, which measures how close one -vector is to another; the distance between vectors is a measure of how similar the two pieces of data they represent -are. In this way, vector search is able to return relevant results even when the exact terms aren't present in the -query. +The power of vector embeddings lies in their ability to quantify similarity between vectors. This is done using a distance metric. One of the most commonly used distance metrics is cosine similarity, which measures how close one vector is to another; the distance between vectors is a measure of how similar the two pieces of data they represent are. In this way, vector search is able to return relevant results even when the exact terms aren't present in the query. ### Handling embedding model input limits -The embedding models used for vector search _do_ have maximum input length limits that users need to consider. The -twelve best-performing models, based on the -[Massive Text Embedding Benchmark (MTEB) Leaderboard](https://huggingface.co/spaces/mteb/leaderboard), are limited to an -input size of 512 tokens. (The 13th best has an exceptional input size limit of 8192 tokens.) +The embedding models used for vector search _do_ have maximum input length limits that users need to consider. The twelve best-performing models, based on the [Massive Text Embedding Benchmark (MTEB) Leaderboard](https://huggingface.co/spaces/mteb/leaderboard), are limited to an input size of 512 tokens. (The 13th best has an exceptional input size limit of 8192 tokens.) -But we can handle this input size limitation by segmenting the data into smaller parts that fit the model's token -constraints, or by adopting a sliding window technique. Segmenting involves cutting the text into smaller pieces that -can be individually vectorized. The sliding window method processes the text in sections, with each new window -overlapping the previous one, to maintain context between portions. These techniques adjust the data to the model's -requirements, allowing for detailed vector representation of larger texts. +But we can handle this input size limitation by segmenting the data into smaller parts that fit the model's token constraints, or by adopting a sliding window technique. Segmenting involves cutting the text into smaller pieces that can be individually vectorized. The sliding window method processes the text in sections, with each new window overlapping the previous one, to maintain context between portions. These techniques adjust the data to the model's requirements, allowing for detailed vector representation of larger texts. -Say, for example, you're searching for a specific article about a local festival in a newspaper's digital archive. The -system identifies the lengthy Sunday edition where the piece appeared, and then, to ensure a thorough search, breaks the -edition down, analyzing it article by article, much like going page by page, until it pinpoints the local festival -article you're looking for. +Say, for example, you're searching for a specific article about a local festival in a newspaper's digital archive. The system identifies the lengthy Sunday edition where the piece appeared, and then, to ensure a thorough search, breaks the edition down, analyzing it article by article, much like going page by page, until it pinpoints the local festival article you're looking for. ### But it's not only text that you can search! -The general-purpose nature of vector embeddings makes it possible to represent almost any form of data, from text to -images to audio. Back in our bookstore, vector embeddings can handle more than just title searches. We can also -represent a transaction as a vector. Each dimension of the vector represents a different attribute, such as the -transaction amount, date, or product category. By comparing these transaction vectors, the search system can identify -patterns or anomalies that would be difficult to spot with traditional search methods. +The general-purpose nature of vector embeddings makes it possible to represent almost any form of data, from text to images to audio. Back in our bookstore, vector embeddings can handle more than just title searches. We can also represent a transaction as a vector. Each dimension of the vector represents a different attribute, such as the transaction amount, date, or product category. By comparing these transaction vectors, the search system can identify patterns or anomalies that would be difficult to spot with traditional search methods. ### Great! But what can I use to get started? -Using a vector database – a system designed to store and perform semantic search at scale – you can compare the query -vector with vectors stored in the database and return the top-k most similar ones. The key components of a vector -database include a vector index, a query engine, partitioning/sharding capabilities, replication features, and an -accessible API. Vector databases are categorized into vector-native databases, hybrid databases, and search engines. -Notable vector database providers include [Pinecone](https://pinecone.io), [Milvus](https://milvus.io), and -[Weaviate](https://weaviate.io). +Using a vector database – a system designed to store and perform semantic search at scale – you can compare the query vector with vectors stored in the database and return the top-k most similar ones. The key components of a vector database include a vector index, a query engine, partitioning/sharding capabilities, replication features, and an accessible API. Vector databases are categorized into vector-native databases, hybrid databases, and search engines. Notable vector database providers include [Pinecone](https://pinecone.io), [Milvus](https://milvus.io), and [Weaviate](https://weaviate.io). -| Key Component | Description | | --------------------- | ------------------------------------------------------- | | -Vector Index | Allows fast and efficient retrieval of similar vectors | | Query Engine | Performs optimized similarity -computations on the index | | Partitioning/Sharding | Enables horizontal scaling | | Replication | Ensures reliability -and data integrity | | API | Allows for efficient vector CRUD operations | +| Key Component | Description | +| --------------------- | ------------------------------------------------------- | +| Vector Index | Allows fast and efficient retrieval of similar vectors | +| Query Engine | Performs optimized similarity computations on the index | +| Partitioning/Sharding | Enables horizontal scaling | +| Replication | Ensures reliability and data integrity | +| API | Allows for efficient vector CRUD operations | ## How can I personalize my search (using a vector database)? @@ -119,44 +83,25 @@ user_preference_weight = 0.3 biased_query_embedding = query_weight * query_embedding + user_preference_weight * user_preference_vector ``` -In this code example, we convert a search query into a vector using an -[open-source, pretrained BERT model from Hugging Face](https://huggingface.co/bert-base-uncased) (you can try this out -online yourself by following the link). We also have a user preference vector, which is usually based on a user's past -clicks or choices. We then arithmetically "add" the query vector and the user preference vector to create a new query -vector that reflects both the user input and user preferences. +In this code example, we convert a search query into a vector using an [open-source, pretrained BERT model from Hugging Face](https://huggingface.co/bert-base-uncased) (you can try this out online yourself by following the link). We also have a user preference vector, which is usually based on a user's past clicks or choices. We then arithmetically "add" the query vector and the user preference vector to create a new query vector that reflects both the user input and user preferences. Use cases of personalized search with vector embeddings ## Conclusions and next steps 😊 -Vector embeddings are revolutionizing the way we interact with and use data. By enabling more accurate and contextually -relevant search results, they are paving the way for a new era of data-driven insights and decision-making. It's not -only early adopters like Pinterest, Spotify, eBay, Airbnb, and Doordash who have -[reaped the benefits of vector search integration](https://rockset.com/blog/introduction-to-semantic-search-from-keyword-to-vector-search/). -Any company can take advantage of vector search to enhance user experience and engagement. Home Depot, for example, -responded to increased online activity during the COVID pandemic period by integrating vector search, leading to -[improved customer service and a boost in online sales](https://www.datanami.com/2022/03/15/home-depot-finds-diy-success-with-vector-search/). -The future of search is here, and it's powered by vector embeddings. +Vector embeddings are revolutionizing the way we interact with and use data. By enabling more accurate and contextually relevant search results, they are paving the way for a new era of data-driven insights and decision-making. It's not only early adopters like Pinterest, Spotify, eBay, Airbnb, and Doordash who have [reaped the benefits of vector search integration](https://rockset.com/blog/introduction-to-semantic-search-from-keyword-to-vector-search/). Any company can take advantage of vector search to enhance user experience and engagement. Home Depot, for example, responded to increased online activity during the COVID pandemic period by integrating vector search, leading to [improved customer service and a boost in online sales](https://www.datanami.com/2022/03/15/home-depot-finds-diy-success-with-vector-search/). The future of search is here, and it's powered by vector embeddings. -So, what's next? How can you start implementing personalized search in your organization? There are plenty of resources -and tools available to help you get started. For instance, you can check out this -[guide on implementing vector search](https://hub.superlinked.com/vector-search) or this -[tutorial on using vector embeddings](https://hub.superlinked.com/vector-compute). +So, what's next? How can you start implementing personalized search in your organization? There are plenty of resources and tools available to help you get started. For instance, you can check out this [guide on implementing vector search](https://hub.superlinked.com/vector-search) or this [tutorial on using vector embeddings](https://hub.superlinked.com/vector-compute). ## Share your thoughts and stay updated -What are your thoughts on personalized search using vector embeddings? Have you used this technology in your -organization? If you'd like to contribute an article to the conversation, don't hesitate to -[get in touch](https://github.com/superlinked/VectorHub)! +What are your thoughts on personalized search using vector embeddings? Have you used this technology in your organization? If you'd like to contribute an article to the conversation, don't hesitate to [get in touch](https://github.com/superlinked/VectorHub)! -Stay Updated: Drop your email in the footer to stay up to date with new resources coming out of VectorHub and -Superlinked. +Stay Updated: Drop your email in the footer to stay up to date with new resources coming out of VectorHub and Superlinked. -Your feedback shapes VectorHub! Found an Issue or Have a Suggestion? If you spot something off in an article or have a -topic you want us to dive into, create a GitHub issue and we'll get on it! - -______________________________________________________________________ +Your feedback shapes VectorHub! Found an Issue or Have a Suggestion? If you spot something off in an article or have a topic you want us to dive into, create a GitHub issue and we'll get on it! +--- ## Contributors - [Michael Jancen-Widmer, author](https://www.contrarian.ai) diff --git a/docs/use_cases/readme.md b/docs/use_cases/readme.md index d682f2824..fbd6069f1 100644 --- a/docs/use_cases/readme.md +++ b/docs/use_cases/readme.md @@ -1,28 +1,23 @@ -# Blog - -There are a wide variety of different use cases for information retrieval and vector-powered retrieval systems. - -Our use cases include: - -- Different examples of problems you can solve using vector retrieval, improving performance in your ML Stack -- Case studies written by practitioners sharing their experiences working with these systems -- Deep dives into specific parts of the ML pipeline, highlighting key considerations when moving vector-driven outputs - and experiments into production - -In the blog section, we collate examples of these use cases and case studies from our contributors discussing how they -use and improve information retrieval systems to solve real-world problems. - -## Contents - -- [Personalized Search](https://hub.superlinked.com/personalized-search-harnessing-the-power-of-vector-embeddings) -- [Recommender Systems](https://hub.superlinked.com/a-recommender-system-collaborative-filtering-with-sparse-metadata) -- [Retrieval Augmented Generation](https://hub.superlinked.com/retrieval-augmented-generation) -- [Enhancing RAG With A Multi-Agent System](https://hub.superlinked.com/enhancing-rag-with-a-multi-agent-system) -- [Vector Embeddings In The Browser](https://hub.superlinked.com/vector-embeddings-in-the-browser) -- [Answering Questions with Knowledge Embeddings](https://hub.superlinked.com/answering-questions-with-knowledge-graph-embeddings) -- [Representation Learning on Graph Structured Data](https://hub.superlinked.com/representation-learning-on-graph-structured-data) -- [Improving RAG performance with Knowledge Graphs](use_cases/knowledge_graphs.md) - -We are always looking to expand our Use Cases and share the latest thinking. So if you've been working on something and -would like to share your experiences with the community, you -[get in touch and contribute](https://github.com/superlinked/VectorHub). +# Blog + +There are a wide variety of different use cases for information retrieval and vector-powered retrieval systems. + +Our use cases include: +- Different examples of problems you can solve using vector retrieval, improving performance in your ML Stack +- Case studies written by practitioners sharing their experiences working with these systems +- Deep dives into specific parts of the ML pipeline, highlighting key considerations when moving vector-driven outputs and experiments into production + +In the blog section, we collate examples of these use cases and case studies from our contributors discussing how they use and improve information retrieval systems to solve real-world problems. + +## Contents +- [Personalized Search](https://hub.superlinked.com/personalized-search-harnessing-the-power-of-vector-embeddings) +- [Recommender Systems](https://hub.superlinked.com/a-recommender-system-collaborative-filtering-with-sparse-metadata) +- [Retrieval Augmented Generation](https://hub.superlinked.com/retrieval-augmented-generation) +- [Enhancing RAG With A Multi-Agent System](https://hub.superlinked.com/enhancing-rag-with-a-multi-agent-system) +- [Vector Embeddings In The Browser](https://hub.superlinked.com/vector-embeddings-in-the-browser) +- [Answering Questions with Knowledge Embeddings](https://hub.superlinked.com/answering-questions-with-knowledge-graph-embeddings) +- [Representation Learning on Graph Structured Data](https://hub.superlinked.com/representation-learning-on-graph-structured-data) +- [Improving RAG performance with Knowledge Graphs](use_cases/knowledge_graphs.md) + + +We are always looking to expand our Use Cases and share the latest thinking. So if you've been working on something and would like to share your experiences with the community, you [get in touch and contribute](https://github.com/superlinked/VectorHub). diff --git a/docs/use_cases/recommender_systems.md b/docs/use_cases/recommender_systems.md index 36a81b178..b21015799 100644 --- a/docs/use_cases/recommender_systems.md +++ b/docs/use_cases/recommender_systems.md @@ -6,81 +6,45 @@ ## Introduction -Personalized product recommendations, according to a study by IBM, can lead to a 10-30% increase in revenue. But -creating truly personalized experiences requires extensive user and product data, which is not always available. For -example, providers with restrictions on collecting user data may only have basic product information, like genre and -popularity. While collaborative filtering approaches work reasonably well when extensive user and product data is -scarce, they leave significant gains on the table. Incorporating side data with a recommender based on collaborative -filtering can significantly increase recommendation quality (>20% gains in precision), leading to greater customer -satisfaction, more traffic, and millions in additional revenue. +Personalized product recommendations, according to a study by IBM, can lead to a 10-30% increase in revenue. But creating truly personalized experiences requires extensive user and product data, which is not always available. For example, providers with restrictions on collecting user data may only have basic product information, like genre and popularity. While collaborative filtering approaches work reasonably well when extensive user and product data is scarce, they leave significant gains on the table. Incorporating side data with a recommender based on collaborative filtering can significantly increase recommendation quality (>20% gains in precision), leading to greater customer satisfaction, more traffic, and millions in additional revenue. ## Scarce data and side data -Recommender Systems are increasingly important, given the plethora of products offered to users/customers. Beginning -approximately twenty years ago, fashion retailers developed basic versions of content-based recommenders that increased -user engagement (compared with a no-recommendations approach). But when the capabilities of event-tracking systems -improved, it became possible to integrate new signals that could help provide even better recommendations. - -At the moment of this writing, fashion retailers have adopted more sophisticated recommendation systems that ingest not -only users' purchasing/viewing history but also user metadata (age, location, spending habits, mood, etc.) and item -metadata (category, popularity, etc.). - -But not everyone has access to the kind or amount of metadata fashion retailers do. Sometimes only scarce side is -available. Public service providers with their own on-demand audio and video platform, for example, are legally -restricted in collecting user metadata. Typically, they use collaborative filtering (CF) approaches, which employ -historical interactions (data consisting of user-item pairs) to extrapolate user preferences via similarities of all -users' browsing/purchasing history. Such companies still have item data - genre, popularity, and so on - that can be -used to improve the quality of recommendations. - -Developers often disregard this side information because it is scarce. While CF (i.e., extrapolating user preferences -via similarities of all users' browsing/purchasing history) works reasonably well in this use case, we can improve the -recommendation quality (thereby increasing user engagement) of CF by adding available side information, even if it's -scarce. - -More precisely, there are libraries that allow us to "inject" side information -([LightFM](https://making.lyst.com/lightfm/docs/home.html), for example). Even the most efficient and effective -collaborative filtering models, such as [ALS Matrix Factorization (MF)](http://yifanhu.net/PUB/cf.pdf) (the Python -["implicit"](https://github.com/benfred/implicit) library) or \[EASE\] (https://arxiv.org/abs/1905.03375) -(Embarrassingly Shallow Autoencoder), can be extended and improved using side information. +Recommender Systems are increasingly important, given the plethora of products offered to users/customers. Beginning approximately twenty years ago, fashion retailers developed basic versions of content-based recommenders that increased user engagement (compared with a no-recommendations approach). But when the capabilities of event-tracking systems improved, it became possible to integrate new signals that could help provide even better recommendations. + +At the moment of this writing, fashion retailers have adopted more sophisticated recommendation systems that ingest not only users' purchasing/viewing history but also user metadata (age, location, spending habits, mood, etc.) and item metadata (category, popularity, etc.). + +But not everyone has access to the kind or amount of metadata fashion retailers do. Sometimes only scarce side is available. Public service providers with their own on-demand audio and video platform, for example, are legally restricted in collecting user metadata. Typically, they use collaborative filtering (CF) approaches, which employ historical interactions (data consisting of user-item pairs) to extrapolate user preferences via similarities of all users' browsing/purchasing history. Such companies still have item data - genre, popularity, and so on - that can be used to improve the quality of recommendations. + +Developers often disregard this side information because it is scarce. While CF (i.e., extrapolating user preferences via similarities of all users' browsing/purchasing history) works reasonably well in this use case, we can improve the recommendation quality (thereby increasing user engagement) of CF by adding available side information, even if it's scarce. + +More precisely, there are libraries that allow us to "inject" side information ([LightFM](https://making.lyst.com/lightfm/docs/home.html), for example). Even the most efficient and effective collaborative filtering models, such as [ALS Matrix Factorization (MF)](http://yifanhu.net/PUB/cf.pdf) (the Python ["implicit"](https://github.com/benfred/implicit) library) or [EASE] (https://arxiv.org/abs/1905.03375) (Embarrassingly Shallow Autoencoder), can be extended and improved using side information. ## Recommender systems as graphs -Matrix factorization is a common collaborative filtering approach. After a low-rank matrix approximation, we have two -sets of vectors, one representing the users and the other representing the items. The inner product of a user and item -vector estimates the rating a particular user gave to a particular item. +Matrix factorization is a common collaborative filtering approach. After a low-rank matrix approximation, we have two sets of vectors, one representing the users and the other representing the items. The inner product of a user and item vector estimates the rating a particular user gave to a particular item. -We can represent this process using a graph, in which users and items are graph nodes, and predicted ratings are edge -weights between them. The graph is bipartite: links appear only between nodes belonging to different groups. The list of -recommendations made to a user correspond to the most likely new item-connections for this user-node. We can easily -represent side info injection as graph nodes - for example, a "genre" node – that link related items. ​ When understood -from a graph perspective, it is easy to see how matrix factorization can be extended to include additional metadata. -What we want is to somehow let the algorithm know about the new links, which help group similar items or users together. -In other words, we want to "inject" new nodes – nodes that link nodes belonging to the same group. How do we do this? -Let's take a look at the illustration below: +We can represent this process using a graph, in which users and items are graph nodes, and predicted ratings are edge weights between them. The graph is bipartite: links appear only between nodes belonging to different groups. The list of recommendations made to a user correspond to the most likely new item-connections for this user-node. We can easily represent side info injection as graph nodes - for example, a "genre" node – that link related items. +​ +When understood from a graph perspective, it is easy to see how matrix factorization can be extended to include additional metadata. What we want is to somehow let the algorithm know about the new links, which help group similar items or users together. In other words, we want to "inject" new nodes – nodes that link nodes belonging to the same group. How do we do this? Let's take a look at the illustration below: Node structure -There are three users: u1, u2 and u3; and four items: i1, i2, i3, i4. The user u1 has interacted with items i1 and i3. -There is a dummy user, who links items that have the same color (i1, i3, i4). By coupling similar items, the dummy user -helps the model identify related content. This increases the chances of item i4 being recommended to user u1. +There are three users: u1, u2 and u3; and four items: i1, i2, i3, i4. The user u1 has interacted with items i1 and i3. There is a dummy user, who links items that have the same color (i1, i3, i4). By coupling similar items, the dummy user helps the model identify related content. This increases the chances of item i4 being recommended to user u1. **Adaptation** -To enrich a CF approach, we need only add dummy data: when only item side information is available, add dummy users; -when only user side information is available, add dummy items; when both user and item information are available, add -both dummy users and dummy items. Obviously, the dummy nodes should not be included in the recommendations; these nodes -are only there to help "inject" some "commonality" of the nodes belonging to a certain group. ​ The same approach (as -used above with MF) can be used with, for example, EASE, or with an explicitly graph-based approach for recommendations -such as [PageRank](https://scikit-network.readthedocs.io/en/latest/use_cases/recommendation.html). In fact, with -PageRank, the walks would include the dummy nodes. ​ A question remains: how should dummy user interactions be weighted -(rated)? We suggest you first start with low weights, see how recommendation quality changes, and iteratively adjust -your weights to fine-tune. +To enrich a CF approach, we need only add dummy data: when only item side information is available, add dummy users; when only user side information is available, add dummy items; when both user and item information are available, add both dummy users and dummy items. Obviously, the dummy nodes should not be included in the recommendations; these nodes are only there to help "inject" some "commonality" of the nodes belonging to a certain group. +​ +The same approach (as used above with MF) can be used with, for example, EASE, or with an explicitly graph-based approach for recommendations such as [PageRank](https://scikit-network.readthedocs.io/en/latest/use_cases/recommendation.html). In fact, with PageRank, the walks would include the dummy nodes. +​ +A question remains: how should dummy user interactions be weighted (rated)? We suggest you first start with low weights, see how recommendation quality changes, and iteratively adjust your weights to fine-tune. **Minimal Code for Adding Dummy Users** Here is some minimal Python code demonstrating the addition of dummy users, one for each category: -```python +``` python import implicit import numpy as np import scipy.sparse as sp @@ -132,33 +96,22 @@ recommended_items ## A real world use case -We evaluated an ALS matrix factorization model with genre dummy users on an audio-on-demand platform with over 250k -items (genres such as "comedy," "documentary," etc.). Compared to their production system, adding dummy nodes increased -recommendation accuracy by over 10%. Obviously, the addition of dummy nodes increases computational and memory -complexity, but in most cases this is a negligible compromise, given the scarcity of side information. Though the -platform we evaluated had over 250K items in the catalog, there were only a few hundred item categories. +We evaluated an ALS matrix factorization model with genre dummy users on an audio-on-demand platform with over 250k items (genres such as "comedy," "documentary," etc.). Compared to their production system, adding dummy nodes increased recommendation accuracy by over 10%. Obviously, the addition of dummy nodes increases computational and memory complexity, but in most cases this is a negligible compromise, given the scarcity of side information. Though the platform we evaluated had over 250K items in the catalog, there were only a few hundred item categories. ## Numerical data as side information -A natural question: what to do when the side information is numerical, not categorical? With numerical side information, -we advise pursuing one of the following two approaches: +A natural question: what to do when the side information is numerical, not categorical? With numerical side information, we advise pursuing one of the following two approaches: 1. Inject a dummy user, with scaled numerical values for weights -1. Analyze the distribution of the numerical data, and build categories based on value ranges +2. Analyze the distribution of the numerical data, and build categories based on value ranges ## When to use side data-injected CF vs. Neural CF -Injecting dummy nodes provides a lightweight way to improve recommendations when only limited side data is available. -The simplicity of this approach (e.g., LightFM), compared to neural models, lies in keeping training fast and preserving -interpretability. Dummy nodes are ideal for sparse categorical features like genre, but may underperform for dense -numerical data. With rich side information, neural collaborative filtering is preferred, despite increased complexity. - -Overall, dummy nodes offer a transparent way to gain value from side data, with little cost when features are simple. -They balance accuracy and speed for common cases. We recommend using this technique when you have sparse categorical -metadata and want straightforward gains without major modeling changes. +Injecting dummy nodes provides a lightweight way to improve recommendations when only limited side data is available. The simplicity of this approach (e.g., LightFM), compared to neural models, lies in keeping training fast and preserving interpretability. Dummy nodes are ideal for sparse categorical features like genre, but may underperform for dense numerical data. With rich side information, neural collaborative filtering is preferred, despite increased complexity. -______________________________________________________________________ +Overall, dummy nodes offer a transparent way to gain value from side data, with little cost when features are simple. They balance accuracy and speed for common cases. We recommend using this technique when you have sparse categorical metadata and want straightforward gains without major modeling changes. +--- ## Contributors - [Mirza Klimenta, PhD](https://www.linkedin.com/in/mirza-klimenta/) diff --git a/docs/use_cases/retrieval_augmented_generation.md b/docs/use_cases/retrieval_augmented_generation.md index 372a16097..cc80380ad 100644 --- a/docs/use_cases/retrieval_augmented_generation.md +++ b/docs/use_cases/retrieval_augmented_generation.md @@ -6,122 +6,64 @@ ## The case for Retrieval-augmented Generation -Retrieval-augmented Generation (RAG) balances information retrieval, which finds existing information, with generation, -which creates new content. When generation occurs without context, it can generate "hallucinations" - inaccurate / -incorrect results. In customer support and content creation generally, hallucinations can have disastrous consequences. -RAG prevents hallucinations by retrieving relevant context. It combines Large Language Models (LLMs) with external data -sources and information retrieval algorithms. This makes RAG an extremely valuable tool in applied settings across many -industries, including legal, education, and finance. +Retrieval-augmented Generation (RAG) balances information retrieval, which finds existing information, with generation, which creates new content. When generation occurs without context, it can generate "hallucinations" - inaccurate / incorrect results. In customer support and content creation generally, hallucinations can have disastrous consequences. RAG prevents hallucinations by retrieving relevant context. It combines Large Language Models (LLMs) with external data sources and information retrieval algorithms. This makes RAG an extremely valuable tool in applied settings across many industries, including legal, education, and finance. ## Why are RAGs getting so much attention? -RAG has become the go-to tool for professionals that want to combine the power of LLMs with their proprietary data. RAG -makes external data available to LLMs the way a prompter helps actors on stage remember their lines. RAG addresses -instances where a generative model is unable by itself to produce a correct answer to a question; RAG fetches relevant -information from an external database, thereby preventing hallucinations. +RAG has become the go-to tool for professionals that want to combine the power of LLMs with their proprietary data. RAG makes external data available to LLMs the way a prompter helps actors on stage remember their lines. RAG addresses instances where a generative model is unable by itself to produce a correct answer to a question; RAG fetches relevant information from an external database, thereby preventing hallucinations. -Hallucinations are the bogeyman that continues to haunt all generative LLMs. Indeed, RAGs are one of the most widely -discussed topics in the AI community (judging by experts' posts on LinkedIn and Twitter), _not_ because of RAG's -performance on real-world problems; applying RAG in industry settings only really began this year (2023), so there isn't -much robust data on them yet, [outside of academia](https://github.com/myscale/retrieval-qa-benchmark). Instead, it's -specifically RAG's ability to deal with hallucinations that makes it such a hot topic. +Hallucinations are the bogeyman that continues to haunt all generative LLMs. Indeed, RAGs are one of the most widely discussed topics in the AI community (judging by experts' posts on LinkedIn and Twitter), _not_ because of RAG's performance on real-world problems; applying RAG in industry settings only really began this year (2023), so there isn't much robust data on them yet, [outside of academia](https://github.com/myscale/retrieval-qa-benchmark). Instead, it's specifically RAG's ability to deal with hallucinations that makes it such a hot topic. ## What are hallucinations and why are they dangerous? -A machine hallucination is a false piece of information created by a generative model. Though some argue that this -anthropomorphism is -[more harmful than helpful](https://betterprogramming.pub/large-language-models-dont-hallucinate-b9bdfa202edf), saying -that a machine "sees something that isn't there" is a helpful analogy for illustrating why machine hallucinations are -bad for business. They are often named as one of the biggest blockers and concerns for industry adoption of -[Generative AI and LLMs](https://fortune.com/2023/04/17/google-ceo-sundar-pichai-artificial-intelligence-bard-hallucinations-unsolved/). -Google's inclusion of made-up information in their Chatbot new achievements presentation (February 2023) was -[followed by a 7% fall in Alphabet's stock price](https://www.cnbc.com/2023/02/08/alphabet-shares-slip-following-googles-ai-event-.html). -While the stock has since recovered and hit new historic highs, this hallucinatory incident demonstrated how sensitive -the public, including investors, can be to the generation of incorrect information by AI. In short, if LLMs are put into -production for use cases where there's more at stake than looking up restaurants on the internet, hallucinations can be -disastrous. +A machine hallucination is a false piece of information created by a generative model. Though some argue that this anthropomorphism is [more harmful than helpful](https://betterprogramming.pub/large-language-models-dont-hallucinate-b9bdfa202edf +), saying that a machine "sees something that isn't there" is a helpful analogy for illustrating why machine hallucinations are bad for business. They are often named as one of the biggest blockers and concerns for industry adoption of [Generative AI and LLMs](https://fortune.com/2023/04/17/google-ceo-sundar-pichai-artificial-intelligence-bard-hallucinations-unsolved/). Google's inclusion of made-up information in their Chatbot new achievements presentation (February 2023) was [followed by a 7% fall in Alphabet's stock price](https://www.cnbc.com/2023/02/08/alphabet-shares-slip-following-googles-ai-event-.html). While the stock has since recovered and hit new historic highs, this hallucinatory incident demonstrated how sensitive the public, including investors, can be to the generation of incorrect information by AI. In short, if LLMs are put into production for use cases where there's more at stake than looking up restaurants on the internet, hallucinations can be disastrous. ## Retrieval and generation: a love story -As its name indicates, RAG consists of two opposing components: retrieval and generation. Retrieval finds information -that matches an input query. It’s impossible by design to retrieve something that isn’t already there. Generation, on -the other hand, does the opposite: it’s prone to hallucinations because it's supposed to generate language that isn’t an -exact replication of existing data. If we had all possible reponses in our data already, there would be no need for -generation. If balanced well, these two opposites complement each other and you get a system that utilizes the best of -both worlds - retrieve what’s possible and generate what’s necessary. - -Let's take a quick look at our two components seperately and what they are doing on a basic level, and then examine how -they operate in their most common use cases. - -1. **Retrieval**: A retrieval model, usually called a "retriever," searches for information in a document or a - collection of documents. You can think of retrieval as a search problem. Traditionally, retrieval has been performed - using rather simple techniques like term frequency-inverse document frequency (TF-IDF), which basically quantifies - how relevant a piece of text is for each document in the context of other documents. Does this word occur often in - document A but not in document B and C? If so, it's probably important. The retrieved documents are then passed to - the context of our generative model. - -1. **Generation**: Generative models, or generators, _can_ generate content without external context. But because - context helps prevent hallucinations, retrievers are used to add information to generative models that they would - otherwise lack. The most popular generative text models right now are without a doubt the LLMs of OpenAI, followed by - Google's and Anthropic's. While these generators are already powerful out-of-the-box, RAG helps close the many - knowledge gaps they suffer from. The retrieved context is simply added to the instruction of the LLM, and, thanks to - a phenomenon called in-context learning, the model can incorporate the external knowledge without any updates to its - weights. +As its name indicates, RAG consists of two opposing components: retrieval and generation. Retrieval finds information that matches an input query. It’s impossible by design to retrieve something that isn’t already there. Generation, on the other hand, does the opposite: it’s prone to hallucinations because it's supposed to generate language that isn’t an exact replication of existing data. If we had all possible reponses in our data already, there would be no need for generation. If balanced well, these two opposites complement each other and you get a system that utilizes the best of both worlds - retrieve what’s possible and generate what’s necessary. + +Let's take a quick look at our two components seperately and what they are doing on a basic level, and then examine how they operate in their most common use cases. + +1. **Retrieval**: A retrieval model, usually called a "retriever," searches for information in a document or a collection of documents. You can think of retrieval as a search problem. Traditionally, retrieval has been performed using rather simple techniques like term frequency-inverse document frequency (TF-IDF), which basically quantifies how relevant a piece of text is for each document in the context of other documents. Does this word occur often in document A but not in document B and C? If so, it's probably important. The retrieved documents are then passed to the context of our generative model. + +2. **Generation**: Generative models, or generators, _can_ generate content without external context. But because context helps prevent hallucinations, retrievers are used to add information to generative models that they would otherwise lack. The most popular generative text models right now are without a doubt the LLMs of OpenAI, followed by Google's and Anthropic's. While these generators are already powerful out-of-the-box, RAG helps close the many knowledge gaps they suffer from. The retrieved context is simply added to the instruction of the LLM, and, thanks to a phenomenon called in-context learning, the model can incorporate the external knowledge without any updates to its weights. ## Common RAG use cases -Given how quickly and widely RAG is being adopted, it would be impractical here to explore all its sectors and use cases -exhaustively. Instead, we'll focus on those use cases and settings where RAGs have seen the widest adoption. +Given how quickly and widely RAG is being adopted, it would be impractical here to explore all its sectors and use cases exhaustively. Instead, we'll focus on those use cases and settings where RAGs have seen the widest adoption. ### 1. Customer Support and Chatbots -Companies providing real-time customer support increasingly use RAG-powered chatbots. By integrating retrieval -mechanisms to access customer-specific data, purchase histories, and FAQs, these chatbots can offer highly personalized -and accurate responses. This not only enhances customer satisfaction but also reduces the workload on human customer -support agents. +Companies providing real-time customer support increasingly use RAG-powered chatbots. By integrating retrieval mechanisms to access customer-specific data, purchase histories, and FAQs, these chatbots can offer highly personalized and accurate responses. This not only enhances customer satisfaction but also reduces the workload on human customer support agents. ### 2. Content Generation and Copywriting -Marketing professionals and content creators increasingly employ RAG models to generate engaging and relevant content. -By retrieving data from various sources, such as research articles, market reports, or user-generated content, RAG -assists in crafting informative and persuasive articles, product descriptions, and advertisements. +Marketing professionals and content creators increasingly employ RAG models to generate engaging and relevant content. By retrieving data from various sources, such as research articles, market reports, or user-generated content, RAG assists in crafting informative and persuasive articles, product descriptions, and advertisements. ### 3. Legal and Compliance -Lawyers and compliance officers deal with an extensive corpus of legal documents and regulations. RAG can simplify legal -research by retrieving relevant case law, statutes, and precedents. It can also help in drafting legal documents and -compliance reports by generating accurate and legally sound text based on the retrieved data. +Lawyers and compliance officers deal with an extensive corpus of legal documents and regulations. RAG can simplify legal research by retrieving relevant case law, statutes, and precedents. It can also help in drafting legal documents and compliance reports by generating accurate and legally sound text based on the retrieved data. ### 4. Education and Training -Educators can enhance students' learning experience by creating personalized study materials using RAG models. RAG can -retrieve relevant textbooks, research papers, and educational resources. Additionally, RAG-powered virtual tutors can -provide explanations and answer student questions in a more contextually relevant manner. +Educators can enhance students' learning experience by creating personalized study materials using RAG models. RAG can retrieve relevant textbooks, research papers, and educational resources. Additionally, RAG-powered virtual tutors can provide explanations and answer student questions in a more contextually relevant manner. ### 5. Financial Services -Financial institutions increasingly leverage RAG to provide more insightful and data-driven services. Whether it's -personalized investment advice based on market trends or retrieving historical financial data for risk analysis, RAG is -proving to be an invaluable tool for making informed financial decisions. +Financial institutions increasingly leverage RAG to provide more insightful and data-driven services. Whether it's personalized investment advice based on market trends or retrieving historical financial data for risk analysis, RAG is proving to be an invaluable tool for making informed financial decisions. -All the above use cases are in sectors (e.g., legal, education, finance) that deal with overwhelming amounts of -non-trivial text. RAG is built for and excels at this task, whether it's Question Answering (QA) and Summarization or -reducing response rates and improving response quality. -That being said, if you are considering using an RAG, it's important to note that it is a complementary but not -fool-proof method for improving generation context by adding external data. Most of the above use cases employ a -human-in-the-loop as a guardrail, to ensure quality outcomes. You should carefully consider whether your use case might -also warrant this kind of guardrail. +All the above use cases are in sectors (e.g., legal, education, finance) that deal with overwhelming amounts of non-trivial text. RAG is built for and excels at this task, whether it's Question Answering (QA) and Summarization or reducing response rates and improving response quality. + +That being said, if you are considering using an RAG, it's important to note that it is a complementary but not fool-proof method for improving generation context by adding external data. Most of the above use cases employ a human-in-the-loop as a guardrail, to ensure quality outcomes. You should carefully consider whether your use case might also warrant this kind of guardrail. + ## RAG implementation example - retrieving accurate financial data -Let's say you want to make an investment decision, but you need accurate financial data to base it on. You can use RAG -to ensure you're not getting erroneous information, and thereby preempt a poorly informed, potentially costly investment -choice. +Let's say you want to make an investment decision, but you need accurate financial data to base it on. You can use RAG to ensure you're not getting erroneous information, and thereby preempt a poorly informed, potentially costly investment choice. -To start, we need an LLM library. We'll be using LangChain, a popular LLM library, because it supports most of the -technology we will need out-of-the-box, and you'll be able to switch in any components of your preference, such as your -favorite vector database or LLM. +To start, we need an LLM library. We'll be using LangChain, a popular LLM library, because it supports most of the technology we will need out-of-the-box, and you'll be able to switch in any components of your preference, such as your favorite vector database or LLM. First, install all the libraries that we will import later: @@ -135,8 +77,7 @@ accelerate einops pypdf sentencepiece ``` -On the model side, we'll use Hugging Face's Transformers library, and Microsoft's lightweight phi-1.5, which performs -remarkably well despite being only 1.7B parameters in size. +On the model side, we'll use Hugging Face's Transformers library, and Microsoft's lightweight phi-1.5, which performs remarkably well despite being only 1.7B parameters in size. ```python @@ -152,8 +93,7 @@ tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-1_5", trust_remote_code ``` -Imagine we are hobby investors and are considering investing in Tesla. We want to know how the company performed in Q2 -this year (2023). Let's ask our LLM about Tesla's revenue. +Imagine we are hobby investors and are considering investing in Tesla. We want to know how the company performed in Q2 this year (2023). Let's ask our LLM about Tesla's revenue. ```python @@ -181,12 +121,9 @@ Answer: Tesla's revenue for Q2 2023 was $1.2 billion. ``` -Despite our model's confident assertion, it turns out that Telsa's February earnings were _not_ the $1.2 billion it -claims. In fact, this result is way off. Without external data, we might have believed phi-1.5's result, and made a -poorly informed investment decision. +Despite our model's confident assertion, it turns out that Telsa's February earnings were _not_ the $1.2 billion it claims. In fact, this result is way off. Without external data, we might have believed phi-1.5's result, and made a poorly informed investment decision. -So how can we fix this? You already know the answer: RAG to the rescue. In order to retrieve relevant context, we need a -document to retrieve from in the first place. We will download Tesla's financial report for Q2 2023 from their website. +So how can we fix this? You already know the answer: RAG to the rescue. In order to retrieve relevant context, we need a document to retrieve from in the first place. We will download Tesla's financial report for Q2 2023 from their website. ```console @@ -194,8 +131,7 @@ document to retrieve from in the first place. We will download Tesla's financial ``` -Great. Next, we will transform the PDF into texts and store them as vector embeddings in a basic vector store. We will -then perform a simple similarity search on our documents (aka retrieval). +Great. Next, we will transform the PDF into texts and store them as vector embeddings in a basic vector store. We will then perform a simple similarity search on our documents (aka retrieval). ```python @@ -255,8 +191,7 @@ for doc in docs: ``` -Finally, we will use the same tokenization and generation pipeline as we did earlier, except this time we will add the -context we retrieved in front of our question. +Finally, we will use the same tokenization and generation pipeline as we did earlier, except this time we will add the context we retrieved in front of our question. ```python @@ -290,24 +225,15 @@ Answer: Tesla's revenue for Q2 2023 was $24.9 billion. ``` -Et voilà! If you go back to the passage from the financial report that we printed out above, you can see that our new, -retrieval augmented figure of "$24.9 billion" is the correct answer to our question. That's approximately 20x the amount -phi-1.5 hallucinated earlier ($1.2 billion). Using RAG has saved us from what could've been a terrible investment -decision. +Et voilà! If you go back to the passage from the financial report that we printed out above, you can see that our new, retrieval augmented figure of "$24.9 billion" is the correct answer to our question. That's approximately 20x the amount phi-1.5 hallucinated earlier ($1.2 billion). Using RAG has saved us from what could've been a terrible investment decision. ## Conclusion -In summary, RAG can add value to your LLM pipelines by combining the internal knowledge of your models with relevant -context from external data sources. In this way, RAG can prevent hallucinations, and more generally adapt LLMs to new -data efficiently. This makes RAG a popular choice for applications in sectors and use cases that deal with overwhelming -amounts of non-trivial text, including customer support, content generation, legal and compliance, education, and -finance. - -One final caveat: When using RAG, you have to make sure the documents you're retrieving from do *actually* include the -answers you're looking for. As with all things Data Science: garbage in, garbage out. +In summary, RAG can add value to your LLM pipelines by combining the internal knowledge of your models with relevant context from external data sources. In this way, RAG can prevent hallucinations, and more generally adapt LLMs to new data efficiently. This makes RAG a popular choice for applications in sectors and use cases that deal with overwhelming amounts of non-trivial text, including customer support, content generation, legal and compliance, education, and finance. -______________________________________________________________________ +One final caveat: When using RAG, you have to make sure the documents you're retrieving from do *actually* include the answers you're looking for. As with all things Data Science: garbage in, garbage out. +--- ## Contributors - [Pascal Biese, author](https://www.linkedin.com/in/pascalbiese/) diff --git a/docs/use_cases/scaling_rag_for_production.md b/docs/use_cases/scaling_rag_for_production.md index 3f191f781..b9aabdf25 100644 --- a/docs/use_cases/scaling_rag_for_production.md +++ b/docs/use_cases/scaling_rag_for_production.md @@ -1,756 +1,662 @@ - - -# Scaling RAG for Production - -Retrieval-augmented Generation (RAG) combines Large Language Models (LLMs) with external data to reduce the probability -of machine hallucinations - AI-generated information that misrepresents underlying data or reality. When developing RAG -systems, scalability is often an afterthought. This creates problems when moving from initial development to production. -Having to manually adjust code while your application grows can get very costly and is prone to errors. - -Our tutorial provides an example of **how you can develop a RAG pipeline with production workloads in mind from the -start**, using the right tools - ones that are designed to scale your application. - -## Development vs. production - -The goals and requirements of development and production are usually very different. This is particularly true for new -technologies like Large Language Models (LLMs) and Retrieval-augmented Generation (RAG), where organizations prioritize -rapid experimentation to test the waters before committing more resources. Once important stakeholders are convinced, -the focus shifts from demonstrating an application's _potential for_ creating value to _actually_ creating value, via -production. Until a system is productionized, its ROI is typically zero. - -**Productionizing**, in the context of [RAG systems](https://hub.superlinked.com/retrieval-augmented-generation), -involves transitioning from a prototype or test environment to a **stable, operational state**, in which the system is -readily accessible and reliable for remote end users, such as via URL - i.e., independent of the end user machine state. -Productionizing also involves **scaling** the system to handle varying levels of user demand and traffic, ensuring -consistent performance and availability. - -Even though there is no ROI without productionizing, organizations often underesimate the hurdles involved in getting to -an end product. Productionizing is always a trade-off between performance and costs, and this is no different for -Retrieval-augmented Generation (RAG) systems. The goal is to achieve a stable, operational, and scalable end product -while keeping costs low. - -Let's look more closely at the basic requirements of an -[RAG system](https://hub.superlinked.com/retrieval-augmented-generation), before going in to the specifics of what -you'll need to productionize it in a cost-effective but scalable way. - -## The basics of RAG - -The most basic RAG workflow looks like this: - -1. Submit a text query to an embedding model, which converts it into a semantically meaningful vector embedding. -1. Send the resulting query vector embedding to your document embeddings storage location - typically a - [vector database](https://hub.superlinked.com/32-key-access-patterns#Ea74G). -1. Retrieve the most relevant document chunks - based on proximity of document chunk embeddings to the query vector - embedding. -1. Add the retrieved document chunks as context to the query vector embedding and send it to the LLM. -1. The LLM generates a response utilizing the retrieved context. - -While RAG workflows can become significantly more complex, incorporating methods like metadata filtering and retrieval -reranking, _all_ RAG systems must contain the components involved in the basic workflow: an embedding model, a store for -document and vector embeddings, a retriever, and a LLM. - -But smart development, with productionization in mind, requires more than just setting up your components in a -functional way. You must also develop with cost-effective scalability in mind. For this you'll need not just these basic -components, but more specifically the tools appropriate to configuring a scalable RAG system. - -## Developing for scalability: the right tools - -### LLM library: LangChain - -As of this writing, LangChain, while it has also been the subject of much criticism, is arguably the most prominent LLM -library. A lot of developers turn to Langchain to build Proof-of-Concepts (PoCs) and Minimum Viable Products (MVPs), or -simply to experiment with new ideas. Whether one chooses LangChain or one of the other major LLM and RAG libraries - for -example, LlamaIndex or Haystack, to name our alternate personal favorites - they can _all_ be used to productionize an -RAG system. That is, all three have integrations for third-party libraries and providers that will handle production -requirements. Which one you choose to interface with your other components depends on the details of your existing tech -stack and use case. - -For the purpose of this tutorial, we'll use part of the Langchain documentation, along with Ray. - -### Scaling with Ray - -Because our goal is to build a 1) simple, 2) scalable, _and_ 3) economically feasible option, not reliant on proprietary -solutions, we have chosen to use [Ray](https://github.com/ray-project/ray), a Python framework for productionizing and -scaling machine learning (ML) workloads. Ray is designed with a range of auto-scaling features that seamlessly scale ML -systems. It's also adaptable to both local environments and Kubernetes, efficiently managing all workload requirements. - -**Ray permits us to keep our tutorial system simple, non-proprietary, and on our own network, rather than the cloud**. -While LangChain, LlamaIndex, and Haystack libraries support cloud deployment for AWS, Azure, and GCP, the details of -cloud deployment heavily depend on - and are therefore very particular to - the specific cloud provider you choose. -These libraries also all contain Ray integrations to enable scaling. But **using Ray directly will provide us with more -universally applicable insights**, given that the Ray integrations within LangChain, LlamaIndex, and Haystack are built -upon the same underlying framework. - -Now that we have our LLM library sorted, let's turn to data gathering and processing. - -## Data gathering and processing - -### Gathering the data - -Every ML journey starts with data, and that data needs to be gathered and stored somewhere. For this tutorial, we gather -data from part of the LangChain documentation. We first download the html files and then create a -[Ray dataset](https://docs.ray.io/en/latest/data/data.html) of them. - -We start by **installing all the dependencies** that we'll use: - -```console -pip install ray langchain sentence-transformers qdrant-client einops openai tiktoken fastapi "ray[serve]" -``` - -We use the OpenAI API in this tutorial, so we'll **need an API key**. We export our API key as an environmental -variable, and then **initialize our Ray environment** like this: - -```python -import os -import ray - -working_dir = "downloaded_docs" - -if not os.path.exists(working_dir): - os.makedirs(working_dir) - -# Setting up our Ray environment -ray.init(runtime_env={ - "env_vars": { - "OPENAI_API_KEY": os.environ["OPENAI_API_KEY"], - }, - "working_dir": str(working_dir) -}) -``` - -To work with the LangChain documentation, we need to **download the html files and process them**. Scraping html files -can get very tricky and the details depend heavily on the structure of the website you’re trying to scrape. The -functions below are only meant to be used in the context of this tutorial. - -```python -import requests -from bs4 import BeautifulSoup -from urllib.parse import urlparse, urljoin -from concurrent.futures import ThreadPoolExecutor, as_completed -import re - -def sanitize_filename(filename): - filename = re.sub(r'[\\/*?:"<>|]', '', filename) # Remove problematic characters - filename = re.sub(r'[^\x00-\x7F]+', '_', filename) # Replace non-ASCII characters - return filename - -def is_valid(url, base_domain): - parsed = urlparse(url) - valid = bool(parsed.netloc) and parsed.path.startswith("/docs/expression_language/") - return valid - -def save_html(url, folder): - try: - headers = {'User-Agent': 'Mozilla/5.0'} - response = requests.get(url, headers=headers) - response.raise_for_status() - - soup = BeautifulSoup(response.content, 'html.parser') - title = soup.title.string if soup.title else os.path.basename(urlparse(url).path) - sanitized_title = sanitize_filename(title) - filename = os.path.join(folder, sanitized_title.replace(" ", "_") + ".html") - - if not os.path.exists(filename): - with open(filename, 'w', encoding='utf-8') as file: - file.write(str(soup)) - print(f"Saved: {filename}") - - links = [urljoin(url, link.get('href')) for link in soup.find_all('a') if link.get('href') and is_valid(urljoin(url, link.get('href')), base_domain)] - return links - else: - return [] - except Exception as e: - print(f"Error processing {url}: {e}") - return [] - -def download_all(start_url, folder, max_workers=5): - visited = set() - to_visit = {start_url} - - with ThreadPoolExecutor(max_workers=max_workers) as executor: - while to_visit: - future_to_url = {executor.submit(save_html, url, folder): url for url in to_visit} - visited.update(to_visit) - to_visit.clear() - - for future in as_completed(future_to_url): - url = future_to_url[future] - try: - new_links = future.result() - for link in new_links: - if link not in visited: - to_visit.add(link) - except Exception as e: - print(f"Error with future for {url}: {e}") -``` - -Because the LangChain documentation is very large, we'll download only **a subset** of it: **LangChain's Expression -Language (LCEL)**, which consists of 28 html pages. - -```python -base_domain = "python.langchain.com" -start_url = "https://python.langchain.com/docs/expression_language/" -folder = working_dir - -download_all(start_url, folder, max_workers=10) -``` - -Now that we've downloaded the files, we can use them to **create our Ray dataset**: - -```python -from pathlib import Path - -# Ray dataset -document_dir = Path(folder) -ds = ray.data.from_items([{"path": path.absolute()} for path in document_dir.rglob("*.html") if not path.is_dir()]) -print(f"{ds.count()} documents") -``` - -Great! But there's one more step left before we can move on to the next phase of our workflow. We need to **extract the -relevant text from our html files and clean up all the html syntax**. For this, we import BeautifulSoup to **parse the -files and find relevant html tags**. - -```python -from bs4 import BeautifulSoup, NavigableString - -def extract_text_from_element(element): - texts = [] - for elem in element.descendants: - if isinstance(elem, NavigableString): - text = elem.strip() - if text: - texts.append(text) - return "\n".join(texts) - -def extract_main_content(record): - with open(record["path"], "r", encoding="utf-8") as html_file: - soup = BeautifulSoup(html_file, "html.parser") - - main_content = soup.find(['main', 'article']) # Add any other tags or class_="some-class-name" here - if main_content: - text = extract_text_from_element(main_content) - else: - text = "No main content found." - - path = record["path"] - return {"path": path, "text": text} - -``` - -We can now use Ray's map() function to run this extraction process. Ray lets us run multiple processes in parallel. - -```python -# Extract content -content_ds = ds.map(extract_main_content) -content_ds.count() - -``` - -Awesome! The results of the above extraction are our dataset. Because Ray datasets are optimized for scaled performance -in production, they don't require us to make costly and error-prone adjustments to our code when our application grows. - -### Processing the data - -To process our dataset, our next three steps are **chunking, embedding, and indexing**. - -**Chunking the data** - -Chunking - splitting your documents into multiple smaller parts - is necessary to make your data meet the LLM’s context -length limits, and helps keep contexts specific enough to remain relevant. Chunks also need to not be too small. When -chunks are too small, the information retrieved may become too narrow to provide adequate query responses. The optimal -chunk size will depend on your data, the models you use, and your use case. We will use a common chunking value here, -one that has been used in a lot of applications. - -Let’s define our text splitting logic first, using a standard text splitter from LangChain: - -```python -from functools import partial -from langchain.text_splitter import RecursiveCharacterTextSplitter - -# Defining our text splitting function -def chunking(document, chunk_size, chunk_overlap): - text_splitter = RecursiveCharacterTextSplitter( - separators=["\n\n", "\n"], - chunk_size=chunk_size, - chunk_overlap=chunk_overlap, - length_function=len) - - chunks = text_splitter.create_documents( - texts=[document["text"]], - metadatas=[{"path": document["path"]}]) - return [{"text": chunk.page_content, "path": chunk.metadata["path"]} for chunk in chunks] -``` - -Again, we utilize Ray's map() function to ensure scalability: - -```python -chunks_ds = content_ds.flat_map(partial( - chunking, - chunk_size=512, - chunk_overlap=50)) -print(f"{chunks_ds.count()} chunks") -``` - -Now that we've gathered and chunked our data scalably, we need to embed and index it, so that we can efficiently -retrieve relevant answers to our queries. - -**Embedding the data** - -We use a pretrained model to create vector embeddings for both our data chunks and the query itself. By measuring the -distance between the chunk embeddings and the query embedding, we can identify the most relevant, or "top-k," chunks. Of -the various pretrained models, we'll use the popular 'bge-base-en-v1.5' model, which, at the time of writing this -tutorial, ranks as the highest-performing model of its size on the -[MTEB Leaderboard](https://huggingface.co/spaces/mteb/leaderboard). For convenience, we continue using LangChain: - -```python -from langchain.embeddings import OpenAIEmbeddings -from langchain.embeddings.huggingface import HuggingFaceEmbeddings -import numpy as np -from ray.data import ActorPoolStrategy - -def get_embedding_model(embedding_model_name, model_kwargs, encode_kwargs): - embedding_model = HuggingFaceEmbeddings( - model_name=embedding_model_name, - model_kwargs=model_kwargs, - encode_kwargs=encode_kwargs) - return embedding_model -``` - -This time, instead of map(), we want to use map_batches(), which requires defining a class object to perform a **call** -on. - -```python -class EmbedChunks: - def __init__(self, model_name): - self.embedding_model = get_embedding_model( - embedding_model_name=model_name, - model_kwargs={"device": "cuda"}, - encode_kwargs={"device": "cuda", "batch_size": 100}) - def __call__(self, batch): - embeddings = self.embedding_model.embed_documents(batch["text"]) - return {"text": batch["text"], "path": batch["path"], "embeddings": embeddings} - -# Embedding our chunks -embedding_model_name = "BAAI/bge-base-en-v1.5" -embedded_chunks = chunks_ds.map_batches( - EmbedChunks, - fn_constructor_kwargs={"model_name": embedding_model_name}, - batch_size=100, - num_gpus=1, - concurrency=1) -``` - -**Indexing the data** - -Now that our chunks are embedded, we need to **store** them somewhere. For the sake of this tutorial, we'll utilize -Qdrant’s new in-memory feature, which lets us experiment with our code rapidly without needing to set up a fully-fledged -instance. However, for deployment in a production environment, you should rely on more robust and scalable solutions — -hosted either within your own network or by a third-party provider. For example, to fully productionize, we would need -to point to our Qdrant (or your preferred hosted vendor) instance instead of using it in-memory. Detailed guidance on -self-hosted solutions, such as setting up a Kubernetes cluster, are beyond the scope of this tutorial. - -```python -from qdrant_client import QdrantClient -from qdrant_client.http.models import Distance, VectorParams - -# Initalizing a local client in-memory -client = QdrantClient(":memory:") - -client.recreate_collection( - collection_name="documents", - vectors_config=VectorParams(size=embedding_size, distance=Distance.COSINE), -) -``` - -To perform the next processing step, storage, using Ray would require more than 2 CPU scores, making this tutorial -incompatible with the free tier of Google Colab. Instead, then, we'll use pandas. Fortunately, Ray allows us to convert -our dataset into a pandas DataFrame with a single line of code: - -```python -emb_chunks_df = embedded_chunks.to_pandas() -``` - -Now that our dataset is converted to pandas, we **define and execute our data storage function**: - -```python -from qdrant_client.models import PointStruct - -def store_results(df, collection_name="documents", client=client): - # Defining our data structure - points = [ - # PointStruct is the data classs used in Qdrant - PointStruct( - id=hash(path), # Unique ID for each point - vector=embedding, - payload={ - "text": text, - "source": path - } - ) - for text, path, embedding in zip(df["text"], df["path"], df["embeddings"]) - ] - - # Adding our data points to the collection - client.upsert( - collection_name=collection_name, - points=points - ) - -store_results(emb_chunks_df) -``` - -This wraps up the data processing part! Our data is now stored in our vector database and ready to be retrieved. - -## Data retrieval - -When you retrieve data from vector storage, it's important to use the same embedding model for your query that you used -for your source data. Otherwise, vector comparison to surface relevant content may result in mismatched or non-nuanced -results (due to semantic drift, loss of context, or inconsistent distance metrics). - -```python -import numpy as np - -# Embed query -embedding_model = HuggingFaceEmbeddings(model_name=embedding_model_name) -query = "How to run agents?" -query_embedding = np.array(embedding_model.embed_query(query)) -len(query_embedding) -``` - -Recall from above that we measure the distance between the query embedding and chunk embeddings to identify the most -relevant, or 'top-k' chunks. In Qdrant’s search, the 'limit' parameter is equivalent to 'k'. By default, the search uses -cosine similarity as the metric, and retrieves from our database the 5 chunks closest to our query embedding: - -```python -hits = client.search( - collection_name="documents", - query_vector=query_embedding, - limit=5 # Return 5 closest points -) - -context_list = [hit.payload["text"] for hit in hits] -context = "\n".join(context_list) -``` - -We rewrite this as a function for later use: - -```python -def semantic_search(query, embedding_model, k): - query_embedding = np.array(embedding_model.embed_query(query)) - hits = client.search( - collection_name="documents", - query_vector=query_embedding, - limit=5 # Return 5 closest points - ) - - context_list = [{"id": hit.id, "source": str(hit.payload["source"]), "text": hit.payload["text"]} for hit in hits] - return context_list -``` - -## Generation - -We're now very close to being able to field queries and retrieve answers! We've set up everything we need to query our -LLM _at scale_. But before querying the model for a response, we want to first inform the query with our data, by -**retrieving relevant context from our vector database and then adding it to the query**. - -To do this, we use a simplified version of the generate.py script provided in Ray's -[LLM repository](https://github.com/ray-project/llm-applications/blob/main/rag/generate.py). This version is adapted to -our code and - to simplify and keep our focus on how to scale a basic RAG system - leaves out a bunch of advanced -retrieval techniques, such as reranking and hybrid search. For our LLM, we use gpt-3.5-turbo, and query it via the -OpenAI API. - -```python -from openai import OpenAI - -def get_client(llm): - api_key = os.environ["OPENAI_API_KEY"] - client = OpenAI(api_key=api_key) - return client - -def generate_response( - llm, - max_tokens=None, - temperature=0.0, - stream=False, - system_content="", - assistant_content="", - user_content="", - max_retries=1, - retry_interval=60, -): - """Generate response from an LLM.""" - retry_count = 0 - client = get_client(llm=llm) - messages = [ - {"role": role, "content": content} - for role, content in [ - ("system", system_content), - ("assistant", assistant_content), - ("user", user_content), - ] - if content - ] - while retry_count <= max_retries: - try: - chat_completion = client.chat.completions.create( - model=llm, - max_tokens=max_tokens, - temperature=temperature, - stream=stream, - messages=messages, - ) - return prepare_response(chat_completion, stream=stream) - - except Exception as e: - print(f"Exception: {e}") - time.sleep(retry_interval) # default is per-minute rate limits - retry_count += 1 - return "" - -def response_stream(chat_completion): - for chunk in chat_completion: - content = chunk.choices[0].delta.content - if content is not None: - yield content - -def prepare_response(chat_completion, stream): - if stream: - return response_stream(chat_completion) - else: - return chat_completion.choices[0].message.content -``` - -Finally, we **generate a response**: - -```python -# Generating our response -query = "How to run agents?" -response = generate_response( - llm="gpt-3.5-turbo", - temperature=0.0, - stream=True, - system_content="Answer the query using the context provided. Be succinct.", - user_content=f"query: {query}, context: {context_list}") -# Stream response -for content in response: - print(content, end='', flush=True) -``` - -To **make using our application even more convenient**, we can simply adapt Ray's official documentation to **implement -our workflow within a single QueryAgent class**, which bundles together and takes care of all of the steps we -implemented above - retrieving embeddings, embedding the search query, performing vector search, processing the results, -and querying the LLM to generate a response. Using this single class approach, we no longer need to sequentially call -all of these functions, and can also include utility functions. (Specifically, `Get_num_tokens` encodes our text and -gets the number of tokens, to calculate the length of the input. To maintain our standard 50:50 ratio to allocate space -to each of input and generation, we use `(text, max_context_length)` to trim input text if it's too long.) - -```python -import tiktoken - -def get_num_tokens(text): - enc = tiktoken.get_encoding("cl100k_base") - return len(enc.encode(text)) - - -def trim(text, max_context_length): - enc = tiktoken.get_encoding("cl100k_base") - return enc.decode(enc.encode(text)[:max_context_length]) - -class QueryAgent: - def __init__( - self, - embedding_model_name="BAAI/bge-base-en-v1.5", - llm="gpt-3.5-turbo", - temperature=0.0, - max_context_length=4096, - system_content="", - assistant_content="", - ): - # Embedding model - self.embedding_model = get_embedding_model( - embedding_model_name=embedding_model_name, - model_kwargs={"device": "cuda"}, - encode_kwargs={"device": "cuda", "batch_size": 100}, - ) - - # LLM - self.llm = llm - self.temperature = temperature - self.context_length = int( - 0.5 * max_context_length - ) - get_num_tokens( # 50% of total context reserved for input - system_content + assistant_content - ) - self.max_tokens = int( - 0.5 * max_context_length - ) # max sampled output (the other 50% of total context) - self.system_content = system_content - self.assistant_content = assistant_content - - def __call__( - self, - query, - num_chunks=5, - stream=True, - ): - # Get top_k context - context_results = semantic_search( - query=query, embedding_model=self.embedding_model, k=num_chunks - ) - - # Generate response - document_ids = [item["id"] for item in context_results] - context = [item["text"] for item in context_results] - sources = [item["source"] for item in context_results] - user_content = f"query: {query}, context: {context}" - answer = generate_response( - llm=self.llm, - max_tokens=self.max_tokens, - temperature=self.temperature, - stream=stream, - system_content=self.system_content, - assistant_content=self.assistant_content, - user_content=trim(user_content, self.context_length), - ) - - # Result - result = { - "question": query, - "sources": sources, - "document_ids": document_ids, - "answer": answer, - "llm": self.llm, - } - return result -``` - -To embed our query and retrieve relevant vectors, and then generate a response, we run our QueryAgent as follows: - -```python -import json - -query = "How to run an agent?" -system_content = "Answer the query using the context provided. Be succinct." -agent = QueryAgent( - embedding_model_name="BAAI/bge-base-en-v1.5", - llm="gpt-3.5-turbo", - max_context_length=4096, - system_content=system_content) -result = agent(query=query, stream=False) -print(json.dumps(result, indent=2)) -``` - -## Serving our application - -Our application is now running! Our last productionizing step is to serve it. Ray's -[Ray Serve](https://docs.ray.io/en/latest/serve/index.html) module makes this step very straightforward. We combine Ray -Serve with FastAPI and pydantic. The @serve.deployment decorator lets us define how many replicas and compute resources -we want to use, and Ray’s autoscaling will handle the rest. Two Ray Serve decorators are all we need to modify our -FastAPI application for production. - -```python -import pickle -import requests -from typing import List - -from fastapi import FastAPI -from pydantic import BaseModel -from ray import serve - -# Initialize application -app = FastAPI() - -class Query(BaseModel): - query: str - -class Response(BaseModel): - llm: str - question: str - sources: List[str] - response: str - -@serve.deployment(num_replicas=1, ray_actor_options={"num_cpus": 2, "num_gpus": 1}) -@serve.ingress(app) -class RayAssistantDeployment: - def __init__(self, embedding_model_name, embedding_dim, llm): - - # Query agent - system_content = "Answer the query using the context provided. Be succinct. " \ - "Contexts are organized in a list of dictionaries [{'text': }, {'text': }, ...]. " \ - "Feel free to ignore any contexts in the list that don't seem relevant to the query. " - self.gpt_agent = QueryAgent( - embedding_model_name=embedding_model_name, - llm="gpt-3.5-turbo", - max_context_length=4096, - system_content=system_content) - - @app.post("/query") - def query(self, query: Query) -> Response: - result = self.gpt_agent( - query=query.query, - stream=False - ) - return Response.parse_obj(result) -``` - -Now, we're ready to **deploy** our application: - -```python -# Deploying our application with Ray Serve -deployment = RayAssistantDeployment.bind( - embedding_model_name="BAAI/bge-base-en-v1.5", - embedding_dim=768, - llm="gpt-3.5.-turbo") - -serve.run(deployment, route_prefix="/") -``` - -Our FastAPI endpoint is capable of being queried like any other API, while Ray take care of the workload automatically: - -```python -# Performing inference -data = {"query": "How to run an agent?"} -response = requests.post( -"https://127.0.0.1:8000/query", json=data -) - -try: - print(response.json()) -except: - print(response.text) -``` - -Wow! We've been on quite a journey. We gathered our data using Ray and some LangChain documentation, processed it by -chunking, embedding, and indexing it, set up our retrieval and generation, and, finally, served our application using -Ray Serve. Our tutorial has so far covered an example of how to develop scalably and economically - how to productionize -from the very start of development. - -Still, there is one last crucial step. - -## Production is only the start: maintenance - -To fully productionize any application, you also need to maintain it. And maintaining your application is a continuous -task. - -Maintenance involves regular assessment and improvement of your application. You may need to routinely update your -dataset if your application relies on being current with real-world changes. And, of course, you should monitor -application performance to prevent degradation. For smoother operations, we recommend integrating your workflows with -CI/CD pipelines. - -### Limitations and future discussion - -Other critical aspects of scalably productionizing fall outside of the scope of this article, but will be explored in -future articles, including: - -- **Advanced Development** Pre-training, finetuning, prompt engineering and other in-depth development techniques -- **Evaluation** Randomness and qualitative metrics, and complex multi-part structure of RAG can make LLM evaluation - difficult -- **Compliance** Adhering to data privacy laws and regulations, especially when handling personal or sensitive - information - -______________________________________________________________________ - -## Contributors - -- [Pascal Biese, author](https://www.linkedin.com/in/pascalbiese/) -- [Robert Turner, editor](https://robertturner.co/copyedit/) + + +# Scaling RAG for Production + +Retrieval-augmented Generation (RAG) combines Large Language Models (LLMs) with external data to reduce the probability of machine hallucinations - AI-generated information that misrepresents underlying data or reality. When developing RAG systems, scalability is often an afterthought. This creates problems when moving from initial development to production. Having to manually adjust code while your application grows can get very costly and is prone to errors. + +Our tutorial provides an example of **how you can develop a RAG pipeline with production workloads in mind from the start**, using the right tools - ones that are designed to scale your application. + +## Development vs. production + +The goals and requirements of development and production are usually very different. This is particularly true for new technologies like Large Language Models (LLMs) and Retrieval-augmented Generation (RAG), where organizations prioritize rapid experimentation to test the waters before committing more resources. Once important stakeholders are convinced, the focus shifts from demonstrating an application's _potential for_ creating value to _actually_ creating value, via production. Until a system is productionized, its ROI is typically zero. + +**Productionizing**, in the context of [RAG systems](https://hub.superlinked.com/retrieval-augmented-generation), involves transitioning from a prototype or test environment to a **stable, operational state**, in which the system is readily accessible and reliable for remote end users, such as via URL - i.e., independent of the end user machine state. Productionizing also involves **scaling** the system to handle varying levels of user demand and traffic, ensuring consistent performance and availability. + +Even though there is no ROI without productionizing, organizations often underesimate the hurdles involved in getting to an end product. Productionizing is always a trade-off between performance and costs, and this is no different for Retrieval-augmented Generation (RAG) systems. The goal is to achieve a stable, operational, and scalable end product while keeping costs low. + +Let's look more closely at the basic requirements of an [RAG system](https://hub.superlinked.com/retrieval-augmented-generation), before going in to the specifics of what you'll need to productionize it in a cost-effective but scalable way. + +## The basics of RAG + +The most basic RAG workflow looks like this: + +1. Submit a text query to an embedding model, which converts it into a semantically meaningful vector embedding. +2. Send the resulting query vector embedding to your document embeddings storage location - typically a [vector database](https://hub.superlinked.com/32-key-access-patterns#Ea74G). +3. Retrieve the most relevant document chunks - based on proximity of document chunk embeddings to the query vector embedding. +4. Add the retrieved document chunks as context to the query vector embedding and send it to the LLM. +5. The LLM generates a response utilizing the retrieved context. + +While RAG workflows can become significantly more complex, incorporating methods like metadata filtering and retrieval reranking, _all_ RAG systems must contain the components involved in the basic workflow: an embedding model, a store for document and vector embeddings, a retriever, and a LLM. + +But smart development, with productionization in mind, requires more than just setting up your components in a functional way. You must also develop with cost-effective scalability in mind. For this you'll need not just these basic components, but more specifically the tools appropriate to configuring a scalable RAG system. + +## Developing for scalability: the right tools + +### LLM library: LangChain + +As of this writing, LangChain, while it has also been the subject of much criticism, is arguably the most prominent LLM library. A lot of developers turn to Langchain to build Proof-of-Concepts (PoCs) and Minimum Viable Products (MVPs), or simply to experiment with new ideas. Whether one chooses LangChain or one of the other major LLM and RAG libraries - for example, LlamaIndex or Haystack, to name our alternate personal favorites - they can _all_ be used to productionize an RAG system. That is, all three have integrations for third-party libraries and providers that will handle production requirements. Which one you choose to interface with your other components depends on the details of your existing tech stack and use case. + +For the purpose of this tutorial, we'll use part of the Langchain documentation, along with Ray. + +### Scaling with Ray + +Because our goal is to build a 1) simple, 2) scalable, _and_ 3) economically feasible option, not reliant on proprietary solutions, we have chosen to use [Ray](https://github.com/ray-project/ray), a Python framework for productionizing and scaling machine learning (ML) workloads. Ray is designed with a range of auto-scaling features that seamlessly scale ML systems. It's also adaptable to both local environments and Kubernetes, efficiently managing all workload requirements. + +**Ray permits us to keep our tutorial system simple, non-proprietary, and on our own network, rather than the cloud**. While LangChain, LlamaIndex, and Haystack libraries support cloud deployment for AWS, Azure, and GCP, the details of cloud deployment heavily depend on - and are therefore very particular to - the specific cloud provider you choose. These libraries also all contain Ray integrations to enable scaling. But **using Ray directly will provide us with more universally applicable insights**, given that the Ray integrations within LangChain, LlamaIndex, and Haystack are built upon the same underlying framework. + +Now that we have our LLM library sorted, let's turn to data gathering and processing. + +## Data gathering and processing + +### Gathering the data + +Every ML journey starts with data, and that data needs to be gathered and stored somewhere. For this tutorial, we gather data from part of the LangChain documentation. We first download the html files and then create a [Ray dataset](https://docs.ray.io/en/latest/data/data.html) of them. + +We start by **installing all the dependencies** that we'll use: + +```console +pip install ray langchain sentence-transformers qdrant-client einops openai tiktoken fastapi "ray[serve]" +``` + +We use the OpenAI API in this tutorial, so we'll **need an API key**. We export our API key as an environmental variable, and then **initialize our Ray environment** like this: + +```python +import os +import ray + +working_dir = "downloaded_docs" + +if not os.path.exists(working_dir): + os.makedirs(working_dir) + +# Setting up our Ray environment +ray.init(runtime_env={ + "env_vars": { + "OPENAI_API_KEY": os.environ["OPENAI_API_KEY"], + }, + "working_dir": str(working_dir) +}) +``` + +To work with the LangChain documentation, we need to **download the html files and process them**. Scraping html files can get very tricky and the details depend heavily on the structure of the website you’re trying to scrape. The functions below are only meant to be used in the context of this tutorial. + +```python +import requests +from bs4 import BeautifulSoup +from urllib.parse import urlparse, urljoin +from concurrent.futures import ThreadPoolExecutor, as_completed +import re + +def sanitize_filename(filename): + filename = re.sub(r'[\\/*?:"<>|]', '', filename) # Remove problematic characters + filename = re.sub(r'[^\x00-\x7F]+', '_', filename) # Replace non-ASCII characters + return filename + +def is_valid(url, base_domain): + parsed = urlparse(url) + valid = bool(parsed.netloc) and parsed.path.startswith("/docs/expression_language/") + return valid + +def save_html(url, folder): + try: + headers = {'User-Agent': 'Mozilla/5.0'} + response = requests.get(url, headers=headers) + response.raise_for_status() + + soup = BeautifulSoup(response.content, 'html.parser') + title = soup.title.string if soup.title else os.path.basename(urlparse(url).path) + sanitized_title = sanitize_filename(title) + filename = os.path.join(folder, sanitized_title.replace(" ", "_") + ".html") + + if not os.path.exists(filename): + with open(filename, 'w', encoding='utf-8') as file: + file.write(str(soup)) + print(f"Saved: {filename}") + + links = [urljoin(url, link.get('href')) for link in soup.find_all('a') if link.get('href') and is_valid(urljoin(url, link.get('href')), base_domain)] + return links + else: + return [] + except Exception as e: + print(f"Error processing {url}: {e}") + return [] + +def download_all(start_url, folder, max_workers=5): + visited = set() + to_visit = {start_url} + + with ThreadPoolExecutor(max_workers=max_workers) as executor: + while to_visit: + future_to_url = {executor.submit(save_html, url, folder): url for url in to_visit} + visited.update(to_visit) + to_visit.clear() + + for future in as_completed(future_to_url): + url = future_to_url[future] + try: + new_links = future.result() + for link in new_links: + if link not in visited: + to_visit.add(link) + except Exception as e: + print(f"Error with future for {url}: {e}") +``` + +Because the LangChain documentation is very large, we'll download only **a subset** of it: **LangChain's Expression Language (LCEL)**, which consists of 28 html pages. + +```python +base_domain = "python.langchain.com" +start_url = "https://python.langchain.com/docs/expression_language/" +folder = working_dir + +download_all(start_url, folder, max_workers=10) +``` + +Now that we've downloaded the files, we can use them to **create our Ray dataset**: + +```python +from pathlib import Path + +# Ray dataset +document_dir = Path(folder) +ds = ray.data.from_items([{"path": path.absolute()} for path in document_dir.rglob("*.html") if not path.is_dir()]) +print(f"{ds.count()} documents") +``` + +Great! But there's one more step left before we can move on to the next phase of our workflow. We need to **extract the relevant text from our html files and clean up all the html syntax**. For this, we import BeautifulSoup to **parse the files and find relevant html tags**. + +```python +from bs4 import BeautifulSoup, NavigableString + +def extract_text_from_element(element): + texts = [] + for elem in element.descendants: + if isinstance(elem, NavigableString): + text = elem.strip() + if text: + texts.append(text) + return "\n".join(texts) + +def extract_main_content(record): + with open(record["path"], "r", encoding="utf-8") as html_file: + soup = BeautifulSoup(html_file, "html.parser") + + main_content = soup.find(['main', 'article']) # Add any other tags or class_="some-class-name" here + if main_content: + text = extract_text_from_element(main_content) + else: + text = "No main content found." + + path = record["path"] + return {"path": path, "text": text} + +``` + +We can now use Ray's map() function to run this extraction process. Ray lets us run multiple processes in parallel. + +```python +# Extract content +content_ds = ds.map(extract_main_content) +content_ds.count() + +``` + +Awesome! The results of the above extraction are our dataset. Because Ray datasets are optimized for scaled performance in production, they don't require us to make costly and error-prone adjustments to our code when our application grows. + +### Processing the data + +To process our dataset, our next three steps are **chunking, embedding, and indexing**. + +**Chunking the data** + +Chunking - splitting your documents into multiple smaller parts - is necessary to make your data meet the LLM’s context length limits, and helps keep contexts specific enough to remain relevant. Chunks also need to not be too small. When chunks are too small, the information retrieved may become too narrow to provide adequate query responses. The optimal chunk size will depend on your data, the models you use, and your use case. We will use a common chunking value here, one that has been used in a lot of applications. + +Let’s define our text splitting logic first, using a standard text splitter from LangChain: + +```python +from functools import partial +from langchain.text_splitter import RecursiveCharacterTextSplitter + +# Defining our text splitting function +def chunking(document, chunk_size, chunk_overlap): + text_splitter = RecursiveCharacterTextSplitter( + separators=["\n\n", "\n"], + chunk_size=chunk_size, + chunk_overlap=chunk_overlap, + length_function=len) + + chunks = text_splitter.create_documents( + texts=[document["text"]], + metadatas=[{"path": document["path"]}]) + return [{"text": chunk.page_content, "path": chunk.metadata["path"]} for chunk in chunks] +``` + +Again, we utilize Ray's map() function to ensure scalability: + +```python +chunks_ds = content_ds.flat_map(partial( + chunking, + chunk_size=512, + chunk_overlap=50)) +print(f"{chunks_ds.count()} chunks") +``` + +Now that we've gathered and chunked our data scalably, we need to embed and index it, so that we can efficiently retrieve relevant answers to our queries. + +**Embedding the data** + +We use a pretrained model to create vector embeddings for both our data chunks and the query itself. By measuring the distance between the chunk embeddings and the query embedding, we can identify the most relevant, or "top-k," chunks. Of the various pretrained models, we'll use the popular 'bge-base-en-v1.5' model, which, at the time of writing this tutorial, ranks as the highest-performing model of its size on the [MTEB Leaderboard](https://huggingface.co/spaces/mteb/leaderboard). For convenience, we continue using LangChain: + +```python +from langchain.embeddings import OpenAIEmbeddings +from langchain.embeddings.huggingface import HuggingFaceEmbeddings +import numpy as np +from ray.data import ActorPoolStrategy + +def get_embedding_model(embedding_model_name, model_kwargs, encode_kwargs): + embedding_model = HuggingFaceEmbeddings( + model_name=embedding_model_name, + model_kwargs=model_kwargs, + encode_kwargs=encode_kwargs) + return embedding_model +``` + +This time, instead of map(), we want to use map_batches(), which requires defining a class object to perform a **call** on. + +```python +class EmbedChunks: + def __init__(self, model_name): + self.embedding_model = get_embedding_model( + embedding_model_name=model_name, + model_kwargs={"device": "cuda"}, + encode_kwargs={"device": "cuda", "batch_size": 100}) + def __call__(self, batch): + embeddings = self.embedding_model.embed_documents(batch["text"]) + return {"text": batch["text"], "path": batch["path"], "embeddings": embeddings} + +# Embedding our chunks +embedding_model_name = "BAAI/bge-base-en-v1.5" +embedded_chunks = chunks_ds.map_batches( + EmbedChunks, + fn_constructor_kwargs={"model_name": embedding_model_name}, + batch_size=100, + num_gpus=1, + concurrency=1) +``` + +**Indexing the data** + +Now that our chunks are embedded, we need to **store** them somewhere. For the sake of this tutorial, we'll utilize Qdrant’s new in-memory feature, which lets us experiment with our code rapidly without needing to set up a fully-fledged instance. However, for deployment in a production environment, you should rely on more robust and scalable solutions — hosted either within your own network or by a third-party provider. For example, to fully productionize, we would need to point to our Qdrant (or your preferred hosted vendor) instance instead of using it in-memory. Detailed guidance on self-hosted solutions, such as setting up a Kubernetes cluster, are beyond the scope of this tutorial. + +```python +from qdrant_client import QdrantClient +from qdrant_client.http.models import Distance, VectorParams + +# Initalizing a local client in-memory +client = QdrantClient(":memory:") + +client.recreate_collection( + collection_name="documents", + vectors_config=VectorParams(size=embedding_size, distance=Distance.COSINE), +) +``` + +To perform the next processing step, storage, using Ray would require more than 2 CPU scores, making this tutorial incompatible with the free tier of Google Colab. Instead, then, we'll use pandas. Fortunately, Ray allows us to convert our dataset into a pandas DataFrame with a single line of code: + +```python +emb_chunks_df = embedded_chunks.to_pandas() +``` + +Now that our dataset is converted to pandas, we **define and execute our data storage function**: + +```python +from qdrant_client.models import PointStruct + +def store_results(df, collection_name="documents", client=client): + # Defining our data structure + points = [ + # PointStruct is the data classs used in Qdrant + PointStruct( + id=hash(path), # Unique ID for each point + vector=embedding, + payload={ + "text": text, + "source": path + } + ) + for text, path, embedding in zip(df["text"], df["path"], df["embeddings"]) + ] + + # Adding our data points to the collection + client.upsert( + collection_name=collection_name, + points=points + ) + +store_results(emb_chunks_df) +``` + +This wraps up the data processing part! Our data is now stored in our vector database and ready to be retrieved. + +## Data retrieval + +When you retrieve data from vector storage, it's important to use the same embedding model for your query that you used for your source data. Otherwise, vector comparison to surface relevant content may result in mismatched or non-nuanced results (due to semantic drift, loss of context, or inconsistent distance metrics). + +```python +import numpy as np + +# Embed query +embedding_model = HuggingFaceEmbeddings(model_name=embedding_model_name) +query = "How to run agents?" +query_embedding = np.array(embedding_model.embed_query(query)) +len(query_embedding) +``` + +Recall from above that we measure the distance between the query embedding and chunk embeddings to identify the most relevant, or 'top-k' chunks. In Qdrant’s search, the 'limit' parameter is equivalent to 'k'. By default, the search uses cosine similarity as the metric, and retrieves from our database the 5 chunks closest to our query embedding: + +```python +hits = client.search( + collection_name="documents", + query_vector=query_embedding, + limit=5 # Return 5 closest points +) + +context_list = [hit.payload["text"] for hit in hits] +context = "\n".join(context_list) +``` + +We rewrite this as a function for later use: + +```python +def semantic_search(query, embedding_model, k): + query_embedding = np.array(embedding_model.embed_query(query)) + hits = client.search( + collection_name="documents", + query_vector=query_embedding, + limit=5 # Return 5 closest points + ) + + context_list = [{"id": hit.id, "source": str(hit.payload["source"]), "text": hit.payload["text"]} for hit in hits] + return context_list +``` + +## Generation + +We're now very close to being able to field queries and retrieve answers! We've set up everything we need to query our LLM _at scale_. But before querying the model for a response, we want to first inform the query with our data, by **retrieving relevant context from our vector database and then adding it to the query**. + +To do this, we use a simplified version of the generate.py script provided in Ray's [LLM repository](https://github.com/ray-project/llm-applications/blob/main/rag/generate.py). This version is adapted to our code and - to simplify and keep our focus on how to scale a basic RAG system - leaves out a bunch of advanced retrieval techniques, such as reranking and hybrid search. For our LLM, we use gpt-3.5-turbo, and query it via the OpenAI API. + +```python +from openai import OpenAI + +def get_client(llm): + api_key = os.environ["OPENAI_API_KEY"] + client = OpenAI(api_key=api_key) + return client + +def generate_response( + llm, + max_tokens=None, + temperature=0.0, + stream=False, + system_content="", + assistant_content="", + user_content="", + max_retries=1, + retry_interval=60, +): + """Generate response from an LLM.""" + retry_count = 0 + client = get_client(llm=llm) + messages = [ + {"role": role, "content": content} + for role, content in [ + ("system", system_content), + ("assistant", assistant_content), + ("user", user_content), + ] + if content + ] + while retry_count <= max_retries: + try: + chat_completion = client.chat.completions.create( + model=llm, + max_tokens=max_tokens, + temperature=temperature, + stream=stream, + messages=messages, + ) + return prepare_response(chat_completion, stream=stream) + + except Exception as e: + print(f"Exception: {e}") + time.sleep(retry_interval) # default is per-minute rate limits + retry_count += 1 + return "" + +def response_stream(chat_completion): + for chunk in chat_completion: + content = chunk.choices[0].delta.content + if content is not None: + yield content + +def prepare_response(chat_completion, stream): + if stream: + return response_stream(chat_completion) + else: + return chat_completion.choices[0].message.content +``` + +Finally, we **generate a response**: + +```python +# Generating our response +query = "How to run agents?" +response = generate_response( + llm="gpt-3.5-turbo", + temperature=0.0, + stream=True, + system_content="Answer the query using the context provided. Be succinct.", + user_content=f"query: {query}, context: {context_list}") +# Stream response +for content in response: + print(content, end='', flush=True) +``` + +To **make using our application even more convenient**, we can simply adapt Ray's official documentation to **implement our workflow within a single QueryAgent class**, which bundles together and takes care of all of the steps we implemented above - retrieving embeddings, embedding the search query, performing vector search, processing the results, and querying the LLM to generate a response. Using this single class approach, we no longer need to sequentially call all of these functions, and can also include utility functions. (Specifically, `Get_num_tokens` encodes our text and gets the number of tokens, to calculate the length of the input. To maintain our standard 50:50 ratio to allocate space to each of input and generation, we use `(text, max_context_length)` to trim input text if it's too long.) + +```python +import tiktoken + +def get_num_tokens(text): + enc = tiktoken.get_encoding("cl100k_base") + return len(enc.encode(text)) + + +def trim(text, max_context_length): + enc = tiktoken.get_encoding("cl100k_base") + return enc.decode(enc.encode(text)[:max_context_length]) + +class QueryAgent: + def __init__( + self, + embedding_model_name="BAAI/bge-base-en-v1.5", + llm="gpt-3.5-turbo", + temperature=0.0, + max_context_length=4096, + system_content="", + assistant_content="", + ): + # Embedding model + self.embedding_model = get_embedding_model( + embedding_model_name=embedding_model_name, + model_kwargs={"device": "cuda"}, + encode_kwargs={"device": "cuda", "batch_size": 100}, + ) + + # LLM + self.llm = llm + self.temperature = temperature + self.context_length = int( + 0.5 * max_context_length + ) - get_num_tokens( # 50% of total context reserved for input + system_content + assistant_content + ) + self.max_tokens = int( + 0.5 * max_context_length + ) # max sampled output (the other 50% of total context) + self.system_content = system_content + self.assistant_content = assistant_content + + def __call__( + self, + query, + num_chunks=5, + stream=True, + ): + # Get top_k context + context_results = semantic_search( + query=query, embedding_model=self.embedding_model, k=num_chunks + ) + + # Generate response + document_ids = [item["id"] for item in context_results] + context = [item["text"] for item in context_results] + sources = [item["source"] for item in context_results] + user_content = f"query: {query}, context: {context}" + answer = generate_response( + llm=self.llm, + max_tokens=self.max_tokens, + temperature=self.temperature, + stream=stream, + system_content=self.system_content, + assistant_content=self.assistant_content, + user_content=trim(user_content, self.context_length), + ) + + # Result + result = { + "question": query, + "sources": sources, + "document_ids": document_ids, + "answer": answer, + "llm": self.llm, + } + return result +``` + +To embed our query and retrieve relevant vectors, and then generate a response, we run our QueryAgent as follows: + +```python +import json + +query = "How to run an agent?" +system_content = "Answer the query using the context provided. Be succinct." +agent = QueryAgent( + embedding_model_name="BAAI/bge-base-en-v1.5", + llm="gpt-3.5-turbo", + max_context_length=4096, + system_content=system_content) +result = agent(query=query, stream=False) +print(json.dumps(result, indent=2)) +``` + +## Serving our application + +Our application is now running! Our last productionizing step is to serve it. Ray's [Ray Serve](https://docs.ray.io/en/latest/serve/index.html) module makes this step very straightforward. We combine Ray Serve with FastAPI and pydantic. The @serve.deployment decorator lets us define how many replicas and compute resources we want to use, and Ray’s autoscaling will handle the rest. Two Ray Serve decorators are all we need to modify our FastAPI application for production. + +```python +import pickle +import requests +from typing import List + +from fastapi import FastAPI +from pydantic import BaseModel +from ray import serve + +# Initialize application +app = FastAPI() + +class Query(BaseModel): + query: str + +class Response(BaseModel): + llm: str + question: str + sources: List[str] + response: str + +@serve.deployment(num_replicas=1, ray_actor_options={"num_cpus": 2, "num_gpus": 1}) +@serve.ingress(app) +class RayAssistantDeployment: + def __init__(self, embedding_model_name, embedding_dim, llm): + + # Query agent + system_content = "Answer the query using the context provided. Be succinct. " \ + "Contexts are organized in a list of dictionaries [{'text': }, {'text': }, ...]. " \ + "Feel free to ignore any contexts in the list that don't seem relevant to the query. " + self.gpt_agent = QueryAgent( + embedding_model_name=embedding_model_name, + llm="gpt-3.5-turbo", + max_context_length=4096, + system_content=system_content) + + @app.post("/query") + def query(self, query: Query) -> Response: + result = self.gpt_agent( + query=query.query, + stream=False + ) + return Response.parse_obj(result) +``` + +Now, we're ready to **deploy** our application: + +```python +# Deploying our application with Ray Serve +deployment = RayAssistantDeployment.bind( + embedding_model_name="BAAI/bge-base-en-v1.5", + embedding_dim=768, + llm="gpt-3.5.-turbo") + +serve.run(deployment, route_prefix="/") +``` + +Our FastAPI endpoint is capable of being queried like any other API, while Ray take care of the workload automatically: + +```python +# Performing inference +data = {"query": "How to run an agent?"} +response = requests.post( +"https://127.0.0.1:8000/query", json=data +) + +try: + print(response.json()) +except: + print(response.text) +``` + +Wow! We've been on quite a journey. We gathered our data using Ray and some LangChain documentation, processed it by chunking, embedding, and indexing it, set up our retrieval and generation, and, finally, served our application using Ray Serve. Our tutorial has so far covered an example of how to develop scalably and economically - how to productionize from the very start of development. + +Still, there is one last crucial step. + +## Production is only the start: maintenance + +To fully productionize any application, you also need to maintain it. And maintaining your application is a continuous task. + +Maintenance involves regular assessment and improvement of your application. You may need to routinely update your dataset if your application relies on being current with real-world changes. And, of course, you should monitor application performance to prevent degradation. For smoother operations, we recommend integrating your workflows with CI/CD pipelines. + +### Limitations and future discussion + +Other critical aspects of scalably productionizing fall outside of the scope of this article, but will be explored in future articles, including: + +- **Advanced Development** Pre-training, finetuning, prompt engineering and other in-depth development techniques +- **Evaluation** Randomness and qualitative metrics, and complex multi-part structure of RAG can make LLM evaluation difficult +- **Compliance** Adhering to data privacy laws and regulations, especially when handling personal or sensitive information + +--- +## Contributors + +- [Pascal Biese, author](https://www.linkedin.com/in/pascalbiese/) +- [Robert Turner, editor](https://robertturner.co/copyedit/)