From 92331653f0b04a3a29d6fe43b743ce46500138da Mon Sep 17 00:00:00 2001 From: robertturner <143536791+robertdhayanturner@users.noreply.github.com> Date: Fri, 16 Feb 2024 10:39:27 -0500 Subject: [PATCH] Update retrieval_augmented_generation_eval.md update SEO description, title, and italicized intro statement --- .../retrieval_augmented_generation_eval.md | 13 +++++++------ 1 file changed, 7 insertions(+), 6 deletions(-) diff --git a/docs/use_cases/retrieval_augmented_generation_eval.md b/docs/use_cases/retrieval_augmented_generation_eval.md index ce628461b..dbdcf80e1 100644 --- a/docs/use_cases/retrieval_augmented_generation_eval.md +++ b/docs/use_cases/retrieval_augmented_generation_eval.md @@ -1,7 +1,8 @@ - + -# Evaluating Retrieval Augmented Generation +# Evaluating Retrieval Augmented Generation - a framework for assessment + +*In this first article of a three-part (monthly) series, we introduce RAG evaluation, outline its challenges, propose an effective evaluation framework, and provide a rough overview of the various evaluation tools and approaches you can use to evaluate your RAG application.* ## Why evaluate RAG? @@ -11,7 +12,7 @@ Retrieval Augmented Generation (RAG) is probably the most useful application of *RAG system (above) using* *Qdrant* *as the knowledge store. To determine which Vector Database fits your specific use case, refer to the* [*Vector DB feature matrix*](https://vdbs.superlinked.com/). -**But to see what is and isn't working in your RAG system, to refine and optimize, you have to evaluate it**. Evaluation is essential to validate and make sure your application does what users expect it to. In this article (the first of three), we go over the broad strokes of our proposed evaluation framework, which includes separate assessments of the model itself, data ingestion, semantic retrieval, and, finally, the RAG application end-to-end, providing a high level discussion of what's involved in each. +**But to see what is and isn't working in your RAG system, to refine and optimize, you have to evaluate it**. Evaluation is essential to validate and make sure your application does what users expect it to. In this article (the first of three, one per month), we go over the broad strokes of our proposed evaluation framework, which includes separate assessments of the model itself, data ingestion, semantic retrieval, and, finally, the RAG application end-to-end, providing a high level discussion of what's involved in each. In article 2, we'll look at RAGAS ([RAG Assessment](https://github.com/explodinggradients/ragas)), learn how to set it up with an example, calculate some of the supported metrics, and compare that with our proposed framework. We'll also examine some examples of our proposed framework. Then, in article 3, we will look at [Arize AI](https://arize.com/)'s way of evaluating RAG applications, using Phoenix Evals to focus on Retrieval evaluation and Response evaluation. @@ -37,7 +38,7 @@ To see where things are going well, can be improved, and also where errors may o ![Classification of Challenges of RAG Evaluation](../assets/use_cases/retrieval_augmented_generation_eval/rag_challenges.jpg) -*The challenges of RAG Evaluation Presentation (above), including the* [*'Lost in the Middle'*](https://arxiv.org/abs/2307.03172) *problem*. +*The challenges of RAG evaluation (above), including the* [*'Lost in the Middle'*](https://arxiv.org/abs/2307.03172) *problem*. The evaluation framework we propose is meant to ensure granular and thorough measurement, addressing the challenges faced in all three components. Broadly, we want to assess: @@ -58,7 +59,7 @@ Let's take a closer look to see what's involved in each of these levels individu We want to ensure that the model can understand the data that we encode. The [Massive Text Embedding Benchmark (MTEB)](https://huggingface.co/spaces/mteb/leaderboard) leverages different public/private datasets to evaluate and report on the different capabilities of individual models. We can use the MTEB to evaluate any model in its list. If, on the other hand, you're working with specialized domains, you may want to put together a specialized dataset to train the model. Another option is to run relevant 'tasks' for your custom model, using instructions available [here](https://github.com/embeddings-benchmark/mteb#leaderboard). -For a custom SentenceTransformer-based model we can set up evaluation tasks as in the following code. We import, configure, initialize, and then evaluate our model: +For a custom SentenceTransformer-based model, we can set up evaluation tasks as in the following code. We import, configure, initialize, and then evaluate our model: ```python import logging