Skip to content

Commit

Permalink
Merge pull request #237 from superlinked/robertdhayanturner-patch-3
Browse files Browse the repository at this point in the history
Update retrieval_augmented_generation_eval.md
  • Loading branch information
robertdhayanturner authored Feb 16, 2024
2 parents 8b08eb1 + 9233165 commit 0a41959
Showing 1 changed file with 7 additions and 6 deletions.
13 changes: 7 additions & 6 deletions docs/use_cases/retrieval_augmented_generation_eval.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
<!-- SEO: Retrieval augmented generation Evaluation - TODO Summary
-->
<!--First of a three-part (monthly) series on RAG evaluation. In this article, we introduce RAG evaluation and its challenges, outline the broad strokes of an effective evaluation framework, and provide an overview of the kinds of evaluation tools and approaches you can use to evaluate your RAG application. We also provide useful links to articles on RAG, RAG evaluation, a vector database feature matrix, and discuss Golden Sets, BLEU and ROUGE, and more. -->

# Evaluating Retrieval Augmented Generation
# Evaluating Retrieval Augmented Generation - a framework for assessment

*In this first article of a three-part (monthly) series, we introduce RAG evaluation, outline its challenges, propose an effective evaluation framework, and provide a rough overview of the various evaluation tools and approaches you can use to evaluate your RAG application.*

## Why evaluate RAG?

Expand All @@ -11,7 +12,7 @@ Retrieval Augmented Generation (RAG) is probably the most useful application of

*RAG system (above) using* <a href="https://qdrant.tech" rel="nofollow">*Qdrant*</a> *as the knowledge store. To determine which Vector Database fits your specific use case, refer to the* [*Vector DB feature matrix*](https://vdbs.superlinked.com/).

**But to see what is and isn't working in your RAG system, to refine and optimize, you have to evaluate it**. Evaluation is essential to validate and make sure your application does what users expect it to. In this article (the first of three), we go over the broad strokes of our proposed evaluation framework, which includes separate assessments of the model itself, data ingestion, semantic retrieval, and, finally, the RAG application end-to-end, providing a high level discussion of what's involved in each.
**But to see what is and isn't working in your RAG system, to refine and optimize, you have to evaluate it**. Evaluation is essential to validate and make sure your application does what users expect it to. In this article (the first of three, one per month), we go over the broad strokes of our proposed evaluation framework, which includes separate assessments of the model itself, data ingestion, semantic retrieval, and, finally, the RAG application end-to-end, providing a high level discussion of what's involved in each.

In article 2, we'll look at RAGAS ([RAG Assessment](https://github.com/explodinggradients/ragas)), learn how to set it up with an example, calculate some of the supported metrics, and compare that with our proposed framework. We'll also examine some examples of our proposed framework. Then, in article 3, we will look at [Arize AI](https://arize.com/)'s way of evaluating RAG applications, using Phoenix Evals to focus on Retrieval evaluation and Response evaluation.

Expand All @@ -37,7 +38,7 @@ To see where things are going well, can be improved, and also where errors may o

![Classification of Challenges of RAG Evaluation](../assets/use_cases/retrieval_augmented_generation_eval/rag_challenges.jpg)

*The challenges of RAG Evaluation Presentation (above), including the* [*'Lost in the Middle'*](https://arxiv.org/abs/2307.03172) *problem*.
*The challenges of RAG evaluation (above), including the* [*'Lost in the Middle'*](https://arxiv.org/abs/2307.03172) *problem*.

The evaluation framework we propose is meant to ensure granular and thorough measurement, addressing the challenges faced in all three components. Broadly, we want to assess:

Expand All @@ -58,7 +59,7 @@ Let's take a closer look to see what's involved in each of these levels individu

We want to ensure that the model can understand the data that we encode. The [Massive Text Embedding Benchmark (MTEB)](https://huggingface.co/spaces/mteb/leaderboard) leverages different public/private datasets to evaluate and report on the different capabilities of individual models. We can use the MTEB to evaluate any model in its list. If, on the other hand, you're working with specialized domains, you may want to put together a specialized dataset to train the model. Another option is to run relevant 'tasks' for your custom model, using instructions available [here](https://github.com/embeddings-benchmark/mteb#leaderboard).

For a custom SentenceTransformer-based model we can set up evaluation tasks as in the following code. We import, configure, initialize, and then evaluate our model:
For a custom SentenceTransformer-based model, we can set up evaluation tasks as in the following code. We import, configure, initialize, and then evaluate our model:

```python
import logging
Expand Down

0 comments on commit 0a41959

Please sign in to comment.