Skip to content

Commit

Permalink
fix
Browse files Browse the repository at this point in the history
Signed-off-by: Prithvi Kannan <[email protected]>
  • Loading branch information
prithvikannan committed Oct 9, 2024
1 parent 05cca40 commit 42b7ceb
Show file tree
Hide file tree
Showing 8 changed files with 28 additions and 21 deletions.
14 changes: 9 additions & 5 deletions genai_cookbook/nbs/1-introduction-to-agents.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,19 +18,23 @@ RAG addresses this issue by first retrieving relevant information from the compa

An agent with a retriever tool is one pattern for RAG, and has the advantage of deciding when to it needs to perform retrieval. This cookbook will describe how to build such an agent.

## Core components of a Agent application
## Core components of an agent application

An agent application is an example of a [compound AI system](https://bair.berkeley.edu/blog/2024/02/18/compound-ai-systems/): it expands on the language capabilities of the model alone by combining it with other tools and procedures.

When using a standalone LLM, a user submits a request, such as a question, to the LLM, and the LLM responds with an answer based solely on its training data.

In its most basic form, the following steps happen in an agent application with a retriever tool:
In its most basic form, the following steps happen in an agent application:

1. **Retrieval:** The **user's request** is used to query some outside source of information. This might mean querying a vector store, conducting a keyword search over some text, or querying a SQL database. The goal of the retrieval step is to obtain **supporting data** that will help the LLM provide a useful response.
1. **User query understanding**: First the agent needs to use an LLM to understand the user's query. This step may also consider the previous steps of the conversation if provided.

2. **Augmentation:** The **supporting data** from the retrieval step is combined with the **user's request**, often using a template with additional formatting and instructions to the LLM, to create a **prompt**.
2. **Tool selection**: The agent will use an LLM to determine if it should use a retriever tool. In the case of a vector search retriever, the LLM will create a retriever query, which will help retriever relevant chunks from the vector database. If no tool is selected, the agent will skip to step 4 and generate the final response.

3. **Generation:** The resulting **prompt** is passed to the LLM, and the LLM generates a response to the **user's request**.
3. **Tool execution**: The agent will then execute the tool with the parameters determined by the LLM and return the output.

4. **LLM Generation**: The LLM will then generate the final response.

The image below demonstrates a RAG agent where a retrieval tool is selected.

```{image} ../images/1-introduction-to-agents/1_img.png
:alt: RAG process
Expand Down
3 changes: 2 additions & 1 deletion genai_cookbook/nbs/2-fundamentals-unstructured-chain.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,10 @@

Once the data has been processed by the data pipeline, it is suitable for use in a retriever tool. This section describes the process that occurs once the user submits a request to the agent application in an online setting.

<!-- TODO (prithvi): add this back in once updated to agents
```{image} ../images/2-fundamentals-unstructured/3_img.png
:align: center
```
``` -->
<br/>

1. **User query understanding**: First the agent needs to use an LLM to understand the user's query. This step may also consider the previous steps of the conversation if provided.
Expand Down
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
## Data pipeline

Throughout this guide we will focus on preparing unstructured data for use in Agent applications. *Unstructured* data refers to data without a specific structure or organization, such as PDF documents that might include text and images, or multimedia content such as audio or videos.
Throughout this guide we will focus on preparing unstructured data for use in agent applications. *Unstructured* data refers to data without a specific structure or organization, such as PDF documents that might include text and images, or multimedia content such as audio or videos.

Unstructured data lacks a predefined data model or schema, making it impossible to query on the basis of structure and metadata alone. As a result, unstructured data requires techniques that can understand and extract semantic meaning from raw text, images, audio, or other content.

During data preparation, the Agent application's data pipeline takes raw unstructured data and transforms it into discrete chunks that can be queried based on their relevance to a user's query. The key steps in data preprocessing are outlined below. Each step has a variety of knobs that can be tuned - for a deeper dive discussion on these knobs, please refer to the [deep dive into RAG section.](/nbs/3-deep-dive)
During data preparation, the agent application's data pipeline takes raw unstructured data and transforms it into discrete chunks that can be queried based on their relevance to a user's query. The key steps in data preprocessing are outlined below. Each step has a variety of knobs that can be tuned - for a deeper dive discussion on these knobs, please refer to the [deep dive into RAG section.](/nbs/3-deep-dive)

```{image} ../images/2-fundamentals-unstructured/2_img.png
:align: center
Expand Down
2 changes: 1 addition & 1 deletion genai_cookbook/nbs/2-fundamentals-unstructured-eval.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

Evaluation and monitoring are critical components to understand if your agent application is performing to the *quality*, *cost*, and *latency* requirements dictated by your use case. Technically, **evaluation** happens during development and **monitoring** happens once the application is deployed to production, but the fundamental components are similar.

An Agent with retriever tool over unstructured data is a complex system with many components that impact the application's quality. Adjusting any single element can have cascading effects on the others. For instance, data formatting changes can influence the retrieved chunks and the LLM's ability to generate relevant responses. Therefore, it's crucial to evaluate each of the application's components in addition to the application as a whole in order to iteratively refine it based on those assessments.
Often, an agent is a complex system with many components that impact the application's quality. Adjusting any single element can have cascading effects on the others. For instance, data formatting changes can influence the retrieved chunks and the LLM's ability to generate relevant responses. Therefore, it's crucial to evaluate each of the application's components in addition to the application as a whole in order to iteratively refine it based on those assessments.

Evaluation and monitoring of Generative AI applications, including agents, differs from classical machine learning in several ways:

Expand Down
4 changes: 2 additions & 2 deletions genai_cookbook/nbs/2-fundamentals-unstructured.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
# Agent fundamentals

In [section 1](1-introduction-to-agents) of this guide, we introduced agents and RAG, explained its functionality at a high level, and highlighted its advantages over standalone LLMs.
In [section 1](1-introduction-to-agents) of this guide, we introduced agents and RAG, explained their functionality at a high level, and highlighted their advantages over standalone LLMs.

This section will introduce the key components and principles behind developing agent applications over unstructured data. In particular, we will discuss:

1. **[Data pipeline](./2-fundamentals-unstructured-data-pipeline):** Transforming unstructured documents, such as collections of PDFs, into a format suitable for retrieval using the agent application's **data pipeline**.
2. [**Retrieval, Augmentation, and Generation (RAG chain)**](./2-fundamentals-unstructured-chain): A series (or **chain**) of steps is called to:
2. [**Retrieval, Augmentation, and Generation (RAG agent)**](./2-fundamentals-unstructured-chain): An agent is called to:
1. Understand the user's question
2. Retrieve the supporting data
3. Call an LLM to generate a response based on the user's question and supporting data
Expand Down
2 changes: 1 addition & 1 deletion genai_cookbook/nbs/3-deep-dive-chain.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,7 @@ Here's how a multi-step query understanding component might look for our a custo

2. **Entity extraction:** Based on the identified intent, use another LLM call to extract relevant entities from the query, such as product names, reported errors, or account numbers.

3. **Query rewriting:** Use the extracted intent and entities to rewrite the original query into a more specific and targeted format, e.g., "My Agent chain is failing to deploy on Model Serving, I'm seeing the following error...".
3. **Query rewriting:** Use the extracted intent and entities to rewrite the original query into a more specific and targeted format, e.g., "My agent is failing to deploy on Model Serving, I'm seeing the following error...".

### Retrieval

Expand Down
4 changes: 3 additions & 1 deletion genai_cookbook/nbs/3-deep-dive.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,13 +32,15 @@ This overlap underscores the need for a holistic approach to RAG quality improve

[**RAG Agent**](/nbs/3-deep-dive-chain)

<!-- TODO (prithvi): add this back in once updated to agents
```{image} ../images/5-hands-on/16_img.png
:align: center
```
``` -->

<br/>

- The choice of LLM and its parameters (e.g., temperature, max tokens)
- The tool selection logic (e.g., retriever tool description)
- The retrieval parameters (e.g., number of chunks/documents retrieved)
- The retrieval approach (e.g., keyword vs. hybrid vs. semantic search, rewriting the user's query, transforming a user's query into filters, re-ranking)
- How to format the prompt with retrieved context, to guide the LLM towards desired output
16 changes: 8 additions & 8 deletions genai_cookbook/nbs/4-evaluation-metrics.md
Original file line number Diff line number Diff line change
@@ -1,27 +1,27 @@
## Assessing performance: Metrics that Matter

With an evaluation set, you are able to measure the performance of your Agent application across a number of different dimensions, including:
With an evaluation set, you are able to measure the performance of your agent application across a number of different dimensions, including:

- **Retrieval quality**: Retrieval metrics assess how successfully your Agent application retrieves relevant supporting data. Precision and recall are two key retrieval metrics.
- **Response quality**: Response quality metrics assess how well the Agent application responds to a user's request. Response metrics can measure, for instance, if the resulting answer is accurate per the ground-truth, how well-grounded the response was given the retrieved context (e.g., did the LLM hallucinate), or how safe the response was (e.g., no toxicity).
- **System performance (cost & latency):** Metrics capture the overall cost and performance of Agent applications. Overall latency and token consumption are examples of chain performance metrics.
- **Retrieval quality**: Retrieval metrics assess how successfully your agent application retrieves relevant supporting data. Precision and recall are two key retrieval metrics.
- **Response quality**: Response quality metrics assess how well the agent application responds to a user's request. Response metrics can measure, for instance, if the resulting answer is accurate per the ground-truth, how well-grounded the response was given the retrieved context (e.g., did the LLM hallucinate), or how safe the response was (e.g., no toxicity).
- **System performance (cost & latency):** Metrics capture the overall cost and performance of agent applications. Overall latency and token consumption are examples of agent performance metrics.

It is very important to collect both response and retrieval metrics. A Agent application can respond poorly in spite of retrieving the correct context; it can also provide good responses on the basis of faulty retrievals. Only by measuring both components can we accurately diagnose and address issues in the application.
It is very important to collect both response and retrieval metrics. an agent application can respond poorly in spite of retrieving the correct context; it can also provide good responses on the basis of faulty retrievals. Only by measuring both components can we accurately diagnose and address issues in the application.

There are two key approaches to measuring performance across these metrics:

- **Deterministic measurement:** Cost and latency metrics can be computed deterministically based on the application's outputs. If your evaluation set includes a list of documents that contain the answer to a question, a subset of the retrieval metrics can also be computed deterministically.
- **LLM judge based measurement** In this approach, a separate [LLM acts as a judge](https://arxiv.org/abs/2306.05685) to evaluate the quality of the Agent application's retrieval and responses. Some LLM judges, such as answer correctness, compare the human-labeled ground truth vs. the app's outputs. Other LLM judges, such as groundedness, do not require human-labeled ground truth to assess their app's outputs.
- **LLM judge based measurement** In this approach, a separate [LLM acts as a judge](https://arxiv.org/abs/2306.05685) to evaluate the quality of the agent application's retrieval and responses. Some LLM judges, such as answer correctness, compare the human-labeled ground truth vs. the app's outputs. Other LLM judges, such as groundedness, do not require human-labeled ground truth to assess their app's outputs.

```{important}
For an LLM judge to be effective, it must be tuned to understand the use case. Doing so requires careful attention to understand where the judge does and does not work well, and then tuning the judge to improve it for the failure cases.
```

> [Mosaic AI Agent Evaluation](https://docs.databricks.com/generative-ai/agent-evaluation/index.html) provides an out-of-the-box implementation, using hosted LLM judge models, for each metric discussed on this page. Agent Evaluation's documentation discusses the [details](https://docs.databricks.com/generative-ai/agent-evaluation/llm-judge-metrics.html) of how these metrics and judges are implemented and provides [capabilities](https://docs.databricks.com/generative-ai/agent-evaluation/advanced-agent-eval.html#provide-examples-to-the-built-in-llm-judges) to tune the judge's with your data to increase their accuracy
> [Mosaic AI Agent Evaluation](https://docs.databricks.com/generative-ai/agent-evaluation/index.html) provides an out-of-the-box implementation, using hosted LLM judge models, for each metric discussed on this page. Agent Evaluation's documentation discusses the [details](https://docs.databricks.com/generative-ai/agent-evaluation/llm-judge-metrics.html) of how these metrics and judges are implemented and provides [capabilities](https://docs.databricks.com/generative-ai/agent-evaluation/advanced-agent-eval.html#provide-examples-to-the-built-in-llm-judges) to tune the judge's with your data to increase their accuracy
### Metric overview

Below is a summary of the metrics that Databricks recommends for measuring the quality, cost, and latency of your Agent application. These metrics are implemented in [Mosaic AI Agent Evaluation](https://docs.databricks.com/generative-ai/agent-evaluation/index.html).
Below is a summary of the metrics that Databricks recommends for measuring the quality, cost, and latency of your agent application. These metrics are implemented in [Mosaic AI Agent Evaluation](https://docs.databricks.com/generative-ai/agent-evaluation/index.html).

<table class="table">
<thead>
Expand Down

0 comments on commit 42b7ceb

Please sign in to comment.