Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tutorial for testing framework #631

Merged
merged 18 commits into from
Jan 22, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
62 changes: 48 additions & 14 deletions docs/evaluation/concepts/index.mdx
Original file line number Diff line number Diff line change
@@ -1,20 +1,17 @@
# Evaluation concepts

The pace of AI application development is often limited by high-quality evaluations.
Evaluations are methods designed to assess the performance and capabilities of AI applications.
The quality and development speed of AI applications is often limited by high-quality evaluation datasets and metrics, which enable you to both optimize and test your applications.

Good evaluations make it easy to iteratively improve prompts, select models, test architectures, and ensure that deployed applications continue to perform as expected.
LangSmith makes building high-quality evaluations easy.
This guide explains the LangSmith evaluation framework and AI evaluation techniques more broadly.
The building blocks of the LangSmith framework are:

This guide explains the key concepts behind the LangSmith evaluation framework and evaluations for AI applications more broadly.
The core components of LangSmith evaluations are:

- [**Datasets**:](/evaluation/concepts#datasets) Collections of test inputs and, optionally, reference outputs for your applications.
- [**Evaluators**](/evaluation/concepts#evaluators): Functions for scoring the outputs generated by applications given dataset inputs.
- [**Datasets**:](/evaluation/concepts#datasets) Collections of test inputs and reference outputs.
- [**Evaluators**](/evaluation/concepts#evaluators): Functions for scoring outputs.

## Datasets

A dataset contains a collection of examples used for evaluating an application.
A dataset is a collection of examples used for evaluating an application. An example is a test input, reference output pair.

![Dataset](./static/dataset_concept.png)

Expand Down Expand Up @@ -141,7 +138,7 @@ Learn [how run pairwise evaluations](/evaluation/how_to_guides/evaluate_pairwise
## Experiment

Each time we evaluate an application on a dataset, we are conducting an experiment.
An experiment is a single execution of the example inputs in your dataset through your application.
An experiment contains the results of running a specific version of your application on the dataset.
Typically, we will run multiple experiments on a given dataset, testing different configurations of our application (e.g., different prompts or LLMs).
In LangSmith, you can easily view all the experiments associated with your dataset.
Additionally, you can [compare multiple experiments in a comparison view](/evaluation/how_to_guides/compare_experiment_results).
Expand Down Expand Up @@ -224,6 +221,42 @@ Online evaluators are generally intended to be run server-side. LangSmith has bu

![Online](./static/online.png)

## Testing

### Evaluations vs testing

Testing and evaluation are very similar and overlapping concepts that often get confused.

**An evaluation measures performance according to a metric(s).**
Evaluation metrics can be fuzzy or subjective, and are more useful in relative terms than absolute ones.
That is, they're often used to compare two systems against each other rather than to assert something about an individual system.

**Testing asserts correctness.**
A system can only be deployed if it passes all tests.

Evaluation metrics can be *turned into* tests.
For example, you can write regression tests to assert that any new version of a system must outperform some baseline version of the system on the relevant evaluation metrics.

It can also be more resource efficient to run tests and evaluations together if your system is expensive to run and you have overlapping datasets for your tests and evaluations.

You can also choose to write evaluations using standard software testing tools like `pytest` or `vitest/jest` out of convenience.

### Using `pytest` and `vitest/jest`

The LangSmith SDKs come with integrations for [pytest](./how_to_guides/pytest) and [`vitest/jest`](./how_to_guides/vitest_jest).
These make it easy to:
- Track test results in LangSmith
- Write evaluations as tests

Tracking test results in LangSmith makes it easy to share results, compare systems, and debug failing tests.

Writing evaluations as tests can be useful when each example you want to evaluate on requires custom logic for running the application and/or evaluators.
The standard evaluation flows assume that you can run your application and evaluators in the same way on every example in a dataset.
But for more complex systems or comprehensive evals, you may want to evaluate specific subsets of your system with specific types of inputs and metrics.
These types of heterogenous evals are much easier to write as a suite of distinct test cases that all get tracked together rather than using the standard evaluate flow.

Using testing tools is also helpful when you want to *both* evaluate your system's outputs *and* assert some basic things about them.

## Application-specific techniques

Below, we will discuss evaluation of a few specific, popular LLM applications.
Expand Down Expand Up @@ -348,13 +381,13 @@ Summarization is one specific type of free-form writing. The evaluation aim is t
| Faithfulness | Is the summary grounded in the source documents (e.g., no hallucinations)? | No | Yes - [prompt](https://smith.langchain.com/hub/langchain-ai/summary-hallucination-evaluator) | Yes |
| Helpfulness | Is summary helpful relative to user need? | No | Yes - [prompt](https://smith.langchain.com/hub/langchain-ai/summary-helpfulness-evaluator) | Yes |

### Classification / Tagging
### Classification and tagging

Classification / Tagging applies a label to a given input (e.g., for toxicity detection, sentiment analysis, etc). Classification / Tagging evaluation typically employs the following components, which we will review in detail below:
Classification and tagging apply a label to a given input (e.g., for toxicity detection, sentiment analysis, etc). Classification/tagging evaluation typically employs the following components, which we will review in detail below:

A central consideration for Classification / Tagging evaluation is whether you have a dataset with `reference` labels or not. If not, users frequently want to define an evaluator that uses criteria to apply label (e.g., toxicity, etc) to an input (e.g., text, user-question, etc). However, if ground truth class labels are provided, then the evaluation objective is focused on scoring a Classification / Tagging chain relative to the ground truth class label (e.g., using metrics such as precision, recall, etc).
A central consideration for classification/tagging evaluation is whether you have a dataset with `reference` labels or not. If not, users frequently want to define an evaluator that uses criteria to apply label (e.g., toxicity, etc) to an input (e.g., text, user-question, etc). However, if ground truth class labels are provided, then the evaluation objective is focused on scoring a classification/tagging chain relative to the ground truth class label (e.g., using metrics such as precision, recall, etc).

If ground truth reference labels are provided, then it's common to simply define a [custom heuristic evaluator](./how_to_guides/custom_evaluator) to compare ground truth labels to the chain output. However, it is increasingly common given the emergence of LLMs simply use `LLM-as-judge` to perform the Classification / Tagging of an input based upon specified criteria (without a ground truth reference).
If ground truth reference labels are provided, then it's common to simply define a [custom heuristic evaluator](./how_to_guides/custom_evaluator) to compare ground truth labels to the chain output. However, it is increasingly common given the emergence of LLMs simply use `LLM-as-judge` to perform the classification/tagging of an input based upon specified criteria (without a ground truth reference).

`Online` or `Offline` evaluation is feasible when using `LLM-as-judge` with the `Reference-free` prompt used. In particular, this is well suited to `Online` evaluation when a user wants to tag / classify application input (e.g., for toxicity, etc).

Expand All @@ -363,3 +396,4 @@ If ground truth reference labels are provided, then it's common to simply define
| Accuracy | Standard definition | Yes | No | No |
| Precision | Standard definition | Yes | No | No |
| Recall | Standard definition | Yes | No | No |

7 changes: 4 additions & 3 deletions docs/evaluation/how_to_guides/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,11 +45,12 @@ Evaluate and improve your application before deploying it.
- [Print detailed logs (Python only)](../../observability/how_to_guides/tracing/output_detailed_logs)
- [Run an evaluation locally (beta, Python only)](./how_to_guides/local)

## Unit testing
## Testing integrations

Unit test your system to identify bugs and regressions.
Run evals using your favorite testing tools:

- [Unit test applications (Python only)](./how_to_guides/unit_testing)
- [Run evals with pytest (beta)](./how_to_guides/pytest)
- [Run evals with Vitest/Jest (beta)](./how_to_guides/vitest_jest)

## Online evaluation

Expand Down
Loading
Loading