langchain-ai · baskaryan · Jan 22, 2025 · Jan 18, 2025 · Jan 18, 2025 · Jan 20, 2025
diff --git a/docs/evaluation/concepts/index.mdx b/docs/evaluation/concepts/index.mdx
@@ -1,20 +1,17 @@
 # Evaluation concepts
 
-The pace of AI application development is often limited by high-quality evaluations.
-Evaluations are methods designed to assess the performance and capabilities of AI applications.
+The quality and development speed of AI applications is often limited by high-quality evaluation datasets and metrics, which enable you to both optimize and test your applications.
 
-Good evaluations make it easy to iteratively improve prompts, select models, test architectures, and ensure that deployed applications continue to perform as expected.
 LangSmith makes building high-quality evaluations easy.
+This guide explains the LangSmith evaluation framework and AI evaluation techniques more broadly.
+The building blocks of the LangSmith framework are:
 
-This guide explains the key concepts behind the LangSmith evaluation framework and evaluations for AI applications more broadly.
-The core components of LangSmith evaluations are:
-
-- [**Datasets**:](/evaluation/concepts#datasets) Collections of test inputs and, optionally, reference outputs for your applications.
-- [**Evaluators**](/evaluation/concepts#evaluators): Functions for scoring the outputs generated by applications given dataset inputs.
+- [**Datasets**:](/evaluation/concepts#datasets) Collections of test inputs and reference outputs.
+- [**Evaluators**](/evaluation/concepts#evaluators): Functions for scoring outputs.
 
 ## Datasets
 
-A dataset contains a collection of examples used for evaluating an application.
+A dataset is a collection of examples used for evaluating an application. An example is a test input, reference output pair.
 
 ![Dataset](./static/dataset_concept.png)
 
@@ -141,7 +138,7 @@ Learn [how run pairwise evaluations](/evaluation/how_to_guides/evaluate_pairwise
 ## Experiment
 
 Each time we evaluate an application on a dataset, we are conducting an experiment.
-An experiment is a single execution of the example inputs in your dataset through your application.
+An experiment contains the results of running a specific version of your application on the dataset.
 Typically, we will run multiple experiments on a given dataset, testing different configurations of our application (e.g., different prompts or LLMs).
 In LangSmith, you can easily view all the experiments associated with your dataset.
 Additionally, you can [compare multiple experiments in a comparison view](/evaluation/how_to_guides/compare_experiment_results).
@@ -224,6 +221,42 @@ Online evaluators are generally intended to be run server-side. LangSmith has bu
 
 ![Online](./static/online.png)
 
+## Testing
+
+### Evaluations vs testing
+
+Testing and evaluation are very similar and overlapping concepts that often get confused.
+
+**An evaluation measures performance according to a metric(s).**
+Evaluation metrics can be fuzzy or subjective, and are more useful in relative terms than absolute ones.
+That is, they're often used to compare two systems against each other rather than to assert something about an individual system.
+
+**Testing asserts correctness.**
+A system can only be deployed if it passes all tests.
+
+Evaluation metrics can be *turned into* tests.
+For example, you can write regression tests to assert that any new version of a system must outperform some baseline version of the system on the relevant evaluation metrics.
+
+It can also be more resource efficient to run tests and evaluations together if your system is expensive to run and you have overlapping datasets for your tests and evaluations.
+
+You can also choose to write evaluations using standard software testing tools like `pytest` or `vitest/jest` out of convenience.
+
+### Using `pytest` and `vitest/jest`
+
+The LangSmith SDKs come with integrations for [pytest](./how_to_guides/pytest) and [`vitest/jest`](./how_to_guides/vitest_jest).
+These make it easy to:
+- Track test results in LangSmith
+- Write evaluations as tests
+
+Tracking test results in LangSmith makes it easy to share results, compare systems, and debug failing tests.
+
+Writing evaluations as tests can be useful when each example you want to evaluate on requires custom logic for running the application and/or evaluators.
+The standard evaluation flows assume that you can run your application and evaluators in the same way on every example in a dataset.
+But for more complex systems or comprehensive evals, you may want to evaluate specific subsets of your system with specific types of inputs and metrics.
+These types of heterogenous evals are much easier to write as a suite of distinct test cases that all get tracked together rather than using the standard evaluate flow.
+
+Using testing tools is also helpful when you want to *both* evaluate your system's outputs *and* assert some basic things about them.
+
 ## Application-specific techniques
 
 Below, we will discuss evaluation of a few specific, popular LLM applications.
@@ -348,13 +381,13 @@ Summarization is one specific type of free-form writing. The evaluation aim is t
 | Faithfulness     | Is the summary grounded in the source documents (e.g., no hallucinations)? | No                     | Yes - [prompt](https://smith.langchain.com/hub/langchain-ai/summary-hallucination-evaluator) | Yes               |
 | Helpfulness      | Is summary helpful relative to user need?                                  | No                     | Yes - [prompt](https://smith.langchain.com/hub/langchain-ai/summary-helpfulness-evaluator)   | Yes               |
 
-### Classification / Tagging
+### Classification and tagging
 
-Classification / Tagging applies a label to a given input (e.g., for toxicity detection, sentiment analysis, etc). Classification / Tagging evaluation typically employs the following components, which we will review in detail below:
+Classification and tagging apply a label to a given input (e.g., for toxicity detection, sentiment analysis, etc). Classification/tagging evaluation typically employs the following components, which we will review in detail below:
 
-A central consideration for Classification / Tagging evaluation is whether you have a dataset with `reference` labels or not. If not, users frequently want to define an evaluator that uses criteria to apply label (e.g., toxicity, etc) to an input (e.g., text, user-question, etc). However, if ground truth class labels are provided, then the evaluation objective is focused on scoring a Classification / Tagging chain relative to the ground truth class label (e.g., using metrics such as precision, recall, etc).
+A central consideration for classification/tagging evaluation is whether you have a dataset with `reference` labels or not. If not, users frequently want to define an evaluator that uses criteria to apply label (e.g., toxicity, etc) to an input (e.g., text, user-question, etc). However, if ground truth class labels are provided, then the evaluation objective is focused on scoring a classification/tagging chain relative to the ground truth class label (e.g., using metrics such as precision, recall, etc).
 
-If ground truth reference labels are provided, then it's common to simply define a [custom heuristic evaluator](./how_to_guides/custom_evaluator) to compare ground truth labels to the chain output. However, it is increasingly common given the emergence of LLMs simply use `LLM-as-judge` to perform the Classification / Tagging of an input based upon specified criteria (without a ground truth reference).
+If ground truth reference labels are provided, then it's common to simply define a [custom heuristic evaluator](./how_to_guides/custom_evaluator) to compare ground truth labels to the chain output. However, it is increasingly common given the emergence of LLMs simply use `LLM-as-judge` to perform the classification/tagging of an input based upon specified criteria (without a ground truth reference).
 
 `Online` or `Offline` evaluation is feasible when using `LLM-as-judge` with the `Reference-free` prompt used. In particular, this is well suited to `Online` evaluation when a user wants to tag / classify application input (e.g., for toxicity, etc).
 
@@ -363,3 +396,4 @@ If ground truth reference labels are provided, then it's common to simply define
 | Accuracy  | Standard definition | Yes                    | No            | No                |
 | Precision | Standard definition | Yes                    | No            | No                |
 | Recall    | Standard definition | Yes                    | No            | No                |
+
diff --git a/docs/evaluation/how_to_guides/index.md b/docs/evaluation/how_to_guides/index.md
@@ -45,11 +45,12 @@ Evaluate and improve your application before deploying it.
 - [Print detailed logs (Python only)](../../observability/how_to_guides/tracing/output_detailed_logs)
 - [Run an evaluation locally (beta, Python only)](./how_to_guides/local)
 
-## Unit testing
+## Testing integrations
 
-Unit test your system to identify bugs and regressions.
+Run evals using your favorite testing tools:
 
-- [Unit test applications (Python only)](./how_to_guides/unit_testing)
+- [Run evals with pytest (beta)](./how_to_guides/pytest)
+- [Run evals with Vitest/Jest (beta)](./how_to_guides/vitest_jest)
 
 ## Online evaluation