Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
41 changes: 30 additions & 11 deletions docs/ai-evals/getting-started.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,43 +9,62 @@ last_update:
date: 2025-07-25
---

# Getting Started with AI Evals

## Create/analyze an offline eval in 10 minutes
This guide walks you through creating and analyzing your first offline evaluation in about 10 minutes. You'll learn how to set up an AI Config (Prompt), create a test dataset, configure automated grading, and view evaluation results.

## Create and Analyze an Offline Eval in 10 Minutes
<!-- (Coming soon: How to start an online eval in 15 minutes) -->


**1. Create a Prompt within Statsig**
### Step 1: Create an AI Config (Prompt)

An AI Config captures the instruction you provide to an LLM to accomplish a specific task, along with model configuration (provider, model name, temperature, etc.). You can create multiple versions as you iterate and choose which one is "live" (retrieved by your application).

This captures the instruction you provide to an LLM to accomplish your task. You can now use the Statsig [Node](/server-core/node-core#getting-a-prompt) or [Python](/server-core/python-core/#getting-a-prompt) Server Core SDKs to retrieve this prompt within your app and use it. You can create multiple versions of the prompt as you iterate, and choose which one is "live" (retrieved by the SDK).
Use the Statsig [Node](/server-core/node-core#getting-a-prompt) or [Python](/server-core/python-core/#getting-a-prompt) Server Core SDKs to retrieve and use your AI Config in production.

<img alt="image" src="https://github.com/user-attachments/assets/a17b3c4d-2126-4dfe-8d4b-d40b1838f878" />



**2. Create a dataset you can use to evaluate LLM completions for your prompt**
### Step 2: Create an Evaluation Dataset

Build a dataset to evaluate LLM completions for your prompt. For the translation example above, this would include a list of words alongside known correct translations in French.

For the example above, this might be a list of words, along side known good translations in French. Small lists can be entered (or upload a CSV).
You can enter small datasets manually or upload a CSV file for larger test sets.

<img alt="image" src="https://github.com/user-attachments/assets/6d4b1abc-bde9-4d63-9d0c-95fef60b3f9a" />


**3. Create a grader that will grade LLM completions for your prompt**
### Step 3: Configure a Grader

Configure a grader that compares the LLM completion text with the reference output. You can use one of the out of box string evaluators, or even configure an LLM-as-a-Judge evaluator that mimics a human's grading rubric.
Set up a grader to automatically score your LLM outputs. You can:
- Use built-in string evaluators (exact match, contains, regex)
- Configure an LLM-as-a-Judge evaluator that applies human-like grading rubrics
- Create custom graders for domain-specific evaluation criteria

The grader compares the LLM completion with the reference output and assigns a score from 0 (fail) to 1 (pass).

<img alt="image" src="https://github.com/user-attachments/assets/3cd510f7-c267-4cdd-bebe-dbee527a5318" />

**3. Run evaluation**
### Step 4: Run Your Evaluation

Run an evaluation on a version of the prompt. You should see results in a few minutes that look like this. You can click into any row of the dataset to understand more about the evaluation for that row.
Execute the evaluation on a specific version of your AI Config. Results typically appear within a few minutes, showing:
- Overall score distribution
- Per-row evaluation details (click any row to see the full assessment)
- Pass/fail rates across your dataset

<img alt="image" src="https://github.com/user-attachments/assets/c450f277-b2ba-4657-b747-440b43859f20" />

You can categorize your dataset, and break scores out by category.
### Step 5: Analyze and Compare Results

**Categorize and segment your results:**
You can categorize your dataset (e.g., by language, complexity, topic) and view scores broken down by category to identify specific areas for improvement.

<img alt="image" src="https://github.com/user-attachments/assets/3c0de7c4-6721-4a45-9a61-04a63db68913" />

If you have scores for multiple versions, you can compare them to see what changed between versions.
**Compare versions:**
If you have evaluation scores for multiple AI Config versions, compare them side-by-side to understand what changed between versions and identify improvements or regressions.

<img alt="image" src="https://github.com/user-attachments/assets/fd593e52-ddec-4826-bf4b-c2ca1d43e4f0" />

32 changes: 24 additions & 8 deletions docs/ai-evals/overview.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -12,15 +12,31 @@ last_update:
:::info
AI Evals are currently in beta; Reach out in Slack to get access. For a quick overview with screenshots - see [here](/ai-evals/getting-started)
:::
Statsig AI Evals helps you deploy AI with confidence by providing tools to benchmark, iterate, and launch AI systems without code changes. Run offline and online evaluations, control deployments with AI Configs, and optimize with experimentation.



## What are AI Evals?
Statsig AI Evals have a few core components to help iterate and serve your LLM apps in production.
1. **[Prompts](/ai-evals/prompts)**: Prompts are a way to represent your LLM prompt (and associated LLM config like Model Provider, Model, Temperature etc). This typically represents a task you're getting the LLM to do (e.g. "Classify this ticket to a triage queue" or "Summarize this text"). You can version prompts, choose in Statsig which version is currently live and retrieve and use this prompt in Production using the Statsig server SDKs. It is possible to use Prompts as the control pane for your LLM apps without using the rest of the Evals product suite.
2. **[Offline Evals](/ai-evals/offline-evals)**: Offline evals offer a quick, automated grading of model outputs on a fixed test set. They catch wins / regressions early—before any real users are exposed. e.g. compare a new support‑bot’s replies to gold (human curated) answers to decide if it is good enough to ship. It is possible to grade output even without a golden dataset (e.g. if you're having an LLM validate English to French translation).
3. **[Online Evals](/ai-evals/online-evals)**: Online evals let you grade your model output in production on real world use cases. You can run the "live" version of a prompt, but can also shadow run "candidate" versions of a prompt, without exposing users to them. Grading works directly on the model output, and has to work without a ground truth to compare against.
Statsig AI Evals provides a comprehensive platform for developing, testing, and deploying LLM applications in production. The core components include:
1. **[AI Configs (Prompts)](/ai-evals/prompts)**: AI Configs store your LLM prompt, model configuration, and parameters (like Model Provider, Model, Temperature, etc.) in Statsig. This gives you version control, deployment management, and the ability to test configurations without code changes. You can version your AI configs, choose which version is live, and retrieve configurations in production using the Statsig server SDKs. AI Configs serve as the control plane for your LLM applications and can be used independently or with the full Evals suite.
2. **[Offline Evals](/ai-evals/offline-evals)**: Run automated evaluations of model outputs on curated test datasets before deploying to production. Offline evals catch wins and regressions early—before real users are exposed. For example, compare a new support bot's replies against gold-standard (human-curated) answers to validate quality. You can grade outputs even without a golden dataset using techniques like LLM-as-a-judge for tasks such as translation quality assessment.
3. **[Online Evals](/ai-evals/online-evals)**: Evaluate model performance in production on real-world use cases. Run your "live" version for users while shadow-testing "candidate" versions in the background without user impact. Online evals grade model outputs directly in production using automated graders that work without ground truth comparisons, enabling continuous quality monitoring and A/B testing of AI features.

## Integration with Statsig Platform
The full suite of Statsig's product development capabilities works seamlessly with AI Evals. You can:
- **Feature Gates**: Target AI features to specific user segments or gradually roll out new AI configurations
- **Experiments**: A/B test different prompt versions or models and measure impact on business metrics
- **Analytics**: Track evaluation performance, costs, latency, and user engagement metrics in real-time dashboards
- **Version Management**: Use AI Configs to manage releases, run automatic evaluations on every new configuration, and maintain deployment control

## Automated Grading with LLM as a Judge

Some evaluations can use simple heuristics (e.g., checking if the AI output matches an expected value like "High", "Medium", or "Low"). However, many real-world scenarios require more nuanced evaluation—such as determining whether "Your ticket has been escalated" and "This ticket has been escalated" convey the same meaning.

## What about Gates, Experiments and Analytics?
The standard suite of Statsig product building capabilities are also available for use here. For example, you can target an LLM feature at a set of people that meet some criteria with a Feature Gate, or choose to roll out a new prompt version as an Experiment and understand impact on metrics, similar to any other experiment.
LLM-as-a-judge enables fast, scalable evaluation of AI outputs without requiring extensive human review. It mimics human assessment of quality and, while not perfect, provides consistent and efficient comparison across different model versions or prompts. For example, you can create an LLM-as-a-judge grader with a prompt like: "Score how close this answer is to the ideal one on a scale from 0 to 1.0, where 1.0 means perfect semantic equivalence."

### LLM as a Judge
Some grading can use heuristics (e.g. check if the AI generated output matches the ideal answer in the dataset when the output is as simple as High, Medium or Low). Some grading can't - you're trying to decide if "Your ticket has been escalated" and "This ticket has been escalated" mean the same thing. LLM-as-a-judge lets you quickly and cheaply evaluate AI outputs at scale without needing tons of human reviewers. It mimics how a human would assess quality — and while not perfect, it's fast, consistent, and good enough to compare different versions of your model or prompt. In this example, we could write an LLM-as-a-judge prompt "Score how close this answer is to the ideal one on a scale between 0-1.0".
This automated grading approach is particularly valuable for:
- Evaluating semantic similarity and tone
- Assessing output quality at scale
- Comparing multiple AI configurations without manual review
- Maintaining consistent evaluation criteria across iterations