Skip to content

Commit

Permalink
fix: split BLEU into SentenceBLEU and CorpusBLEU
Browse files Browse the repository at this point in the history
  • Loading branch information
kadamrahul18 committed Jan 14, 2025
1 parent 84b7669 commit 53e8014
Show file tree
Hide file tree
Showing 4 changed files with 248 additions and 204 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -9,15 +9,15 @@ Heuristic metrics are rule-based evaluation methods that allow you to check spec

You can use the following heuristic metrics:

| Metric | Description |
| ----------- | ------------------------------------------------------------------------------------------------- |
| Equals | Checks if the output exactly matches an expected string |
| Contains | Check if the output contains a specific substring, can be both case sensitive or case insensitive |
| RegexMatch | Checks if the output matches a specified regular expression pattern |
| IsJson | Checks if the output is a valid JSON object |
| Levenshtein | Calculates the Levenshtein distance between the output and an expected string |
| BLEU | Calculates the BLEU score for output text against one or more reference texts |

| Metric | Description |
|--------------|---------------------------------------------------------------------------------------------------|
| Equals | Checks if the output exactly matches an expected string |
| Contains | Check if the output contains a specific substring, can be both case sensitive or case insensitive |
| RegexMatch | Checks if the output matches a specified regular expression pattern |
| IsJson | Checks if the output is a valid JSON object |
| Levenshtein | Calculates the Levenshtein distance between the output and an expected string |
| SentenceBLEU | Calculates a single-sentence BLEU score for a candidate vs. one or more references |
| CorpusBLEU | Calculates a corpus-level BLEU score for multiple candidates vs. their references |
## Score an LLM response

You can score an LLM response by first initializing the metrics and then calling the `score` method:
Expand Down Expand Up @@ -101,55 +101,61 @@ print(score)

### BLEU

The BLEU metric calculates how close the LLM output is to one or more reference translations. This single metric class can compute:
- Single-sentence BLEU: Pass a single output string and one or more reference strings.
- Corpus-level BLEU: Pass a list of output strings and a parallel list of reference strings (or lists of references).
The BLEU (Bilingual Evaluation Understudy) metrics estimate how close the LLM outputs are to one or more reference translations. Opik provides two separate classes:
- `SentenceBLEU` – Single-sentence BLEU
- `CorpusBLEU` – Corpus-level BLEU
Both rely on the underlying NLTK BLEU implementation with optional smoothing methods, weights, and variable n-gram orders.

Single-Sentence BLEU
Use `SentenceBLEU` to compute single-sentence BLEU between a single candidate and one (or more) references:

```python
from opik.evaluation.metrics import BLEU
from opik.evaluation.metrics import SentenceBLEU

bleu_metric = BLEU()
metric = SentenceBLEU(n_grams=4, smoothing_method="method1")

score = bleu_metric.score(
# Single reference
score = metric.score(
output="Hello world!",
reference="Hello world"
)
print(score.value, score.reason)

score = bleu_metric.score(
# Multiple references
score = metric.score(
output="Hello world!",
reference=["Hello planet", "Hello world"]
)
print(score.value, score.reason)

```

Corpus-Level BLEU
Use `CorpusBLEU` to compute corpus-level BLEU for multiple candidates vs. multiple references. Each candidate and its references align by index in the list:

```python
from opik.evaluation.metrics import BLEU
from opik.evaluation.metrics import CorpusBLEU

bleu_metric = BLEU()
metric = CorpusBLEU()

outputs = ["Hello there", "This is a test."]
references = [
# For the first candidate, two references
["Hello world", "Hello there"],
# For the second candidate, one reference
"This is a test."
]

result = bleu_metric.score(output=outputs, reference=references)
print(result.value, result.reason)
score = metric.score(output=outputs, reference=references)
print(score.value, score.reason)
```

You can also customize n-grams, smoothing methods, or weights:

```python
from opik.evaluation.metrics import BLEU
from opik.evaluation.metrics import SentenceBLEU

metric = BLEU(
metric = SentenceBLEU(
n_grams=4,
smoothing_method="method1",
smoothing_method="method2",
weights=[0.25, 0.25, 0.25, 0.25]
)

Expand All @@ -159,3 +165,4 @@ score = metric.score(
)
print(score.value, score.reason)
```
**Note:** If any candidate or reference is empty, SentenceBLEU or CorpusBLEU will raise a MetricComputationError. Handle or validate inputs accordingly.
5 changes: 3 additions & 2 deletions sdks/python/src/opik/evaluation/metrics/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
from .heuristics.is_json import IsJson
from .heuristics.levenshtein_ratio import LevenshteinRatio
from .heuristics.regex_match import RegexMatch
from .heuristics.bleu import BLEU
from .heuristics.bleu import SentenceBLEU, CorpusBLEU
from .llm_judges.answer_relevance.metric import AnswerRelevance
from .llm_judges.context_precision.metric import ContextPrecision
from .llm_judges.context_recall.metric import ContextRecall
Expand All @@ -30,5 +30,6 @@
"RegexMatch",
"MetricComputationError",
"BaseMetric",
"BLEU",
"SentenceBLEU",
"CorpusBLEU",
]
Loading

0 comments on commit 53e8014

Please sign in to comment.