fix: split BLEU into SentenceBLEU and CorpusBLEU

comet-ml · Jan 14, 2025 · 53e8014 · 53e8014
1 parent 84b7669
commit 53e8014
Show file tree

Hide file tree

Showing 4 changed files with 248 additions and 204 deletions.
diff --git a/apps/opik-documentation/documentation/docs/evaluation/metrics/heuristic_metrics.md b/apps/opik-documentation/documentation/docs/evaluation/metrics/heuristic_metrics.md
@@ -9,15 +9,15 @@ Heuristic metrics are rule-based evaluation methods that allow you to check spec
 
 You can use the following heuristic metrics:
 
-| Metric      | Description                                                                                       |
-| ----------- | ------------------------------------------------------------------------------------------------- |
-| Equals      | Checks if the output exactly matches an expected string                                           |
-| Contains    | Check if the output contains a specific substring, can be both case sensitive or case insensitive |
-| RegexMatch  | Checks if the output matches a specified regular expression pattern                               |
-| IsJson      | Checks if the output is a valid JSON object                                                       |
-| Levenshtein | Calculates the Levenshtein distance between the output and an expected string                     |
-| BLEU        | Calculates the BLEU score for output text against one or more reference texts                     |
-
+| Metric       | Description                                                                                       |
+|--------------|---------------------------------------------------------------------------------------------------|
+| Equals       | Checks if the output exactly matches an expected string                                           |
+| Contains     | Check if the output contains a specific substring, can be both case sensitive or case insensitive |
+| RegexMatch   | Checks if the output matches a specified regular expression pattern                               |
+| IsJson       | Checks if the output is a valid JSON object                                                       |
+| Levenshtein  | Calculates the Levenshtein distance between the output and an expected string                     |
+| SentenceBLEU | Calculates a single-sentence BLEU score for a candidate vs. one or more references                |
+| CorpusBLEU   | Calculates a corpus-level BLEU score for multiple candidates vs. their references                 |
 ## Score an LLM response
 
 You can score an LLM response by first initializing the metrics and then calling the `score` method:
@@ -101,55 +101,61 @@ print(score)
 
 ### BLEU
 
-The BLEU metric calculates how close the LLM output is to one or more reference translations. This single metric class can compute:
-- Single-sentence BLEU: Pass a single output string and one or more reference strings.
-- Corpus-level BLEU: Pass a list of output strings and a parallel list of reference strings (or lists of references).
+The BLEU (Bilingual Evaluation Understudy) metrics estimate how close the LLM outputs are to one or more reference translations. Opik provides two separate classes:
+- `SentenceBLEU` – Single-sentence BLEU
+- `CorpusBLEU` – Corpus-level BLEU
+Both rely on the underlying NLTK BLEU implementation with optional smoothing methods, weights, and variable n-gram orders.
 
-Single-Sentence BLEU
+Use `SentenceBLEU` to compute single-sentence BLEU between a single candidate and one (or more) references:
 
 ```python
-from opik.evaluation.metrics import BLEU
+from opik.evaluation.metrics import SentenceBLEU
 
-bleu_metric = BLEU()
+metric = SentenceBLEU(n_grams=4, smoothing_method="method1")
 
-score = bleu_metric.score(
+# Single reference
+score = metric.score(
     output="Hello world!",
     reference="Hello world"
 )
 print(score.value, score.reason)
 
-score = bleu_metric.score(
+# Multiple references
+score = metric.score(
     output="Hello world!",
     reference=["Hello planet", "Hello world"]
 )
 print(score.value, score.reason)
+
 ```
 
-Corpus-Level BLEU
+Use `CorpusBLEU` to compute corpus-level BLEU for multiple candidates vs. multiple references. Each candidate and its references align by index in the list:
 
 ```python
-from opik.evaluation.metrics import BLEU
+from opik.evaluation.metrics import CorpusBLEU
 
-bleu_metric = BLEU()
+metric = CorpusBLEU()
 
 outputs = ["Hello there", "This is a test."]
 references = [
+    # For the first candidate, two references
     ["Hello world", "Hello there"],
+    # For the second candidate, one reference
     "This is a test."
 ]
 
-result = bleu_metric.score(output=outputs, reference=references)
-print(result.value, result.reason)
+score = metric.score(output=outputs, reference=references)
+print(score.value, score.reason)
 ```
 
 You can also customize n-grams, smoothing methods, or weights:
 
 ```python
-from opik.evaluation.metrics import BLEU
+from opik.evaluation.metrics import SentenceBLEU
 
-metric = BLEU(
+metric = SentenceBLEU(
     n_grams=4,
-    smoothing_method="method1",
+    smoothing_method="method2",
     weights=[0.25, 0.25, 0.25, 0.25]
 )
 
@@ -159,3 +165,4 @@ score = metric.score(
 )
 print(score.value, score.reason)
 ```
+**Note:** If any candidate or reference is empty, SentenceBLEU or CorpusBLEU will raise a MetricComputationError. Handle or validate inputs accordingly.
diff --git a/sdks/python/src/opik/evaluation/metrics/__init__.py b/sdks/python/src/opik/evaluation/metrics/__init__.py
@@ -3,7 +3,7 @@
 from .heuristics.is_json import IsJson
 from .heuristics.levenshtein_ratio import LevenshteinRatio
 from .heuristics.regex_match import RegexMatch
-from .heuristics.bleu import BLEU
+from .heuristics.bleu import SentenceBLEU, CorpusBLEU
 from .llm_judges.answer_relevance.metric import AnswerRelevance
 from .llm_judges.context_precision.metric import ContextPrecision
 from .llm_judges.context_recall.metric import ContextRecall
@@ -30,5 +30,6 @@
     "RegexMatch",
     "MetricComputationError",
     "BaseMetric",
-    "BLEU",
+    "SentenceBLEU",
+    "CorpusBLEU",
 ]