Implemented BLEU score, wrote unit tests and documentation for it. (#…

…1006) * Implemented BLEU score, wrote unit tests and documentation for it. * modified bleu.py to use nltk.translate.bleu_score and rewrote unit tests * fix: implemented requested code review changes in bleu.py * fix: split BLEU into SentenceBLEU and CorpusBLEU * fix: gave nltk_bleu_score.SmoothingFunction quotes so that it passes e2e tests --------- Co-authored-by: Aliaksandr Kuzmik <[email protected]>
comet-ml · Jan 17, 2025 · e835dfd · e835dfd
1 parent 5268758
commit e835dfd
Show file tree

Hide file tree

Showing 6 changed files with 508 additions and 49 deletions.
diff --git a/apps/opik-documentation/documentation/docs/cookbook/dspy.ipynb b/apps/opik-documentation/documentation/docs/cookbook/dspy.ipynb
@@ -37,17 +37,9 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 1,
+   "execution_count": null,
    "metadata": {},
-   "outputs": [
-    {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
-      "OPIK: Opik is already configured. You can check the settings by viewing the config file at /Users/jacquesverre/.opik.config\n"
-     ]
-    }
-   ],
+   "outputs": [],
    "source": [
     "import opik\n",
     "\n",
@@ -56,7 +48,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 2,
+   "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -78,7 +70,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 3,
+   "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -95,31 +87,9 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 4,
+   "execution_count": null,
    "metadata": {},
-   "outputs": [
-    {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
-      "WARNING:langfuse:Langfuse client is disabled since no public_key was provided as a parameter or environment variable 'LANGFUSE_PUBLIC_KEY'. See our docs: https://langfuse.com/docs/sdk/python/low-level-sdk#initialize-client\n",
-      "OPIK: Started logging traces to the \"DSPY\" project at https://www.comet.com/opik/jacques-comet/redirect/projects?name=DSPY.\n"
-     ]
-    },
-    {
-     "data": {
-      "text/plain": [
-       "Prediction(\n",
-       "    reasoning='The meaning of life is a philosophical question that has been contemplated by humans for centuries. Different cultures, religions, and individuals have proposed various interpretations. Some suggest that the meaning of life is to seek happiness, fulfillment, and personal growth, while others believe it is about serving a higher purpose or contributing to the well-being of others. Ultimately, the meaning of life may vary from person to person, shaped by personal experiences, beliefs, and values.',\n",
-       "    answer=\"The meaning of life is subjective and can vary greatly among individuals. It may involve seeking happiness, personal growth, and contributing to the well-being of others, or fulfilling a higher purpose, depending on one's beliefs and experiences.\"\n",
-       ")"
-      ]
-     },
-     "execution_count": 4,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
+   "outputs": [],
    "source": [
     "cot = dspy.ChainOfThought(\"question -> answer\")\n",
     "cot(question=\"What is the meaning of life?\")"

diff --git a/apps/opik-documentation/documentation/docs/evaluation/metrics/heuristic_metrics.md b/apps/opik-documentation/documentation/docs/evaluation/metrics/heuristic_metrics.md
@@ -9,14 +9,15 @@ Heuristic metrics are rule-based evaluation methods that allow you to check spec
 
 You can use the following heuristic metrics:
 
-| Metric      | Description                                                                                       |
-| ----------- | ------------------------------------------------------------------------------------------------- |
-| Equals      | Checks if the output exactly matches an expected string                                           |
-| Contains    | Check if the output contains a specific substring, can be both case sensitive or case insensitive |
-| RegexMatch  | Checks if the output matches a specified regular expression pattern                               |
-| IsJson      | Checks if the output is a valid JSON object                                                       |
-| Levenshtein | Calculates the Levenshtein distance between the output and an expected string                     |
-
+| Metric       | Description                                                                                       |
+|--------------|---------------------------------------------------------------------------------------------------|
+| Equals       | Checks if the output exactly matches an expected string                                           |
+| Contains     | Check if the output contains a specific substring, can be both case sensitive or case insensitive |
+| RegexMatch   | Checks if the output matches a specified regular expression pattern                               |
+| IsJson       | Checks if the output is a valid JSON object                                                       |
+| Levenshtein  | Calculates the Levenshtein distance between the output and an expected string                     |
+| SentenceBLEU | Calculates a single-sentence BLEU score for a candidate vs. one or more references                |
+| CorpusBLEU   | Calculates a corpus-level BLEU score for multiple candidates vs. their references                 |
 ## Score an LLM response
 
 You can score an LLM response by first initializing the metrics and then calling the `score` method:
@@ -97,3 +98,71 @@ metric = LevenshteinRatio()
 score = metric.score(output="Hello world !", reference="hello")
 print(score)
 ```
+
+### BLEU
+
+The BLEU (Bilingual Evaluation Understudy) metrics estimate how close the LLM outputs are to one or more reference translations. Opik provides two separate classes:
+- `SentenceBLEU` – Single-sentence BLEU
+- `CorpusBLEU` – Corpus-level BLEU
+Both rely on the underlying NLTK BLEU implementation with optional smoothing methods, weights, and variable n-gram orders.
+
+Use `SentenceBLEU` to compute single-sentence BLEU between a single candidate and one (or more) references:
+
+```python
+from opik.evaluation.metrics import SentenceBLEU
+
+metric = SentenceBLEU(n_grams=4, smoothing_method="method1")
+
+# Single reference
+score = metric.score(
+    output="Hello world!",
+    reference="Hello world"
+)
+print(score.value, score.reason)
+
+# Multiple references
+score = metric.score(
+    output="Hello world!",
+    reference=["Hello planet", "Hello world"]
+)
+print(score.value, score.reason)
+
+```
+
+Use `CorpusBLEU` to compute corpus-level BLEU for multiple candidates vs. multiple references. Each candidate and its references align by index in the list:
+
+```python
+from opik.evaluation.metrics import CorpusBLEU
+
+metric = CorpusBLEU()
+
+outputs = ["Hello there", "This is a test."]
+references = [
+    # For the first candidate, two references
+    ["Hello world", "Hello there"],
+    # For the second candidate, one reference
+    "This is a test."
+]
+
+score = metric.score(output=outputs, reference=references)
+print(score.value, score.reason)
+```
+
+You can also customize n-grams, smoothing methods, or weights:
+
+```python
+from opik.evaluation.metrics import SentenceBLEU
+
+metric = SentenceBLEU(
+    n_grams=4,
+    smoothing_method="method2",
+    weights=[0.25, 0.25, 0.25, 0.25]
+)
+
+score = metric.score(
+    output="The cat sat on the mat",
+    reference=["The cat is on the mat", "A cat sat here on the mat"]
+)
+print(score.value, score.reason)
+```
+**Note:** If any candidate or reference is empty, SentenceBLEU or CorpusBLEU will raise a MetricComputationError. Handle or validate inputs accordingly.
diff --git a/sdks/python/src/opik/evaluation/metrics/__init__.py b/sdks/python/src/opik/evaluation/metrics/__init__.py
@@ -3,6 +3,7 @@
 from .heuristics.is_json import IsJson
 from .heuristics.levenshtein_ratio import LevenshteinRatio
 from .heuristics.regex_match import RegexMatch
+from .heuristics.bleu import SentenceBLEU, CorpusBLEU
 from .llm_judges.answer_relevance.metric import AnswerRelevance
 from .llm_judges.context_precision.metric import ContextPrecision
 from .llm_judges.context_recall.metric import ContextRecall
@@ -29,4 +30,6 @@
     "RegexMatch",
     "MetricComputationError",
     "BaseMetric",
+    "SentenceBLEU",
+    "CorpusBLEU",
 ]