TRI-ML · sedrick-keh-tri · Feb 15, 2024 · Feb 18, 2024 · Feb 19, 2024 · Feb 19, 2024
diff --git a/.gitignore b/.gitignore
@@ -16,3 +16,5 @@ temp
 # IPython
 profile_default/
 ipython_config.py
+wandb
+examples/wandb
diff --git a/README.md b/README.md
@@ -245,6 +245,10 @@ For a full list of supported arguments, check out the [interface](https://github
 
 ## Visualizing Results
 
+You can seamlessly visualize and analyze the results of your evaluation harness runs using both Weights & Biases (W&B) and Zeno.
+
+### Zeno
+
 You can use [Zeno](https://zenoml.com) to visualize the results of your eval harness runs.
 
 First, head to [hub.zenoml.com](https://hub.zenoml.com) to create an account and get an API key [on your account page](https://hub.zenoml.com/account).
@@ -284,6 +288,41 @@ If you run the eval harness on multiple tasks, the `project_name` will be used a
 
 You can find an example of this workflow in [examples/visualize-zeno.ipynb](examples/visualize-zeno.ipynb).
 
+### Weights and Biases
+
+With the [Weights and Biases](https://wandb.ai/site) integration, you can now spend more time extracting deeper insights into your evaluation results. The integration is designed to streamline the process of logging and visualizing experiment results using the Weights & Biases (W&B) platform.
+
+The integration provide functionalities
+
+- to automatically log the evaluation results,
+- log the samples as W&B Tables for easy visualization,
+- log the `results.json` file as an artifact for version control,
+- log the `<task_name>_eval_samples.json` file if the samples are logged,
+- generate a comprehensive report for analysis and visualization with all the important metric,
+- log task and cli specific configs,
+- and more out of the box like the command used to run the evaluation, GPU/CPU counts, timestamp, etc.
+
+First you'll need to install the lm_eval[wandb] package extra. Do `pip install lm_eval[wandb]`.
+
+Authenticate your machine with an your unique W&B token. Visit https://wandb.ai/authorize to get one. Do `wandb login` in your command line terminal.
+
+Run eval harness as usual with a `wandb_args` flag. Use this flag to provide arguments for initializing a wandb run ([wandb.init](https://docs.wandb.ai/ref/python/init)) as comma separated string arguments.
+
+```bash
+lm_eval \
+    --model hf \
+    --model_args pretrained=microsoft/phi-2,trust_remote_code=True \
+    --tasks hellaswag,mmlu_abstract_algebra \
+    --device cuda:0 \
+    --batch_size 8 \
+    --output_path output/phi-2 \
+    --limit 10 \
+    --wandb_args project=lm-eval-harness-integration \
+    --log_samples
+```
+
+In the stdout, you will find the link to the W&B run page as well as link to the generated report. You can find an example of this workflow in [examples/visualize-wandb.ipynb](examples/visualize-wandb.ipynb).
+
 ## How to Contribute or Learn More?
 
 For more information on the library and how everything fits together, check out all of our [documentation pages](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/docs)! We plan to post a larger roadmap of desired + planned library improvements soon, with more information on how contributors can help.

diff --git a/docs/interface.md b/docs/interface.md
@@ -48,6 +48,8 @@ This mode supports a number of command-line arguments, the details of which can
 
 * `--seed`: Set seed for python's random, numpy and torch.  Accepts a comma-separated list of 3 values for python's random, numpy, and torch seeds, respectively, or a single integer to set the same seed for all three.  The values are either an integer or 'None' to not set the seed. Default is `0,1234,1234` (for backward compatibility).  E.g. `--seed 0,None,8` sets `random.seed(0)` and `torch.manual_seed(8)`. Here numpy's seed is not set since the second value is `None`.  E.g, `--seed 42` sets all three seeds to 42.
 
+* `--wandb_args`:  Tracks logging to Weights and Biases for evaluation runs and includes args passed to `wandb.init`, such as `project` and `job_type`. Full list (here.)[https://docs.wandb.ai/ref/python/init]
+
 ## External Library Usage
 
 We also support using the library's external API for use within model training loops or other scripts.

diff --git a/docs/model_guide.md b/docs/model_guide.md
@@ -66,7 +66,7 @@ All three request types take as input `requests` of type `list[Instance]` that h
   - It should return `(ll,) : Tuple[float]` , a.k.a. solely the *loglikelihood* of producing each piece of text given no starting input.
 
 
-To allow a model to be evaluated on all types of tasks, you will need to implement these three types of measurements (note that `loglikelihood_rolling` is a special case of `loglikelihood`). For a reference implementation, check out `lm_eval/models/huggingface.py` !
+To allow a model to be evaluated on all types of tasks, you will need to implement these three types of measurements (note that `loglikelihood_rolling` is a special case of `loglikelihood`). For a reference implementation, check out `lm_eval/models/huggingface.py` ! Additionally, check out `lm_eval.api.model.TemplateLM` for a class that abstracts away some commonly used functions across LM subclasses, or see if your model would lend itself well to subclassing the `lm_eval.models.huggingface.HFLM` class and overriding just the initialization or a couple methods!
 
 **Tip: be careful of indexing in loglikelihood!**
 

diff --git a/examples/visualize-wandb.ipynb b/examples/visualize-wandb.ipynb
@@ -0,0 +1,130 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "fc477b96-adee-4829-a9d7-a5eb990df358",
+   "metadata": {},
+   "source": [
+    "# Visualizing Results in Weights and Biases\n",
+    "\n",
+    "With the Weights and Biases integration, you can now spend more time extracting deeper insights into your evaluation results. The integration is designed to streamline the process of logging and visualizing experiment results using the Weights & Biases (W&B) platform.\n",
+    "\n",
+    "The integration provide functionalities\n",
+    "\n",
+    "- to automatically log the evaluation results,\n",
+    "- log the samples as W&B Tables for easy visualization,\n",
+    "- log the `results.json` file as an artifact for version control,\n",
+    "- log the `<task_name>_eval_samples.json` file if the samples are logged,\n",
+    "- generate a comprehensive report for analysis and visualization with all the important metric,\n",
+    "- log task and cli configs,\n",
+    "- and more out of the box like the command used to run the evaluation, GPU/CPU counts, timestamp, etc.\n",
+    "\n",
+    "The integration is super easy to use with the eval harness. Let's see how!"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "3851439a-bff4-41f2-bf21-1b3d8704913b",
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [],
+   "source": [
+    "# Install this project if you did not already have it.\n",
+    "# This is all that is needed to be installed to start using Weights and Biases\n",
+    "\n",
+    "!pip -qq install -e ..[wandb]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8507fd7e-3b99-4a92-89fa-9eaada74ba91",
+   "metadata": {},
+   "source": [
+    "# Run the Eval Harness\n",
+    "\n",
+    "Run the eval harness as usual with a `wandb_args` flag. This flag is used to provide arguments for initializing a wandb run ([wandb.init](https://docs.wandb.ai/ref/python/init)) as comma separated string arguments.\n",
+    "\n",
+    "If `wandb_args` flag is used, the metrics and all other goodness will be automatically logged to Weights and Biases. In the stdout, you will find the link to the W&B run page as well as link to the generated report."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "eec5866e-f01e-42f8-8803-9d77472ef991",
+   "metadata": {},
+   "source": [
+    "## Set your API Key\n",
+    "\n",
+    "Before you can use W&B, you need to authenticate your machine with an authentication key. Visit https://wandb.ai/authorize to get one."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "d824d163-71a9-4313-935d-f1d56397841c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import wandb\n",
+    "wandb.login()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "124e4a34-1547-4bed-bc09-db012bacbda6",
+   "metadata": {},
+   "source": [
+    "> Note that if you are using command line you can simply authenticate your machine by doing `wandb login` in your terminal. For more info check out the [documentation](https://docs.wandb.ai/quickstart#2-log-in-to-wb)."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "abc6f6b6-179a-4aff-ada9-f380fb74df6e",
+   "metadata": {},
+   "source": [
+    "## Run and log to W&B"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "bd0a8130-a97b-451a-acd2-3f9885b88643",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!lm_eval \\\n",
+    "    --model hf \\\n",
+    "    --model_args pretrained=microsoft/phi-2,trust_remote_code=True \\\n",
+    "    --tasks hellaswag,mmlu_abstract_algebra \\\n",
+    "    --device cuda:0 \\\n",
+    "    --batch_size 8 \\\n",
+    "    --output_path output/phi-2 \\\n",
+    "    --limit 10 \\\n",
+    "    --wandb_args project=lm-eval-harness-integration \\\n",
+    "    --log_samples"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.12"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/lm_eval/__main__.py b/lm_eval/__main__.py
@@ -11,6 +11,7 @@
 import numpy as np
 
 from lm_eval import evaluator, utils
+from lm_eval.logging_utils import WandbLogger
 from lm_eval.tasks import TaskManager, include_path, initialize_tasks
 from lm_eval.utils import make_table
 
@@ -167,6 +168,11 @@ def parse_eval_args() -> argparse.Namespace:
         metavar="CRITICAL|ERROR|WARNING|INFO|DEBUG",
         help="Controls the reported logging error level. Set to DEBUG when testing + adding new task configurations for comprehensive log output.",
     )
+    parser.add_argument(
+        "--wandb_args",
+        default="",
+        help="Comma separated string arguments passed to wandb.init, e.g. `project=lm-eval,job_type=eval",
+    )
     parser.add_argument(
         "--predict_only",
         "-x",
@@ -195,6 +201,9 @@ def cli_evaluate(args: Union[argparse.Namespace, None] = None) -> None:
         # we allow for args to be passed externally, else we parse them ourselves
         args = parse_eval_args()
 
+    if args.wandb_args:
+        wandb_logger = WandbLogger(args)
+
     eval_logger = utils.eval_logger
     eval_logger.setLevel(getattr(logging, f"{args.verbosity}"))
     eval_logger.info(f"Verbosity set to {args.verbosity}")
@@ -309,6 +318,16 @@ def cli_evaluate(args: Union[argparse.Namespace, None] = None) -> None:
 
         batch_sizes = ",".join(map(str, results["config"]["batch_sizes"]))
 
+        # Add W&B logging
+        if args.wandb_args:
+            try:
+                wandb_logger.post_init(results)
+                wandb_logger.log_eval_result()
+                if args.log_samples:
+                    wandb_logger.log_eval_samples(samples)
+            except Exception as e:
+                eval_logger.info(f"Logging to Weights and Biases failed due to {e}")
+
         if args.output_path:
             output_path_file.open("w", encoding="utf-8").write(dumped)
 
@@ -334,6 +353,10 @@ def cli_evaluate(args: Union[argparse.Namespace, None] = None) -> None:
         if "groups" in results:
             print(make_table(results, "groups"))
 
+        if args.wandb_args:
+            # Tear down wandb run once all the logging is done.
+            wandb_logger.run.finish()
+
 
 if __name__ == "__main__":
     cli_evaluate()
diff --git a/lm_eval/api/metrics.py b/lm_eval/api/metrics.py
@@ -4,11 +4,11 @@
 from collections.abc import Iterable
 from typing import List
 
+import evaluate as hf_evaluate
 import numpy as np
 import sacrebleu
 import sklearn.metrics
 
-import evaluate
 from lm_eval.api.registry import register_aggregation, register_metric
 
 
@@ -146,7 +146,7 @@ def acc_mutual_info_fn(items):  # This is a passthrough function
     return items
 
 
-exact_match = evaluate.load("exact_match")
+exact_match = hf_evaluate.load("exact_match")
 
 
 @register_metric(

diff --git a/lm_eval/api/model.py b/lm_eval/api/model.py
@@ -247,3 +247,61 @@ def fn(requests):
 
     def get_cache_hook(self):
         return CacheHook(self)
+
+
+class TemplateLM(LM):
+    """
+    A class acting as intermediary between the LM base class
+    and boilerplate often included in other LM subclasses.
+    """
+
+    @property
+    @abc.abstractmethod
+    def eot_token_id(self):
+        pass
+
+    @abc.abstractmethod
+    def tok_encode(self, string: str, **kwargs):
+        pass
+
+    @abc.abstractmethod
+    def _loglikelihood_tokens(self, requests, **kwargs):
+        pass
+
+    def _encode_pair(self, context, continuation):
+        n_spaces = len(context) - len(context.rstrip())
+        if n_spaces > 0:
+            continuation = context[-n_spaces:] + continuation
+            context = context[:-n_spaces]
+
+        whole_enc = self.tok_encode(context + continuation)
+        context_enc = self.tok_encode(context)
+
+        context_enc_len = len(context_enc)
+        continuation_enc = whole_enc[context_enc_len:]
+
+        return context_enc, continuation_enc
+
+    def loglikelihood(self, requests) -> List[Tuple[float, bool]]:
+        new_reqs = []
+        for context, continuation in [req.args for req in requests]:
+            if context == "":
+                # end of text as context
+                context_enc, continuation_enc = (
+                    [self.eot_token_id],
+                    self.tok_encode(continuation),
+                )
+            else:
+                context_enc, continuation_enc = self._encode_pair(context, continuation)
+
+            new_reqs.append(((context, continuation), context_enc, continuation_enc))
+
+        return self._loglikelihood_tokens(new_reqs)
+
+    @abc.abstractmethod
+    def loglikelihood_rolling(self, requests) -> List[Tuple[float, bool]]:
+        pass
+
+    @abc.abstractmethod
+    def generate_until(self, requests) -> List[str]:
+        pass
diff --git a/lm_eval/api/registry.py b/lm_eval/api/registry.py
@@ -1,7 +1,8 @@
 import logging
 from typing import Callable, Dict
 
-import evaluate
+import evaluate as hf_evaluate
+
 from lm_eval.api.model import LM
 
 
@@ -128,7 +129,7 @@ def get_metric(name: str, hf_evaluate_metric=False) -> Callable:
             )
 
     try:
-        metric_object = evaluate.load(name)
+        metric_object = hf_evaluate.load(name)
         return metric_object.compute
     except Exception:
         eval_logger.error(

diff --git a/lm_eval/api/task.py b/lm_eval/api/task.py
@@ -4,6 +4,7 @@
 import random
 import re
 from collections.abc import Callable
+from copy import deepcopy
 from dataclasses import asdict, dataclass
 from inspect import getsource
 from typing import Any, List, Literal, Tuple, Union
@@ -1064,7 +1065,7 @@ def construct_requests(
             return request_list
 
         elif self.OUTPUT_TYPE == "generate_until":
-            arguments = (ctx, self.config.generation_kwargs)
+            arguments = (ctx, deepcopy(self.config.generation_kwargs))
 
         return Instance(
             request_type=self.OUTPUT_TYPE, doc=doc, arguments=arguments, idx=0, **kwargs