Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upstream checkpoint 2 #6

Open
wants to merge 21 commits into
base: based-fork-3
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
f3b7917
Update README.md (#1430)
davidbhoffmann Feb 15, 2024
a604f05
improve hf_hub activation (#1438)
michaelfeil Feb 18, 2024
19cbb29
Correct typo in task name (#1443)
larekrow Feb 19, 2024
89deeea
update bbh, gsm8k, mmlu parsing logic and prompts (Orca2 bbh_cot_zero…
thnkinbtfly Feb 19, 2024
8680e93
Add a new task HaeRae-Bench (#1445)
h-albert-lee Feb 20, 2024
45941c6
Group reqs by context (#1425)
baberabb Feb 20, 2024
5ab295c
Add a new task GPQA (the part without CoT) (#1434)
uanu2002 Feb 20, 2024
c26a6ac
Added KMMLU evaluation method and changed ReadMe (#1447)
h-albert-lee Feb 21, 2024
ba5cdf0
Add TemplateLM boilerplate LM class (#1279)
anjor Feb 22, 2024
00dc996
Log which subtasks were called with which groups (#1456)
haileyschoelkopf Feb 22, 2024
a72babb
PR fixing the issue #1391 (wrong contexts in the mgsm task) (#1440)
leocnj Feb 22, 2024
2683fbb
feat: Add Weights and Biases support (#1339)
ayulockin Feb 22, 2024
75ac1f4
Fixed generation args issue affection OpenAI completion model (#1458)
Am1n3e Feb 22, 2024
8371662
update parsing logic of mgsm following gsm8k (#1462)
thnkinbtfly Feb 23, 2024
eacb74e
Adding documentation for Weights and Biases CLI interface (#1466)
veekaybee Feb 23, 2024
f78e2da
Add environment and transformers version logging in results dump (#1464)
LSinev Feb 24, 2024
d27c0c0
Apply code autoformatting with Ruff to tasks/*.py an *__init__.py (#1…
LSinev Feb 26, 2024
c1145df
setting trust_remote_code (#1467)
veekaybee Feb 26, 2024
7de7b27
add arabic mmlu (#1402)
khalil-Hennara Feb 26, 2024
4c51111
Add Gemma support (Add flag to control BOS token usage) (#1465)
haileyschoelkopf Feb 26, 2024
b80707e
Merge branch 'based-fork-3' into upstream-checkpoint-2
sedrick-keh-tri Mar 24, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -16,3 +16,5 @@ temp
# IPython
profile_default/
ipython_config.py
wandb
examples/wandb
39 changes: 39 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -245,6 +245,10 @@ For a full list of supported arguments, check out the [interface](https://github

## Visualizing Results

You can seamlessly visualize and analyze the results of your evaluation harness runs using both Weights & Biases (W&B) and Zeno.

### Zeno

You can use [Zeno](https://zenoml.com) to visualize the results of your eval harness runs.

First, head to [hub.zenoml.com](https://hub.zenoml.com) to create an account and get an API key [on your account page](https://hub.zenoml.com/account).
Expand Down Expand Up @@ -284,6 +288,41 @@ If you run the eval harness on multiple tasks, the `project_name` will be used a

You can find an example of this workflow in [examples/visualize-zeno.ipynb](examples/visualize-zeno.ipynb).

### Weights and Biases

With the [Weights and Biases](https://wandb.ai/site) integration, you can now spend more time extracting deeper insights into your evaluation results. The integration is designed to streamline the process of logging and visualizing experiment results using the Weights & Biases (W&B) platform.

The integration provide functionalities

- to automatically log the evaluation results,
- log the samples as W&B Tables for easy visualization,
- log the `results.json` file as an artifact for version control,
- log the `<task_name>_eval_samples.json` file if the samples are logged,
- generate a comprehensive report for analysis and visualization with all the important metric,
- log task and cli specific configs,
- and more out of the box like the command used to run the evaluation, GPU/CPU counts, timestamp, etc.

First you'll need to install the lm_eval[wandb] package extra. Do `pip install lm_eval[wandb]`.

Authenticate your machine with an your unique W&B token. Visit https://wandb.ai/authorize to get one. Do `wandb login` in your command line terminal.

Run eval harness as usual with a `wandb_args` flag. Use this flag to provide arguments for initializing a wandb run ([wandb.init](https://docs.wandb.ai/ref/python/init)) as comma separated string arguments.

```bash
lm_eval \
--model hf \
--model_args pretrained=microsoft/phi-2,trust_remote_code=True \
--tasks hellaswag,mmlu_abstract_algebra \
--device cuda:0 \
--batch_size 8 \
--output_path output/phi-2 \
--limit 10 \
--wandb_args project=lm-eval-harness-integration \
--log_samples
```

In the stdout, you will find the link to the W&B run page as well as link to the generated report. You can find an example of this workflow in [examples/visualize-wandb.ipynb](examples/visualize-wandb.ipynb).

## How to Contribute or Learn More?

For more information on the library and how everything fits together, check out all of our [documentation pages](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/docs)! We plan to post a larger roadmap of desired + planned library improvements soon, with more information on how contributors can help.
Expand Down
2 changes: 2 additions & 0 deletions docs/interface.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,8 @@ This mode supports a number of command-line arguments, the details of which can

* `--seed`: Set seed for python's random, numpy and torch. Accepts a comma-separated list of 3 values for python's random, numpy, and torch seeds, respectively, or a single integer to set the same seed for all three. The values are either an integer or 'None' to not set the seed. Default is `0,1234,1234` (for backward compatibility). E.g. `--seed 0,None,8` sets `random.seed(0)` and `torch.manual_seed(8)`. Here numpy's seed is not set since the second value is `None`. E.g, `--seed 42` sets all three seeds to 42.

* `--wandb_args`: Tracks logging to Weights and Biases for evaluation runs and includes args passed to `wandb.init`, such as `project` and `job_type`. Full list (here.)[https://docs.wandb.ai/ref/python/init]

## External Library Usage

We also support using the library's external API for use within model training loops or other scripts.
Expand Down
2 changes: 1 addition & 1 deletion docs/model_guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,7 @@ All three request types take as input `requests` of type `list[Instance]` that h
- It should return `(ll,) : Tuple[float]` , a.k.a. solely the *loglikelihood* of producing each piece of text given no starting input.


To allow a model to be evaluated on all types of tasks, you will need to implement these three types of measurements (note that `loglikelihood_rolling` is a special case of `loglikelihood`). For a reference implementation, check out `lm_eval/models/huggingface.py` !
To allow a model to be evaluated on all types of tasks, you will need to implement these three types of measurements (note that `loglikelihood_rolling` is a special case of `loglikelihood`). For a reference implementation, check out `lm_eval/models/huggingface.py` ! Additionally, check out `lm_eval.api.model.TemplateLM` for a class that abstracts away some commonly used functions across LM subclasses, or see if your model would lend itself well to subclassing the `lm_eval.models.huggingface.HFLM` class and overriding just the initialization or a couple methods!

**Tip: be careful of indexing in loglikelihood!**

Expand Down
130 changes: 130 additions & 0 deletions examples/visualize-wandb.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,130 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "fc477b96-adee-4829-a9d7-a5eb990df358",
"metadata": {},
"source": [
"# Visualizing Results in Weights and Biases\n",
"\n",
"With the Weights and Biases integration, you can now spend more time extracting deeper insights into your evaluation results. The integration is designed to streamline the process of logging and visualizing experiment results using the Weights & Biases (W&B) platform.\n",
"\n",
"The integration provide functionalities\n",
"\n",
"- to automatically log the evaluation results,\n",
"- log the samples as W&B Tables for easy visualization,\n",
"- log the `results.json` file as an artifact for version control,\n",
"- log the `<task_name>_eval_samples.json` file if the samples are logged,\n",
"- generate a comprehensive report for analysis and visualization with all the important metric,\n",
"- log task and cli configs,\n",
"- and more out of the box like the command used to run the evaluation, GPU/CPU counts, timestamp, etc.\n",
"\n",
"The integration is super easy to use with the eval harness. Let's see how!"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3851439a-bff4-41f2-bf21-1b3d8704913b",
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"# Install this project if you did not already have it.\n",
"# This is all that is needed to be installed to start using Weights and Biases\n",
"\n",
"!pip -qq install -e ..[wandb]"
]
},
{
"cell_type": "markdown",
"id": "8507fd7e-3b99-4a92-89fa-9eaada74ba91",
"metadata": {},
"source": [
"# Run the Eval Harness\n",
"\n",
"Run the eval harness as usual with a `wandb_args` flag. This flag is used to provide arguments for initializing a wandb run ([wandb.init](https://docs.wandb.ai/ref/python/init)) as comma separated string arguments.\n",
"\n",
"If `wandb_args` flag is used, the metrics and all other goodness will be automatically logged to Weights and Biases. In the stdout, you will find the link to the W&B run page as well as link to the generated report."
]
},
{
"cell_type": "markdown",
"id": "eec5866e-f01e-42f8-8803-9d77472ef991",
"metadata": {},
"source": [
"## Set your API Key\n",
"\n",
"Before you can use W&B, you need to authenticate your machine with an authentication key. Visit https://wandb.ai/authorize to get one."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d824d163-71a9-4313-935d-f1d56397841c",
"metadata": {},
"outputs": [],
"source": [
"import wandb\n",
"wandb.login()"
]
},
{
"cell_type": "markdown",
"id": "124e4a34-1547-4bed-bc09-db012bacbda6",
"metadata": {},
"source": [
"> Note that if you are using command line you can simply authenticate your machine by doing `wandb login` in your terminal. For more info check out the [documentation](https://docs.wandb.ai/quickstart#2-log-in-to-wb)."
]
},
{
"cell_type": "markdown",
"id": "abc6f6b6-179a-4aff-ada9-f380fb74df6e",
"metadata": {},
"source": [
"## Run and log to W&B"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "bd0a8130-a97b-451a-acd2-3f9885b88643",
"metadata": {},
"outputs": [],
"source": [
"!lm_eval \\\n",
" --model hf \\\n",
" --model_args pretrained=microsoft/phi-2,trust_remote_code=True \\\n",
" --tasks hellaswag,mmlu_abstract_algebra \\\n",
" --device cuda:0 \\\n",
" --batch_size 8 \\\n",
" --output_path output/phi-2 \\\n",
" --limit 10 \\\n",
" --wandb_args project=lm-eval-harness-integration \\\n",
" --log_samples"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.12"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
23 changes: 23 additions & 0 deletions lm_eval/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@
import numpy as np

from lm_eval import evaluator, utils
from lm_eval.logging_utils import WandbLogger
from lm_eval.tasks import TaskManager, include_path, initialize_tasks
from lm_eval.utils import make_table

Expand Down Expand Up @@ -167,6 +168,11 @@ def parse_eval_args() -> argparse.Namespace:
metavar="CRITICAL|ERROR|WARNING|INFO|DEBUG",
help="Controls the reported logging error level. Set to DEBUG when testing + adding new task configurations for comprehensive log output.",
)
parser.add_argument(
"--wandb_args",
default="",
help="Comma separated string arguments passed to wandb.init, e.g. `project=lm-eval,job_type=eval",
)
parser.add_argument(
"--predict_only",
"-x",
Expand Down Expand Up @@ -195,6 +201,9 @@ def cli_evaluate(args: Union[argparse.Namespace, None] = None) -> None:
# we allow for args to be passed externally, else we parse them ourselves
args = parse_eval_args()

if args.wandb_args:
wandb_logger = WandbLogger(args)

eval_logger = utils.eval_logger
eval_logger.setLevel(getattr(logging, f"{args.verbosity}"))
eval_logger.info(f"Verbosity set to {args.verbosity}")
Expand Down Expand Up @@ -309,6 +318,16 @@ def cli_evaluate(args: Union[argparse.Namespace, None] = None) -> None:

batch_sizes = ",".join(map(str, results["config"]["batch_sizes"]))

# Add W&B logging
if args.wandb_args:
try:
wandb_logger.post_init(results)
wandb_logger.log_eval_result()
if args.log_samples:
wandb_logger.log_eval_samples(samples)
except Exception as e:
eval_logger.info(f"Logging to Weights and Biases failed due to {e}")

if args.output_path:
output_path_file.open("w", encoding="utf-8").write(dumped)

Expand All @@ -334,6 +353,10 @@ def cli_evaluate(args: Union[argparse.Namespace, None] = None) -> None:
if "groups" in results:
print(make_table(results, "groups"))

if args.wandb_args:
# Tear down wandb run once all the logging is done.
wandb_logger.run.finish()


if __name__ == "__main__":
cli_evaluate()
4 changes: 2 additions & 2 deletions lm_eval/api/metrics.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,11 @@
from collections.abc import Iterable
from typing import List

import evaluate as hf_evaluate
import numpy as np
import sacrebleu
import sklearn.metrics

import evaluate
from lm_eval.api.registry import register_aggregation, register_metric


Expand Down Expand Up @@ -146,7 +146,7 @@ def acc_mutual_info_fn(items): # This is a passthrough function
return items


exact_match = evaluate.load("exact_match")
exact_match = hf_evaluate.load("exact_match")


@register_metric(
Expand Down
58 changes: 58 additions & 0 deletions lm_eval/api/model.py
Original file line number Diff line number Diff line change
Expand Up @@ -247,3 +247,61 @@ def fn(requests):

def get_cache_hook(self):
return CacheHook(self)


class TemplateLM(LM):
"""
A class acting as intermediary between the LM base class
and boilerplate often included in other LM subclasses.
"""

@property
@abc.abstractmethod
def eot_token_id(self):
pass

@abc.abstractmethod
def tok_encode(self, string: str, **kwargs):
pass

@abc.abstractmethod
def _loglikelihood_tokens(self, requests, **kwargs):
pass

def _encode_pair(self, context, continuation):
n_spaces = len(context) - len(context.rstrip())
if n_spaces > 0:
continuation = context[-n_spaces:] + continuation
context = context[:-n_spaces]

whole_enc = self.tok_encode(context + continuation)
context_enc = self.tok_encode(context)

context_enc_len = len(context_enc)
continuation_enc = whole_enc[context_enc_len:]

return context_enc, continuation_enc

def loglikelihood(self, requests) -> List[Tuple[float, bool]]:
new_reqs = []
for context, continuation in [req.args for req in requests]:
if context == "":
# end of text as context
context_enc, continuation_enc = (
[self.eot_token_id],
self.tok_encode(continuation),
)
else:
context_enc, continuation_enc = self._encode_pair(context, continuation)

new_reqs.append(((context, continuation), context_enc, continuation_enc))

return self._loglikelihood_tokens(new_reqs)

@abc.abstractmethod
def loglikelihood_rolling(self, requests) -> List[Tuple[float, bool]]:
pass

@abc.abstractmethod
def generate_until(self, requests) -> List[str]:
pass
5 changes: 3 additions & 2 deletions lm_eval/api/registry.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
import logging
from typing import Callable, Dict

import evaluate
import evaluate as hf_evaluate

from lm_eval.api.model import LM


Expand Down Expand Up @@ -128,7 +129,7 @@ def get_metric(name: str, hf_evaluate_metric=False) -> Callable:
)

try:
metric_object = evaluate.load(name)
metric_object = hf_evaluate.load(name)
return metric_object.compute
except Exception:
eval_logger.error(
Expand Down
3 changes: 2 additions & 1 deletion lm_eval/api/task.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
import random
import re
from collections.abc import Callable
from copy import deepcopy
from dataclasses import asdict, dataclass
from inspect import getsource
from typing import Any, List, Literal, Tuple, Union
Expand Down Expand Up @@ -1064,7 +1065,7 @@ def construct_requests(
return request_list

elif self.OUTPUT_TYPE == "generate_until":
arguments = (ctx, self.config.generation_kwargs)
arguments = (ctx, deepcopy(self.config.generation_kwargs))

return Instance(
request_type=self.OUTPUT_TYPE, doc=doc, arguments=arguments, idx=0, **kwargs
Expand Down
Loading