Skip to content

Commit

Permalink
New mechanism for evaluation contributions (#47)
Browse files Browse the repository at this point in the history
* new mechanism for eval contributions

* added doc following nathan comments
  • Loading branch information
clefourrier authored Feb 24, 2024
1 parent 831ad47 commit 92e9b50
Show file tree
Hide file tree
Showing 3 changed files with 105 additions and 5 deletions.
18 changes: 14 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -108,13 +108,16 @@ If you want to add a new metric, first check if you can use one of the parametri
Then, follow the example in `src.lighteval.metrics.metrics` to register your metric.

### Adding a new task
To add a new task, first **add its dataset** on the hub.
To add a new task, first either open an issue, to determine whether it will be integrated in the core evaluations of lighteval, or in the community tasks, and **add its dataset** on the hub.
Note: Core evaluations are evals we will add to our test suite to ensure non regression through time, and which already see a high usage in the community.
A popular community evaluation can move to become a core evaluation through time.

Then, **find a suitable prompt function** or **create a new prompt function** in `src.lighteval.tasks.task_prompt_formatting.py`. This function must output a `Doc` object, which should contain `query`, your prompt, and either `gold`, the gold output, or `choices` and `gold_index`, the list of choices and index or indices of correct answers. If your query contains an instruction which should not be repeated in a few shot setup, add it to an `instruction` field.
#### Core evaluations
Prompt function: **find a suitable prompt function** in `src.lighteval.tasks.task_prompt_formatting.py`, or code your own. This function must output a `Doc` object, which should contain `query`, your prompt, and either `gold`, the gold output, or `choices` and `gold_index`, the list of choices and index or indices of correct answers. If your query contains an instruction which should not be repeated in a few shot setup, add it to an `instruction` field.

Lastly, create a **line summary** of your evaluation, in `src/lighteval/tasks/tasks_table.jsonl`. This summary should contain the following fields:
Summary: create a **line summary** of your evaluation, in `src/lighteval/tasks/tasks_table.jsonl`. This summary should contain the following fields:
- `name` (str), your evaluation name
- `suite` (list), the suite(s) to which your evaluation should belong. This field allows us to compare different tasks implementation, and is used a task selection to differentiate the versions to launch. At the moment, you'll find the keywords ["helm", "bigbench", "original", "lighteval"]; you can add also add new ones (for test, we recommend using "custom").
- `suite` (list), the suite(s) to which your evaluation should belong. This field allows us to compare different tasks implementation, and is used a task selection to differentiate the versions to launch. At the moment, you'll find the keywords ["helm", "bigbench", "original", "lighteval", "community", "custom"]; for core evals, please choose `lighteval`.
- `prompt_function` (str), the name of the prompt function you defined in the step above
- `hf_repo` (str), the path to your evaluation dataset on the hub
- `hf_subset` (str), the specific subset you want to use for your evaluation (note: when the dataset has no subset, fill this field with `"default"`, not with `None` or `""`)
Expand All @@ -133,6 +136,13 @@ Lastly, create a **line summary** of your evaluation, in `src/lighteval/tasks/ta
- `output_regex` (str), A regex string that will be used to filter your generation. (Genrative metrics will only select tokens that are between the first and the second sequence matched by the regex. For example, for a regex matching `\n` and a generation `\nModel generation output\nSome other text` the metric will only be fed with `Model generation output`)
- `frozen` (bool), for now is set to False, but we will steadily pass all stable tasks to True.

Make sure you can launch your model with your new task using `--tasks lighteval|yournewtask|2|0`.

#### Community evaluations
Copy the `community_tasks/_template.yml` to `community_tasks/yourevalname.py` and edit it to add your custom tasks (the parameters you can use are explained above). It contains an interesting mechanism if the dataset you are adding contains a lot of subsets.

Make sure you can launch your model with your new task using `--tasks community|yournewtask|2|0 --custom_tasks community_tasks/yourevalname.py`.

## Available metrics
### Metrics for multiple choice tasks
These metrics use log-likelihood of the different possible targets.
Expand Down
90 changes: 90 additions & 0 deletions community_tasks/_template.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
# ruff: noqa: F405, F403, F401
"""
Custom evaluation tasks for lighteval. Copy this file and complete it with the info for your task.
This file generally create just a TASKS_TABLE and TASKS_GROUPS which are then imported by LightEval.
Author:
"""
from lighteval.tasks.lighteval_task import LightevalTaskConfig
from lighteval.tasks.requests import Doc
from lighteval.tasks.tasks_prompt_formatting import LETTER_INDICES


## EVAL WITH NO SUBSET ##
# This is how you create a simple tasks (like hellaswag) which has one single subset
# attached to it, and one evaluation possible.
task = LightevalTaskConfig(
name="myothertask",
prompt_function="prompt_fn", # must be defined in the file or imported from src/lighteval/tasks/tasks_prompt_formatting.py
suite=["community"],
hf_repo="",
hf_subset="default",
hf_avail_splits=[],
evaluation_splits=[],
few_shots_split="",
few_shots_select="",
metric=[""],
)

## EVALS WITH SUBSET
# This is how you create a subset task (like MMLU), which has several subset
# each being its own evaluation task.

# fmt: off
SAMPLE_SUBSETS = [] # list of all the subsets to use for this eval
# fmt: on


class CustomSubsetTask(LightevalTaskConfig):
def __init__(
self,
name,
hf_subset,
):
super().__init__(
name=name,
hf_subset=hf_subset,
prompt_function="prompt_fn", # must be defined in the file
hf_repo="",
metric=[""],
hf_avail_splits=[],
evaluation_splits=[],
few_shots_split="",
few_shots_select="",
suite=["community"],
generation_size=-1,
stop_sequence=None,
output_regex=None,
frozen=False,
)


## DEFINE YOUR PROMPT FUNCTIONS
# Define as many as you need for your different tasks
def prompt_fn(line, task_name: str = None):
"""Defines how to go from a dataset line to a doc object.
Follow examples in src/lighteval/tasks/tasks_prompt_formatting.py, or get more info
about what this function should do in the README.
"""
return Doc(
task_name=task_name,
query="",
choices="",
gold_index=0,
instruction="",
)


## STORE YOUR EVALS
SUBSET_TASKS = [CustomSubsetTask(name=f"mytask:{subset}", hf_subset=subset) for subset in SAMPLE_SUBSETS]
_TASKS = SUBSET_TASKS + [task]

## MODULE LOGIC
# You should not need to touch this
# Convert to dict for lighteval
TASKS_TABLE = [task.as_dict() for task in _TASKS]

if __name__ == "__main__":
print(t["name"] for t in TASKS_TABLE)
print(len(TASKS_TABLE))
2 changes: 1 addition & 1 deletion src/lighteval/tasks/registry.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@

# original is the reimplementation of original evals
# custom is to play around
DEFAULT_SUITES = ["helm", "bigbench", "lighteval", "original", "custom"]
DEFAULT_SUITES = ["helm", "bigbench", "lighteval", "original", "custom", "community"]

TRUNCATE_FEW_SHOTS_DEFAULTS = True

Expand Down

0 comments on commit 92e9b50

Please sign in to comment.