New mechanism for evaluation contributions (#47)

* new mechanism for eval contributions * added doc following nathan comments
huggingface · Feb 24, 2024 · 92e9b50 · 92e9b50
1 parent 831ad47
commit 92e9b50
Show file tree

Hide file tree

Showing 3 changed files with 105 additions and 5 deletions.
diff --git a/README.md b/README.md
@@ -108,13 +108,16 @@ If you want to add a new metric, first check if you can use one of the parametri
 Then, follow the example in `src.lighteval.metrics.metrics` to register your metric.
 
 ### Adding a new task
-To add a new task, first **add its dataset** on the hub.
+To add a new task, first either open an issue, to determine whether it will be integrated in the core evaluations of lighteval, or in the community tasks, and **add its dataset** on the hub.
+Note: Core evaluations are evals we will add to our test suite to ensure non regression through time, and which already see a high usage in the community.
+A popular community evaluation can move to become a core evaluation through time.
 
-Then, **find a suitable prompt function** or **create a new prompt function** in `src.lighteval.tasks.task_prompt_formatting.py`. This function must output a `Doc` object, which should contain `query`, your prompt, and either `gold`, the gold output, or `choices` and `gold_index`, the list of choices and index or indices of correct answers. If your query contains an instruction which should not be repeated in a few shot setup, add it to an `instruction` field.
+#### Core evaluations
+Prompt function: **find a suitable prompt function** in `src.lighteval.tasks.task_prompt_formatting.py`, or code your own. This function must output a `Doc` object, which should contain `query`, your prompt, and either `gold`, the gold output, or `choices` and `gold_index`, the list of choices and index or indices of correct answers. If your query contains an instruction which should not be repeated in a few shot setup, add it to an `instruction` field.
 
-Lastly, create a **line summary** of your evaluation, in `src/lighteval/tasks/tasks_table.jsonl`. This summary should contain the following fields:
+Summary: create a **line summary** of your evaluation, in `src/lighteval/tasks/tasks_table.jsonl`. This summary should contain the following fields:
 - `name` (str), your evaluation name
-- `suite` (list), the suite(s) to which your evaluation should belong. This field allows us to compare different tasks implementation, and is used a task selection to differentiate the versions to launch. At the moment, you'll find the keywords ["helm", "bigbench", "original", "lighteval"]; you can add also add new ones (for test, we recommend using "custom").
+- `suite` (list), the suite(s) to which your evaluation should belong. This field allows us to compare different tasks implementation, and is used a task selection to differentiate the versions to launch. At the moment, you'll find the keywords ["helm", "bigbench", "original", "lighteval", "community", "custom"]; for core evals, please choose `lighteval`.
 - `prompt_function` (str), the name of the prompt function you defined in the step above
 - `hf_repo` (str), the path to your evaluation dataset on the hub
 - `hf_subset` (str), the specific subset you want to use for your evaluation (note: when the dataset has no subset, fill this field with `"default"`, not with `None` or `""`)
@@ -133,6 +136,13 @@ Lastly, create a **line summary** of your evaluation, in `src/lighteval/tasks/ta
 - `output_regex` (str), A regex string that will be used to filter your generation. (Genrative metrics will only select tokens that are between the first and the second sequence matched by the regex. For example, for a regex matching `\n` and a generation `\nModel generation output\nSome other text` the metric will only be fed with `Model generation output`)
 - `frozen` (bool), for now is set to False, but we will steadily pass all stable tasks to True.
 
+Make sure you can launch your model with your new task using `--tasks lighteval|yournewtask|2|0`.
+
+#### Community evaluations
+Copy the `community_tasks/_template.yml` to `community_tasks/yourevalname.py` and edit it to add your custom tasks (the parameters you can use are explained above). It contains an interesting mechanism if the dataset you are adding contains a lot of subsets.
+
+Make sure you can launch your model with your new task using `--tasks community|yournewtask|2|0 --custom_tasks community_tasks/yourevalname.py`.
+
 ## Available metrics
 ### Metrics for multiple choice tasks
 These metrics use log-likelihood of the different possible targets.

diff --git a/community_tasks/_template.py b/community_tasks/_template.py
@@ -0,0 +1,90 @@
+# ruff: noqa: F405, F403, F401
+"""
+Custom evaluation tasks for lighteval. Copy this file and complete it with the info for your task.
+
+This file generally create just a TASKS_TABLE and TASKS_GROUPS which are then imported by LightEval.
+
+Author:
+"""
+from lighteval.tasks.lighteval_task import LightevalTaskConfig
+from lighteval.tasks.requests import Doc
+from lighteval.tasks.tasks_prompt_formatting import LETTER_INDICES
+
+
+## EVAL WITH NO SUBSET ##
+# This is how you create a simple tasks (like hellaswag) which has one single subset
+# attached to it, and one evaluation possible.
+task = LightevalTaskConfig(
+    name="myothertask",
+    prompt_function="prompt_fn",  # must be defined in the file or imported from src/lighteval/tasks/tasks_prompt_formatting.py
+    suite=["community"],
+    hf_repo="",
+    hf_subset="default",
+    hf_avail_splits=[],
+    evaluation_splits=[],
+    few_shots_split="",
+    few_shots_select="",
+    metric=[""],
+)
+
+## EVALS WITH SUBSET
+# This is how you create a subset task (like MMLU), which has several subset
+# each being its own evaluation task.
+
+# fmt: off
+SAMPLE_SUBSETS = [] # list of all the subsets to use for this eval
+# fmt: on
+
+
+class CustomSubsetTask(LightevalTaskConfig):
+    def __init__(
+        self,
+        name,
+        hf_subset,
+    ):
+        super().__init__(
+            name=name,
+            hf_subset=hf_subset,
+            prompt_function="prompt_fn",  # must be defined in the file
+            hf_repo="",
+            metric=[""],
+            hf_avail_splits=[],
+            evaluation_splits=[],
+            few_shots_split="",
+            few_shots_select="",
+            suite=["community"],
+            generation_size=-1,
+            stop_sequence=None,
+            output_regex=None,
+            frozen=False,
+        )
+
+
+## DEFINE YOUR PROMPT FUNCTIONS
+# Define as many as you need for your different tasks
+def prompt_fn(line, task_name: str = None):
+    """Defines how to go from a dataset line to a doc object.
+    Follow examples in src/lighteval/tasks/tasks_prompt_formatting.py, or get more info
+    about what this function should do in the README.
+    """
+    return Doc(
+        task_name=task_name,
+        query="",
+        choices="",
+        gold_index=0,
+        instruction="",
+    )
+
+
+## STORE YOUR EVALS
+SUBSET_TASKS = [CustomSubsetTask(name=f"mytask:{subset}", hf_subset=subset) for subset in SAMPLE_SUBSETS]
+_TASKS = SUBSET_TASKS + [task]
+
+## MODULE LOGIC
+# You should not need to touch this
+# Convert to dict for lighteval
+TASKS_TABLE = [task.as_dict() for task in _TASKS]
+
+if __name__ == "__main__":
+    print(t["name"] for t in TASKS_TABLE)
+    print(len(TASKS_TABLE))
diff --git a/src/lighteval/tasks/registry.py b/src/lighteval/tasks/registry.py
@@ -15,7 +15,7 @@
 
 # original is the reimplementation of original evals
 # custom is to play around
-DEFAULT_SUITES = ["helm", "bigbench", "lighteval", "original", "custom"]
+DEFAULT_SUITES = ["helm", "bigbench", "lighteval", "original", "custom", "community"]
 
 TRUNCATE_FEW_SHOTS_DEFAULTS = True