-
Notifications
You must be signed in to change notification settings - Fork 116
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* Add docs * Add wiki to docs * Adapt wiki as docs * Force docs build * Fix link in _toctree * Add titles to docs pages * Update docs/source/evaluate-the-model-on-a-server-or-container.mdx Co-authored-by: Alvaro Bartolome <[email protected]> --------- Co-authored-by: Clémentine Fourrier <[email protected]> Co-authored-by: Alvaro Bartolome <[email protected]>
- Loading branch information
1 parent
f109de1
commit 0c80801
Showing
16 changed files
with
2,407 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,18 @@ | ||
name: Build Documentation | ||
|
||
on: | ||
push: | ||
branches: | ||
- main | ||
- doc-builder* | ||
- v*-release | ||
|
||
jobs: | ||
build: | ||
uses: huggingface/doc-builder/.github/workflows/build_main_documentation.yml@main | ||
with: | ||
commit_sha: ${{ github.sha }} | ||
package: lighteval | ||
secrets: | ||
token: ${{ secrets.HUGGINGFACE_PUSH }} | ||
hf_token: ${{ secrets.HF_DOC_BUILD_PUSH }} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,16 @@ | ||
name: Build PR Documentation | ||
|
||
on: | ||
pull_request: | ||
|
||
concurrency: | ||
group: ${{ github.workflow }}-${{ github.head_ref || github.run_id }} | ||
cancel-in-progress: true | ||
|
||
jobs: | ||
build: | ||
uses: huggingface/doc-builder/.github/workflows/build_pr_documentation.yml@main | ||
with: | ||
commit_sha: ${{ github.event.pull_request.head.sha }} | ||
pr_number: ${{ github.event.number }} | ||
package: lighteval |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,30 @@ | ||
- sections: | ||
- local: index | ||
title: 🤗 Lighteval | ||
- local: installation | ||
title: Installation | ||
- local: quicktour | ||
title: Quicktour | ||
title: Getting started | ||
- sections: | ||
- local: saving-and-reading-results | ||
title: Save and read results | ||
- local: using-the-python-api | ||
title: Use the Python API | ||
- local: adding-a-custom-task | ||
title: Add a custom task | ||
- local: adding-a-new-metric | ||
title: Add a custom metric | ||
- local: use-vllm-as-backend | ||
title: Use VLLM as backend | ||
- local: evaluate-the-model-on-a-server-or-container | ||
title: Evaluate on Server | ||
- local: contributing-to-multilingual-evaluations | ||
title: Contributing to multilingual evaluations | ||
title: Guides | ||
- sections: | ||
- local: metric-list | ||
title: Available Metrics | ||
- local: available-tasks | ||
title: Available Tasks | ||
title: API |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,196 @@ | ||
# Adding a Custom Task | ||
|
||
To add a new task, first either open an issue, to determine whether it will be | ||
integrated in the core evaluations of lighteval, in the extended tasks, or the | ||
community tasks, and add its dataset on the hub. | ||
|
||
- Core evaluations are evaluations that only require standard logic in their | ||
metrics and processing, and that we will add to our test suite to ensure non | ||
regression through time. They already see high usage in the community. | ||
- Extended evaluations are evaluations that require custom logic in their | ||
metrics (complex normalisation, an LLM as a judge, ...), that we added to | ||
facilitate the life of users. They already see high usage in the community. | ||
- Community evaluations are submissions by the community of new tasks. | ||
|
||
A popular community evaluation can move to become an extended or core evaluation over time. | ||
|
||
> [!TIP] | ||
> You can find examples of custom tasks in the <a href="https://github.com/huggingface/lighteval/tree/main/community_tasks">community_task</a> directory. | ||
## Step by step creation of a custom task | ||
|
||
> [!WARNING] | ||
> To contribute your custom metric to the lighteval repo, you would first need | ||
> to install the required dev dependencies by running `pip install -e .[dev]` | ||
> and then run `pre-commit install` to install the pre-commit hooks. | ||
First, create a python file under the `community_tasks` directory. | ||
|
||
You need to define a prompt function that will convert a line from your | ||
dataset to a document to be used for evaluation. | ||
|
||
```python | ||
# Define as many as you need for your different tasks | ||
def prompt_fn(line, task_name: str = None): | ||
"""Defines how to go from a dataset line to a doc object. | ||
Follow examples in src/lighteval/tasks/default_prompts.py, or get more info | ||
about what this function should do in the README. | ||
""" | ||
return Doc( | ||
task_name=task_name, | ||
query=line["question"], | ||
choices=[f" {c}" for c in line["choices"]], | ||
gold_index=line["gold"], | ||
instruction="", | ||
) | ||
``` | ||
|
||
Then, you need to choose a metric, you can either use an existing one (defined | ||
in `lighteval/metrics/metrics.py`) or [create a custom one](adding-a-new-metric)). | ||
|
||
```python | ||
custom_metric = SampleLevelMetric( | ||
metric_name="my_custom_metric_name", | ||
higher_is_better=True, | ||
category=MetricCategory.IGNORED, | ||
use_case=MetricUseCase.NONE, | ||
sample_level_fn=lambda x: x, # how to compute score for one sample | ||
corpus_level_fn=np.mean, # How to aggreagte the samples metrics | ||
) | ||
``` | ||
|
||
Then, you need to define your task. You can define a task with or without subsets. | ||
To define a task with no subsets: | ||
|
||
```python | ||
# This is how you create a simple task (like hellaswag) which has one single subset | ||
# attached to it, and one evaluation possible. | ||
task = LightevalTaskConfig( | ||
name="myothertask", | ||
prompt_function=prompt_fn, # must be defined in the file or imported from src/lighteval/tasks/tasks_prompt_formatting.py | ||
suite=["community"], | ||
hf_repo="", | ||
hf_subset="default", | ||
hf_avail_splits=[], | ||
evaluation_splits=[], | ||
few_shots_split=None, | ||
few_shots_select=None, | ||
metric=[], # select your metric in Metrics | ||
) | ||
``` | ||
|
||
If you want to create a task with multiple subset, add them to the | ||
`SAMPLE_SUBSETS` list and create a task for each subset. | ||
|
||
```python | ||
SAMPLE_SUBSETS = [] # list of all the subsets to use for this eval | ||
|
||
|
||
class CustomSubsetTask(LightevalTaskConfig): | ||
def __init__( | ||
self, | ||
name, | ||
hf_subset, | ||
): | ||
super().__init__( | ||
name=name, | ||
hf_subset=hf_subset, | ||
prompt_function=prompt_fn, # must be defined in the file or imported from src/lighteval/tasks/tasks_prompt_formatting.py | ||
hf_repo="", | ||
metric=[custom_metric], # select your metric in Metrics or use your custom_metric | ||
hf_avail_splits=[], | ||
evaluation_splits=[], | ||
few_shots_split=None, | ||
few_shots_select=None, | ||
suite=["community"], | ||
generation_size=-1, | ||
stop_sequence=None, | ||
output_regex=None, | ||
frozen=False, | ||
) | ||
SUBSET_TASKS = [CustomSubsetTask(name=f"mytask:{subset}", hf_subset=subset) for subset in SAMPLE_SUBSETS] | ||
``` | ||
|
||
Here is a list of the parameters and their meaning: | ||
|
||
- `name` (str), your evaluation name | ||
- `suite` (list), the suite(s) to which your evaluation should belong. This | ||
field allows us to compare different task implementations and is used as a | ||
task selection to differentiate the versions to launch. At the moment, you'll | ||
find the keywords ["helm", "bigbench", "original", "lighteval", "community", | ||
"custom"]; for core evals, please choose `lighteval`. | ||
- `prompt_function` (Callable), the prompt function you defined in the step | ||
above | ||
- `hf_repo` (str), the path to your evaluation dataset on the hub | ||
- `hf_subset` (str), the specific subset you want to use for your evaluation | ||
(note: when the dataset has no subset, fill this field with `"default"`, not | ||
with `None` or `""`) | ||
- `hf_avail_splits` (list), all the splits available for your dataset (train, | ||
valid or validation, test, other...) | ||
- `evaluation_splits` (list), the splits you want to use for evaluation | ||
- `few_shots_split` (str, can be `null`), the specific split from which you | ||
want to select samples for your few-shot examples. It should be different | ||
from the sets included in `evaluation_splits` | ||
- `few_shots_select` (str, can be `null`), the method that you will use to | ||
select items for your few-shot examples. Can be `null`, or one of: | ||
- `balanced` select examples from the `few_shots_split` with balanced | ||
labels, to avoid skewing the few shot examples (hence the model | ||
generations) toward one specific label | ||
- `random` selects examples at random from the `few_shots_split` | ||
- `random_sampling` selects new examples at random from the | ||
`few_shots_split` for every new item, but if a sampled item is equal to | ||
the current one, it is removed from the available samples | ||
- `random_sampling_from_train` selects new examples at random from the | ||
`few_shots_split` for every new item, but if a sampled item is equal to | ||
the current one, it is kept! Only use this if you know what you are | ||
doing. | ||
- `sequential` selects the first `n` examples of the `few_shots_split` | ||
- `generation_size` (int), the maximum number of tokens allowed for a | ||
generative evaluation. If your evaluation is a log likelihood evaluation | ||
(multi-choice), this value should be -1 | ||
- `stop_sequence` (list), a list of strings acting as end of sentence tokens | ||
for your generation | ||
- `metric` (list), the metrics you want to use for your evaluation (see next | ||
section for a detailed explanation) | ||
- `output_regex` (str), A regex string that will be used to filter your | ||
generation. (Generative metrics will only select tokens that are between the | ||
first and the second sequence matched by the regex. For example, for a regex | ||
matching `\n` and a generation `\nModel generation output\nSome other text` | ||
the metric will only be fed with `Model generation output`) | ||
- `frozen` (bool), for now, is set to False, but we will steadily pass all | ||
stable tasks to True. | ||
- `trust_dataset` (bool), set to True if you trust the dataset. | ||
|
||
|
||
Then you need to add your task to the `TASKS_TABLE` list. | ||
|
||
```python | ||
# STORE YOUR EVALS | ||
|
||
# tasks with subset: | ||
TASKS_TABLE = SUBSET_TASKS | ||
|
||
# tasks without subset: | ||
# TASKS_TABLE = [task] | ||
``` | ||
|
||
Finally, you need to add a module logic to convert your task to a dict for lighteval. | ||
|
||
```python | ||
# MODULE LOGIC | ||
# You should not need to touch this | ||
# Convert to dict for lighteval | ||
if __name__ == "__main__": | ||
print(t.name for t in TASKS_TABLE) | ||
print(len(TASKS_TABLE)) | ||
``` | ||
|
||
Once your file is created you can then run the evaluation with the following command: | ||
|
||
```bash | ||
lighteval accelerate \ | ||
--model_args "pretrained=HuggingFaceH4/zephyr-7b-beta" \ | ||
--tasks "community|{custom_task}|{fewshots}|{truncate_few_shot}" \ | ||
--custom_tasks {path_to_your_custom_task_file} \ | ||
--output_dir "./evals" | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,93 @@ | ||
# Adding a New Metric | ||
|
||
First, check if you can use one of the parametrized functions in | ||
[src.lighteval.metrics.metrics_corpus]() or | ||
[src.lighteval.metrics.metrics_sample](). | ||
|
||
If not, you can use the `custom_task` system to register your new metric: | ||
|
||
> [!TIP] | ||
> To see an example of a custom metric added along with a custom task, look at <a href="">the IFEval custom task</a>. | ||
|
||
> [!WARNING] | ||
> To contribute your custom metric to the lighteval repo, you would first need | ||
> to install the required dev dependencies by running `pip install -e .[dev]` | ||
> and then run `pre-commit install` to install the pre-commit hooks. | ||
|
||
- Create a new Python file which should contain the full logic of your metric. | ||
- The file also needs to start with these imports | ||
|
||
```python | ||
from aenum import extend_enum | ||
from lighteval.metrics import Metrics | ||
``` | ||
|
||
You need to define a sample level metric: | ||
|
||
```python | ||
def custom_metric(predictions: list[str], formatted_doc: Doc, **kwargs) -> bool: | ||
response = predictions[0] | ||
return response == formatted_doc.choices[formatted_doc.gold_index] | ||
``` | ||
|
||
Here the sample level metric only returns one metric, if you want to return multiple metrics per sample you need to return a dictionary with the metrics as keys and the values as values. | ||
|
||
```python | ||
def custom_metric(predictions: list[str], formatted_doc: Doc, **kwargs) -> dict: | ||
response = predictions[0] | ||
return {"accuracy": response == formatted_doc.choices[formatted_doc.gold_index], "other_metric": 0.5} | ||
``` | ||
|
||
Then, you can define an aggregation function if needed, a common aggregation function is `np.mean`. | ||
|
||
```python | ||
def agg_function(items): | ||
flat_items = [item for sublist in items for item in sublist] | ||
score = sum(flat_items) / len(flat_items) | ||
return score | ||
``` | ||
|
||
Finally, you can define your metric. If it's a sample level metric, you can use the following code: | ||
|
||
```python | ||
my_custom_metric = SampleLevelMetric( | ||
metric_name={custom_metric_name}, | ||
higher_is_better={either True or False}, | ||
category={MetricCategory}, | ||
use_case={MetricUseCase}, | ||
sample_level_fn=custom_metric, | ||
corpus_level_fn=agg_function, | ||
) | ||
``` | ||
|
||
If your metric defines multiple metrics per sample, you can use the following code: | ||
|
||
```python | ||
custom_metric = SampleLevelMetricGrouping( | ||
metric_name={submetric_names}, | ||
higher_is_better={n: {True or False} for n in submetric_names}, | ||
category={MetricCategory}, | ||
use_case={MetricUseCase}, | ||
sample_level_fn=custom_metric, | ||
corpus_level_fn={ | ||
"accuracy": np.mean, | ||
"other_metric": agg_function, | ||
}, | ||
) | ||
``` | ||
|
||
To finish, add the following, so that it adds your metric to our metrics list | ||
when loaded as a module. | ||
|
||
```python | ||
# Adds the metric to the metric list! | ||
extend_enum(Metrics, "metric_name", metric_function) | ||
if __name__ == "__main__": | ||
print("Imported metric") | ||
``` | ||
|
||
You can then give your custom metric to lighteval by using `--custom-tasks | ||
path_to_your_file` when launching it. | ||
|
Oops, something went wrong.