Skip to content

Commit

Permalink
Set up docs (#403)
Browse files Browse the repository at this point in the history
* Add docs

* Add wiki to docs

* Adapt wiki as docs

* Force docs build

* Fix link in _toctree

* Add titles to docs pages

* Update docs/source/evaluate-the-model-on-a-server-or-container.mdx

Co-authored-by: Alvaro Bartolome <[email protected]>

---------

Co-authored-by: Clémentine Fourrier <[email protected]>
Co-authored-by: Alvaro Bartolome <[email protected]>
  • Loading branch information
3 people authored Nov 28, 2024
1 parent f109de1 commit 0c80801
Show file tree
Hide file tree
Showing 16 changed files with 2,407 additions and 0 deletions.
18 changes: 18 additions & 0 deletions .github/workflows/doc-build.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
name: Build Documentation

on:
push:
branches:
- main
- doc-builder*
- v*-release

jobs:
build:
uses: huggingface/doc-builder/.github/workflows/build_main_documentation.yml@main
with:
commit_sha: ${{ github.sha }}
package: lighteval
secrets:
token: ${{ secrets.HUGGINGFACE_PUSH }}
hf_token: ${{ secrets.HF_DOC_BUILD_PUSH }}
16 changes: 16 additions & 0 deletions .github/workflows/doc-pr-build.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
name: Build PR Documentation

on:
pull_request:

concurrency:
group: ${{ github.workflow }}-${{ github.head_ref || github.run_id }}
cancel-in-progress: true

jobs:
build:
uses: huggingface/doc-builder/.github/workflows/build_pr_documentation.yml@main
with:
commit_sha: ${{ github.event.pull_request.head.sha }}
pr_number: ${{ github.event.number }}
package: lighteval
30 changes: 30 additions & 0 deletions docs/source/_toctree.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
- sections:
- local: index
title: 🤗 Lighteval
- local: installation
title: Installation
- local: quicktour
title: Quicktour
title: Getting started
- sections:
- local: saving-and-reading-results
title: Save and read results
- local: using-the-python-api
title: Use the Python API
- local: adding-a-custom-task
title: Add a custom task
- local: adding-a-new-metric
title: Add a custom metric
- local: use-vllm-as-backend
title: Use VLLM as backend
- local: evaluate-the-model-on-a-server-or-container
title: Evaluate on Server
- local: contributing-to-multilingual-evaluations
title: Contributing to multilingual evaluations
title: Guides
- sections:
- local: metric-list
title: Available Metrics
- local: available-tasks
title: Available Tasks
title: API
196 changes: 196 additions & 0 deletions docs/source/adding-a-custom-task.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,196 @@
# Adding a Custom Task

To add a new task, first either open an issue, to determine whether it will be
integrated in the core evaluations of lighteval, in the extended tasks, or the
community tasks, and add its dataset on the hub.

- Core evaluations are evaluations that only require standard logic in their
metrics and processing, and that we will add to our test suite to ensure non
regression through time. They already see high usage in the community.
- Extended evaluations are evaluations that require custom logic in their
metrics (complex normalisation, an LLM as a judge, ...), that we added to
facilitate the life of users. They already see high usage in the community.
- Community evaluations are submissions by the community of new tasks.

A popular community evaluation can move to become an extended or core evaluation over time.

> [!TIP]
> You can find examples of custom tasks in the <a href="https://github.com/huggingface/lighteval/tree/main/community_tasks">community_task</a> directory.
## Step by step creation of a custom task

> [!WARNING]
> To contribute your custom metric to the lighteval repo, you would first need
> to install the required dev dependencies by running `pip install -e .[dev]`
> and then run `pre-commit install` to install the pre-commit hooks.
First, create a python file under the `community_tasks` directory.

You need to define a prompt function that will convert a line from your
dataset to a document to be used for evaluation.

```python
# Define as many as you need for your different tasks
def prompt_fn(line, task_name: str = None):
"""Defines how to go from a dataset line to a doc object.
Follow examples in src/lighteval/tasks/default_prompts.py, or get more info
about what this function should do in the README.
"""
return Doc(
task_name=task_name,
query=line["question"],
choices=[f" {c}" for c in line["choices"]],
gold_index=line["gold"],
instruction="",
)
```

Then, you need to choose a metric, you can either use an existing one (defined
in `lighteval/metrics/metrics.py`) or [create a custom one](adding-a-new-metric)).

```python
custom_metric = SampleLevelMetric(
metric_name="my_custom_metric_name",
higher_is_better=True,
category=MetricCategory.IGNORED,
use_case=MetricUseCase.NONE,
sample_level_fn=lambda x: x, # how to compute score for one sample
corpus_level_fn=np.mean, # How to aggreagte the samples metrics
)
```

Then, you need to define your task. You can define a task with or without subsets.
To define a task with no subsets:

```python
# This is how you create a simple task (like hellaswag) which has one single subset
# attached to it, and one evaluation possible.
task = LightevalTaskConfig(
name="myothertask",
prompt_function=prompt_fn, # must be defined in the file or imported from src/lighteval/tasks/tasks_prompt_formatting.py
suite=["community"],
hf_repo="",
hf_subset="default",
hf_avail_splits=[],
evaluation_splits=[],
few_shots_split=None,
few_shots_select=None,
metric=[], # select your metric in Metrics
)
```

If you want to create a task with multiple subset, add them to the
`SAMPLE_SUBSETS` list and create a task for each subset.

```python
SAMPLE_SUBSETS = [] # list of all the subsets to use for this eval


class CustomSubsetTask(LightevalTaskConfig):
def __init__(
self,
name,
hf_subset,
):
super().__init__(
name=name,
hf_subset=hf_subset,
prompt_function=prompt_fn, # must be defined in the file or imported from src/lighteval/tasks/tasks_prompt_formatting.py
hf_repo="",
metric=[custom_metric], # select your metric in Metrics or use your custom_metric
hf_avail_splits=[],
evaluation_splits=[],
few_shots_split=None,
few_shots_select=None,
suite=["community"],
generation_size=-1,
stop_sequence=None,
output_regex=None,
frozen=False,
)
SUBSET_TASKS = [CustomSubsetTask(name=f"mytask:{subset}", hf_subset=subset) for subset in SAMPLE_SUBSETS]
```

Here is a list of the parameters and their meaning:

- `name` (str), your evaluation name
- `suite` (list), the suite(s) to which your evaluation should belong. This
field allows us to compare different task implementations and is used as a
task selection to differentiate the versions to launch. At the moment, you'll
find the keywords ["helm", "bigbench", "original", "lighteval", "community",
"custom"]; for core evals, please choose `lighteval`.
- `prompt_function` (Callable), the prompt function you defined in the step
above
- `hf_repo` (str), the path to your evaluation dataset on the hub
- `hf_subset` (str), the specific subset you want to use for your evaluation
(note: when the dataset has no subset, fill this field with `"default"`, not
with `None` or `""`)
- `hf_avail_splits` (list), all the splits available for your dataset (train,
valid or validation, test, other...)
- `evaluation_splits` (list), the splits you want to use for evaluation
- `few_shots_split` (str, can be `null`), the specific split from which you
want to select samples for your few-shot examples. It should be different
from the sets included in `evaluation_splits`
- `few_shots_select` (str, can be `null`), the method that you will use to
select items for your few-shot examples. Can be `null`, or one of:
- `balanced` select examples from the `few_shots_split` with balanced
labels, to avoid skewing the few shot examples (hence the model
generations) toward one specific label
- `random` selects examples at random from the `few_shots_split`
- `random_sampling` selects new examples at random from the
`few_shots_split` for every new item, but if a sampled item is equal to
the current one, it is removed from the available samples
- `random_sampling_from_train` selects new examples at random from the
`few_shots_split` for every new item, but if a sampled item is equal to
the current one, it is kept! Only use this if you know what you are
doing.
- `sequential` selects the first `n` examples of the `few_shots_split`
- `generation_size` (int), the maximum number of tokens allowed for a
generative evaluation. If your evaluation is a log likelihood evaluation
(multi-choice), this value should be -1
- `stop_sequence` (list), a list of strings acting as end of sentence tokens
for your generation
- `metric` (list), the metrics you want to use for your evaluation (see next
section for a detailed explanation)
- `output_regex` (str), A regex string that will be used to filter your
generation. (Generative metrics will only select tokens that are between the
first and the second sequence matched by the regex. For example, for a regex
matching `\n` and a generation `\nModel generation output\nSome other text`
the metric will only be fed with `Model generation output`)
- `frozen` (bool), for now, is set to False, but we will steadily pass all
stable tasks to True.
- `trust_dataset` (bool), set to True if you trust the dataset.


Then you need to add your task to the `TASKS_TABLE` list.

```python
# STORE YOUR EVALS

# tasks with subset:
TASKS_TABLE = SUBSET_TASKS

# tasks without subset:
# TASKS_TABLE = [task]
```

Finally, you need to add a module logic to convert your task to a dict for lighteval.

```python
# MODULE LOGIC
# You should not need to touch this
# Convert to dict for lighteval
if __name__ == "__main__":
print(t.name for t in TASKS_TABLE)
print(len(TASKS_TABLE))
```

Once your file is created you can then run the evaluation with the following command:

```bash
lighteval accelerate \
--model_args "pretrained=HuggingFaceH4/zephyr-7b-beta" \
--tasks "community|{custom_task}|{fewshots}|{truncate_few_shot}" \
--custom_tasks {path_to_your_custom_task_file} \
--output_dir "./evals"
```
93 changes: 93 additions & 0 deletions docs/source/adding-a-new-metric.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
# Adding a New Metric

First, check if you can use one of the parametrized functions in
[src.lighteval.metrics.metrics_corpus]() or
[src.lighteval.metrics.metrics_sample]().

If not, you can use the `custom_task` system to register your new metric:

> [!TIP]
> To see an example of a custom metric added along with a custom task, look at <a href="">the IFEval custom task</a>.

> [!WARNING]
> To contribute your custom metric to the lighteval repo, you would first need
> to install the required dev dependencies by running `pip install -e .[dev]`
> and then run `pre-commit install` to install the pre-commit hooks.

- Create a new Python file which should contain the full logic of your metric.
- The file also needs to start with these imports

```python
from aenum import extend_enum
from lighteval.metrics import Metrics
```

You need to define a sample level metric:

```python
def custom_metric(predictions: list[str], formatted_doc: Doc, **kwargs) -> bool:
response = predictions[0]
return response == formatted_doc.choices[formatted_doc.gold_index]
```

Here the sample level metric only returns one metric, if you want to return multiple metrics per sample you need to return a dictionary with the metrics as keys and the values as values.

```python
def custom_metric(predictions: list[str], formatted_doc: Doc, **kwargs) -> dict:
response = predictions[0]
return {"accuracy": response == formatted_doc.choices[formatted_doc.gold_index], "other_metric": 0.5}
```

Then, you can define an aggregation function if needed, a common aggregation function is `np.mean`.

```python
def agg_function(items):
flat_items = [item for sublist in items for item in sublist]
score = sum(flat_items) / len(flat_items)
return score
```

Finally, you can define your metric. If it's a sample level metric, you can use the following code:

```python
my_custom_metric = SampleLevelMetric(
metric_name={custom_metric_name},
higher_is_better={either True or False},
category={MetricCategory},
use_case={MetricUseCase},
sample_level_fn=custom_metric,
corpus_level_fn=agg_function,
)
```

If your metric defines multiple metrics per sample, you can use the following code:

```python
custom_metric = SampleLevelMetricGrouping(
metric_name={submetric_names},
higher_is_better={n: {True or False} for n in submetric_names},
category={MetricCategory},
use_case={MetricUseCase},
sample_level_fn=custom_metric,
corpus_level_fn={
"accuracy": np.mean,
"other_metric": agg_function,
},
)
```

To finish, add the following, so that it adds your metric to our metrics list
when loaded as a module.

```python
# Adds the metric to the metric list!
extend_enum(Metrics, "metric_name", metric_function)
if __name__ == "__main__":
print("Imported metric")
```

You can then give your custom metric to lighteval by using `--custom-tasks
path_to_your_file` when launching it.

Loading

0 comments on commit 0c80801

Please sign in to comment.