Add `--examples` Argument for Fine-Grained Task Evaluation in `lm-evaluation-harness`. This feature is the first step towards efficient multi-prompt evaluation with PromptEval [1,2] #2520

felipemaiapolo · 2024-11-26T19:47:09Z

This PR introduces a new --examples argument to the evaluation pipeline in lm-evaluation-harness, enabling users to evaluate specific examples across multiple tasks. This enhancement extends the functionality of the --limit argument by allowing users to control which examples are included in the evaluation. Users can specify task examples via a JSON file containing a dictionary where keys are task names and values are lists of example indices. For instance, a JSON file might look like this:

{
  "mmlu_astronomy": [0, 3, 6],
  "mmlu_anatomy": [1, 4, 7, 10],
  "mmlu_econometrics": [2, 5, 8, 11, 14]
}

To use this feature, for example, you could save the dictionary to a file (e.g., /path/to/examples.json) and run the following command:

lm_eval \
  --model hf \
  --model_args pretrained=Qwen/Qwen1.5-0.5B \
  --tasks mmlu_astronomy,mmlu_anatomy,mmlu_econometrics \
  --device cuda:0 \
  --log_samples \
  --output_path "/path/to/output" \
  --examples "/path/to/examples.json"

If we do not specify the examples for a task, all examples will be evaluated.

This new feature has multiple applications. It allows practitioners to evaluate models on specific subsets of interest, such as critical edge cases or benchmarks. It also supports multi-prompt evaluation using PromptEval [1,2] by enabling the evaluation of a few selected examples for each prompt template, followed by performance distribution estimation. As part of the future roadmap, we plan to integrate PromptEval functionality directly into lm-evaluation-harness to provide a seamless evaluation experience.

References
[1] Maia Polo, Felipe, Ronald Xu, Lucas Weber, Mírian Silva, Onkar Bhardwaj, Leshem Choshen, Allysson Flavio Melo de Oliveira, Yuekai Sun, and Mikhail Yurochkin. "Efficient multi-prompt evaluation of LLMs." arXiv preprint arXiv:2405.17202 (2024).
[2] https://github.com/felipemaiapolo/prompteval

CLAassistant · 2024-11-26T19:47:16Z

All committers have signed the CLA.

Signed-off-by: Mírian Silva <[email protected]

felipemaiapolo added 2 commits October 28, 2024 18:24

added option --examples

f06ea84

specifying examples in dictionary

8f6de73

felipemaiapolo requested review from baberabb and lintangsutawika as code owners November 26, 2024 19:47

mirianfsilva and others added 4 commits November 26, 2024 19:49

run pre-commit - fix arg type

4863977

Signed-off-by: Mírian Silva <[email protected]

fixing bug when examples==None

28c322a

fixing bug when examples==None

724612d

limit or examples must be None in simple_evaluate.py and in evaluator.py

7613990

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `--examples` Argument for Fine-Grained Task Evaluation in `lm-evaluation-harness`. This feature is the first step towards efficient multi-prompt evaluation with PromptEval [1,2] #2520

Add `--examples` Argument for Fine-Grained Task Evaluation in `lm-evaluation-harness`. This feature is the first step towards efficient multi-prompt evaluation with PromptEval [1,2] #2520

felipemaiapolo commented Nov 26, 2024

CLAassistant commented Nov 26, 2024 •

edited

Loading

Add --examples Argument for Fine-Grained Task Evaluation in lm-evaluation-harness. This feature is the first step towards efficient multi-prompt evaluation with PromptEval [1,2] #2520

Are you sure you want to change the base?

Add --examples Argument for Fine-Grained Task Evaluation in lm-evaluation-harness. This feature is the first step towards efficient multi-prompt evaluation with PromptEval [1,2] #2520

Conversation

felipemaiapolo commented Nov 26, 2024

CLAassistant commented Nov 26, 2024 • edited Loading

Add `--examples` Argument for Fine-Grained Task Evaluation in `lm-evaluation-harness`. This feature is the first step towards efficient multi-prompt evaluation with PromptEval [1,2] #2520

Add `--examples` Argument for Fine-Grained Task Evaluation in `lm-evaluation-harness`. This feature is the first step towards efficient multi-prompt evaluation with PromptEval [1,2] #2520

CLAassistant commented Nov 26, 2024 •

edited

Loading