Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add --examples Argument for Fine-Grained Task Evaluation in lm-evaluation-harness. This feature is the first step towards efficient multi-prompt evaluation with PromptEval [1,2] #2520

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

felipemaiapolo
Copy link

This PR introduces a new --examples argument to the evaluation pipeline in lm-evaluation-harness, enabling users to evaluate specific examples across multiple tasks. This enhancement extends the functionality of the --limit argument by allowing users to control which examples are included in the evaluation. Users can specify task examples via a JSON file containing a dictionary where keys are task names and values are lists of example indices. For instance, a JSON file might look like this:

{
  "mmlu_astronomy": [0, 3, 6],
  "mmlu_anatomy": [1, 4, 7, 10],
  "mmlu_econometrics": [2, 5, 8, 11, 14]
}

To use this feature, for example, you could save the dictionary to a file (e.g., /path/to/examples.json) and run the following command:

lm_eval \
  --model hf \
  --model_args pretrained=Qwen/Qwen1.5-0.5B \
  --tasks mmlu_astronomy,mmlu_anatomy,mmlu_econometrics \
  --device cuda:0 \
  --log_samples \
  --output_path "/path/to/output" \
  --examples "/path/to/examples.json"

If we do not specify the examples for a task, all examples will be evaluated.

This new feature has multiple applications. It allows practitioners to evaluate models on specific subsets of interest, such as critical edge cases or benchmarks. It also supports multi-prompt evaluation using PromptEval [1,2] by enabling the evaluation of a few selected examples for each prompt template, followed by performance distribution estimation. As part of the future roadmap, we plan to integrate PromptEval functionality directly into lm-evaluation-harness to provide a seamless evaluation experience.

References
[1] Maia Polo, Felipe, Ronald Xu, Lucas Weber, Mírian Silva, Onkar Bhardwaj, Leshem Choshen, Allysson Flavio Melo de Oliveira, Yuekai Sun, and Mikhail Yurochkin. "Efficient multi-prompt evaluation of LLMs." arXiv preprint arXiv:2405.17202 (2024).
[2] https://github.com/felipemaiapolo/prompteval

@CLAassistant
Copy link

CLAassistant commented Nov 26, 2024

CLA assistant check
All committers have signed the CLA.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants