GitHub - huggingface/lighteval: Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends

Your go-to toolkit for lightning-fast, flexible LLM evaluation, from Hugging Face's Leaderboard and Evals Team.

Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends—whether your model is being served somewhere or already loaded in memory. Dive deep into your model's performance by saving and exploring detailed, sample-by-sample results to debug and see how your models stack-up.

Customization at your fingertips: letting you either browse all our existing tasks and metrics or effortlessly create your own custom task and custom metric, tailored to your needs.

Available Tasks

Lighteval supports 1000+ evaluation tasks across multiple domains and languages. Use this space to find what you need, or, here's an overview of some popular benchmarks:

📚 Knowledge

General Knowledge: MMLU, MMLU-Pro, MMMU, BIG-Bench
Question Answering: TriviaQA, Natural Questions, SimpleQA, Humanity's Last Exam (HLE)
Specialized: GPQA, AGIEval

🧮 Math and Code

Math Problems: GSM8K, GSM-Plus, MATH, MATH500
Competition Math: AIME24, AIME25
Multilingual Math: MGSM (Grade School Math in 10+ languages)
Coding Benchmarks: LCB (LiveCodeBench)

🎯 Chat Model Evaluation

Instruction Following: IFEval, IFEval-fr
Reasoning: MUSR, DROP (discrete reasoning)
Long Context: RULER
Dialogue: MT-Bench
Holistic Evaluation: HELM, BIG-Bench

🌍 Multilingual Evaluation

Cross-lingual: XTREME, Flores200 (200 languages), XCOPA, XQuAD
Language-specific:
- Arabic: ArabicMMLU
- Filipino: FilBench
- French: IFEval-fr, GPQA-fr, BAC-fr
- German: German RAG Eval
- Serbian: Serbian LLM Benchmark, OZ Eval
- Turkic: TUMLU (9 Turkic languages)
- Chinese: CMMLU, CEval, AGIEval
- Russian: RUMMLU, Russian SQuAD
- And many more...

🧠 Core Language Understanding

NLU: GLUE, SuperGLUE, TriviaQA, Natural Questions
Commonsense: HellaSwag, WinoGrande, ProtoQA
Natural Language Inference: XNLI
Reading Comprehension: SQuAD, XQuAD, MLQA, Belebele

⚡️ Installation

Note: lighteval is currently completely untested on Windows, and we don't support it yet. (Should be fully functional on Mac/Linux)

pip install lighteval

Lighteval allows for many extras when installing, see here for a complete list.

If you want to push results to the Hugging Face Hub, add your access token as an environment variable:

huggingface-cli login

🚀 Quickstart

Lighteval offers the following entry points for model evaluation:

lighteval accelerate: Evaluate models on CPU or one or more GPUs using 🤗 Accelerate
lighteval nanotron: Evaluate models in distributed settings using ⚡️ Nanotron
lighteval vllm: Evaluate models on one or more GPUs using 🚀 VLLM
lighteval sglang: Evaluate models using SGLang as backend
lighteval endpoint: Evaluate models using various endpoints as backend
- lighteval endpoint inference-endpoint: Evaluate models using Hugging Face's Inference Endpoints API
- lighteval endpoint tgi: Evaluate models using 🔗 Text Generation Inference running locally
- lighteval endpoint litellm: Evaluate models on any compatible API using LiteLLM
- lighteval endpoint inference-providers: Evaluate models using HuggingFace's inference providers as backend

Did not find what you need ? You can always make your custom model API by following this guide

lighteval custom: Evaluate custom models (can be anything)

Here's a quick command to evaluate using the Accelerate backend:

lighteval accelerate \
    "model_name=gpt2" \
    "leaderboard|truthfulqa:mc|0"

Or use the Python API to run a model already loaded in memory!

from transformers import AutoModelForCausalLM

from lighteval.logging.evaluation_tracker import EvaluationTracker
from lighteval.models.transformers.transformers_model import TransformersModel, TransformersModelConfig
from lighteval.pipeline import ParallelismManager, Pipeline, PipelineParameters


MODEL_NAME = "meta-llama/Meta-Llama-3-8B-Instruct"
BENCHMARKS = "lighteval|gsm8k|0"

evaluation_tracker = EvaluationTracker(output_dir="./results")
pipeline_params = PipelineParameters(
    launcher_type=ParallelismManager.NONE,
    max_samples=2
)

model = AutoModelForCausalLM.from_pretrained(
  MODEL_NAME, device_map="auto"
)
config = TransformersModelConfig(model_name=MODEL_NAME, batch_size=1)
model = TransformersModel.from_model(model, config)

pipeline = Pipeline(
    model=model,
    pipeline_parameters=pipeline_params,
    evaluation_tracker=evaluation_tracker,
    tasks=BENCHMARKS,
)

results = pipeline.evaluate()
pipeline.show_results()
results = pipeline.get_results()

🙏 Acknowledgements

Lighteval took inspiration from the following amazing frameworks: Eleuther's AI Harness and Stanford's HELM. We are grateful to their teams for their pioneering work on LLM evaluations.

We'd also like to offer our thanks to all the community members who have contributed to the library, adding new features and reporting or fixing bugs.

🌟 Contributions Welcome 💙💚💛💜🧡

Got ideas? Found a bug? Want to add a task or metric? Contributions are warmly welcomed!

If you're adding a new feature, please open an issue first.

If you open a PR, don't forget to run the styling!

pip install -e .[dev]
pre-commit install
pre-commit run --all-files

📜 Citation

@misc{lighteval,
  author = {Habib, Nathan and Fourrier, Clémentine and Kydlíček, Hynek and Wolf, Thomas and Tunstall, Lewis},
  title = {LightEval: A lightweight framework for LLM evaluation},
  year = {2023},
  version = {0.11.0},
  url = {https://github.com/huggingface/lighteval}
}

Name		Name	Last commit message	Last commit date
Latest commit History 507 Commits
.github		.github
assets		assets
docs/source		docs/source
examples		examples
src/lighteval		src/lighteval
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Available Tasks

📚 Knowledge

🧮 Math and Code

🎯 Chat Model Evaluation

🌍 Multilingual Evaluation

🧠 Core Language Understanding

⚡️ Installation

🚀 Quickstart

🙏 Acknowledgements

🌟 Contributions Welcome 💙💚💛💜🧡

📜 Citation

About

Uh oh!

Releases 11

Uh oh!

Contributors 107

Languages

License

huggingface/lighteval

Folders and files

Latest commit

History

Repository files navigation

Available Tasks

📚 Knowledge

🧮 Math and Code

🎯 Chat Model Evaluation

🌍 Multilingual Evaluation

🧠 Core Language Understanding

⚡️ Installation

🚀 Quickstart

🙏 Acknowledgements

🌟 Contributions Welcome 💙💚💛💜🧡

📜 Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 11

Uh oh!

Contributors 107

Languages