Tasks

A list of supported tasks and task groupings can be viewed with lm-eval --tasks list.

For more information, including a full list of task names and their precise meanings or sources, follow the links provided to the individual README.md files for each subfolder.

Task Family	Description	Language(s)
aclue	Tasks focusing on ancient Chinese language understanding and cultural aspects.	Ancient Chinese
aexams	Tasks in Arabic related to various academic exams covering a range of subjects.	Arabic
agieval	Tasks involving historical data or questions related to history and historical texts.	English, Chinese
anli	Adversarial natural language inference tasks designed to test model robustness.	English
arabic_leaderboard_complete	A full version of the tasks in the Open Arabic LLM Leaderboard, focusing on the evaluation of models that reflect the characteristics of Arabic language understanding and comprehension, culture, and heritage. Note that some of these tasks are machine-translated.	Arabic (Some MT)
arabic_leaderboard_light	A light version of the tasks in the Open Arabic LLM Leaderboard (i.e., 10% samples of the test set in the original benchmarks), focusing on the evaluation of models that reflect the characteristics of Arabic language understanding and comprehension, culture, and heritage. Note that some of these tasks are machine-translated.	Arabic (Some MT)
arabicmmlu	Localized Arabic version of MMLU with multiple-choice questions from 40 subjects.	Arabic
arc	Tasks involving complex reasoning over a diverse set of questions.	English
arithmetic	Tasks involving numerical computations and arithmetic reasoning.	English
asdiv	Tasks involving arithmetic and mathematical reasoning challenges.	English
babi	Tasks designed as question and answering challenges based on simulated stories.	English
basque_bench	Collection of tasks in Basque encompassing various evaluation areas.	Basque
basqueglue	Tasks designed to evaluate language understanding in Basque language.	Basque
bbh	Tasks focused on deep semantic understanding through hypothesization and reasoning.	English, German
belebele	Language understanding tasks in a variety of languages and scripts.	Multiple (122 languages)
benchmarks	General benchmarking tasks that test a wide range of language understanding capabilities.
bertaqa	Local Basque cultural trivia QA tests in English and Basque languages.	English, Basque, Basque (MT)
bigbench	Broad tasks from the BIG-bench benchmark designed to push the boundaries of large models.	Multiple
blimp	Tasks testing grammatical phenomena to evaluate language model's linguistic capabilities.	English
catalan_bench	Collection of tasks in Catalan encompassing various evaluation areas.	Catalan
ceval	Tasks that evaluate language understanding and reasoning in an educational context.	Chinese
cmmlu	Multi-subject multiple choice question tasks for comprehensive academic assessment.	Chinese
code_x_glue	Tasks that involve understanding and generating code across multiple programming languages.	Go, Java, JS, PHP, Python, Ruby
commonsense_qa	CommonsenseQA, a multiple-choice QA dataset for measuring commonsense knowledge.	English
copal_id	Indonesian causal commonsense reasoning dataset that captures local nuances.	Indonesian
coqa	Conversational question answering tasks to test dialog understanding.	English
crows_pairs	Tasks designed to test model biases in various sociodemographic groups.	English, French
csatqa	Tasks related to SAT and other standardized testing questions for academic assessment.	Korean
drop	Tasks requiring numerical reasoning, reading comprehension, and question answering.	English
eq_bench	Tasks focused on equality and ethics in question answering and decision-making.	English
eus_exams	Tasks based on various professional and academic exams in the Basque language.	Basque
eus_proficiency	Tasks designed to test proficiency in the Basque language across various topics.	Basque
eus_reading	Reading comprehension tasks specifically designed for the Basque language.	Basque
eus_trivia	Trivia and knowledge testing tasks in the Basque language.	Basque
fda	Tasks for extracting key-value pairs from FDA documents to test information extraction.	English
fld	Tasks involving free-form and directed dialogue understanding.	English
french_bench	Set of tasks designed to assess language model performance in French.	French
galician_bench	Collection of tasks in Galician encompassing various evaluation areas.	Galician
glue	General Language Understanding Evaluation benchmark to test broad language abilities.	English
gpqa	Tasks designed for general public question answering and knowledge verification.	English
gsm8k	A benchmark of grade school math problems aimed at evaluating reasoning capabilities.	English
haerae	Tasks focused on assessing detailed factual and historical knowledge.	Korean
headqa	A high-level education-based question answering dataset to test specialized knowledge.	Spanish, English
hellaswag	Tasks to predict the ending of stories or scenarios, testing comprehension and creativity.	English
hendrycks_ethics	Tasks designed to evaluate the ethical reasoning capabilities of models.	English
hendrycks_math	Mathematical problem-solving tasks to test numerical reasoning and problem-solving.	English
ifeval	Interactive fiction evaluation tasks for narrative understanding and reasoning.	English
inverse_scaling	Multiple-choice tasks from the Inverse Scaling Prize, designed to find settings where larger language models perform worse.	English
japanese_leaderboard	Japanese language understanding tasks to benchmark model performance on various linguistic aspects.	Japanese
kbl	Korean Benchmark for Legal Language Understanding.	Korean
kmmlu	Knowledge-based multi-subject multiple choice questions for academic evaluation.	Korean
kobest	A collection of tasks designed to evaluate understanding in Korean language.	Korean
kormedmcqa	Medical question answering tasks in Korean to test specialized domain knowledge.	Korean
lambada	Tasks designed to predict the endings of text passages, testing language prediction skills.	English
lambada_cloze	Cloze-style LAMBADA dataset.	English
lambada_multilingual	Multilingual LAMBADA dataset. This is a legacy version of the multilingual dataset, and users should instead use `lambada_multilingual_stablelm`.	German, English, Spanish, French, Italian
lambada_multilingual_stablelm	Multilingual LAMBADA dataset. Users should prefer evaluating on this version of the multilingual dataset instead of on `lambada_multilingual`.	German, English, Spanish, French, Italian, Dutch, Portuguese
leaderboard	Task group used by Hugging Face's Open LLM Leaderboard v2. Those tasks are static and will not change through time	English
lingoly	Challenging logical reasoning benchmark in low-resource languages with controls for memorization	English, Multilingual
logiqa	Logical reasoning tasks requiring advanced inference and deduction.	English, Chinese
logiqa2	Large-scale logical reasoning dataset adapted from the Chinese Civil Service Examination.	English, Chinese
mathqa	Question answering tasks involving mathematical reasoning and problem-solving.	English
mc_taco	Question-answer pairs that require temporal commonsense comprehension.	English
med_concepts_qa	Benchmark for evaluating LLMs on their abilities to interpret medical codes and distinguish between medical concept.	English
metabench	Distilled versions of six popular benchmarks which are highly predictive of overall benchmark performance and of a single general ability latent trait.	English
medmcqa	Medical multiple choice questions assessing detailed medical knowledge.	English
medqa	Multiple choice question answering based on the United States Medical License Exams.
mgsm	Benchmark of multilingual grade-school math problems.	Spanish, French, German, Russian, Chinese, Japanese, Thai, Swahili, Bengali, Telugu
minerva_math	Mathematics-focused tasks requiring numerical reasoning and problem-solving skills.	English
mmlu	Massive Multitask Language Understanding benchmark for broad domain language evaluation. Several variants are supported.	English
mmlu_pro	A refined set of MMLU, integrating more challenging, reasoning-focused questions and expanding the choice set from four to ten options.	English
mmlusr	Variation of MMLU designed to be more rigorous.	English
model_written_evals	Evaluation tasks auto-generated for evaluating a collection of AI Safety concerns.
mutual	A retrieval-based dataset for multi-turn dialogue reasoning.	English
nq_open	Open domain question answering tasks based on the Natural Questions dataset.	English
okapi/arc_multilingual	Tasks that involve reading comprehension and information retrieval challenges.	Multiple (31 languages) Machine Translated.
okapi/hellaswag_multilingual	Tasks that involve reading comprehension and information retrieval challenges.	Multiple (30 languages) Machine Translated.
okapi/mmlu_multilingual	Tasks that involve reading comprehension and information retrieval challenges.	Multiple (34 languages) Machine Translated.
okapi/truthfulqa_multilingual	Tasks that involve reading comprehension and information retrieval challenges.	Multiple (31 languages) Machine Translated.
openbookqa	Open-book question answering tasks that require external knowledge and reasoning.	English
paloma	Paloma is a comprehensive benchmark designed to evaluate open language models across a wide range of domains, ranging from niche artist communities to mental health forums on Reddit.	English
paws-x	Paraphrase Adversaries from Word Scrambling, focusing on cross-lingual capabilities.	English, French, Spanish, German, Chinese, Japanese, Korean
pile	Open source language modelling data set that consists of 22 smaller, high-quality datasets.	English
pile_10k	The first 10K elements of The Pile, useful for debugging models trained on it.	English
piqa	Physical Interaction Question Answering tasks to test physical commonsense reasoning.	English
polemo2	Sentiment analysis and emotion detection tasks based on Polish language data.	Polish
portuguese_bench	Collection of tasks in European Portuguese encompassing various evaluation areas.	Portuguese
prost	Tasks requiring understanding of professional standards and ethics in various domains.	English
pubmedqa	Question answering tasks based on PubMed research articles for biomedical understanding.	English
qa4mre	Question Answering for Machine Reading Evaluation, assessing comprehension and reasoning.	English
qasper	Question Answering dataset based on academic papers, testing in-depth scientific knowledge.	English
race	Reading comprehension assessment tasks based on English exams in China.	English
realtoxicityprompts	Tasks to evaluate language models for generating text with potential toxicity.
sciq	Science Question Answering tasks to assess understanding of scientific concepts.	English
score	Systematic consistency and robustness evaluation for LLMs on 3 datasets(MMLU-Pro, Agi Eval and MATH)	English
scrolls	Tasks that involve long-form reading comprehension across various domains.	English
siqa	Social Interaction Question Answering to evaluate common sense and social reasoning.	English
spanish_bench	Collection of tasks in Spanish encompassing various evaluation areas.	Spanish
squad_completion	A variant of the SQuAD question answering task designed for zero-shot evaluation of small LMs.	English
squadv2	Stanford Question Answering Dataset version 2, a reading comprehension benchmark.	English
storycloze	Tasks to predict story endings, focusing on narrative logic and coherence.	English
super_glue	A suite of challenging tasks designed to test a range of language understanding skills.	English
swag	Situations With Adversarial Generations, predicting the next event in videos.	English
swde	Information extraction tasks from semi-structured web pages.	English
tinyBenchmarks	Evaluation of large language models with fewer examples using tiny versions of popular benchmarks.	English
tmmluplus	An extended set of tasks under the TMMLU framework for broader academic assessments.	Traditional Chinese
toxigen	Tasks designed to evaluate language models on their propensity to generate toxic content.	English
translation	Tasks focused on evaluating the language translation capabilities of models.	Arabic, English, Spanish, Basque, Hindi, Indonesian, Burmese, Russian, Swahili, Telugu, Chinese
triviaqa	A large-scale dataset for trivia question answering to test general knowledge.	English
truthfulqa	A QA task aimed at evaluating the truthfulness and factual accuracy of model responses.	English
turkishmmlu	A multiple-choice QA test modeled after MMLU, written in Turkish based on Turkish high-school level exams.	Turkish
unitxt	A number of tasks implemented using the unitxt library for flexible, shareable, and reusable data preparation and evaluation for generative AI.	English
unscramble	Tasks involving the rearrangement of scrambled sentences to test syntactic understanding.	English
webqs	Web-based question answering tasks designed to evaluate internet search and retrieval.	English
wikitext	Tasks based on text from Wikipedia articles to assess language modeling and generation.	English
winogrande	A large-scale dataset for coreference resolution, inspired by the Winograd Schema Challenge.	English
wmdp	A benchmark with the objective of minimizing performance, based on potentially-sensitive multiple-choice knowledge questions.	English
wmt2016	Tasks from the WMT 2016 shared task, focusing on translation between multiple languages.	English, Czech, German, Finnish, Russian, Romanian, Turkish
wsc273	The Winograd Schema Challenge, a test of commonsense reasoning and coreference resolution.	English
xcopa	Cross-lingual Choice of Plausible Alternatives, testing reasoning in multiple languages.	Estonian, Haitian, Indonesian, Italian, Quechua, Swahili, Tamil, Thai, Turkish, Vietnamese, Chinese
xnli	Cross-Lingual Natural Language Inference to test understanding across different languages.	Arabic, Bulgarian, German, Greek, English, Spanish, French, Hindi, Russian, Swahili, Thai, Turkish, Urdu, Vietnamese, Chinese
xnli_eu	Cross-lingual Natural Language Inference tasks in Basque.	Basque
xquad	Cross-lingual Question Answering Dataset in multiple languages.	Arabic, German, Greek, English, Spanish, Hindi, Romanian, Russian, Thai, Turkish, Vietnamese, Chinese
xstorycloze	Cross-lingual narrative understanding tasks to predict story endings in multiple languages.	Russian, Simplified Chinese, Spanish, Arabic, Hindi, Indonesian, Telugu, Swahili, Basque, Burmese
xwinograd	Cross-lingual Winograd schema tasks for coreference resolution in multiple languages.	English, French, Japanese, Portuguese, Russian, Chinese

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Tasks

Files

README.md

Latest commit

History

README.md

File metadata and controls

Tasks