HPLT-E is a framework for comprehensive multilingual and multi-prompt k-shot evaluation across 124 tasks in eight typologically diverse languages: Catalan, Spanish, Basque, Galician, Norwegian, Ukrainian, Czech, and Finnish.
08.10.2025
: We are releasing HPLT v3 together with HPLT-E, used to benchmark HPLT v3 against HPLT v2 in our multilingual ablation studies.
HPLT-E combines existing monolingual benchmarks for Catalan (CatalanBench), Spanish (SpanishBench), Basque (BasqueBench), Galician (GalicianBench), Norwegian (NorEval), Finnish (FinBench v2), and Czech (BenCzechMark). In addition, we create a multi-task benchmark for Ukrainian (UkrainianBench) and extend single-prompt benchmarks to the multi-prompt scenario (Catalan, Spanish, Basque, Galician, and Ukrainian). HPLT-E covers a diverse set of 124 natural language understanding and generation tasks, each supporting 3-7 human-written prompts. Our main evaluation principles include:
- Diversity: broader representation of lesser-resourced languages in context of pretraining corpora comparison.
- Data quality: use of human-created datasets to ensure reliable evaluation.
- Robust evaluation: evaluation across 450+ prompts written by native speakers to account for prompt sensitivity.
- Reproducibility: full integration of HPLT-E into LM Evaluation Harness for user-friendly standardized evaluation.
- Benchmark: CatalanBench
- Paper: aclanthology.org/2025.coling-main.699
- Homepage: N/A
- Language code:
cat_Latn
- Original LM Evaluation Harness implementation: github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/catalan_bench
- HPLT-E multi-prompt implementation: cat_Latn
Tasks
Name | LM Evaluation Harness | Task type | Task category |
---|---|---|---|
ARC-ca | arc_ca_challenge_p[0-2] |
Multiple-choice QA | Language-specific & world knowledge |
ARC-ca | arc_ca_easy_p[0-2] |
Multiple-choice QA | Language-specific & world knowledge |
Belebele | catbelebele_p[0-2] |
Multiple-choice QA | Reading comprehension |
CatalanQA | catalanqa_p[0-2] |
Generative QA | Language-specific & world knowledge |
CatCoLA | catcola_p[0-2] |
Text classification | Language knowledge |
COPA-ca | copa_ca_p[0-2] |
Text cassification | Commonsense reasoning |
CoQCat | coqcat_p[0-2] |
Generative QA | Reading comprehension |
MGSM-cat | mgsm_direct_ca_p[0-2] |
Generative QA | Mathematical reasoning |
OpenBookQA-cat | openbookqa_ca_p[0-2] |
Multiple-choice QA | Language-specific & world knowledge |
Parafraseja | parafraseja_p[0-2] |
Text classification | Paraphrasing |
PAWS-ca | paws_ca_p[0-2] |
Text classification | Paraphrasing |
PIQA-ca | piqa_ca_p[0-2] |
Multiple-choice QA | Commonsense reasoning |
SIQA-ca | siqa_ca_p[0-2] |
Multiple-choice QA | Commonsense reasoning |
TE-ca | teca_p[0-2] |
Text classification | Entailment |
VeritasQA-cat Generation | veritasqa_ca_gen_p[0-2] |
Generative QA | Truthfulness |
VeritasQA-cat Multiple-choice | veritasqa_ca_mc1_p[0-2] |
Multiple-choice QA | Truthfulness |
VeritasQA-cat Multiple-choice | veritasqa_ca_mc2_p[0-2] |
Multiple-choice QA | Truthfulness |
WNLI | wnli_ca_p[0-2] |
Text classification | Entailment |
XNLI | xnli_ca_p[0-2] |
Text classification | Entailment |
XQuAD | xquad_ca_p[0-2] |
Generative QA | Reading comprehension |
xStoryCloze | xstorycloze_ca_p[0-2] |
Multiple-choice QA | Commonsense reasoning |
Cocoteros | cocoteros_va_p[0-2] |
Text generation | Commonsense reasoning |
FLORES | flores_en-ca_p[0-2] |
Sequence-to-sequence generation | Machine translation |
- Benchmark: SpanishBench
- Paper: aclanthology.org/2025.coling-main.699
- Homepage: N/A
- Language code:
spa_Latn
- Original LM Evaluation Harness implementation: github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/spanish_bench
- HPLT-E multi-prompt implementation: spa_Latn
Tasks
Name | LM Evaluation Harness | Task type | Task category |
---|---|---|---|
Belebele | spabelebele_p[0-2] |
Multiple-choice QA | Reading comprehension |
COPA | copa_es_p[0-2] |
Text cassification | Commonsense reasoning |
ESCoLA | escola_p[0-2] |
Text cassification | Language knowledge |
MGSM-es | mgsm_direct_es_p[0-2] |
Generative QA | Mathematical reasoning |
OpenBookQA-es | openbookqa_es_p[0-2] |
Multiple-choice QA | Language-specific & world knowledge |
PAWS-es | paws_es_p[0-2] |
Text cassification | Paraphrasing |
VeritasQA-es Generation | veritasqa_es_gen_p[0-2] |
Generative QA | Truthfulness |
VeritasQA-es Multiple-choice | veritasqa_es_mc1_p[0-2] |
Multiple-choice QA | Truthfulness |
VeritasQA-es Multiple-choice | veritasqa_es_mc2_p[0-2] |
Multiple-choice QA | Truthfulness |
XNLI | xnli_es_p[0-2] |
Text cassification | Entailment |
XQuAD | xquad_es_p[0-2] |
Generative QA | Reading comprehension |
xStoryCloze | xstorycloze_es_p[0-2] |
Multiple-choice QA | Commonsense reasoning |
Cocoteros | cocoteros_es_p[0-2] |
Text generation | Commonsense reasoning |
FLORES | flores_en-es_p[0-2] |
Sequence-to-sequence generation | Machine translation |
INCLUDE | include_spanish_p[0-2] |
Multiple-choice QA | Language-specific & world knowledge |
Global-MMLU | global_mmlu_spanish_p[0-2] |
Multiple-choice QA | Language-specific & world knowledge |
- Benchmark: BasqueBench
- Paper: aclanthology.org/2025.coling-main.699
- Homepage: N/A
- Language code:
eus_Latn
- Original LM Evaluation Harness implementation: github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/basque_bench
- HPLT-E multi-prompt implementation: eus_Latn
Tasks
Name | LM Evaluation Harness | Task type | Task category |
---|---|---|---|
Belebele | eusbelebele_p[0-2] |
Multiple-choice QA | Reading comprehension |
EusExams | eus_exams_eu_p[0-2] |
Multiple-choice QA | Language-specific & world knowledge |
EusProfficiency | eus_proficiency_p[0-2] |
Multiple-choice QA | Language-specific & world knowledge |
EusReading | eus_reading_p[0-2] |
Multiple-choice QA | Reading comprehension |
EusTrivia | eus_trivia_p[0-2] |
Multiple-choice QA | Language-specific & world knowledge |
MGSM-eu | mgsm_direct_eu_p[0-2] |
Generative QA | Mathematical reasoning |
PIQA-eu | piqa_eu_p[0-2] |
Multiple-choice QA | Commonsense reasoning |
WNLI | wnli_eu_p[0-2] |
Text classification | Entailment |
XCOPA | xcopa_eu_p[0-2] |
Text cassification | Commonsense reasoning |
XNLI | xnli_eu_native_p[0-2] |
Text classification | Entailment |
xStoryCloze | xstorycloze_eu_p[0-2] |
Multiple-choice QA | Commonsense reasoning |
PAWS-eu | paws_eu_p[0-2] |
Text classification | Paraphrasing |
ARC-eu | arc_eu_easy_p[0-2] |
Multiple-choice QA | Language-specific & world knowledge |
ARC-eu | arc_eu_challenge_p[0-2] |
Multiple-choice QA | Language-specific & world knowledge |
FLORES | flores_en-eu_p[0-2] |
Sequence-to-sequence generation | Machine translation |
INCLUDE | include_basque_p[0-2] |
Multiple-choice QA | Language-specific & world knowledge |
- Benchmark: NorEval
- Paper: aclanthology.org/2025.findings-acl.181
- Homepage: github.com/ltgoslo/noreval
- Language code:
nor_Latn
- Original LM Evaluation Harness implementation: github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/noreval
- HPLT-E multi-prompt implementation: N/A
Tasks
Name | LM Evaluation Harness (Bokmål) | LM Evaluation Harness (Nynorsk) | Task type | Task category |
---|---|---|---|---|
NoReC Sentence | norec_sentence_p[0-4] |
❌ | Text classification | Sentiment analysis |
NoReC Document | norec_document_p[0-4] |
❌ | Text classification | Sentiment analysis |
NorIdiom | noridiom_nob_p[0-4] |
noridiom_nno_p[0-4] |
Sentence completion | Language knowledge |
Belebele | norbelebele_p[0-4] |
❌ | Multiple-choice QA | Reading comprehension |
NRK-Quiz-QA | nrk_quiz_qa_nob_p[0-4] |
nrk_quiz_qa_nno_p[0-4] |
Multiple-choice QA | Language-specific & world knowledge |
NorOpenBookQA | noropenbookqa_nob_p[0-4] |
noropenbookqa_nno_p[0-4] |
Multiple-choice QA | Language-specific & world knowledge |
NorCommonsenseQA | norcommonsenseqa_nob_p[0-4] |
norcommonsenseqa_nno_p[0-4] |
Multiple-choice QA | Commonsense reasoning |
NorTruthfulQA Multiple choice | nortruthfulqa_mc_nob_p[0-4] |
nortruthfulqa_mc_nno_p[0-4] |
Multiple-choice QA | Truthfulness |
NorQuAD | norquad_p[0-4] |
❌ | Generative QA | Reading comprehension |
NorTruthfulQA Generation | nortruthfulqa_gen_nob_p[0-4] |
nortruthfulqa_gen_nno_p[0-4] |
Generative QA | Truthfulness |
Tatoeba (English → Bokmål/Nynorsk) | tatoeba_eng_nob_p[0-4] |
tatoeba_eng_nno_p[0-4] |
Sequence-to-sequence generation | Machine translation |
- Benchmark: UkrainianBench
- Paper: N/A
- Homepage: github.com/hplt-project/hplt-e
- Language code:
ukr_Cyrl
- Original LM Evaluation Harness implementation: N/A
- HPLT-E multi-prompt implementation: ukr_Cyrl
Tasks
Name | LM Evaluation Harness | Task type | Task category |
---|---|---|---|
Global-MMLU | global_mmlu_ukrainian_p[0-2] |
Multiple-choice QA | Language-specific & world knowledge |
ZNO | zno_p[0-2] |
Multiple-choice QA | Language-specific & world knowledge |
INCLUDE | include_ukrainian_p[0-2] |
Multiple-choice QA | Language-specific & world knowledge |
TextDetox | textdetox_ukr_p[0-2] |
Text classification | Toxicity detection |
UA-SQuAD | ua_squad_p[0-2] |
Generative QA | Reading comprehension |
Belebele | ukrbelebele_p[0-2] |
Multiple-choice QA | Reading comprehension |
WMT24PP | wmt24pp_en-uk_p[0-2] |
Sequence-to-sequence generation | Machine translation |
- Benchmark: BenCzechMark
- Paper: arxiv.org/abs/2412.17933
- Homepage: github.com/DCGM/lm-evaluation-harness
- Language code:
ces_Latn
- Original LM Evaluation Harness implementation: github.com/DCGM/lm-evaluation-harness/tree/main/lm_eval/tasks/benczechmark
- HPLT-E multi-prompt implementation: ces_Latn
NB: we update BenCzechmark to enable support for latest LM Evaluation Harness versions and create prompts for Global-MMLU
Tasks
Name | LM Evaluation Harness | Task type | Task category |
---|---|---|---|
Belebele | cesbelebele_p[0-4] |
Multiple-choice QA | Reading comprehension |
Global-MMLU | global_mmlu_czech_p[0-4] |
Multiple-choice QA | Language-specific & world knowledge |
SQAD3.2 | cs_sqad32_p[0-4] |
Generative QA | Reading comprehension |
Umimeto | umimeto_p[0-4] |
Multiple-choice QA | Language-specific & world knowledge |
CERMAT OPEN | cermat_czech_open_p[0-4] |
Generative QA | Language knowledge |
CERMAT TF | cermat_czech_tf_p[0-4] |
Multiple-choice QA | Language knowledge |
CERMAT MC | cermat_czech_mc_p[0-4] |
Multiple-choice QA | Language knowledge |
Klokan QA | klokan_qa_p[0-4] |
Multiple-choice QA | Mathematical reasoning |
CERMAT (Math) MC | cermat_czmath_mc_p[0-4] |
Multiple-choice QA | Mathematical reasoning |
CERMAT (Math) OPEN | cermat_czmath_open_p[0-4] |
Generative QA | Mathematical reasoning |
CTKFacts | ctkfacts_nli_p[0-4] |
Text classification | Entailment |
Subjectivity | ces_subjectivity_p[0-4] |
Text classification | Sentiment analysis |
CzechSentiment - Mall | sentiment_mall_p[0-4] |
Text classification | Sentiment analysis |
CzechSentiment - CSFD | sentiment_csfd_p[0-4] |
Text classification | Sentiment analysis |
CzechSentiment - FB | sentiment_fb_p[0-4] |
Text classification | Sentiment analysis |
- Benchmark: FrenchBench
- Paper: arxiv.org/abs/2402.00786
- Homepage: huggingface.co/croissantllm
- Language code:
fra_Latn
- Original LM Evaluation Harness implementation: github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/french_bench
- HPLT-E multi-prompt implementation: fra_Latn
Tasks
Name | LM Evaluation Harness | Task type | Task category |
---|---|---|---|
FQuaD | fquad_p[0-2] |
Generative QA | Reading comprehension |
French Language Test: Grammar | french_bench_grammar_p[0-2] |
Multiple-choice QA | Language knowledge |
French Language Test: Vocabulary | french_bench_vocabulary_p[0-2] |
Multiple-choice QA | Language knowledge |
French Language Test: Reading | french_bench_reading_p[0-2] |
Multiple-choice QA | Reading comprehension |
Belebele | frabelebele_p[0-2] |
Multiple-choice QA | Reading comprehension |
French NLI | topic_based_nli_p[0-2] |
Text classification | Entailment |
XNLI | french_xnli_p[0-2] |
Text classification | Entailment |
INCLUDE | include_french_p[0-2] |
Multiple-choice QA | Language-specific & world knowledge |
Global-MMLU | global_mmlu_french_p[0-2] |
Multiple-choice QA | Language-specific & world knowledge |
- Benchmark: FinBench v2
- Paper: TBA
- Homepage: N/A
- Language code:
fin_Latn
- Original LM Evaluation Harness implementation: github.com/LumiOpen/lm-evaluation-harness/tree/finbench_v2/lm_eval/tasks/finbench_v2
- HPLT-E multi-prompt implementation: N/A
Tasks
| Name| Formulation | LM Evaluation Harness| Task type | Task category| FinBench v2 dataset version |
|:--------------------|-------------|:|:----------------|:----|:------------------|
| ARC-challenge-fi | mcf | arc_challenge_fi_mcf_fbv2_p[0-4]
| Multiple-choice QA | Language-scpecific & world knowledge | finbenchv2-arc-c-fi-ht |
| | cf | arc_challenge_fi_cf_fbv2_p[0-4]
|| | | |
| Belebele | mcf | belebele_fin_Latn_mcf_fbv2_p[0-4]
| Multiple-choice QA | Reading comprehension | finbenchv2-belebele-fi-og |
| | cf | belebele_fin_Latn_cf_fbv2_p[0-4]
|| | | |
| GoldenSwag | mcf | goldenswag_ht_fi_mcf_fbv2_p[0-4]
| Sentence completion | Commonsense reasoning | finbenchv2-goldenswag-fi-ht |
| | cf | goldenswag_ht_fi_cf_fbv2_p[0-4]
|| | | |
| FIN-Bench | mcf | finbench_analogies_mcf_fbv2_p[0-4]
| Multiple-choice | Relational reasoning | FIN-bench |
| | cf | finbench_analogies_cf_fbv2_p[0-4]
|| | | |
| | mcf | finbench_emotions_mcf_fbv2_p[0-4]
| Multiple-choice | Sentiment analysis | FIN-bench |
| | cf | finbench_emotions_cf_fbv2_p[0-4]
|| | | |
| | mcf | finbench_empirical_judgments_mcf_fbv2_p[0-4]
| Multiple-choice | Causal reasoning | FIN-bench |
| | cf | finbench_empirical_judgments_cf_fbv2_p[0-4]
|| | | |
| | mcf | finbench_general_knowledge_mcf_fbv2_p[0-4]
| Multiple-choice | Language-scpecific & world knowledge | FIN-bench |
| | cf | finbench_general_knowledge_cf_fbv2_p[0-4]
|| | | |
| | mcf | finbench_hhh_alignment_mcf_fbv2_p[0-4]
| Multiple-choice | Alignment and safety | FIN-bench |
| | cf | finbench_hhh_alignment_cf_fbv2_p[0-4]
|| | | |
| | mcf | finbench_paraphrase_mcf_fbv2_p[0-4]
| Multiple-choice | Paraphrasing | FIN-bench |
| | cf | finbench_paraphrase_cf_fbv2_p[0-4]
|| | | |
| | mcf | finbench_similarities_abstraction_mcf_fbv2_p[0-4]
| Multiple-choice | Commonsense reasoning | FIN-bench |
| | cf | finbench_similarities_abstraction_cf_fbv2_p[0-4]
|| | | |
In our HPLT v3 release, we conduct a series of multilingual ablation studies to compare HPLT v2 and HPLT v3 on the eight HPLT-E languages. We pretrain monolingual Llama-style 2.15B decoder-only models using 30B tokens from each language corpus, keeping the hyperparameters and tokenizer fixed across experiments:
- Hidden size: 2048
- Attention heads: 32
- Layers: 24
- Sequence length: 2048
- Tokenizer: Gemma-3 (SentencePiece, vocabulary size 262K)
Pretraining is performed with the Megatron-LM framework on the LUMI HPC supercomputer, using 16 AMD MI250x nodes and totaling approx. 1k GPU hours. We evaluate the ablation models at regular checkpoint intervals (every 1B tokens) in a 0-shot setup, aggregating results across all prompts and selecting tasks that provide reliable signal during pretraining.
We use the standard task-specific metrics and report the maximum score across the prompts as the main performance aggregation method. We adapt the FineWeb2 evaluation design to examine the signal HPLT-E tasks provide based on the criteria and statistics summarized below.
Criterion | Scope | Description | Requirement |
---|---|---|---|
Prompt-level median absolute deviation (MAD) | Mid-late pretraining window (15B-30B) | Typical prompt sensitivity | ≤ 5 |
Consistency (Kendall τ) | Mid-late pretraining window (15B-30B) | Stability of corpus rankings across pretraining intervals | No strict threshold |
Trajectory-level coefficient of variation (CV) | Mid-late pretraining window (15B-30B) | Relative variation around upper-bound performance | ≤ 10-12% |
Prompt-switch rate (%) | Mid-late pretraining window (15B-30B) | Measures best prompt consistency across checkpoints (prompt lottery) | No strict threshold |
Spread | Final checkpoint (30B) | The absolute difference between the maximum and minimum scores across prompts | No strict threshold |
Signal-to-noise ratio (SNR) | Final checkpoint (30B) | Noise from prompt variability | ≥ 3 |
Non-randomness | Final checkpoint (30B) | The absolute difference between the final score and random baseline | Must be positive and satisfactory |
Monotonicity (Spearman correlation) | All checkpoints (1B-30B) | Correlation between step and performance score | ≥ 0.5 |
To compute a language-level score across the selected tasks, we:
- Rescale performance scores relative to a random baseline using min–max normalization.
- Average the normalized scores within each task category.
- Compute the final language score as the mean of these category averages.
Results for all languages are presented below.
To compute the multilingual score, we utilize several approaches:
- Average normalized score: We average min-max normalized language scores.
- Average rank: we rank the 30B models’ language scores across all corpora configurations and average their ranks.
- Borda count: first, we rank the 30B models for each language; second, we apply the Borda count on the language-wise rankings to compute the final ranking. We utilize the Vote'n'Rank framework.
Rank-based aggregation
Corpus | Avg. rank | Borda count |
---|---|---|
HPLT v3 | 1.25 | 5 |
HPLT v2 | 1.75 | 2 |
Our preliminary pretraining corpus comparison shows that LLMs pretrained on HPLT v3 consistently outperform HPLT v2 across the HPLT-E languages. In particular, v3 models achieve stronger results for Ukrainian, Basque, Catalan, and French, perform on par with v2 for Finnish and Czech, and show minor decreases for Spanish and Norwegian.
Stay tuned for more evaluation and methodological updates 💥
- Install LM Evaluation Harness as described here.
git clone --depth 1 https://github.com/EleutherAI/lm-evaluation-harness
cd lm-evaluation-harness
pip install -e .
- Clone our HPLT-E GitHub repository to get access to the multi-prompt versions of SpanishBench, CatalanBench, BasqueBench, BenCzechMark, FrenchBench, and UkrainianBench.
git clone https://github.com/hplt-project/hplt-e.git
- Get the
finbench_v2
folder from the FinBench v2 GitHub repository
Detailed guidelines on how to use LM Evaluation Harness can be found here. The task names can be found in the LM Evaluation Harness column in the language-specific task tables provided above. _p[i-j]
stands for the corresponding supported prompts.
Below is an example of a basic framework usage and must-have arguments. The evaluation requires the usage of the include_path
argument to ensure our tasks are registered in the framework:
lm_eval \
--model hf \
--model_args pretrained=my_hf_model_name \
--tasks global_mmlu_ukrainian_p0 \
--include_path hplt-e/
--output results/ukrainian/ \
--log_samples \
--show_config \
--write_out \
--batch_size auto \
--num_fewshot 0
An alternative approach to run all tasks of interest at once involves creating a task group. LM Evaluation Harness allows to group tasks as described here. An example for the Ukrainian global_mmlu_ukrainian_p0
group task can be found here.
We thank Étienne Simon (UiO), Lucas Georges Gabriel Charpentier (UiO), and Daryna Dementieva (TUM) for their contribution to our prompt collection for French and Ukrainian.