Skip to content

hplt-project/hplt-e

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HPLT-E: Multilingual and Comprehensive LLM Evaluation

HPLT-E is a framework for comprehensive multilingual and multi-prompt k-shot evaluation across 124 tasks in eight typologically diverse languages: Catalan, Spanish, Basque, Galician, Norwegian, Ukrainian, Czech, and Finnish.

Updates

  • 08.10.2025: We are releasing HPLT v3 together with HPLT-E, used to benchmark HPLT v3 against HPLT v2 in our multilingual ablation studies.

Contents

Overview

HPLT-E combines existing monolingual benchmarks for Catalan (CatalanBench), Spanish (SpanishBench), Basque (BasqueBench), Galician (GalicianBench), Norwegian (NorEval), Finnish (FinBench v2), and Czech (BenCzechMark). In addition, we create a multi-task benchmark for Ukrainian (UkrainianBench) and extend single-prompt benchmarks to the multi-prompt scenario (Catalan, Spanish, Basque, Galician, and Ukrainian). HPLT-E covers a diverse set of 124 natural language understanding and generation tasks, each supporting 3-7 human-written prompts. Our main evaluation principles include:

  • Diversity: broader representation of lesser-resourced languages in context of pretraining corpora comparison.
  • Data quality: use of human-created datasets to ensure reliable evaluation.
  • Robust evaluation: evaluation across 450+ prompts written by native speakers to account for prompt sensitivity.
  • Reproducibility: full integration of HPLT-E into LM Evaluation Harness for user-friendly standardized evaluation.

Evaluation suite

Catalan

Tasks
Name LM Evaluation Harness Task type Task category
ARC-ca arc_ca_challenge_p[0-2] Multiple-choice QA Language-specific & world knowledge
ARC-ca arc_ca_easy_p[0-2] Multiple-choice QA Language-specific & world knowledge
Belebele catbelebele_p[0-2] Multiple-choice QA Reading comprehension
CatalanQA catalanqa_p[0-2] Generative QA Language-specific & world knowledge
CatCoLA catcola_p[0-2] Text classification Language knowledge
COPA-ca copa_ca_p[0-2] Text cassification Commonsense reasoning
CoQCat coqcat_p[0-2] Generative QA Reading comprehension
MGSM-cat mgsm_direct_ca_p[0-2] Generative QA Mathematical reasoning
OpenBookQA-cat openbookqa_ca_p[0-2] Multiple-choice QA Language-specific & world knowledge
Parafraseja parafraseja_p[0-2] Text classification Paraphrasing
PAWS-ca paws_ca_p[0-2] Text classification Paraphrasing
PIQA-ca piqa_ca_p[0-2] Multiple-choice QA Commonsense reasoning
SIQA-ca siqa_ca_p[0-2] Multiple-choice QA Commonsense reasoning
TE-ca teca_p[0-2] Text classification Entailment
VeritasQA-cat Generation veritasqa_ca_gen_p[0-2] Generative QA Truthfulness
VeritasQA-cat Multiple-choice veritasqa_ca_mc1_p[0-2] Multiple-choice QA Truthfulness
VeritasQA-cat Multiple-choice veritasqa_ca_mc2_p[0-2] Multiple-choice QA Truthfulness
WNLI wnli_ca_p[0-2] Text classification Entailment
XNLI xnli_ca_p[0-2] Text classification Entailment
XQuAD xquad_ca_p[0-2] Generative QA Reading comprehension
xStoryCloze xstorycloze_ca_p[0-2] Multiple-choice QA Commonsense reasoning
Cocoteros cocoteros_va_p[0-2] Text generation Commonsense reasoning
FLORES flores_en-ca_p[0-2] Sequence-to-sequence generation Machine translation

Spanish

Tasks
Name LM Evaluation Harness Task type Task category
Belebele spabelebele_p[0-2] Multiple-choice QA Reading comprehension
COPA copa_es_p[0-2] Text cassification Commonsense reasoning
ESCoLA escola_p[0-2] Text cassification Language knowledge
MGSM-es mgsm_direct_es_p[0-2] Generative QA Mathematical reasoning
OpenBookQA-es openbookqa_es_p[0-2] Multiple-choice QA Language-specific & world knowledge
PAWS-es paws_es_p[0-2] Text cassification Paraphrasing
VeritasQA-es Generation veritasqa_es_gen_p[0-2] Generative QA Truthfulness
VeritasQA-es Multiple-choice veritasqa_es_mc1_p[0-2] Multiple-choice QA Truthfulness
VeritasQA-es Multiple-choice veritasqa_es_mc2_p[0-2] Multiple-choice QA Truthfulness
XNLI xnli_es_p[0-2] Text cassification Entailment
XQuAD xquad_es_p[0-2] Generative QA Reading comprehension
xStoryCloze xstorycloze_es_p[0-2] Multiple-choice QA Commonsense reasoning
Cocoteros cocoteros_es_p[0-2] Text generation Commonsense reasoning
FLORES flores_en-es_p[0-2] Sequence-to-sequence generation Machine translation
INCLUDE include_spanish_p[0-2] Multiple-choice QA Language-specific & world knowledge
Global-MMLU global_mmlu_spanish_p[0-2] Multiple-choice QA Language-specific & world knowledge

Basque

Tasks
Name LM Evaluation Harness Task type Task category
Belebele eusbelebele_p[0-2] Multiple-choice QA Reading comprehension
EusExams eus_exams_eu_p[0-2] Multiple-choice QA Language-specific & world knowledge
EusProfficiency eus_proficiency_p[0-2] Multiple-choice QA Language-specific & world knowledge
EusReading eus_reading_p[0-2] Multiple-choice QA Reading comprehension
EusTrivia eus_trivia_p[0-2] Multiple-choice QA Language-specific & world knowledge
MGSM-eu mgsm_direct_eu_p[0-2] Generative QA Mathematical reasoning
PIQA-eu piqa_eu_p[0-2] Multiple-choice QA Commonsense reasoning
WNLI wnli_eu_p[0-2] Text classification Entailment
XCOPA xcopa_eu_p[0-2] Text cassification Commonsense reasoning
XNLI xnli_eu_native_p[0-2] Text classification Entailment
xStoryCloze xstorycloze_eu_p[0-2] Multiple-choice QA Commonsense reasoning
PAWS-eu paws_eu_p[0-2] Text classification Paraphrasing
ARC-eu arc_eu_easy_p[0-2] Multiple-choice QA Language-specific & world knowledge
ARC-eu arc_eu_challenge_p[0-2] Multiple-choice QA Language-specific & world knowledge
FLORES flores_en-eu_p[0-2] Sequence-to-sequence generation Machine translation
INCLUDE include_basque_p[0-2] Multiple-choice QA Language-specific & world knowledge

Norwegian

Tasks
Name LM Evaluation Harness (Bokmål) LM Evaluation Harness (Nynorsk) Task type Task category
NoReC Sentence norec_sentence_p[0-4] Text classification Sentiment analysis
NoReC Document norec_document_p[0-4] Text classification Sentiment analysis
NorIdiom noridiom_nob_p[0-4] noridiom_nno_p[0-4] Sentence completion Language knowledge
Belebele norbelebele_p[0-4] Multiple-choice QA Reading comprehension
NRK-Quiz-QA nrk_quiz_qa_nob_p[0-4] nrk_quiz_qa_nno_p[0-4] Multiple-choice QA Language-specific & world knowledge
NorOpenBookQA noropenbookqa_nob_p[0-4] noropenbookqa_nno_p[0-4] Multiple-choice QA Language-specific & world knowledge
NorCommonsenseQA norcommonsenseqa_nob_p[0-4] norcommonsenseqa_nno_p[0-4] Multiple-choice QA Commonsense reasoning
NorTruthfulQA Multiple choice nortruthfulqa_mc_nob_p[0-4] nortruthfulqa_mc_nno_p[0-4] Multiple-choice QA Truthfulness
NorQuAD norquad_p[0-4] Generative QA Reading comprehension
NorTruthfulQA Generation nortruthfulqa_gen_nob_p[0-4] nortruthfulqa_gen_nno_p[0-4] Generative QA Truthfulness
Tatoeba (English → Bokmål/Nynorsk) tatoeba_eng_nob_p[0-4] tatoeba_eng_nno_p[0-4] Sequence-to-sequence generation Machine translation

Ukrainian

  • Benchmark: UkrainianBench
  • Paper: N/A
  • Homepage: github.com/hplt-project/hplt-e
  • Language code: ukr_Cyrl
  • Original LM Evaluation Harness implementation: N/A
  • HPLT-E multi-prompt implementation: ukr_Cyrl
Tasks
Name LM Evaluation Harness Task type Task category
Global-MMLU global_mmlu_ukrainian_p[0-2] Multiple-choice QA Language-specific & world knowledge
ZNO zno_p[0-2] Multiple-choice QA Language-specific & world knowledge
INCLUDE include_ukrainian_p[0-2] Multiple-choice QA Language-specific & world knowledge
TextDetox textdetox_ukr_p[0-2] Text classification Toxicity detection
UA-SQuAD ua_squad_p[0-2] Generative QA Reading comprehension
Belebele ukrbelebele_p[0-2] Multiple-choice QA Reading comprehension
WMT24PP wmt24pp_en-uk_p[0-2] Sequence-to-sequence generation Machine translation

Czech

NB: we update BenCzechmark to enable support for latest LM Evaluation Harness versions and create prompts for Global-MMLU

Tasks
Name LM Evaluation Harness Task type Task category
Belebele cesbelebele_p[0-4] Multiple-choice QA Reading comprehension
Global-MMLU global_mmlu_czech_p[0-4] Multiple-choice QA Language-specific & world knowledge
SQAD3.2 cs_sqad32_p[0-4] Generative QA Reading comprehension
Umimeto umimeto_p[0-4] Multiple-choice QA Language-specific & world knowledge
CERMAT OPEN cermat_czech_open_p[0-4] Generative QA Language knowledge
CERMAT TF cermat_czech_tf_p[0-4] Multiple-choice QA Language knowledge
CERMAT MC cermat_czech_mc_p[0-4] Multiple-choice QA Language knowledge
Klokan QA klokan_qa_p[0-4] Multiple-choice QA Mathematical reasoning
CERMAT (Math) MC cermat_czmath_mc_p[0-4] Multiple-choice QA Mathematical reasoning
CERMAT (Math) OPEN cermat_czmath_open_p[0-4] Generative QA Mathematical reasoning
CTKFacts ctkfacts_nli_p[0-4] Text classification Entailment
Subjectivity ces_subjectivity_p[0-4] Text classification Sentiment analysis
CzechSentiment - Mall sentiment_mall_p[0-4] Text classification Sentiment analysis
CzechSentiment - CSFD sentiment_csfd_p[0-4] Text classification Sentiment analysis
CzechSentiment - FB sentiment_fb_p[0-4] Text classification Sentiment analysis

French

Tasks
Name LM Evaluation Harness Task type Task category
FQuaD fquad_p[0-2] Generative QA Reading comprehension
French Language Test: Grammar french_bench_grammar_p[0-2] Multiple-choice QA Language knowledge
French Language Test: Vocabulary french_bench_vocabulary_p[0-2] Multiple-choice QA Language knowledge
French Language Test: Reading french_bench_reading_p[0-2] Multiple-choice QA Reading comprehension
Belebele frabelebele_p[0-2] Multiple-choice QA Reading comprehension
French NLI topic_based_nli_p[0-2] Text classification Entailment
XNLI french_xnli_p[0-2] Text classification Entailment
INCLUDE include_french_p[0-2] Multiple-choice QA Language-specific & world knowledge
Global-MMLU global_mmlu_french_p[0-2] Multiple-choice QA Language-specific & world knowledge

Finnish

Tasks

| Name| Formulation | LM Evaluation Harness| Task type | Task category| FinBench v2 dataset version | |:--------------------|-------------|:|:----------------|:----|:------------------| | ARC-challenge-fi | mcf | arc_challenge_fi_mcf_fbv2_p[0-4] | Multiple-choice QA | Language-scpecific & world knowledge | finbenchv2-arc-c-fi-ht | | | cf | arc_challenge_fi_cf_fbv2_p[0-4] || | | | | Belebele | mcf | belebele_fin_Latn_mcf_fbv2_p[0-4]| Multiple-choice QA | Reading comprehension | finbenchv2-belebele-fi-og | | | cf | belebele_fin_Latn_cf_fbv2_p[0-4] || | | | | GoldenSwag | mcf | goldenswag_ht_fi_mcf_fbv2_p[0-4] | Sentence completion | Commonsense reasoning | finbenchv2-goldenswag-fi-ht | | | cf | goldenswag_ht_fi_cf_fbv2_p[0-4] || | | | | FIN-Bench | mcf | finbench_analogies_mcf_fbv2_p[0-4] | Multiple-choice | Relational reasoning | FIN-bench | | | cf | finbench_analogies_cf_fbv2_p[0-4]|| | | | | | mcf | finbench_emotions_mcf_fbv2_p[0-4] | Multiple-choice | Sentiment analysis | FIN-bench | | | cf | finbench_emotions_cf_fbv2_p[0-4] || | | | | | mcf | finbench_empirical_judgments_mcf_fbv2_p[0-4] | Multiple-choice | Causal reasoning | FIN-bench | | | cf | finbench_empirical_judgments_cf_fbv2_p[0-4] || | | | | | mcf | finbench_general_knowledge_mcf_fbv2_p[0-4] | Multiple-choice | Language-scpecific & world knowledge | FIN-bench | | | cf | finbench_general_knowledge_cf_fbv2_p[0-4] || | | | | | mcf | finbench_hhh_alignment_mcf_fbv2_p[0-4] | Multiple-choice | Alignment and safety | FIN-bench | | | cf | finbench_hhh_alignment_cf_fbv2_p[0-4] || | | | | | mcf | finbench_paraphrase_mcf_fbv2_p[0-4] | Multiple-choice | Paraphrasing | FIN-bench | | | cf | finbench_paraphrase_cf_fbv2_p[0-4] || | | | | | mcf | finbench_similarities_abstraction_mcf_fbv2_p[0-4] | Multiple-choice | Commonsense reasoning | FIN-bench | | | cf | finbench_similarities_abstraction_cf_fbv2_p[0-4] || | | |

Ablation studies

In our HPLT v3 release, we conduct a series of multilingual ablation studies to compare HPLT v2 and HPLT v3 on the eight HPLT-E languages. We pretrain monolingual Llama-style 2.15B decoder-only models using 30B tokens from each language corpus, keeping the hyperparameters and tokenizer fixed across experiments:

  • Hidden size: 2048
  • Attention heads: 32
  • Layers: 24
  • Sequence length: 2048
  • Tokenizer: Gemma-3 (SentencePiece, vocabulary size 262K)

Pretraining is performed with the Megatron-LM framework on the LUMI HPC supercomputer, using 16 AMD MI250x nodes and totaling approx. 1k GPU hours. We evaluate the ablation models at regular checkpoint intervals (every 1B tokens) in a 0-shot setup, aggregating results across all prompts and selecting tasks that provide reliable signal during pretraining.

Task selection

We use the standard task-specific metrics and report the maximum score across the prompts as the main performance aggregation method. We adapt the FineWeb2 evaluation design to examine the signal HPLT-E tasks provide based on the criteria and statistics summarized below.

Criterion Scope Description Requirement
Prompt-level median absolute deviation (MAD) Mid-late pretraining window (15B-30B) Typical prompt sensitivity ≤ 5
Consistency (Kendall τ) Mid-late pretraining window (15B-30B) Stability of corpus rankings across pretraining intervals No strict threshold
Trajectory-level coefficient of variation (CV) Mid-late pretraining window (15B-30B) Relative variation around upper-bound performance ≤ 10-12%
Prompt-switch rate (%) Mid-late pretraining window (15B-30B) Measures best prompt consistency across checkpoints (prompt lottery) No strict threshold
Spread Final checkpoint (30B) The absolute difference between the maximum and minimum scores across prompts No strict threshold
Signal-to-noise ratio (SNR) Final checkpoint (30B) Noise from prompt variability ≥ 3
Non-randomness Final checkpoint (30B) The absolute difference between the final score and random baseline Must be positive and satisfactory
Monotonicity (Spearman correlation) All checkpoints (1B-30B) Correlation between step and performance score ≥ 0.5

Language-level performance aggregation

Computing a language score

To compute a language-level score across the selected tasks, we:

  1. Rescale performance scores relative to a random baseline using min–max normalization.
  2. Average the normalized scores within each task category.
  3. Compute the final language score as the mean of these category averages.

Results for all languages are presented below.

Spanish Spanish results
Catalan Spanish results
Basque Spanish results
Czech Spanish results
Finnish Spanish results
Norwegian Spanish results
Ukrainian Spanish results
French Spanish results

Computing a "multilingual" score

To compute the multilingual score, we utilize several approaches:

  1. Average normalized score: We average min-max normalized language scores.
  2. Average rank: we rank the 30B models’ language scores across all corpora configurations and average their ranks.
  3. Borda count: first, we rank the 30B models for each language; second, we apply the Borda count on the language-wise rankings to compute the final ranking. We utilize the Vote'n'Rank framework.
Average normalized score (max. as prompt-level aggregation) Spanish results
Average normalized score (median as prompt-level aggregation) Spanish results
Rank-based aggregation
Corpus Avg. rank Borda count
HPLT v3 1.25 5
HPLT v2 1.75 2

Key Takeaways

Our preliminary pretraining corpus comparison shows that LLMs pretrained on HPLT v3 consistently outperform HPLT v2 across the HPLT-E languages. In particular, v3 models achieve stronger results for Ukrainian, Basque, Catalan, and French, perform on par with v2 for Finnish and Czech, and show minor decreases for Spanish and Norwegian.

Stay tuned for more evaluation and methodological updates 💥

Installation and usage

  1. Install LM Evaluation Harness as described here.
git clone --depth 1 https://github.com/EleutherAI/lm-evaluation-harness
cd lm-evaluation-harness
pip install -e .
  1. Clone our HPLT-E GitHub repository to get access to the multi-prompt versions of SpanishBench, CatalanBench, BasqueBench, BenCzechMark, FrenchBench, and UkrainianBench.
git clone https://github.com/hplt-project/hplt-e.git
  1. Get the finbench_v2 folder from the FinBench v2 GitHub repository

Examples

Detailed guidelines on how to use LM Evaluation Harness can be found here. The task names can be found in the LM Evaluation Harness column in the language-specific task tables provided above. _p[i-j] stands for the corresponding supported prompts.

Basic usage

Below is an example of a basic framework usage and must-have arguments. The evaluation requires the usage of the include_path argument to ensure our tasks are registered in the framework:

lm_eval \
  --model hf \
  --model_args pretrained=my_hf_model_name \
  --tasks global_mmlu_ukrainian_p0 \
  --include_path hplt-e/
  --output results/ukrainian/ \
  --log_samples \
  --show_config \
  --write_out \
  --batch_size auto \
  --num_fewshot 0

Task groups

An alternative approach to run all tasks of interest at once involves creating a task group. LM Evaluation Harness allows to group tasks as described here. An example for the Ukrainian global_mmlu_ukrainian_p0 group task can be found here.

Acknowledgements

We thank Étienne Simon (UiO), Lucas Georges Gabriel Charpentier (UiO), and Daryna Dementieva (TUM) for their contribution to our prompt collection for French and Ukrainian.

About

Multilingual evaluation framework

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages