CatalanBench is a benchmark for evaluating language models in Catalan tasks. This is, it evaluates the ability of a language model to understand and generate Catalan text. CatalanBench offers a combination of pre-existing, open datasets and datasets developed exclusivelly for this benchmark. All the details of CatalanBench will be published in a paper soon.
The new evaluation datasets included in CatalanBench are:
Task | Category | Homepage |
---|---|---|
ARC_ca | Question Answering | https://huggingface.co/datasets/projecte-aina/arc_ca |
MGSM_ca | Math | https://huggingface.co/datasets/projecte-aina/mgsm_ca |
OpenBookQA_ca | Question Answering | https://huggingface.co/datasets/projecte-aina/openbookqa_ca |
Parafraseja | Paraphrasing | https://huggingface.co/datasets/projecte-aina/Parafraseja |
PIQA_ca | Question Answering | https://huggingface.co/datasets/projecte-aina/piqa_ca |
SIQA_ca | Question Answering | https://huggingface.co/datasets/projecte-aina/siqa_ca |
XStoryCloze_ca | Commonsense Reasoning | https://huggingface.co/datasets/projecte-aina/xstorycloze_ca |
The datasets included in CatalanBench that have been made public in previous pubications are:
Paper for CatalanBench coming soon.
catalan_bench
: All tasks included in CatalanBench.flores_ca
: All FLORES translation tasks from or to Catalan.
cabreu
: Three CaBREU tasks for each type of summary (extractive, abstractive and extreme).phrases_va
: Two Phrases_va tasks for language adaptation between Catalan and Valencian.
The following tasks evaluate tasks on CatalanBench dataset using various scoring methods.
arc_ca_challenge
arc_ca_easy
belebele_cat_Latn
cabreu
catalanqa
catcola
copa_ca
coqcat
flores_ca
flores_ca-de
flores_ca-en
flores_ca-es
flores_ca-eu
flores_ca-fr
flores_ca-gl
flores_ca-it
flores_ca-pt
flores_de-ca
flores_en-ca
flores_es-ca
flores_eu-ca
flores_fr-ca
flores_gl-ca
flores_it-ca
flores_pt-ca
mgsm_direct_ca
openbookqa_ca
parafraseja
paws_ca
phrases_ca
piqa_ca
siqa_ca
teca
veritasqa_gen_ca
veritasqa_mc1_ca
veritasqa_mc2_ca
wnli_ca
xnli_ca
xquad_ca
xstorycloze_ca
Some of these tasks are taken from benchmarks already available in LM Evaluation Harness. These are:
belebele_cat_Latn
: Belebele Catalan
- Is the task an existing benchmark in the literature?
- Have you referenced the original paper that introduced the task?
- If yes, does the original paper provide a reference implementation?
- Yes, original implementation contributed by author of the benchmark
If other tasks on this dataset are already supported:
- Is the "Main" variant of this task clearly denoted?
- Have you provided a short sentence in a README on what each new variant adds / evaluates?
- Have you noted which, if any, published evaluation setups are matched by this variant?