Main repository for the sharing of Quechua-Spanish Speech Translation data as part of the low-resource shared task at IWSLT 2025.
In addition to the 1 hour, 40 minutes of Quechua audio data aligned with Spanish translations, we also provided
participants with a corpus of 48 hours of fully transcribed Quechua audio without
translations for the unconstrained
task. The audio data and corresponding transcriptions are a bigger extract from the Siminchik data set. The hope is that
this data can be directly used for assistance in the development of speech recognition components for the unconstrained
task. The data can be easily downloaded directly fron here: Unconstrained QUE-SPA Additional Audio 1.
Please Note: Participants are not required to use this data but are free to use with the license below.
@article{cardenas2018siminchik,
title={Siminchik: A speech corpus for preservation of southern quechua},
author={Cardenas, Ronald and Zevallos, Rodolfo and Baquerizo, Reynaldo and Camacho, Luis},
journal={ISI-NLP 2},
pages={21},
year={2018}
}
We are also sharing a secondary dataset named que_spa_synthetic_translation
, which contains a set of synthetic Quechua-Spanish translations. This dataset is intended as additional training data for the unconstrained task. While these translations are machine-generated (translated by Google) and post-edited by a Quechua speaker, they can be valuable for experimenting with training strategies in the unconstrained setup. The dataset can be found in the que_spa_synthetic_translation
folder, and participants are encouraged to explore its utility. This secondary corpus contains the Spanish translations of the Huqariq corpus (Zevallos et al.,2022). It is comprised of about 8 hours of Quechua audio with their transcriptions and translations into Spanish.
We are highly interested in feedback about this data, please contact John E. Ortega (j.ortega [email symbol] northeastern.edu) and Rodolfo Zevallos (rodolfojoel.zevallos [email symbol] upf.edu) if you plan on using this data.
Please Note: Participants are not required to use this data but are free to use with the license below.
@inproceedings{zevallos2022huqariq,
title={Huqariq: A Multilingual Speech Corpus of Native Languages of Peru forSpeech Recognition},
author={Zevallos, Rodolfo and Camacho, Luis and Melgarejo, Nelsi},
booktitle={Proceedings of the Thirteenth Language Resources and Evaluation Conference},
pages={5029--5034},
year={2022}
}
As part of the unconstrained task, we allow the use of Machine Transaltion parallel text from previous work. Participants are also not required to use this data.
The data is found in this repository in the folder: additional_mt_text
.
They are extracted from the JW300 and Hinantin websites and used in the cited work below.
Please make sure to cite the work below if you use this data.
@article{ortega2020neural,
title={Neural machine translation with a polysynthetic low resource language},
author={Ortega, John E and Castro Mamani, Richard and Cho, Kyunghyun},
journal={Machine Translation},
volume={34},
number={4},
pages={325--346},
year={2020},
publisher={Springer}
}
All audio recordings are property of Siminchikkunarayku and Llamacha.
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License.
Part of this work has been funded by AmericasNLP-2022, John E. Ortega, and Llamacha. Special thanks to Eva Mühlbauer, Maximilian Torres and Anku Kichka their support.