This repository provides data and code for "CrowdSpeech and Vox DIY: Benchmark Dataset for Crowdsourced Audio Transcription" paper.
The collected transcriptions stored in data/*-crowd.tsv
, ground-truth transcriptions stored in data/*-gt.txt
. We also provide a code for the annotation process
and speech synthesis in annotation
and speech_sythesis
folders, respectively.
CrowdSpeech and VoxDIY datasets stored in the data
folder. Each dataset is associated with two filed: <dataset>-<split>-crowd.tsv
and <dataset>-<split>-gt.txt
. The first one contains three columns INPUT:audio
— an audio file given to crowd workers, OUTPUT:transcription
— worker's transcription and ASSIGNMENT:worker_id
— a unique worker identifier. The second file contains two tab-separated columns without header: an audio file and the ground-truth transcription.
You can also download the CrowdSpeech dataset from HuggingFace.
First, you may need to install some dependencies:
pip3 install crowd-kit toloka-kit jiwer
Then, you can easily evaluate all our baseline aggregation methods by a single command:
python3 baselines.py data/<dataset>-gt.txt data/<dataset>-crowd.tsv
In order to get the Oracle result, run
python3 oracle.py data/<dataset>-gt.txt data/<dataset>-crowd.tsv
You can also get the Inter-Rater Agreement by running
python3 agreement.py data/<dataset>-crowd.tsv
You can find an IPython notebook with a code for the data collection process for the VoxDIY. For the quality control, we use a special class, TaskProcessor
, that
gets all the submits that are not accepted or rejected at the moment, calculates workers' skills, and checks if a submit should be accepted or rejected.
Our data is also available at HuggingFace Hub as well as the T5 model trained on train-clean
, dev-clean
and dev-other
parts of CrowdSpeech.
This snippet shows the example of the model's inference:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, AutoConfig
mname = "toloka/t5-large-for-text-aggregation"
tokenizer = AutoTokenizer.from_pretrained(mname)
model = AutoModelForSeq2SeqLM.from_pretrained(mname)
input = "samplee text | sampl text | sample textt"
input_ids = tokenizer.encode(input, return_tensors="pt")
outputs = model.generate(input_ids)
decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(decoded) # sample text
© YANDEX LLC, 2021. Licensed under the Apache License, Version 2.0. See LICENSE file for more details.
© YANDEX LLC, 2021. Licensed under the Creative Commons Attribution 4.0 license. See data/LICENSE file for more details.
LibriSpeech dataset is used under the Creative Commons Attribution 4.0 license.
CrowdWSA2019 dataset is used under the Creative Commons Attribution 4.0 license.