Making Supervised Large Datasets for English / German / Spanish #35

snakers4 · 2020-04-26T08:32:57Z

Hi,

Have not found any contacts in the press-release or in the paper (please correct me if I am wrong), so I decided to open an issue here to reach out.

My name is Alexander, I am the main author of Open STT and these recent articles from The Gradient:

TLDR - we have collected 30k hours of annotation in Russian with close to zero investment into manual annotation and we are doing the same in English / German / Spanish. My personal goal is to collect 10-20k hours in English and 10k in German + Spanish. We have chosen these languages (apart from English ofc) because they are popular, we speak them (at least I can read) and phonetics is really simple and similar to Russian.

On Russian data we have built production grade models and have even deployed some high-load services into production (if you speak Russian - please follow these links http://silero.ai/, https://mobile-demo.silero.ai/, https://habr.com/ru/post/494006/)

I wonder if FAIR (please correct me if FAIR and facebookresearch is not the same entity) would be interested in any win-win collaboration or sponsoring our efforts to fully open-source our models and datasets.

Libri-Light offers 60+ k hours of unlabelled speech, a small training set for limited supervision (10h, 1h or 10 minutes of labelled speech), and a common set of metrics to evaluated three settings:

You can build almost fully supervised datasets from Librivox (granted there will be some noise the data ofc). I wonder why you did not do / share this. This is such a low-hanging fruit!

Best,
Alexander

snakers4 · 2020-04-27T04:39:52Z

Also I wonder why do you use flac, but not a modern speech oriented codec like opus?
It is a lossless format made for music, it takes much more space than opus.

snakers4 changed the title ~~Supervised large~~ Making Supervised Large Datasets for English / German / Spanish Apr 26, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Making Supervised Large Datasets for English / German / Spanish #35

Making Supervised Large Datasets for English / German / Spanish #35

snakers4 commented Apr 26, 2020

snakers4 commented Apr 27, 2020

Making Supervised Large Datasets for English / German / Spanish #35

Making Supervised Large Datasets for English / German / Spanish #35

Comments

snakers4 commented Apr 26, 2020

snakers4 commented Apr 27, 2020