Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Making Supervised Large Datasets for English / German / Spanish #35

Open
snakers4 opened this issue Apr 26, 2020 · 1 comment
Open

Making Supervised Large Datasets for English / German / Spanish #35

snakers4 opened this issue Apr 26, 2020 · 1 comment

Comments

@snakers4
Copy link

Hi,

Have not found any contacts in the press-release or in the paper (please correct me if I am wrong), so I decided to open an issue here to reach out.

My name is Alexander, I am the main author of Open STT and these recent articles from The Gradient:

TLDR - we have collected 30k hours of annotation in Russian with close to zero investment into manual annotation and we are doing the same in English / German / Spanish. My personal goal is to collect 10-20k hours in English and 10k in German + Spanish. We have chosen these languages (apart from English ofc) because they are popular, we speak them (at least I can read) and phonetics is really simple and similar to Russian.

On Russian data we have built production grade models and have even deployed some high-load services into production (if you speak Russian - please follow these links http://silero.ai/, https://mobile-demo.silero.ai/, https://habr.com/ru/post/494006/)

I wonder if FAIR (please correct me if FAIR and facebookresearch is not the same entity) would be interested in any win-win collaboration or sponsoring our efforts to fully open-source our models and datasets.

Libri-Light offers 60+ k hours of unlabelled speech, a small training set for limited supervision (10h, 1h or 10 minutes of labelled speech), and a common set of metrics to evaluated three settings:

You can build almost fully supervised datasets from Librivox (granted there will be some noise the data ofc). I wonder why you did not do / share this. This is such a low-hanging fruit!

Best,
Alexander

@snakers4 snakers4 changed the title Supervised large Making Supervised Large Datasets for English / German / Spanish Apr 26, 2020
@snakers4
Copy link
Author

Also I wonder why do you use flac, but not a modern speech oriented codec like opus?
It is a lossless format made for music, it takes much more space than opus.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant