Create dataset loader for VoxLingua107 #328

SamuelCahyawijaya · 2022-11-20T05:48:05Z

Dataset	voxlingua
Description	VoxLingua107 is a speech dataset for training spoken language identification models. The dataset consists of short speech segments automatically extracted from YouTube videos and labeled according the language of the video title and description, with some post-processing steps to filter out false positives. VoxLingua107 contains data for 107 languages, including Indonesian, Javanese, and Sundanese.
License	CC-BY 4.0

haryoa · 2022-12-20T12:27:53Z

#self-assign

SamuelCahyawijaya added this to Nusantara Dataset Initiative Nov 20, 2022

github-actions bot assigned haryoa Dec 20, 2022

Provide feedback