There are three main steps in adding your data to OmniLingo. The first step is importing the data into IPFS, the second is indexing the data and the final step is publishing the data.
Import data into your local IPFS node and generate an index:
$ importer.py dataset_dir index_path
e.g.
$ importer.py ./cv-corpus-7.0-2021-07-21/tr/ tr.json
where the dataset_dir
is in Common Voice format.
Index the data, extracting a balanced subset of clips by a complexity metric:
$ indexer.py locale index_path
e.g.
$ indexer.py tr tr.json
This will return a CID that looks like QmXpgcavH2shpBbfnFoymPxEw2zpr4MdAgi1aaoZT4Yeho
Publish data to the global index in OmniLingo on IPFS:
$ publisher.py locale cid
e.g.
$ publisher.py tr QmXpgcavH2shpBbfnFoymPxEw2zpr4MdAgi1aaoZT4Yeho
Publish to a name using the local node ID:
ipfs name publish cid
e.g.
ipfs name publish QmXpgcavH2shpBbfnFoymPxEw2zpr4MdAgi1aaoZT4Yeho
To publish model files (e.g. for the pronunciation assistance) you need a directory, containing two files:
models/LOCALE.tflite
: The binary for the ASR modelmodels/LOCALE.json
: Metadata for the model
The metadata file, e.g. pt.json
for Portuguese, should look like:
{"format": "coqui", "type": "acoustic", "licence":"AGPL-3.0", "src":"https://itml.cl.indiana.edu/models/"}
You can publish using:
python3 publisher.py --merge QmXMp1Dv1Sf7ZHXcH6puqbudBhDNkqngopadzcy8Qikuqt --with-model models/pt.tflite pt QmbWXcHWVdRFh3ZmXEbf4tXTk6nqp8zkaNa4aAxaeQ9VTQ