Skip to content

Latest commit

 

History

History
69 lines (50 loc) · 2.31 KB

README.md

File metadata and controls

69 lines (50 loc) · 2.31 KB

Dictionaries 🇧🇷

🦊 Resources stored on GitLab (LFS): https://gitlab.com/fb-resources/dicts-br

Phonetic, syllabic and stress vowel dicionaries generated by FalaBrasil's tagger tool 🦊. The scripts under this repo, however, use their dockerized version 🐳.

The word list contains ~2.4 million unfiltered words estimated from 4 portions of the OSCAR CORPUS v2019's Portuguese version. A slim, filtered version with the top-250k most frequent words is also available.

Usage

$ ./run.sh
Output should be as follows:

run.sh: generating phonetic dictionary (lexicon)
local/g2p.sh ./data/count.txt.gz ./data/log/g2p-count.log ./data/lexicon.count.txt.gz
Time: 29:06.35 (20.37 secs). RAM: 578948 KB
run.sh: generating syllabic dictionary
local/syl.sh ./data/count.txt.gz ./data/log/syl-count.log ./data/syllables.count.txt.gz
Time: 0:24.04 (7.81 secs). RAM: 48784 KB
run.sh: generating vowel stressing dictionary 
local/stress.sh ./data/count.txt.gz ./data/log/stress-count.log ./data/stress.count.txt.gz
Time: 0:20.40 (7.01 secs). RAM: 48808 KB
run.sh: generating phonetic dictionary (lexicon)
local/g2p.sh ./data/vocab.txt.gz ./data/log/g2p-vocab.log ./data/lexicon.vocab.txt.gz
Time: 3:14.06 (1.45 secs). RAM: 71708 KB
run.sh: generating syllabic dictionary
local/syl.sh ./data/vocab.txt.gz ./data/log/syl-vocab.log ./data/syllables.vocab.txt.gz
Time: 0:04.81 (0.83 secs). RAM: 49044 KB
run.sh: generating vowel stressing dictionary 
local/stress.sh ./data/vocab.txt.gz ./data/log/stress-vocab.log ./data/stress.vocab.txt.gz
Time: 0:02.82 (0.60 secs). RAM: 48532 KB
run.sh: success! check out './data' dir

License

MIT

Citation

TBD (Eurasip 2022?)

FalaBrasil UFPA

Grupo FalaBrasil (2021) - https://ufpafalabrasil.gitlab.io/
Universidade Federal do Pará (UFPA) - https://portal.ufpa.br/
Cassio Batista - https://cassota.gitlab.io/