Skip to content

Latest commit

 

History

History
78 lines (49 loc) · 3.3 KB

README.md

File metadata and controls

78 lines (49 loc) · 3.3 KB

Celosia

Fast and accurate phonemizer1, built in Rust

Celosia (/siːˈloʊʃiə/ see-LOH-shee-ə) is a Rust crate that turns a sentence of natural language into its phoneme transcript automatically. It supports English (amepd), Japanese (romaji), Mandarin (pinyin), French and German (prosodylab), with language-specific data (stress, accent and tones).

🚧 WIP, DO NOT USE 🚧

Overview

This section briefly introduces the phonemization pipeline for each language.

English

  1. Look up words in amepd for spelling and stress.
  2. For words that have multiple spellings, use the POS tag provided by amepd and a Averaged Perceptron Tagger to disambiguate them.
  3. For OOV (out-of-vocabulary) words, predict the spelling with a g2p2 model.

Japanese

  1. Retrieve the full context label from the text with openjtalk-rs.
  2. Parse the context label, retrieve accent phrases and their mora boundary information.
  3. We ignore OOV words for the UTF code doesn't contain any information of spelling.

Mandarin Chinese

  1. Segment text with jieba-rs into words.
  2. Look up words in CC-CEDICT for pinyin and tones.
  3. For words that have multiple spellings, use a CRF model to disambiguate them.
  4. We ignore OOV words for the UTF code doesn't contain any information of spelling.

French & German

Thanks for the orthography, French & German generally don't have the disambiguation (i.e. one word, multiple spelling) problem that is commonly seen in the languages above.

  1. Look up words in prosody-lab's dictionaries.
  2. For OOVs, we predict the spelling with a g2p2 model.

G2P

The G2P model we're using is a seq2seq transformer model, you can find more information in the module.

Development

# test
cargo test
# benchmark
cargo bench
# build
cargo build --release

License

Celosia is dual-licensed under the Apache License, Version 2.0 or the MIT license, at your option. This file may not be copied, modified, or distributed except according to those terms.

About the name

Celosiawikipedia is a small genus of edible and ornamental plants in the amaranth family, Amaranthaceae.

Celosia.JPG
By Hariya1234 - Own work, CC BY-SA 3.0, Link

Credits

The 3-rd party licences can be referred at third_party/README.md.

Footnotes

  1. Phonemize here refers to the procedure of transforming one or more word(s) into a phoneme sequence:
    "hello, world" -> "hh ax l ow1 _ w er1 l d" # Yes
    "world" -> "w er1 l d" # Yes

  2. G2P/g2p here refers to the procedure of transforming one single word into a phoneme sequence:
    "hello, world" -> "hh ax l ow1 _ w er1 l d" # No
    "world" -> "w er1 l d" # Yes 2