Grapheme to Phoneme (G2P) is a function that generates pronunciations (phonemes) for words based on their written form (graphemes). It has an important role in automatic speech recognition systems, natural language processing and text-to-speech engines. This G2P model implements a transformer architecture on python PyTorch and FairSeq. This repo implements a G2P model with two APIs:
- load_g2p_model: Loads the G2P model from disk.
- decode_word: Outputs phonemes given a word. It optionally exposes phoneme stress information.
This repo works on Python>=3.7.8 and uses poetry to install dependencies. Assuming pyenv
and poetry
is installed, the repo can be downloaded as follows:
cd g2p_seq2seq_pytorch/
pyenv virtualenv 3.7.8 g2p
pyenv activate g2p
poetry install
We provide a pretrained 3x3 layer transformer model with 256 hidden units here.
The model should be named 20210722.pt
. Place the model file in the g2p_seq2seq_pytorch/g2p_seq2seq_pytorch/models/
folder.
from g2p_seq2seq_pytorch.g2p import G2PPytorch
model = G2PPytorch()
model.load_model()
model.decode_word("amsterdam") # "AE M S T ER D AE M"
model.decode_word("amsterdam", with_stress=True) # "AE1 M S T ER0 D AE2 M"
We use CMUDict latest for train and validation. Validation is ~10% of the total dataset. Note that CMUDict latest doesn't have any test splits. Note also that CMUDict latest has phoneme stress information.
We use CMUDict PRONASYL 2007 test set for testing. Note that CMUDict PRONASYL 2007 doesn't have stress information.
-
Prepare the training/validation/test data for model ingestion. This step involves tokenization, removing stop words and binarization of data
-
Train the model on the binarized data and generate predictions on the test data.
We cannot directly look at the output of the test evaluation results since the test set does not have the stress information. We have to remove that stress information from the generated output to directly compare to the test set. We do this since we want the model to learn from the stress information even though we want to quantify it's performance on the test set.
cd scripts/
sh prepare-g2p.sh
sh train-and-generate.sh
We benchmarked the PyTorch model against the CMUSphinx TensorFlow model with the following metrics:
- Phonetic error rate (%): For each word, calculate the percentage of the total number of predicted phonemes that are correct when compared to the gold phonemes. Average this across all words.
- Word error rate (%): For each word, compare the entire sequence of predicted phonemes to the gold phonemes. We calculate the percentage of words whose predicted phonemes are an exact match to the gold phonemes.
- CPU Latency (milli-seconds): Time taken to execute the G2P function on a CPU instance.
- GPU Latency (milli-seconds): Time taken to execute the G2P function on a GPU instance.
Architecture | PER (%) | WER (%) | CPU Latency (ms) | GPU Latency (ms) |
---|---|---|---|---|
CMUSphinx | 4.16 | 19.91 | 13.76 | - |
PyTorch | 5.26 | 23.80 | 10.19 | 5.41 |
More details on the benchmarking datasets can be found in our blog post.