pngnw_bert

REPO STATUS: WORKS BUT NOT GOING TO BE MAINTAINED

Unofficial PyTorch implementation of PnG BERT with some changes.

Dubbed "Phoneme and Grapheme and Word BERT", this model includes additional word-level embeddings on both grapheme and phoneme side of the model.

Also does include additional text-to-emoji objective using DeepMoji teacher model.

I no longer recommend using PnG BERT or this modified version because of the high compute costs.

Since each input is chars+phonemes instead of just wordpieces, the input length is around 6x longer than BERT.

With dot-prod attention scaling with the square of the input length, the attention is theoretically 36x more expensive in PnG BERT than normal BERT.

Here's the modified architecture.

New stuff is

Word Values Embeddings
Rel Word and Rel Token Position Embeddings
Subword Position Embeddings
Emoji Teacher Loss

The position embeddings are configurable in the config and I will likely disable some of them once I find the best configuration for training.

Update 19th Feb

I tested 5% Trained PnGnW BERT checkpoint with Tacotron2 Decoder.

Alignment Achieved in 300k samples, about 80% faster than the original tacotron2 text encoder [1].

I'll look into adding Flash Attention next since training is taking longer than I'd like.

[1] - LOCATION-RELATIVE ATTENTION MECHANISMS FOR ROBUST LONG-FORM SPEECH SYNTHESIS

Update 3rd March

I've;

added Flash Attention
Trained Tacotron2, Prosody Prediction and Prosody-to-Mel models with PnGnW BERT
Experimented with different Position Embedding (Learned Embedding vs Sinusoidal Embedding)

I found that - in downstream TTS tasks - fine-tuned PnGnW BERT is about on par with fine-tuning normal BERT + using DeepMoji + using G2p, while requiring much more VRAM and compute.

I can't recommend using this repo. The idea sounded really cool but after experimenting, it seems like the only benefit to this method is simplifying the pipeline by using a single model instead of multiple smaller models. There is no noticeable improvement in quality (which makes me really sad) and it requires 10x~ more compute.

It's still possible that this method will help a lot with accented speakers or other more challenging cases, but for normal English speakers it's just not worth it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

pngnw_bert

REPO STATUS: WORKS BUT NOT GOING TO BE MAINTAINED

Files

README.md

Latest commit

History

README.md

File metadata and controls

pngnw_bert

REPO STATUS: WORKS BUT NOT GOING TO BE MAINTAINED