Skip to content

Latest commit

 

History

History
78 lines (39 loc) · 2.62 KB

README.md

File metadata and controls

78 lines (39 loc) · 2.62 KB

pngnw_bert

PRETRAINED WEIGHTS LINK


REPO STATUS: WORKS BUT NOT GOING TO BE MAINTAINED


Unofficial PyTorch implementation of PnG BERT with some changes.

Dubbed "Phoneme and Grapheme and Word BERT", this model includes additional word-level embeddings on both grapheme and phoneme side of the model.

Also does include additional text-to-emoji objective using DeepMoji teacher model.


I no longer recommend using PnG BERT or this modified version because of the high compute costs.

Since each input is chars+phonemes instead of just wordpieces, the input length is around 6x longer than BERT.

With dot-prod attention scaling with the square of the input length, the attention is theoretically 36x more expensive in PnG BERT than normal BERT.


pre_training_architecture.png

Here's the modified architecture.

New stuff is

  • Word Values Embeddings

  • Rel Word and Rel Token Position Embeddings

  • Subword Position Embeddings

  • Emoji Teacher Loss

The position embeddings are configurable in the config and I will likely disable some of them once I find the best configuration for training.


Update 19th Feb

I tested 5% Trained PnGnW BERT checkpoint with Tacotron2 Decoder.

pngnw_bert_tacotron2_alignment.png

Alignment Achieved in 300k samples, about 80% faster than the original tacotron2 text encoder [1].

I'll look into adding Flash Attention next since training is taking longer than I'd like.

[1] - LOCATION-RELATIVE ATTENTION MECHANISMS FOR ROBUST LONG-FORM SPEECH SYNTHESIS


Update 3rd March

I've;

  • added Flash Attention
  • Trained Tacotron2, Prosody Prediction and Prosody-to-Mel models with PnGnW BERT
  • Experimented with different Position Embedding (Learned Embedding vs Sinusoidal Embedding)

I found that - in downstream TTS tasks - fine-tuned PnGnW BERT is about on par with fine-tuning normal BERT + using DeepMoji + using G2p, while requiring much more VRAM and compute.

I can't recommend using this repo. The idea sounded really cool but after experimenting, it seems like the only benefit to this method is simplifying the pipeline by using a single model instead of multiple smaller models. There is no noticeable improvement in quality (which makes me really sad) and it requires 10x~ more compute.

It's still possible that this method will help a lot with accented speakers or other more challenging cases, but for normal English speakers it's just not worth it.