This library tracks and implements speech language models, with particular focus on recent advances in speech langage modelling using acoustic codes.
The implementation are for research purposes and are intended to follow the papers that proposed them more closely.
- vq-vae(2017): Proposed vector quantization for acoustic code discovery using gumbel softmax for straight-through estimation of latents acoustic code. Also, showed that acoustic codes are closely related to phoneme categories.
- wav2vec: Proposed (contrastive) future time step prediction of actual tokens from negative samples.
- vq-wav2vec: Applied vector quantization to wav2vec and show promising results for speech language modelling (using masked language modelling), ASR, They also showed higher quality to compression ratio than conventional audio compression algotithms.
- wav2vec Discrete and Continuous training: They compared using quantized speeech vs wav2vec features, Filterbanks, and MFCC, showing that quantized features perform better than continuous features for ASR.
- wav2vec2: Similar to vq-wav2vec, but with performance improvements and changes in training strategies. Transformer used as the context network to learn long-term dependencies. \i substituted transformers with Conformers in my implementation.
- Pytorch lightning: used for multi-gpu training and inference
- Tabisha: His implementation of the WavNet vocoder is adapted as the VQ-VAE decoder.