Skip to content

Training a NN using preprocessed Linear Predictive Coding (LPC) values

Notifications You must be signed in to change notification settings

Ky-Ng/Vowel-Detection-NN

Repository files navigation

Vowel Classification Neural Network


Table of Contents

  1. Motivation
  2. High Level Overview
  3. Tangent: Formants and LPCs
    1. Linear Predictive Coding
    2. Sampling Rates
    3. Neural Network Motivation
  4. Technologies Used
    1. Signal Processing
    2. Neural Network Training
  5. Vowel NN Classifier
    1. Dataset Building
    2. Vowel Classification Training
    3. Side Tangent: Vowel Backness
    4. Limitations

Motivation

Feedforward Neural Network for speech recognition of vowels using Linear Predictive Coding (LPC) Coefficients.

For a walkthrough of this repository, checkout this video.

Shoutout to my Linguistics professor who wrote this in 1992 and is still teaching this to us at the ripe age of 70+.


High Level Overview

Task Components Scripts Used Full Writeup
Vowel NN Classifier 1. Create LPC coefficients for training and ground truth data
2. Train feedforward models on the vowel classification task
Create_LPC_Data_Sets
Vowel_Classification_NN
Link
Vowel Backness Identifying Vowel Backness using LPC data Generate_LPC_Data Link

Tangent: Formants and LPCs

Linear Predictive Coding

  • As a quick side tangent, Linear Predictive Coding or LPC is a technique discovered by Bell Labs in 1989 used to quantize and compress speech signals.
  • LPC uses an auto-regressive model with by predicting the ith wave sample using the past N samples and learned constants
c1 * wave(i-1) + c2 * wave(i-2) + c3 * wave(i-3) + ... + cN * wave(i-N)
  • We refer to LPC N as the wave quantization using the LPC process using N coefficients

Sampling Rates

  • In this project, we use LPC 14 since we have a sample rate of 14kHz or 14,000 samples per second
  • In the original Bell Labs research, the researchers used the heuristic 1 LPC coefficient per 1kHz sampling rate

Neural Network Motivation

  • Since LPC are linguistically grounded in vocal tract constrictions (since they are related to formants), each vowel exhibits has a unique LPC/Formant "fingerprint"
  • Thus, our goal is to use a Feedforward Neural Network to classify which vowel is being produced in a speech signal

(TODO: More citations needed for this)


Technologies Used

Signal Processing

  • lpc function for generation LPC Coefficients from Waveforms
  • resample for changing the original 44,100 (44kHz) sampling rate to 14,000 (14kHz) sampling rate Libraries:
  • Matlab Signal Processing Toolbox
  • Librosa
    • Equivalent to Matlab's Signal Processing Toolbox features the lpc and resample functions
  • SciPy
    • Deserializes Matlab .mat files into Python objects (numpy arrays)

Neural Network Training

  • PyTorch for Neural Network Architecture and Training

Vowel NN Classifier

  • NN Classifier taking LPC coefficients as inputs and a One Hot Encoding of 10 vowels as its output

1) Dataset Building

  1. A Matlab script resamples, truncates, and preprocesses the utterances in order to ensure the LPC coefficients are reflective of the target vowel and not random noise
  2. For more details on the sampling process, read the writeup on section "Preprocessing Methods"

2) Vowel Classification Training

  1. A Python script takes in the .mat files from the previous file and trains a simple feedforward network on the vowel classification task
  2. The neural networks also have varied hidden layer sizes where increasing the number of hidden neurons seems to have increased the learning
    1. For more details on the hidden neurons, read another writeup, "1 vs. 5 Hidden Neurons"

Side Tangent: Vowel Backness

  • Rather than taking a Neural Network approach to identifying vowel backness, I took the effect size using Cohen's Dof the front vs back vowels to attempt to identify which LPC Coefficients could identify frontness vs backness
  • The script for the effect size calculations can be found here: Generate_LPC_Data.ipynb

Limitations

The model architecture used for this project is quite simple and more of a proof of concept for more sophisticated speech detection tasks.

Furthermore, only 218 data samples were used. Since there are 10 output vowels, it would also seem that input output pairs on neurons with less than 4 neurons would not perform well (since models with 3 neurons would only have 2^3 = 8 possibilities while there are 10 vowels)


About

Training a NN using preprocessed Linear Predictive Coding (LPC) values

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published