Skip to content

Here I have created a deep learning system that is capable of capturing the slight changes in human speech and based on that it determines what emotions have been conveyed through the speech.

License

Notifications You must be signed in to change notification settings

NavinBondade/Revealing-The-Emotions-In-Human-Speech

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Revealing The Emotions In Speech With Deep Learning

HHuman communication extends beyond words; it also encompasses the nuances of speech, such as tone and pitch, which can dramatically alter the meaning of a message. Recognizing the complexity of this form of communication, I have developed a sophisticated deep-learning system designed to capture these subtle variations in human speech. This system is capable of accurately identifying and interpreting the emotions conveyed through spoken language. The model is trained to detect seven distinct emotions: Angry, Happy, Neutral, Sad, Fearful, Disgusted, and Surprised. By analyzing these emotional cues, the system provides a deeper understanding of the underlying sentiment in speech, making it a valuable tool for applications in areas such as customer service, mental health, and human-computer interaction.

Libraries Used

  • Tensorflow
  • Keras
  • Numpy
  • Pandas
  • Matplotlib
  • Seaborn
  • Librosa
  • Sklearn

Audio Analysis



Target Class Distribution


Model Details

To effectively capture the emotions embedded within speech, I employed the Mel-Frequency Cepstral Coefficients (MFCCs) technique for feature extraction from the audio data. This process laid the groundwork for the development of a deep learning model, which is structured around Long Short-Term Memory (LSTM) layers and fully connected neural layers. The model architecture comprises two LSTM layers and three dense layers. The LSTM layers, known for their ability to learn and retain long-term dependencies through their unique gating mechanisms, are crucial in processing the temporal aspects of speech data. The fully connected layers then leverage this information to accurately classify the audio into one of the predefined emotion categories.

Each layer in the model, except for the final dense layer, utilizes the Rectified Linear Unit (ReLU) activation function, which introduces non-linearity and helps the model learn complex patterns. The final dense layer, however, uses the softmax activation function, transforming the output into probabilities that can be interpreted as the likelihood of the audio belonging to each emotion class. To ensure the model's robustness and prevent overfitting, techniques such as batch normalization and dropout are integrated into the architecture, enhancing its generalization capabilities.

Model Training

The model underwent a comprehensive training process spanning 33 epochs to optimize its performance on the given task. During each epoch, the model iteratively processed the entire training dataset, allowing it to progressively learn and refine its internal parameters for improved accuracy and generalization.

For optimization, the model employed the Stochastic Gradient Descent (SGD) algorithm with a learning rate of 0.001 and a momentum factor of 0.9. SGD is a widely used optimization technique that updates the model's weights incrementally by calculating the gradient of the loss function with respect to the weights for a subset of the data. The chosen learning rate of 0.001 dictates the step size at which the model updates its weights during training; a relatively small learning rate like this ensures stable and gradual convergence towards the minimum of the loss function, reducing the risk of overshooting optimal values.

Incorporating a momentum term of 0.9 further enhances the efficiency of the SGD optimizer. Momentum helps accelerate SGD in relevant directions and dampens oscillations by accumulating a fraction of the previous weight updates. This leads to faster convergence, especially in scenarios where the loss surface has many local minima or is ravine-shaped, by maintaining consistent progress and preventing the optimization process from getting stuck in suboptimal points.

The training process utilized the categorical cross-entropy loss function to evaluate and guide the model's learning. Categorical cross-entropy is particularly suitable for multi-class classification problems as it quantifies the difference between the predicted probability distribution and the true distribution of the classes. By penalizing the model more heavily for incorrect or confident yet wrong predictions, this loss function effectively encourages the model to output probability distributions that closely align with the actual distribution of the data. This rigorous penalization mechanism ensures that the model not only learns to make correct predictions but also accurately represents the confidence levels associated with each prediction.

Overall, the combination of multiple training epochs, the SGD optimizer with carefully chosen learning rate and momentum, and the categorical cross-entropy loss function collectively contributed to a robust and effective training regimen. This setup enabled the model to systematically reduce prediction errors and enhance its ability to generalize from the training data to unseen data, thereby achieving reliable and accurate performance in its designated tasks.

Model Evaluation

The model demonstrates strong performance, achieving a training accuracy of 92% with a corresponding loss of 0.22, indicating it has effectively learned the patterns in the training data. Moreover, its test accuracy of 94% and reduced loss of 0.12 suggest that the model generalizes well to unseen data, performing even better on the test set. The lower loss on the test data, combined with the higher accuracy, underscores the model's robustness and ability to make precise predictions without overfitting, making it well-suited for real-world applications.

  • Training Data Accuracy: 92%
  • Training Data Loss: 0.22
  • Test Data Accuracy: 94%
  • Test Data Loss: 0.12

Dataset

The dataset on which model trained is from University of Toronto, Psychology Department and can be downloaded from kaggle: https://bit.ly/3yAatHM or from the official website: https://bit.ly/3yzKv7B.

Conclusion

In this project, I have created an LSTM based deep learning system that is capable of determining seven emotions: Angry, Happy, Neutral, Sad, Fearful, Disgusted, and Suprised in human speech with an impressive accuracy of 94 percent.

About

Here I have created a deep learning system that is capable of capturing the slight changes in human speech and based on that it determines what emotions have been conveyed through the speech.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published