Skip to content
Karolina Stosio edited this page Aug 28, 2017 · 17 revisions

Welcome to the multi-modal-emotion-prediction wiki!

The code in the repo was developed as a Google Summer of Code project,

in which we explored the problem of emotion recognition from dialogs. The approach we took was building a multi-modal machine learning model based on

Using an IEMOCAP (The Interactive Emotional Dyadic Motion Capture) data set comprising (among others) audio recordings and transcripts aligned word by word with the speech, we could split audio into short functional utterances. This allowed us for using a hierarchical Long Short-Term Memory (LSTM) model that firstly analyses an audio segment corresponding to one word, then combines information extracted from all of the words from the sentence and based on the collective knowledge makes a prediction about the emotional load of the whole sentence. We hoped that this approach, inspired by a success of such architectures in rating sentences (reference ), could offer currently not available insights presenting how a sequence of given audio features results in a given emotional response.

The report presents data preprocessing steps, the detailed architecture of the model and preliminary results of training.

Clone this wiki locally