Detection of fillers in conversational speech

About fillers

This work is a NLP master's degree project aiming at improving ASR systems, in particular in the specific task of detecting fillers in spontaneous speech.

Fillers are a specific type of disfluencies. Disfluencies can be defined as interruptions in normal flow of speech which make utterances longer without adding semantic information. Fillers, in particular, are vocalizations, like “uhm”, “eh”, “ah” etc., which fill the time that should be occupied by a word (generally in correspondence of hesitations, uncertainty or attempts to hold the floor).

Improving Automatic Speech Recognition systems

Automatic Speech Recognition (ASR) systems systems are generally trained on read speech data, which make them weak for recognizing disfluencies, since read speech is generally free of disfluencies. Training ASR on spontaneous speech data would be the best method to overcome this problem, however such data is expensive and laborious to obtain. Our approach is the implementation of an event detection sequence tagger, which is a type of Neural network (NN) that require fewer data to train on. The NN will tag sequences with the labels silence, speech, and filler. By combining the NN to a classical ASR architecture, a stronger ASR system able to recognize a natural spontaneous speech is possible.

Approach

Using the corpus CallHome, we performed forced alignment between transcriptions and audio files using Montreal Forced Aligner. This alignment allowed us to automatically assign _silence_, _speech_, or _filler_ to each frame of the audio files. Finally we fed a sequence tagging Neural Network with the sequences of Mel-Frequency Cepstral Coefficients (MFCC) extracted over the above-mentioned frames.

Notebook

To run the notebook, you will need to install before hand the following python libraries: Pydub, soundfile, textgrids, pandas, torch, torchaudio, and sklearn. Also, the following softwares are needed: Montreal Forced Aligner (for the forced alignment process), Sox for the stereo-mono conversion.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
.ipynb_checkpoints		.ipynb_checkpoints
corpus		corpus
pickles		pickles
README.md		README.md
bibliography.pdf		bibliography.pdf
notebook.ipynb		notebook.ipynb
report.pdf		report.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Detection of fillers in conversational speech

About fillers

Improving Automatic Speech Recognition systems

Approach

Notebook

About

Releases

Packages

Contributors 2

Languages

thiborose/detection_of_fillers

Folders and files

Latest commit

History

Repository files navigation

Detection of fillers in conversational speech

About fillers

Improving Automatic Speech Recognition systems

Approach

Notebook

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages