A speech recognition system that uses feature extraction and dynamic time warping (DTW) to identify words and to find the most similar speaker.
- Word Recognition: Identifies words spoken by different speakers using a weighted voting mechanism across multiple feature types.
- Speaker Similarity Detection: Determines the most similar speaker to a target speaker based on feature comparisons.
- Dynamic Time Warping (DTW): Measures similarity between time-series data with a focus on minimizing cost paths.
- Multiple Feature Types: Supports MFCC, LPC, and Mel Spectrogram features for diverse and accurate audio analysis.
- Confidence and Threshold Handling: Ensures reliable predictions by applying confidence thresholds and cost checks.
- Clone the repository:
git clone https://github.com/mradovic38/dtw-speech-recognition
cd dtw-speech-recognition
- Install dependencies:
pip install -r requirements.txt
- Create a directory of audio files of different speakers saying the same group of words.
- Name your audio files as
word-speaker.wav
(ex.down-mark.wav
). - For details on how to run the project on your data, refer to the
run.py
file. It contains examples and explanations for word recognition and speaker similarity detection.
- Adjust feature weights in
word_recognition.py
to prioritize specific features. - Modify confidence and cost thresholds to suit different datasets.
Extracts audio features such as MFCC, LPC, and Mel spectrograms from input audio files.
Computes similarity between time-series data (input audio vs. database).
Aggregates feature-based predictions using weighted voting across all three feature types for reliable word identification.
Compares speakers by averaging DTW costs across shared vocabulary based on a passed feature extraction alghoritm.
This project is licensed under the MIT License. See the LICENSE file for details.