- Focus: Speech to text convertion using DeepSpeech model
- Qualitative & Quantitative analysis
- Evaluate the performance of DeepSpeech on Swithcboard data
- What goes right?
- What goes wrong?
- Can we reproduce research-level metrics?
- Evaluate the results using Word Rate Error
Analyse the results and observe what affected to the wrong predictions. Some ideas:
- The length of the sentences and speech.
- The length of the words (correctly predicted vs wrongly).
- Analyzing structure of the wrongly predicted words. For example: Words that start with W are predicted wrongly in general. We have to make an analyze for this
- Cem:
- Deadline for writing the report is Saturday 23:59
- Report
- Explain what deepspeech does
- Mention why we remove 2 samples (ground truth is wrong)
- Explain that we don't know what exact dataset they use in the paper (we have no label in our data about easy or hard examples)
- Report
- Add:
- Explanation about paper:
- Dataset and network architecture
- Couple sentences per section
- Table 3 from paper
- Graph Joris made
- Add "questions answered" part
- Explain RIFF error and how we fixed it
- Explain WER (denominator is total number of words in ground truth)
- Explanation about paper:
- Deadline for writing the report is Saturday 23:59
- Joris:
- Compute seconds of transcription per number of words
- Add hardware specs: Intel(R) Core(TM) i7-4710HQ CPU @ 2.50GHz
- Check report on sunday
Very easy: Install model and library. Only issue is that we had to rewrite WAV file (but this wouldn't occur on deployement of the app).
# Create and activate a virtualenv
virtualenv -p python3 $HOME/tmp/deepspeech-venv/
source $HOME/tmp/deepspeech-venv/bin/activate
# Install DeepSpeech
pip3 install deepspeech
# Download pre-trained English model files
curl -LO https://github.com/mozilla/DeepSpeech/releases/download/v0.9.3/deepspeech-0.9.3-models.pbmm
curl -LO https://github.com/mozilla/DeepSpeech/releases/download/v0.9.3/deepspeech-0.9.3-models.scorer
# Download example audio files
curl -LO https://github.com/mozilla/DeepSpeech/releases/download/v0.9.3/audio-0.9.3.tar.gz
tar xvf audio-0.9.3.tar.gz
# Transcribe an audio file
deepspeech --model deepspeech-0.9.3-models.pbmm --scorer deepspeech-0.9.3-models.scorer --audio audio/2830-3980-0043.wav
Yes (add seconds of transcription per number of words).
Intel(R) Core(TM) i7-4710HQ CPU @ 2.50GHz
Can probably be inferred from hardware and transciption time.
Qualitatively, can you illustrate performance with well-chosen perfect recognitions and illustrate typical mistakes?
Further analysis? e.g. Words/Letters that are mostly missclassified
Quantitatively, what is the Word Error Rate? How does it compare with state-of-the-art engines on the same dataset?
Metric: word error rate
WER = float(S + D + I) / float(H + S + D)
Based on substitutions, deletions, insertions and hits (explain each term)
Maybe add this for intro? (some statistics about voice recognition)
- Quick explanation of the model (presented by Cem)
- architecture
- ideas behind the model
- what's innovative in this model?
- Presentation of the way we solved the problem (presented by Joris)
- Code architecture (one slide per box)
- RIFF error (explanation + solution)
- Quantitative results (WER, ...etc) (Cem)
- Qualitative results (Type of words that go right/wrong, ...etc) (Joris)