-
Notifications
You must be signed in to change notification settings - Fork 0
/
timit-per.Rmd
130 lines (89 loc) · 5.15 KB
/
timit-per.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
---
title: 'Historical Error Rates in voice and image recognition'
author: "Sami Kallinen \\@sakalli"
date: "August 21, 2016"
output: html_document
---
**MACHINE LEARNING** | **SPEECH RECOGNITION** | **IMAGE RECOGNITION** | **DEEP LEARNING**
### Historical Phoneme Error Rates on the TIMIT corpus
This is a simple excersize to illustrate the advances in machine learning regarding speech recognition. The TIMIT corpus was chosen as it was first published in 1988 and therefore there is some historical data to compare to. Yet, I could be argued that neither PER nor the TIMIT corpus is a optimal way to measure the development of the methods and techniques, yet the clear advantage is that we have material from the end of the eighties.
```{r echo=FALSE, results='hide', message=FALSE, warning=FALSE, error=FALSE}
# load libraries -----------------------------------------------------
library(googlesheets)
library(dplyr)
library(ggplot2)
library(tidyr)
library(plotly)
library(knitr)
library(lubridate)
setwd("~/R Scripts/timit-per/")
```
The data has been manually collected in an open [spreadsheet](https://docs.google.com/spreadsheets/d/1_AjakTpWBPsHc3DGdV3h8p65QYEY6EE9gqkwiHeOdUw). The phoneme error rates on the TIMIT corpus have been collected from the following three sources: *Lopes & Perdigao 2011*, ["the wer are we" github repo](https://github.com/syhw/wer_are_we) and *Mohaed, Dahl & Hinton, 2011*. Feel free to improve and correct the data.
```{r echo=FALSE, results='hide', message=FALSE, warning=FALSE, error=FALSE}
# load spreasheet to R ----------------------------------------------
ss <- gs_url("https://docs.google.com/spreadsheets/d/1_AjakTpWBPsHc3DGdV3h8p65QYEY6EE9gqkwiHeOdUw/pubhtml", lookup = FALSE)
sheet_timit <- ss %>% gs_read(ws="TIMIT")
sheet_imagenet <- ss %>% gs_read(ws="IMAGENET")
# Splitting the Methods variable into separate
# rows and then factorize it.
speech_recognition <- separate_rows(sheet_timit, `Methods`, sep = ", ")
speech_recognition$Methods <- speech_recognition$Methods %>% as.factor
speech_recognition$Papers <- speech_recognition$Papers %>% as.factor
# Finding best PERs historically ------------------------------------
get_best_per <- function(x){
x$change = 0
min_diff <- -1
x <- x %>%
arrange(Year,desc(PER))
while(min_diff < 0) {
x <- x %>%
filter(change >= 0) %>%
mutate(change = lag(PER)-PER) %>%
mutate(change =ifelse(is.na(change),0,change))
min_diff <- x$change %>% min
}
x <- x %>% group_by(Year) %>% summarize(PER = min(PER))
x
}
# Finding the historical best results
best_per <- speech_recognition %>%
get_best_per %>%
rename(best_per = PER)
```
Here is a plot of the data, the dotted line shows what is the best error rates reached by year:
```{r echo=FALSE, warning=FALSE, error=FALSE, message=FALSE}
plot_sr <- ggplot(speech_recognition, aes(Year, PER, color=Methods)) +
geom_point(position = position_jitter(w = 0.3, h = 0.3), aes(shape=Papers)) +
geom_step(data=best_per, aes(Year, best_per), color="gray40", linetype="dotted") +
ylab("Phoneme Error Rate") +
theme(legend.position = "none") +
ggtitle("Phoneme Error Rate development on TIMIT corpus")
ggplotly(plot_sr)
```
You can clearly see the advance that the famous Graves paper resulted in 2012.
### Historical Error Rates of Winning Entries in the Imagenet Competition
"The ImageNet project is a large visual database designed for use in visual object recognition software research. /.../ Since 2010, the annual Imagenet Large Scale Visual Recognition Challenge (ILSVCR) is a competition where research teams submit programs that classify and detect objects and scenes." - Wikipedia
Below we plot the error rate development of the winning entries in the ILSVCR competition.
```{r echo=FALSE}
image_recognition <- sheet_imagenet %>% mutate(Researchers = paste0(Authors, ", ", Institution), Error = Error*100)
image_recognition$Method <- image_recognition$Method %>% as.factor
image_recognition$Researchers <- image_recognition$Researchers %>% as.factor
plot_ir <- ggplot(image_recognition, aes(Year, Error)) +
geom_step(color="gray40", linetype="dotted") +
geom_point(aes(color = Method, shape=Researchers)) +
ylim(0,NA) +
ylab("Error Rate") +
theme(legend.position = "none") +
ggtitle("Error Rate Imagenet Image Classificcation Contest")
ggplotly(plot_ir)
```
Here the introduction of deep convolutional neural networks in 2012 showed a massive drop of error rate from 25 % to 15 %. The furher development of deep learning has showed a considerable drop in the rates during the successive years.
Two tables with the data used above can be seen below. This is a scratch of the surface and there are many aspects to improve in the data and this presentation. Feel free to contribute. The code can be found on this [Github repo](https://github.com/skallinen/timit-phoneme-error-rates) and the link to the google spreadsheet with the data can be found [here.](https://docs.google.com/spreadsheets/d/1_AjakTpWBPsHc3DGdV3h8p65QYEY6EE9gqkwiHeOdUw)
The Imagenet result data:
```{r echo=FALSE}
kable(sheet_imagenet, format = "markdown")
```
The TIMIT result data:
```{r echo=FALSE}
kable(sheet_timit, format = "markdown")
```