-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathprediction_probabilities.html
executable file
·345 lines (283 loc) · 21.7 KB
/
prediction_probabilities.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
<!DOCTYPE HTML>
<html lang="en">
<head>
<title>Neural Network Prediction Scores are not Probabilities
</title>
<base target="_blank">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1">
<meta charset="UTF-8">
<meta property="og:site_name" content="LearningKirvs">
<meta property="og:title" content="Prediction Scores are not Probabilities">
<meta property="og:image" content="https://jtuckerk.github.io/prediction_probabilities/images/correct_ratio.png">
<meta property="og:description" content="When a classification network outputs a softmaxed value of .99, is it right 99% of the time?">
<!-- Font --> <!-- TODO IMAGES -->
<meta name="twitter:card" content="summary_large_image">
<meta name="twitter:site" content="@learningkirvs">
<meta name="twitter:creator" content="@tuckerkirven">
<meta name="twitter:title" content="Prediction Scores are not Probabilities">
<meta name="twitter:description" content="When a classification network outputs a softmaxed value of .99, is it right 99% of the time?">
<meta name="twitter:image" content="https://jtuckerk.github.io/prediction_probabilities/images/correct_ratio.png">
<link href="https://fonts.googleapis.com/css?family=Roboto:300,400,500" rel="stylesheet">
<!-- Stylesheets -->
<link href="common-css/bootstrap.css" rel="stylesheet">
<link href="common-css/ionicons.css" rel="stylesheet">
<link href="visualize_nn_predictions/css/styles.css" rel="stylesheet">
<link href="visualize_nn_predictions/css/responsive.css" rel="stylesheet">
<link href="visualize_nn_predictions/css/custom.css" rel="stylesheet">
<link href="css/custom.css" rel="stylesheet">
<!-- Global site tag (gtag.js) - Google Analytics -->
<script async src="https://www.googletagmanager.com/gtag/js?id=UA-140927467-1"></script>
<script>
window.dataLayer = window.dataLayer || [];
function gtag() { dataLayer.push(arguments); } gtag('js', new Date());
gtag('config', 'UA-140927467-1');
</script>
<script type="text/x-mathjax-config">
MathJax.Hub.Config({"HTML-CSS": { scale: 85}, tex2jax: {inlineMath: [['$','$'],
['\\(','\\)']]}});
</script>
<script src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"
type="text/javascript"></script>
</script>
</head>
<body>
<header>
<div class="row no-gutters">
<div class="col-12">
<a href="index.html" target="_self" class="logo float-left noline">Learning
Kirvs</a>
<a class="float-right menu">
<div class="float-left" style="position:relative; color: #555555;">Code</div>
<a href="https://colab.research.google.com/drive/1qYEtxLvT5Z7CXBe1aVRnhw2xt32Y8lJM#forceEdit=true&sandboxMode=true" class="float-right colab-icon" style="width: 40px; height: 40px; padding-top: 5px; padding-left: 7px; margin-left:10px;margin-right:5px;">
<img src="images/tensorflow.png" class="img-fluid mx-auto">
</a>
<a href="https://colab.research.google.com/drive/1_xjoPnj6JD75zBIxmu67JQB7NinXOKlS#forceEdit=true&sandboxMode=true" class="float-right colab-icon" style="width: 40px; height: 40px; padding-top: 5px; padding-left: 7px; margin-left:10px">
<img src="images/pytorch.png" class="img-fluid mx-auto">
</a>
</a>
</div>
</div>
<!-- conatiner -->
</header>
<section class="post-area section">
<div class="container">
<div class="row">
<div class="col-lg-22 col-md-22 col-sm-12 col-xs-12 mx-auto">
<div class="main-post">
<div class="blog-post-inner">
<h2 class="title center-text">
Prediction Scores are not Probabilities
</h2>
<br>
<h3>TLDR:</h3>
<p class="para">
In order for a multi-class classifier output to be a
valid <i>probability distribution</i> over the classes, a score's
value would have to indicate how often a label with that score is
the correct label. For example, outputs with a score of $.75$ should
be correct $75\%$ of the time, no more, no less. This isn't usually
the case with neural networks and it's easy to show. <b>Just adding
a softmax activation does not magically turn outputs into
probabilities</b>. The below image shows the proportion of correct
predictions vs the output scores for a very simple neural network.
</p>
<img src="prediction_probabilities/images/correct_ratio.png" class="img-fluid mx-auto">
<p class="para">As models and training algorithms get more complex, the outputs typically diverge further from ideal probability estimates.
</p>
<h3>Why do I care?</h3>
<p class="para">
Improving the accuracy of Machine Learning model predictions is the
subject of much study. Estimating and calibrating model uncertainty
is another field of study, but receives much less attention. I hope
to do a full post or series on uncertainty in ML, but for this post
I'll be focusing on neural networks and how their outputs lack
probabilistically reliable information.
</p>
<p class="para">
Improving the accuracy of Machine Learning model predictions is the
subject of much study. Estimating and calibrating model uncertainty
is another field of study, but receives much less attention. I hope
to do a full post or series on uncertainty in ML, but for this post
I'll be focusing on neural networks and how their outputs lack
probabilistically reliable information.
</p>
<p class="para">
Imagine a neural net used for a medical diagnosis that determines if
a surgery is necessary. Such a model should have high accuracy when
recommending that a surgery be performed, but it would be
additionally useful if the model provided a confidence estimate of
its predictions. For example, if it recommended surgery with $75\%$
confidence, that should mean that out of 1000 cases where surgery
was recommended with $75\%$ confidence, around 750 cases would have
actually required surgery. This would allow for experts and patients
to determine if that prediction was worth acting on or if more tests
or consideration were needed before proceeding.
</p>
<h3>Prove it</h3>
<p class="para">
I've provided code using
both <a href="https://colab.research.google.com/drive/1_xjoPnj6JD75zBIxmu67JQB7NinXOKlS#forceEdit=true&sandboxMode=true">Pytorch</a>
and <a href="https://colab.research.google.com/drive/1qYEtxLvT5Z7CXBe1aVRnhw2xt32Y8lJM#forceEdit=true&sandboxMode=true">Tensorflow</a>
so that you can see for yourself how various things like the amount
of data, training time, and model complexity affect the over or
under confidence of a model's predictions. This post and code
illustrate how common neural networks and training
procedures don't inherently result in models that produce well
calibrated confidence estimates.
</p>
<p class="para">I'll be using the same architecture
as <a href="visualize_nn_predictions.html#Model3"
target="_blank">Model 3</a> (a 2 layer nueral network) from my first post with one change:
Instead of using a single output value, I'll use two outputs with a
Softmax function. This will make the model more similar to models
typically used for
<a href="https://towardsdatascience.com/machine-learning-multiclass-classification-with-imbalanced-data-set-29f6a177c1a">multi-class
classification</a> problems.</p>
<div class="center-text">
$$ \begin{equation} softmax (\textbf{x}_i) = \frac{e^{x_i}
}{\Sigma_j e^{x_j} } \end{equation} $$
</div>
<div class="figure-text"><b>Softmax Function</b> This function takes a
sequence $\textbf{x}$ and returns a sequence $\hat{\textbf{x}}$
whose elements sum to 1 and are scaled to preserve each element's
rank (position when sorted). </div>
<p class="para"> It's this Softmax function that seems to be at the
source of the confusion about whether or not the outputs for a NN
constitute valid probabilities. If you search something along the
lines of "how to get a probability from neural network output" in
Google, you'll get things like a medium article with the title "The
Softmax Function, Neural Net Outputs as Probabilities", and
a <a target="_blank"
href="https://stats.stackexchange.com/questions/256420/neural-networks-output-probability-estimates">StackExchange
post</a> asking a similar question where the top 2 answers suggest
using the softmax function. The 3rd response with only 1 upvote
suggests "Softmax of state-of-art deep learning models is more of a
score than probability estimates."</p>
<p class="para"><b>So which is right?</b> Let's quickly define what it
would mean if the softmaxed outputs of a K-class classifier were
probabilities. First, the probabilities estimates for all classes
should add to $1$. If the sum is greater than $1$ this would imply a
greater than $100\%$ chance of some outcome(s). Assuming distinct
classes (a model cannot predict that an image is both a cat and a
dog) this is not possible. The Softmax function covers this
requirement converting a set of any real valued scores into a set of
positive values with a sum of 1. Second, as discussed in the case of
a surgery recommendation system, the probability estimate should
be <i>well calibrated:</i> prediction scores should equal the
proportion of correct predictions e.g. predictions with $75\%$
confidence should be correct $75\%$ of the time. This is the
requirement that we'll be verifying.</p>
<p class="para"> We'll use the
synthetic <a href="visualize_nn_predictions.html#XOR_Dataset"
target="_blank">XOR dataset</a> from post 1 with $2$ output classes:
$not\_xor$ and $is\_xor$. The benefit of a synthetic dataset is that
we can generate as little or as much train and validation data as we
want, in order to explore the effect that has on the model
outputs. After training we can count the number of
predictions that were classified correctly and plot those in bins
according to their value.</p>
<img src="prediction_probabilities/images/correct_incorrect_counts.png" class="img-fluid mx-auto">
<div class="figure-text-center"><b>Figure 1.)</b> Counts by $is\_xor$
prediction score and correctness (validation dataset)</div>
<p class="para"> This model gets around 95% accuracy and we can see that
there are a large number of very high and low predictions that are
all correct. Whereas the predictions with scores near $0.5$ seem
equally as likely to be right or wrong. This is a similar trend to
what we might expect if the prediction scores were valid
probabilities. The next plot shows the ratio of correct predictions
in each bin and makes it a bit easier to make sense of the values in
the middle bins.</p>
<img src="prediction_probabilities/images/correct_ratio.png" class="img-fluid mx-auto">
<div class="figure-text-center"><b>Figure 2.)</b> Ratio of correct
predictions by $is\_xor$ prediction value bin (validation
dataset)</div>
<p class="para"> You might expect the ideal line to be $y=x$, but for
values $<0.5$ this is not the case. The prediction (the class with the
largest value) will always be $>0.5$ since this is a $K=2$ class
classification and the predicted scores always sum to $1$. I've opted to
just run with this symmetry and include the accuracy ratios for both
classes in one plot. So a $not\_xor$ prediction with a score of $0.9$
and an ideal accuracy ratio of $0.9$ is the same as a $is\_xor$ score of
$0.1$, which we can see in corresponds to the ideal ratio of $0.9$.
</p>
<p class="para"> Figure 2. shows that in many cases the actual
proportion of correct predictions deviates significantly from
the <i>ideal</i> values that would be output if the model
predictions were probabilities. This model underestimates some of
its predictions: the model outputs a low score of around $0.80$, but
gets $100\%$ accuracy for those examples. These plots have been
generated from validation data on $50\times$ more data than the
training set to ensure we get a representative sample. Typically,
high accuracy is the primary focus and holding out so much data to
verify a model's calibration is not prioritized.</p>
<a name="footnote1_loc">
<p class="para">
Try messing around with different settings in the notebook and
you'll likely see that the graph above shows a relatively well
calibrated model<a href="#footnote1_exp" style='color: #229dff;'
target="_self"><sup>1</sup></a>(especially compared to one in
which you try to squeeze out as much accuracy as possible). What
I hope is clear is that if this is the case with such a simple
model and a large amount of uniform data, then it is unlikely
that more complex models and datasets would produce valid
probability estimates as outputs.</p>
</a>
<div class="animate center-text">
<video class="videoGif embed-responsive img-max" muted preload="metadata" disableRemotePlayback playsinline>
<source src="prediction_probabilities/gifs/correct_ratio30fps.mp4#t=0.1"
type="video/mp4" />
</video>
</div>
<div class="figure-text"><b>Animation 1.)</b> Ratio of correct
predictions by prediction value over time (validation set)<br>
<div class="figure-text-sub"><b>Bottom:</b> Accuracy over
time.</div>
</div>
<p class="para"> As mentioned earlier there are a number of factors that
affect how close prediction scores are to being valid probability
estimates of accuracy, the animation above shows how these estimates
vary over the course of training, but a full analysis of these
effects is beyond the scope of this post. If you're interested in
learning more, <a href="https://arxiv.org/abs/1706.04599" ,
target="_blank"> this paper</a> analyzes common models like ResNet,
digs deeper into the effect of various modeling choices, and
introduces a way to calibrate models to output scores that are
better probability estimates.</p>
<p class="para">
Feel free to reach out with any questions or comments on the tweet
below and follow
me <a href="https://twitter.com/tuckerkirven">@tuckerkirven</a> to
see announcements about other posts like this.
</p>
<blockquote class="twitter-tweet tw-align-center"><p lang="und" dir="ltr"><a href="https://twitter.com/hashtag/MachineLearning?src=hash&ref_src=twsrc%5Etfw">#MachineLearning</a> <a href="https://twitter.com/hashtag/NeuralNetworks?src=hash&ref_src=twsrc%5Etfw">#NeuralNetworks</a> <a href="https://t.co/UWPfMLQ314">https://t.co/UWPfMLQ314</a></p>— Tucker Kirven 🧢 (@tuckerkirven) <a href="https://twitter.com/tuckerkirven/status/1259725004595318784?ref_src=twsrc%5Etfw">May 11, 2020</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
<h4> Footnotes:</h4>
<p class="footnote"><a name="footnote1_exp" style="color:
#229dff">1. </a> It is often the case that while neural network
scores are not perfectly calibrated probability estimates, there is
a high correlation between scores and accuracy. In other words lower
scores more often result in incorrect results and higher score
result in correct results. This just can't be relied upon to always
be the case.
<a href="#footnote1_loc" target="_self">Back to footnote source</a>
</div>
</div>
<!-- post-info -->
</div>
<!-- main-post -->
</div>
<!-- col-lg-8 col-md-12 -->
</div>
<!-- row -->
</div>
<!-- container -->
</section>
<!-- post-area -->
<!-- SCIPTS -->
<script src="common-js/jquery-3.1.1.min.js"></script>
<script src="common-js/tether.min.js"></script>
<script src="common-js/bootstrap.js"></script>
<script src="common-js/scripts.js"></script>
<script src="js/gif.js"> </script>
</body>
</html>