-
Notifications
You must be signed in to change notification settings - Fork 0
/
machine_learning_fundamentals_notes_201801.txt
602 lines (529 loc) · 39.7 KB
/
machine_learning_fundamentals_notes_201801.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
INDEX:
1. == WEEK 2 - LINEAR REGRESSION ==
2. == WEEK 3 - LOGISTIC REGRESSION ==
3. == WEEK 4 & 5 - NEURAL NETWORKS ==
4. == WEEK 6 - APPLICATION OF LEARNING ALGORITHMS ==
5. == WEEK 7 - SUPPORT VECTOR MACHINES ==
6. == WEEK 8 - UNSUPERVISED LEARNING ==
7. == WEEK 9 - ANOMAOLY DETECTION ==
8. == WEEK 10 - GRADIENT DESCENT WITH LARGE DATASETS ==
9. == WEEK 11 - PHOTO OPTICAL CHARACTER RECOGNITION (OCR) ==
X. == ASSIGNMENT SOLUTION EXAMPLES ==
== WEEK 2 - LINEAR REGRESSION ==
Feature scaling:
> feature scaling (using range of values) & mean normalisation (or standard deviation), ie: (x1 - u1) / s1
> this assists with convergence speed
Learning rate:
> begin with low rate (ie: 10e-3), plot the change in cost function with each iteration and watch for convergence
> increase the learning rate with a 3 fold increase
Features & polynomial regression:
> We can improve our features and the form of our hypothesis function in a couple different ways.
> We can combine multiple features into one. For example, we can combine x1 and x2 into a new feature x3 by taking x1⋅x2.
> Our hypothesis function need not be linear (a straight line) if that does not fit the data well. We can change the behavior or curve of our hypothesis function by making it a quadratic, cubic or square root function (or any other form).
eg: hθ(x)=θ0+θ1x1+θ2x2+θ3x3+θ4√(x4)
Using "Normal Equation" to solve to minimise cost function for linear regression:
> theta = pinv(X'.X).X'.Y
> the normal equation works by minimizing the cost function (J) by explicitly taking its derivatives with respect to the θj ’s, and setting them to zero.
> in this case, feature scaling / normalisation is not required, as we are solving for the global minima
Minimising cost function for linear regression:
> if n (# features) is large (ie: >=10,000), then use gradient descent
> if n (# features) is small (ie: <=10,000), then use "normal equation"
Gradient descent algorithm:
> for all features, j:
θ(j):= θ(j) - alpha 1/m . sum{i=1:m}(hθ(x(sup i) - y(sup i)) . x(sub j)(sup i)
> where j = feature (each feature iterated), and i = training example (each example iterated)
> basically, all θ (feature parameters) are simultaneously updated at learning rate of alpha relative to changes in the point derivative change in the cost function
Vectorization:
> allows for improved efficiency in solving problems
> eg: Yhat = matrix X * vector θ
== WEEK 3 - LOGISTIC REGRESSION ==
Classification:
> classification is required when predicting a small number of discrete values
> linear regression can produce values << 0 and >> 1, even if training examples are such that y∈{0,1}
> outliers can drastically change linear regression classification and thus significantly impair prediction accuracy
> logistic regression models can be set up to produce a classification hypothesis such that 0 <= hθ(x) <= 1
> sigmoid function == logistic function = hθ(x) = g(z) = 1 / (1 + e^-z) = 1 / (1 + e^-θ'X)
> hθ(x) is the estimated probability that y = 1 on input x
Decision Boundary:
> hθ(x):= g(z):= g(θ'x) >= 0.5 where θ'x >= 0 (ie θ'.0 = 1 / (1 + e^-θ'x) = 1 / (1 + e^-0) = 1 / (1 + 1) = 1 / 2
> therefore, we can predict that y = 1 where hθ(x) >= 0.5 which is when θ'x >= 0.
> the decision boundary reflects the line dividing parameters in the feature space between the regions where the hypothesis predicts y = 1 from the region that predict where y = 0
> basically, the way our logistic function g behaves is that when its input is greater than or equal to zero, its output is greater than or equal to 0.5
> non-linear decision boundaries are possible by using higher-order polynomials (ie: θ0 + θ1x2 + θ2x2 + θx1^2x2^2 ...)
Cost Function:
> we define the logistic regression cost function without a square, because this creates a non-convex cost function, which typically has multiple local optima (in contrast to a convex cost function which has a global minimum)
> writing the cost function in the following way ensures that it is convex for logistic regression
> J(θ):= 1/m ∑(i:m) Cost(hθ(x(i)),y(i))
> Cost(hθ(x),y):= -log(hθ(x)) where y = 1, (so, as hθ(x) --> 0, J(θ) --> infinity) AND
> Cost(hθ(x),y):= -log(1 - hθ(x)) where y = 0, (so, as hθ(x) --> 1, J(θ) ---> infinity)
> therefore, Cost(hθ(x),y):= -y.log(hθ(x)) - (1 - y).log(1 - hθ(x))
> therefore, J(θ) = -1/m . [∑(i:m) -y.log(hθ(x(i))) + (1 - y(i)).log(1 - hθ(x(i)))]
> Vectorised:
h=g(Xθ)
J(θ)=1/m⋅(−y'log(h)−(1−y)'log(1−h))
Gradient descent algorithm:
> simultaneously update, for all features, n, is exactly the same as the linear regression update rule:
θ(n):= θ(n) - alpha 1/m . ∑(i:m) (hθ(x(i) - y(i)) . x(n)(i))
> where n = feature (each feature iterated), and i = training example (each example iterated), and alpha is the learning rate
> Vectorised:
θ:=θ−α/m.X'(g(Xθ)−y(hat))
Feature scaling:
> Important in logistic regression also!
Advanced Descent Algorithms:
> some incomprehensible stuff (Conjugate gradient, BFGS, L-BFGS)
> use "fminunc" function in Octave
Multiclass Classification:
> a single class (one of k classes) is separated out from the others and assigned a positive class (y=1), while everything else is bundled as a negative class (y=0)
> thus, we end up with k logistic regression classifiers, predicting a single class relative to all others bundled
> we run all k binary logistic regression classifiers, and pick the classifier that maximises hθ(x)
Overfitting:
> Underfitting is known as having bias
> Overfitting is known as having variance
> We want models to "generalise" to new test data
> If there are too many features and very few training examples then overfitting can be a problem
Regularisation - Linear Regression:
> To avoid overfitting, we can apply a "regularisation parameter", λ (lambda), to the cost function, λ.∑(j:n)θj^2 - the λ determines how much our theta params are inflated in the cost function
> too large a λ will result in too high a pentalty, and err towards underfitting, while too small a λ will err towards overfitting (where there are many features used relative to the training set size)
>> this adds a penalty to the cost function for each θ, thus, the cost function tries to minimise / reduce the size of θ, while also trying to reduce the difference between predicted outcomes and actuals
> thus, J(θ) = 1/2m . ∑(i:m)(hθ(x(i)) - y(i))^2 . λ.∑(j:n)θj^2
> NOTE: we do not penalise θ0 (for the constant term) - we only penalise θ's for the features
Regularisation - Linear Regression - Gradient Descent:
> Minimise cost function by also subtracting the derivative of the regularisation equation,
> Thus: θ(j):= θ(j) - alpha 1/m . [ ∑(i:m)(hθ(x(i)) - y(i)).x(j)(i) + λ/m.θj ]
> which is also written as: θ(j):= θ(j)(1 - alpha.λ/m) - alpha.1/m . ∑(i:m)(hθ(x(i)) - y(i)).x(j)(i)
> Determine simultaneously for all features, j
> Remember that the regularisation component only applies to θ(1:j), and not θ0 (for the constant)
Regularisation - Logistic Regression:
> We add in the regularisation term, λ/2m.∑(j:n)θ(j)^2
> thus, cost functon:
J(θ) = -1/m . [ ∑(i:m) y(i).log(hθ(x(i))) + (1-y(i)).log(1-hθ(x(i))) ] + λ/2m.∑(j=1:n)θ(j)^2
Regularisation - Logistic Regression - Gradient Descent:
> The formula looks the same to regularisation of linear regression with gradient descent, above, however, noting that logistic regression takes a differently defined hθ(x):= 1 / (1 + e^-θ'x)
== WEEK 4 & 5 - NEURAL NETWORKS ==
* Logistic regression is good for discrete hypotheses (creating a definition boundary for straight lines and ellipses), but not for complex multi-feature non-linear problems
> eg: to solve complex non-linear boundaries, quadratic feature maps may be required, such as 50*50 pixel image = 2500 featurs = 3 million features, or 50 million features for a 100 * 100 pixel image)
> thus, we need something other than logistic regression
Model Representation:
* a neuron receives a bunch of inputs (dendrite pulses) from various feature units, and a bias node or unit (x0)
> the bias unit, x0, takes a value of 1, however it can still have a differing weight (theta(0))
* the neuron will output a signal (axonal output) according to its activation function (ie: sigmoid / logistic)
* the feature thetas (or, parameters) are typically referred to as "weights" in neural networks
* input units --> hidden layer(s) --> output layer
* the matrix of weights controlling function mapping from layer (j) to (j+1) will take the dimensional form of:
[# units in layer (j+1) * # inputs layer (j) + 1] --> the + 1 is for the bias weight,
ie: layer (j+1) rows for each unit, and layer (j) + 1 columns for input feature units including a bias
> a(i)(j+1) = g(z(i)(j)) = g(θ(i0)(j)x(0) + θ(i1)(j)x(1) + .... + θ(in)(j)x(n)
> activation of unit i in layer j is a function of theta (layer j, feature n)(layer j) * x(feature n)
* the hidden layers of nodes are sometimes called "activation units"
Forward Propagation:
> activation of inputs layer to hidden layers(s) to output layer
> a(i)(j) = g(x) = g(z(i)(j)) --> node (unit) i, layer j
> z(j+1) = θ(j).x(j) = θ(j).a(j)
> a(j+1) = g(z(j+1))
> vector representation: a(j) = x(j) = [x0 ; x1 ; ... ; xn] ; z(j) = [z(1)(j) + z(2)(j) + ... + z(n)(j)]
> z(j+1) = θ(j).a(j)
> θ(j) is a matrix with dimensions s(j) * n + 1, where s(j) = # activation nodes in layer j, and n = # features + 1 for bias node
==> consider the # activation nodes to be like the # of training examples, m
> a(j) is a vector with height n + 1
Cost function & Back Propagation Algorithm:
> L = number of layers; sl = number of units in layer l (not including bias unit / node); K = number of units in output layer
> l = current layer; j = node j in layer l; i = current training example (observation); m = number of training examples
> ie: a(sub j)(sup l) = activation of jth unit in layer l
> binary classification = 1 output unit (K=1); multi-class classification, where K >= 3 classes, (ie: vector output [0;0;1;0]
> The cost function is take across all output nodes, K
> Backpropagation algorithm
* is essentially the "gradient descent"
* delta, ð, reflects the error term of node j in layer l
* For output layer: ð(j)(l) = a(j)(l) = h(θ(k))(j) - y(j), so vector notation: a(L) - y
* For output layer: ie: the error term, delta, is the predicted value of unit a in layer l, minus the actual value of unit a in layer l
* For hidden layers:
** vector notn: ð(j) = θ(j)'*ð(j+1) .* g(prime)(z(j)), where g(prime)(z(3)) = a(3) .* (1 - a(3))
* For input layer: no calculation, as these are actuals (observed)
* Backpropagation refers to the fact that we calculate the error term for the output layer, then iterate backwards through the hidden layers
* ∆ (ie: capital delta) = ∆(ij)(l) = ∆(ij)(l) + a(l)ð(l+1)
* vector notn: ∆(l) = ∆(l) + ð(l+1)*a(l)'
> Calculation of the partial derivatives of J(theta) WRT theta(ij)(l) is: [observation i, node j, layer l]
* where j=0 (bias node): D(ij)(l) = 1/m ∆(ij)(l)
* where j<>0 (feature node): D(ij)(l) = 1/m ∆(ij)(l) + λθ(ij)(l)
> We iterate over the training examples, running ForwardProp for each example then BackProp for each example
BackProp in Practice:
> before a crash, I wrote a whole lot of stuff on unrolling matrices to a row vector ie:theta_vec = [theta_1(:);...;theta_L-1(:)]
* Implementation notes:
thetaVector = [ Theta1(:); Theta2(:); Theta3(:); ]
deltaVector = [ D1(:); D2(:); D3(:) ]
Theta1 = reshape(thetaVector(1:110),10,11)
Theta2 = reshape(thetaVector(111:220),10,11)
Theta3 = reshape(thetaVector(221:231),1,11)
> and about gradient checking, using epsilon as an offset to J(theta), two-sided, to estimate the gradient to make sure that BackProp is doing the correct thing
* Implementation notes:
epsilon = 1e-4;
for i = 1:n,
thetaPlus = theta;
thetaPlus(i) += epsilon;
thetaMinus = theta;
thetaMinus(i) -= epsilon;
gradApprox(i) = (J(thetaPlus) - J(thetaMinus))/(2*epsilon)
end;
> If theta is initialised to zero (or any value which is the same for all elements of theta), thetas from each feature (x, e 1:n) feeding into activation units of hidden layer will take on the same values following BackProp (gradient descent), and the activation units of the hidden layer will also take the same value!
* Thus, thetas are randomly initialised, eg:
If the dimensions of Theta1 is 10x11, Theta2 is 10x11 and Theta3 is 1x11.
Theta1 = rand(10,11) * (2 * INIT_EPSILON) - INIT_EPSILON;
Theta2 = rand(10,11) * (2 * INIT_EPSILON) - INIT_EPSILON;
Theta3 = rand(1,11) * (2 * INIT_EPSILON) - INIT_EPSILON;
Network Architecture:
> # input nodes = # features (dimensions)
> # output nodes = # classes ==> in a multi-class output, this is represented as a vector, eg: s(L) = 5, if y=3, [0;0;1;0;0]
> hidden layers: usually 1 layer is ok, and if >1, then ensure nodes in layer l == nodes in layer l+1
Calculate ForwardProp and BackProp for each training example, like so:
for i = 1:m,
Perform forward propagation and backpropagation using example (x(i),y(i))
(Get activations a(l) and delta terms d(l) for l = 2,...,L
== WEEK 6 - APPLICATION OF LEARNING ALGORITHMS ==
Model Selection:
> Generate hypotheses θ's for features across different models (ie: different order polynomials) on Training set (60%), choose the model with the lowest cost function on the Cross Validation set (20%), then estimate generalisation error of model on Test set (20%)
> Remember: Select model thetas using Training set, validate and select lowest cost function model on Cross Validation set, and finally, report on model generalisation / success using the Test set
Bias vs Variance:
> high bias (underfitting), high variance (overfitting)
> high bias: Training error and Cross Validation errors are high
> high variance: Training error is low and Cross Validation error is high (ie: J(theta,CV) >> J(theta,Training)
> Regularisation (lambda) value selection can be iterated across models, and cost function outcomes can be plotted across these models for Cross Validation and Training sets
* Low lambda will result in low penalties for thetas and thus more likely overfit the Training set, whereas large lambda will highly penalise thetas and likely result in an underfit (ie: just provide a bias term)
Learning Curves (Diagnostic):
> "Learning Curves" - plotting the cost functions for Training set and Cross Validation sets
> Helps detect if model is underfitted (bias) or overfitted (variance)
> Plot error by m (# of training examples used)
> Error for Training set:
* Is low for a small number of m (training examples)
* Increases for a higher number of m
> Error for Cross Validation set:
* Is high for small number of m (training examples) because the solution (thetas) does not generalise very well
* Decreases for a higher number of m
> In cases of high bias (underfit models), as m increases, Training set and Cross Validation set errors converge very early on, and are very similar (J(train) ~ J(CV))
* ie: a high bias model cannot be helped by increasing the # of training examples (m)
> In cases of high variance (overfit models), as m increases, Training set and Cross Validation set errors increasingly converge, and there is a large gap between them (J(train) << J(CV))
* ie: a high variation model can be helped by increasing the # of training examples (m)
> High Variance (OverFitting) can be fixed by:
* Increasing training examples (m)
* Using a smaller set of features
* Increasing lambda (regularisation parameter penalty)
> Low Variance (Underfitting) can be fixed by:
* Using additional features
* Adding polynomial features
* Decreasing lambda (regularisation parameter penalty)
Model Complexity Effects:
> Lower-order polynomials (low model complexity) have high bias and low variance. In this case, the model fits poorly consistently.
> Higher-order polynomials (high model complexity) fit the training data extremely well and the test data extremely poorly. These have low bias on the training data, but very high variance.
> In reality, we would want to choose a model somewhere in between, that can generalize well but also fits the data reasonably well.
Building a Spam Classifier:
How could you spend your time to improve the accuracy of this classifier?
> Collect lots of data (for example "honeypot" project but doesn't always work)
> Develop sophisticated features (for example: using email header data in spam emails)
> Develop algorithms to process your input in different ways (recognizing misspellings in spam).
It is difficult to tell which of the options will be most helpful.
Error Analysis:
The recommended approach to solving machine learning problems is to:
> Start with a simple algorithm, implement it quickly, and test it early on your cross validation data.
> Plot learning curves to decide if more data, more features, etc. are likely to help.
> Manually examine the errors on examples in the cross validation set and try to spot a trend where most of the errors were made.
* Use a numerical evaluation such as % error rate on the Cross Validation dataset to determine accuracy change across model choices
Skewed Classes:
> Have more examples in one class than another class (ie: rare diseases)
> Use a confusion matrix to determine accuracy:
* Precision: of predicted cases, what % were true-positive = True Positive / # Predicted Positive = True Positive / (True Positive + False Positive)
* Recall: of actual cases, what % did we accurately predict = True Positive / # Actual Positive = True Positive / (True Positive + False Negative)
> y = 1 when predicting rare cases
> If we increase the threshold for predicting rare (cancer?) cases to say >0.7 for logistic regression, we will increase precision but have a lower recall ==> ie: we might avoid the trauma of a false report of cancer
> If we lower the threshold for predicting rare (cancer?) cases to say >0.3 for logistic regression, we will increase recall (fewer false negatives) but have a lower precision ==> ie: we might send off more people for testing than we need to
F1 Score:
> F1 Score = 2 * (P*R) / (P + R), where P = Precision & R = Recall
> F1 Score element of {0,....,1}
Large data rationale:
> Use a large number of features, so likely to create more complex prediction, creating a low bias algorithm with low cost function
> use a large number of training examples, so unlikely to overfit, creating a low variance algorithm with cost function on training similar to cost function on testing
== WEEK 7 - SUPPORT VECTOR MACHINES ==
SVM:
* An alternative take on logistic regression cost minimization formulua, such that:
** min (θ) C ∑(i=1:m) [ y(i) * cost1(θ'x(i)) + (1-y(i)) * cost0(θ'x(i)) ] + 1/2.∑(j=1:n).θ(j)^2 , where C = 1/λ
> the hypothesis, hθ(x) of SVMs does not output a probability, rather, a scalar 1 or 0
* SVM hθ(x) = 1 if θ'x >> 0
* SVM hθ(x) = 0 otherwise, ie: if θ'x << 0
> SVM has a built in buffer-zone, and rather than centering around 0, it will predict:
* if y=1, θ'x >= 1
* if y=0, θ'x <= -1
> The buffer acts as a "large margin" classifier, which means that the predicted line between classes has the largest margin between classes
> When C (1/λ) is not very large, the SVM does a better job of ignoring outliers and finding a line of best prediction (ie: will underfit / apply more regularisation)
> SVMs are convex functions, so always have a *global* minima
Kernels:
> For predicting a non-linear boundary, adding higher order polynomial features can help predict y =1, such that θ0 + θ1x1 + ... θ5x^2.. >= 0
> We can create features, f, which take the similarity function or kernel (Gaussian kernel),
ie: f1 = similarity(x,l(1)) = exp(-||x-l(1)||^2 / (2*ð)^2)
where ð = sigma; l = landmark, a proximity landmark; x = training example; ||foo|| is the distance between training example and the landmark
* if x ~ l(1), then f1 ~ e^0 ~ 1
* if x very farm from l(1), then f1 ~ e^-(large number) ~ 0
> Each landmark can be used to define a different feature
> Changes in sigma:
* Increase in sigma - reduces the negative exponent value, so a difference between x and l results in a more gradual drop of f to zero
* Decrease in sigma - increases the negative exponent value, so a difference between x and l results in a faster drop of f to zero
* Note that changes in sigma does not affect the max value that f obtains
> So, hypothesis function becomes:
* hθ(x) = θ0 + θ1f1 + θ2f2 + θ3f3 ... θnfn, for functions describing similarity b/w x and l
** ie: when x is close to a landmark (l), θ will be close to 1, and when it's further away, θ will be closer to 0
> Landmarks:
* Using landmarks and similarity functions gives the ability to predict very complex non-linear decision boundaries, and without creating complex polynomial features
* Landmarks are created based off values for x, so m examples becomes m landmarks, with a similarity function for each landmark, creating a vector of features f1--fm, where f0 = 1 for the bias/intercept term
* For each example, x(i)y(i), each features f1(i)-->fm(i) are calculated as the similarity relative to all other training examples, sim(x(i),l(1)) --> sim(x(i),l(m))
* This will become a feature vector, f(i):= [f0(i); f1(i); ...; fm(i)]
* Predict y = 1 if hθ(x):= θ'f >= 0,
** min (θ) C ∑(i=1:m) [ y(i) * cost1(θ'f(i)) + (1-y(i)) * cost0(θ'f(i)) ] + 1/2.∑(j=1:n).θ(j)^2 , where C = 1/λ
** we do not regularise on θ0
** Large C: lower bias, higher variance [less regularisation] --> overfit
** Small C: higher bias, lower variance [more regularisation] --> underfit
** Large Sigma squared: features f(i) vary more smoothly, higher bias, lower variance --> underfit
** Small Sigma squared: features f(i) vary more rapidly, lower bias, higher variance --> overfit
> Modelling decisions:
1. Choice of C and sigma squared
2. Choice of kernel
> Linear kernel (ie: dont create features, just use x) might be chosen where the # features, n, is very high, and number of training examples, m, is very low; the linear kernel returns a straight-line decision boundary
> Gaussian kernel: NOTE: remember to perform feature scaling before using a Gaussian kernel
> Polynomial kernel: includes a constant and choice on the polynomial term (exponent degree)
> Other kernels: string kernel, chi square kernel, histogram intersection kernel...
SVM vs Logistic Regression:
> If n (features) is high and m (training examples is low), use logistic regression or SVM with a linear kernel; ie: n=10k, m=1k
> If n is small and m is intermediate (ie: n=1k, m=10k), use SVM with Gaussian kernel
> If n is small (~1k) and m is large (50k+), Gaussian kernels struggle with this, so
* create additional features, and
* use logistic regression or SVM without a kernel (ie: linear kernel)
> NOTE: A neural network will work well in all conditions, but might talke longer to train
== WEEK 8 - UNSUPERVISED LEARNING ==
Clustering:
> K-means clustering is the most popular clustering algorithm
* Initialises with K cluster centroids, where K = number of desired clusters
** observations are assigned to the closest cluster; these values are then averaged and the centroids are assigned this value
** observations are again assigned to the closest cluster, and this step repeats, until convergence
* input K clusters with training set x(1)...x(m), noting there are no 'y' values
* process:
** randomly init K cluster centroids, μ1-μk
** for i:m (# training examples), c(i):=index (from 1 to K) of cluster centroid closest to x(i), and to which x(i) is assigned
** for k = 1:K, μk:= average of points assigned to cluster k
** μc(i) = average of points assigned to the cluster to which x(i) has been assigned
>> ie: if x(i) is assigned cluster 5, c(i) = 5, so μc(i) = μ5
* If a cluster ends up with zero examples assigned, it can either be removed / eliminated or randomly re-initialised
> Optimisation Obective:
* Minimise cost function J, finding c and μ to minimise the following:
min (c(1)...c(m), μ1...μK), J(c(1)...c(m),μ1...μK) = 1/m . ∑(i=1:m)||x(i) - μc(i)||^2, which is the norm of the difference between x(i) and uc(i)
** This cost function is also called the Distortion of the K-Means cost function
** The cost function should always converge to a minimum, and never increase (if it does, this means where may be a bug in the code)
> Random initialisation of cluster centroids
* Set K to < m (training examples)
* Randomly pick K training examples, and set cluster centroids μ1 ... μK to the value of these training examples
* Based on initialisation, there may be a local-optima solution, so to address this we can:
** Initialise K-means multiple times, ie: 100 iterations to generate cost function (distortion function, J) minima
** Pick cluster with lowest cost (distortion)
** If K is low (<10), this will make a big impact
** If K is large, multiple iterations might not be as effective and the initial solution will be OK
> Choosing K - the number of clusters
* Often this is manual - think about whether the solution is more useful in the real-world application
* The "elbow method" is to plot the Cost function J by # Clusters and where the drop-off in J becomes low, chose that # clusters
Dimensionality Reduction / Principal Component Analysis:
* Benefits:
** Reduces memory requirements for data storage (k such that >=99% of variance retained)
** Learning algorithms can run more rapidly (k such that >=99% of variance retained)
** Assist with visualisation (2 or 3 features from 50+)
* NOTE: Before performing PCA, undertake mean normalisation and feature scaling!
** ie: ( x(j) - μ(j) )/ s(j)
* PCA finds a directional vector, u(1), onto which to project features which minimises the projection error, reducing dimensions from 2 to 1
* Unlike linear-regression which minimises the distance between a line-of-best-fit and Y, PCA minimises the distance between both features values and the line, which is minimisation of the orthogonal distance
* In reducing the dimension from n to k, compute the "covariance matrix", Sigma matrix, and the eigenvectors or "Single Value Decomposition" of matrix Sigma (Sigma is an n x n matrix); take the first 7 columns of the U matrix
* PCA Algorithm:
** Mean normalise and feature scale
** Sigma = 1/m . ∑(i=1:m) x(i).x(i)' = (1/m) * X' * X; --> Covariance matrix
** [U,S,V] = svd(Sigma);
** Ureduce = U(:, 1:k); % n * k matrix
** z = Ureduce' * X; % k * 1 vector (ie: k * n . k * 1 = k * 1 vector)
> Reconstruction of X from compressed PCA representation can be approximated by:
** Xapprox = Ureduce * z; % n * 1 vector
> Choosing the Number of Principal Components
* PCA tries to minimise the average squared projection error,
Choose k to minimse projection error [eq 1]: 1/m . ∑(i=1:m) || X(i) - Xapprox(i) || ^2
Given variance [eq 2]: 1/m . ∑(i=1:m) || X(i) ||^2
** Choose k to be the smallest value so that [eq 1] / [eq 2] <= 0.1%, or, "99% of variance is retained" in the projection (which means we can use these projections to still effectively explain variation in the target)
** To solve this, call [U,S,V] = svd(Sigma);
** Then, calculate smallest k for which: ∑(i=1:k)Sii / ∑(i=1:n)Sii >= 0.99 (ie: 99% of variance retained)
* When performing PCA, create the mapping based on the training set only, then apply these same maps to the cross validation and test sets
* Before performing PCA, build the model with x(i), and see how it works. If it doesnt work well (ie: very slow, memory problem), then implement PCA and use z(i) ie: the compressed representation
* Note: do not use PCA to prevent overfitting - use regularisation instead.
== WEEK 9 - ANOMAOLY DETECTION ==
Density Estimation:
> for training set {x(1)...x(m)}, where x is an element of real numbers to need
* p(x) := π(j=1:n) p(x(j); mu(j); sigma(j)^2), where π reflects the product of the probabilities
* p(x(j)) is defined by the Gaussian probability distribution
* Declare an anomaly where p(x) < ε (epsilon) (value set based on experience?)
> model: p(x) = p(x(1); mu(1); sigma(1)^2) * ... * p(x(n); mu(n); sigma(n)^2)
Building Anomaly Detection System:
> Train: Build model, p(x), with "good" targets only, as the training data (60% split)
> Cross Validate: 20% of training set, using 50% of the "bad/anomalous" targets
* Predict y=1 (anomaly) if p (x) < ε (epsilon)
* Evaluate with true-+ve, false-+ve, true--ve, false--ve, Precision/Recall, F-score
* Using prediction accuracy is not ideal with skewed data (ie: rare anomolies) as prediction of non-anomalous data will be regarded as having high accuracy
* NOTE: We can use the cross validation set to choose a value for epsilon, ε
> Test: 20% of training set, using 50% of the "bad/anomalous" targets
Anomaly Detection vs Supervised Learning:
> Anomaly Detection when:
* very few positive (anomalous) examples
* lots of negative examples
* many types of anomalies, which are rare, and very difficult to predict
* examples: fraud, manufacturing (aircraft engines), machines monitored in data centre
> Supervised Learning when:
* large number of positive and negative examples
* positive examples can be explained by the algorithm, and will predictable appear in future test sets
* examples: email spam classification, weather prediction, cancer classification
Using non-Gaussian features:
> When creating the PDF for feature x(i), p(x(i); mu(i); sigma(i)^2)
* If x(i) is non-Gaussian, we can consider transforming it, ie:
* log(x(i))
* log(x(i)+c)
* sqrt(x(i))
* x(i)^(1/3)
> Consider creating features that would take on unusually large or small values in the event of an anomaly
Building multivariate Gaussian anomaly detection model:
> (1) Fit model p(x) by:
* mu = 1/m ∑(i=1:m) x(i)
* Sigma = 1/m ∑(i=1:m) (x(i) - mu) . (x(i) - mu)'
> (2) For new value x, compute:
* p(x) = 1 / ( 2pi^n/2 |∑|^1/2 ) . exp(-1/2 (x - mu)' inverted(Sigma) (x - mu) )
* flag anomaly if p(x) < ε
Original model vs Multivariate model:
> original: > model: p(x) = p(x(1); mu(1); sigma(1)^2) * ... * p(x(n); mu(n); sigma(n)^2)
> Use the original model when:
* manually create features to capture anomalies
* create unusual features
* computationally cheaper, and scales well to large n (features)
* works well with small training set m (# training examples)
> Use the multivariate model when:
* automatically capture correlations between features
* computationally more expensive
* must have m > n else Sigma is non-invertible, and generally, we want m >= 10n to be useful
Collaborative Filtering:
> Given x(1)...x(nm) we can estimate θ(1)...θ(nu)
> We can then use the θ's to estimate x's
> on and on, until we get convergence of the minimisation of errors
* We can instead merge the minimisation functions into a single function to solve for θ's (parameters) and x's (features)
* We do not solve for the intercept feature, x(0)
* x(i) and θ(i) are randomised to small values to serve as symmetry breaking (similar to the random initialization of a neural network’s parameters) and ensures the algorithm learns features x(1),…,x(nm) that are different from each other.
* θ(j, 1:n(users)), x(i, 1:n(movies)) --> the θ's are described by each user for representative movies, while the x's (features) are describing values attached to each movie
> Create vectorised representation by:
* X[Capital] = [x(1)'; x(2)'; ... ; x(nm)'] --> where each x represents the features of each movie for each user
* θ[Capital] = [θ(1)'; θ(2)'; ... ; θ(nu)'] --> where each θ represents the parameters of each movie for each user
* Thus, X[Capital] . θ[Capital]' to solve
> One can mean-normalise to impute missing values for various users
== WEEK 10 - GRADIENT DESCENT WITH LARGE DATASETS ==
> More data facilitates for more accurate models, but training on millions of examples can give rise to significant computational overhead
> To determine what m (# training exmples), one can plot a learning curve WRT m, and determine if there is high variance (overfitting) when m is small
> types of descent:
a) Batch gradient descent - use m examples
b) Stochastic gradient descent - use 1 examples
c) Mini batch gradient descent - use b examples (where 1 < b < m)
Stochastic Gradient Descent:
> "Batch gradient descent" is where the descent algorithm uses *all* of the training examples to calculate theta updates, and can be computationally expensive with large datasets
> Stochastic GD only deals with the cost of one training example at a time.
> The steps for SGD are (repeated in entirety up to 10 times):
* 1. Randomly shuffle dataset
* 2. Repeat for training examples i=1:m, :
* Repeat for features j=0:n, :
** θj = θj - alpha.(hθ(x(i)) - y(i) ) . xj(i)
> Unlike batch GD, stochastic GD is a bit more circuitous, and doesnt descend exactly to the local minima, but will head down towards it then oscillate around the global minimum (up or down in cost), although the θ will approach the global minima
> Convergence can be checked by:
* 1) during learning, compute cost J, then update θ
* 2) every 1,000 iterations, plot the cost J averaged over the last 1,000 examples processed
* Troubleshooting:
** If there is a lot of noise in the plot, try increasing the # examples averaged over
** If the curve is increasing, this suggests divergence, so perhaps use a smaller alpha (rate of descent reduction)
* To enhance descent toward the global minima, for SGD we can slowly decrease alpha to assist with convergence of θ, and this should reduce the oscillation
Mini-Batch Gradient Descent:
> b = mini-batch size, ie: # examples to use in gradient descent algorithm
> b is typically between {2,100}
> Repeat similarly to SGD, except in batches of 'b' examples, up to m examples
> Good vectorization implementation in MBGD can facilitate for speed greater than SGD
Online learning:
> Might be used for a constant stream of data (ie: whether user wants to pay for shipping, given origin / destination) to learn continuously, rather than store data as a constant dataset
* Update learning algorithm with Stochastic Gradient Descent then move on; optimise price offered based on this information
> Product search is another application, ie: someone searches for android phone, have 100 phones in store, offer 10 for sale
* Capture how many words in query match features of the phone, then capture whether the user clicks on teh phone (y=1 if clicked), the predict likelihood of clicking on phone, ie: "Predicted Click-Through-Rate"
> Other examples: special offers, customised news articles, product recommendations, collaborative filtering mechanism for CTR stuff...
Map-Reduce and Data Parallelism:
> Split the training dataset into portions and get different machines to run the calculations across the cost function and gradient descent, then unite the results later on
> Parallelise over machine cores, or across machines
== WEEK 11 - PHOTO OPTICAL CHARACTER RECOGNITION (OCR) ==
Photo OCR:
> Undertake the following steps (pipeline):
0. Input image
1. Text detection
* apply expansion-operator to highlight the pixels which likely contain text
* draw boxes around the identified text blocks with applicable aspect ratios (should be wider than taller)
2. Character Segmentation
* scan the text blocks, and look for natural white-space breaks between characters using a 1D sliding window
3. Character Recognition / Classification
4. Token correction (word correction)
> When thinking about which part of the pipeline to invest the most time into, we can approach it in the following way, a method called "Ceiling Analysis":
* Look at accuracy of entire system on test set
* Look at change in accuracy when 100% accurate answers are provided to Text Detection
* Look at change in accuracy when 100% accurate answers are provided to Character Segmentation
* Look at change in accuracy when 100% accurate answers are provided to Character Recognition
* This will help us narrow down to the value of adding more resources to particular areas of the pipeline
> With face recognition, ceiling analysis might undertake the following steps:
* Develop pipeline:
** Camera Image --> Remove Background --> Face Detection --> Eyes & Nose & Mouth Segmentation --> Logistic Regression --> Label
* Ceiling analysis:
** Accuracy add of background removal
** Accuracy add of face detection
** Accuracy add of eyes detection
** Accuracy add of nose detection
** Accuracy add of mouth detection
> Supervised learning for pedestrian detection
* build training dataset with small positive images (y=1) and negative images (y=0, no pedestrians present)
* Take the large input images (with lots of potential pedestrians) and then:
1. scan a window (small size) in the top left
2. scan / slide the window / patch across the input image, moving by a given step size / stride across the image
3. increase the size of the window / patch, and slide / scan across the input image, with the potential to detect larger-size pedestrians
> The above process applies to a photo OCR system
> Artificial data synthesis for photo OCR can be undertaken by:
* Taking characters from free font libraries and pasting them onto random backgrounds
* Apply various operators, such as torsion, blurring, affine distortion, scaling, distortion
* Take actual positive image, then use artificial distortions, such as warping and other transforming
* With audio, we can add background sounds to increase the training dataset artificially
* These distortions **should be representative of what we would observe in the test set!**
** Implementation Note:
1. before creating new data, ensure that we are using a low bias classifier first, by plotting learning curves; if not, increase the number of features / hidden units until we do (have a low bias classifier), and then start adding in extra training examples
2. think about how much work it would take to get 10x's the amount of data?
* consider artificial data synthesis (generate data from scratch, or using existing example and introduce distortions)
* collect the data and label them myself (ie: # eggs in a photo)
* crowd-source (ie: Amazon Mechanical Turk) - hire others to label the data for you, at low cost
== ASSIGNMENT SOLUTION EXAMPLES ==
https://github.com/liquidpie/andrew-ng-ml-solutions
https://github.com/kaleko/CourseraML
help: https://www.coursera.org/learn/machine-learning/discussions/all/threads/vgCyrQoMEeWv5yIAC00Eog?page=2
help: https://www.coursera.org/learn/machine-learning/profiles/4a373eb3f73b6c0a931e5ac2518d7298
== LINKS for Later Neural Network Review ==
http://eugenezhulenev.com/blog/2014/11/14/stock-price-prediction-with-big-data-and-machine-learning/
https://nicholastsmith.wordpress.com/2016/04/20/stock-market-prediction-using-multi-layer-perceptrons-with-tensorflow/
http://jspauld.com/post/35126549635/how-i-made-500k-with-machine-learning-and-hft
https://github.com/martin-gorner/tensorflow-rnn-shakespeare
https://www.tensorflow.org/versions/master/get_started/mnist/pros
https://www.springboard.com/blog/beginners-guide-neural-network-in-python-scikit-learn-0-18/
https://www.udemy.com/courses/search/?q=neural%20networks&src=sac&kw=neural&sort=highest-rated&courseLabel=7380
http://stackabuse.com/tensorflow-neural-network-tutorial/
http://pybrain.org/docs/tutorial/datasets.html
http://sujitpal.blogspot.com.au/2014/07/handwritten-digit-recognition-with.html
https://pandas.pydata.org/pandas-docs/stable/dsintro.html
http://glitter.im/
https://github.com/wassname/rl-portfolio-management
https://github.com/ZhengyaoJiang/PGPortfolio/blob/master/user_guide.md
https://guides.github.com/activities/hello-world/
https://research.googleblog.com/2016/08/improving-inception-and-image.html
https://cloud.google.com/ml-engine/docs/getting-started-training-prediction
https://finance.yahoo.com/quote/%5EGSPC/history?period1=1514552400&period2=1514638800&interval=1d&filter=history&frequency=1d
https://stackoverflow.com/questions/42744903/keras-lstm-error-when-checking-model-input-dimension#
https://keras.io/getting-started/sequential-model-guide/
https://elitedatascience.com/keras-tutorial-deep-learning-in-python
http://cs231n.github.io/neural-networks-1/
https://github.com/apress/practical-ml-w-python (http://isbn.directory/book/9781484232064)