-
Notifications
You must be signed in to change notification settings - Fork 0
/
ML_cheat_sheet.txt
237 lines (192 loc) · 14.7 KB
/
ML_cheat_sheet.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
== Model features & general implementation notes ==
Feature scaling (range, high-low) & mean normalisation:
> ie (xi- u) / s
> assists with convergence speed
> use with linear and logisitic regression
Learning rate (alpha):
> try a range of values, with 3 fold increase (10e-3,3*10e03,10e-1,3*10e-1,1,3)
> begin with a low rate, then increase, plotting convergence
> a higher alpha will increase convergence speed but might lead to parameter estimates bouncing out of the local minima
> a lower alpha will train slowly
Train-Cross Validation-Test Split & Function:
> Generate model θ's (paramaters) on the Training set (60% of data), choose the model with the lowest cost function on the Cross Validation set (20%), then estimate generalisation of the model on the Test set (20%)
Underfitting and Overfitting:
> Underfitting is known as having bias - typically, Training error and Cross Validation errors are high
> Overfitting is known as having variance - typically, Training error is low and Cross Validation error is high
> If there are too many features and very few training examples then overfitting can be a problem
> We want models to "generalise" to new test data
> For underfitting, we might want to:
* increase the # of features
* build more complex features (polynomial, etc)
* decrease regularisation coefficient
* Note: increasing m (training examples) typically will not help
> For overfitting, we might want to:
* decrease the # of features
* consider more specific feature selection
* increase regularisation coefficient
* increase m (training examples)
Regularisation rate (lambda):
> upwardly adjusts the cost function for the θ parameters
> can be used to correct underfitting and overfitting
> a higher lambda increases the penalty on θ, and trends toward underfitting
> a lower lambda decreases the penalty on θ, and trends toward overfitting
> Note: we do not penalise the intercept term (bias node) - we only penalise the features
> a range of lambda values can be iterated across models, and cost function outcomes can be plotted across these models for Cross Validation and Training sets to visualise underfitting or overfitting
Learning Curve model diagnostics:
> Plot cost function (error) by m (# of training examples used)
Feature extension / addition:
> extra features can be created through polynomial (cubic, quadratic, square) or other (tan, log, square root) transforms
> such features can assist with models with high bias (underfit), and to fit non-linear boundaries
Error Analysis in Model Build:
> Start with a simple algorithm, implement it quickly, and validate it early on your cross validation data.
> Plot learning curves to decide if more data, more features, or other approaches are likely to help.
> Manually examine the errors on examples in the cross validation set and try to spot a trend where most of the errors were made.
Skewed Classes (ie: rate events):
> Have more examples in one class than another class
> Use a confusion matrix to determine accuracy:
* Precision: of predicted cases, what % were true-positive = True Positive / # Predicted Positive = True Positive / (True Positive + False Positive)
* Recall: of actual cases, what % did we accurately predict = True Positive / # Actual Positive = True Positive / (True Positive + False Negative)
F1 Score:
> F1 Score = 2 * (P*R) / (P + R), where P = Precision & R = Recall
> F1 Score element of {0...1}
Large Data:
> if using a large number of features, be sure to use a large number of training examples, so as to not overfit on the training set
Which Model to Choose:
> If n (features) is high and m (training examples is low), use logistic regression or SVM with a linear kernel; ie: n=10k, m=1k
> If n is small and m is intermediate (ie: n=1k, m=10k), use SVM with Gaussian kernel
> If n is small (~1k) and m is large (50k+), Gaussian kernels struggle with this, so
* create additional features, and
* use logistic regression or SVM without a kernel (ie: linear kernel)
> A neural network will work well in all conditions, but might take longer to train
Anomaly Detection vs Supervised Learning:
> Anomaly Detection when:
* very few positive (anomalous) examples
* lots of negative examples
* many types of anomalies, which are rare, and very difficult to predict
> Supervised Learning when:
* large number of positive and negative examples
* positive examples can be explained by the algorithm, and will predictable appear in future test sets
Batch, Mini-Batch and Stochastic Gradient Descent:
a) Batch gradient descent - use m examples
> "Batch gradient descent" is where the descent algorithm uses all of the training examples to calculate theta updates, and can be computationally expensive with large datasets
b) Stochastic gradient descent - use 1 examples
> Stochastic GD only deals with the cost of one training example at a time.
> Randomly shuffle the dataset, then parameters for each example
> SGD doesnt descend exactly to the local minima, but will head down towards it then oscillate around the global minimum
> Convergence can be checked by:
* during learning, compute cost J, then update θ
* every 1,000 iterations, plot the cost J averaged over the last 1,000 examples processed
* If there is a lot of noise in the plot, try increasing the # examples averaged over
* If the curve is increasing, this suggests divergence, so perhaps use a smaller alpha (rate of descent reduction)
* To enhance descent toward the global minima, for SGD we can slowly decrease alpha to assist with convergence of θ, and this should reduce the oscillation
c) Mini batch gradient descent - use b examples (where 1 < b < m)
> b is typically between {2,100}
> Repeat similarly to SGD, except in batches of 'b' examples, up to m examples
> Good vectorization implementation in MBGD can facilitate for speed greater than SGD
Map-Reduce and Data Parallelism:
> Split the training dataset into portions and get different machines to run the calculations across the cost function and gradient descent, then unite the results later on
> Parallelise over machine cores, or across machines
Ceiling Analysis:
> When thinking about which part of the machine learning pipeline to invest the most time into, we can approach it in the following way with a method called "Ceiling Analysis":
* Look at accuracy of entire system on test set
* Look at change in accuracy when 100% accurate answers are provided to Text Detection
* Look at change in accuracy when 100% accurate answers are provided to Character Segmentation
* Look at change in accuracy when 100% accurate answers are provided to Character Recognition
* This will help us narrow down to the value of adding more resources to particular areas of the pipeline
== Linear regression ==
Background:
> models can produce predicted values << 0 and >> 1, even if training examples are such that y∈{0,1}
> outliers can drastically change linear regression classification and thus significantly impair prediction accuracy
> convex functions, which always have a global minima
Normal Equation:
The "normal equation" for linear regression solves for the global minima - feature scaling is not required
> θ = pinv(X'.X).X'.Y
Implementation Notes:
> if n (# features) is large (ie: >=10,000), then use gradient descent
> if n (# features) is small (ie: <=10,000), then use "normal equation"
== Logistic regression ==
Background:
> classification is required when predicting a small number of discrete values
> models can be set up to produce a classification hypothesis such that 0 <= hθ(x) <= 1
> the decision boundary reflects the line dividing parameters in the feature space between the regions where the hypothesis predicts y = 1 from the region that predict where y = 0
> good for discrete hypotheses (creating a definition boundary for straight lines and ellipses), but not for complex multi-feature non-linear problems
Implementation Notes:
> adjust the decision boundary to influence test sensitivity - increase to increase precision (True Positive / # Predicted Positive), at the expense of lower recall (True Positive / # Actual Positive)
Multiclass Classification:
> a single class (one of k classes) is separated out from the others and assigned a positive class (y=1), while everything else is bundled as a negative class (y=0)
> we will end up with k logistic regression classifiers, predicting a single class relative to all others bundled
> we run all k binary logistic regression classifiers, and pick the classifier that maximises hθ(x)
== Neural Networks ==
Background:
> well suited to complex multi-feature non-linear problems
> forward propagation trains the θ's (weights) across layers
> backward propagation essentially functions as gradient descent, calculating the error term for the output layer, then iterating backwards through the hidden layers
Implementation Notes:
> If θ is initialised to zero (or any value which is the same for all elements of θ), θ's from each feature feeding into activation units of hidden layer will take on the same values following BackProp, and the activation units of the hidden layer will also take the same value!
> Thus, input layer θ's are randomly initialised, including multiplication through a small value, epsilon (ie: Ɛ=1e-4)
== Support Vector Machines ==
Background:
> the hypothesis, h(θ), outputs a scalar of 0 or 1
> SVM has a built in buffer zone, acting as a "large margin" classifier, which means that the predicted line between classes has the largest margin between classes
> convex functions, which always have a global minima
Implementation Notes:
> Kernels (such as similarity or Gaussian Kernel) can help define non-linear boundaries
> Landmarks, based off values for x, can be used with a similar function to create model features (proximity of examples to proximity landmarks)
> Coefficients for sigma (similarity function) and C (regularisation) can be tweaked to adjust model underfit or overfit
> Different kernels can be chosen to change model fit, such as: * linear=straight line decision boundary, which might be used when # features is high and # training examples is low
* Gaussian kernel / similarity function - NOTE: remember to perform feature scaling first!
* Polynoial kernel: will need to tweak the constant term and polynomial term / exponent
* Other kernels, such as string, chi-square, histogram intersection, etc
== Clustering ==
Background:
> K-Means randomly initialises K clusters, serially averaging the closest (Euclidian vector) observations to generate a map of training examples to these clusters, with re-zoning of the cluster to example member averages
Implementation Notes:
> Minimise the cost function or Distortion of the K-Means, which is the norm of the difference between x(i) and uc(i)
> Cluster centroids are randomly initialised; clusters which end up with zero observations can be removed or re-initialised
> The cost function should always converge to a minimum, and never increase
> In considering random initialisation:
* there may be local optima
* therefore, consider running K-Means for multiple iterations, then choosing clusters with the lowest cost distortion
> In considering K number of clusters:
* This is often manual - consider this application in the context of real-world application
* The "elbow method" approach is to plot the Cost function J by # Clusters, and where the drop-off in J becomes low, chose that # Clusters
== Dimensionality Reduction ==
Background:
> Principally reduces memory requirements for data storage, and assists learning algorithms in running more rapidly
> PCA minimises the distance between both feature values and the line of fit, which is minimisation of the orthogonal distance
Implementation Notes:
> Before PCA, undertake mean normalisation and feature scaling!
> In reducing the feature dimensions from n to k, compute the "covariance matrix", Sigma matrix, and the eigenvectors or "Single Value Decomposition" of matrix Sigma
> In choosing k components, ensure that a minimum of 99% of variance is retained in the projection (ie: reduce features n to the point where variance can still significantly explained by the model)
> Create PCA mapping based on the training set only, then apply these same maps to the cross validation and test sets
> Before performing PCA, build the model with all features, n - if it is slow / memory problems, the implement PCA and use z features (the compressed representation)
> Note: do not use PCA to prevent overfitting - use regularisation instead.
== Anomaly Detection ==
Background:
> Density estimation can be used to predict outliers
> Consider building a multivariate Gaussian anomaly detection model, or if features are non-Gaussian, we can consider transforming them (log, log+constant, sqrt, **1/3...)
Implementation Notes:
> Train: Build model, p(x), with "good" targets only, as the training data (60% split)
> Cross Validate: 20% of training set, using 50% of the "bad/anomalous" targets
* Predict y=1 (anomaly) if p (x) < ε (epsilon)
* Evaluate with true-+ve, false-+ve, true--ve, false--ve, Precision/Recall, F-score
* Using prediction accuracy is not ideal with skewed data (ie: rare anomolies) as prediction of non-anomalous data will be regarded as having high accuracy
> We can use the cross validation set to choose a value for epsilon, ε
> Test: 20% of training set, using 50% of the "bad/anomalous" targets
== Supervised Learning - Image Object Detection ==
Implementation Notes:
> build training dataset with small positive images (y=1) and negative images (y=0, no pedestrians present)
* scan a window (small size) in the top left
* scan / slide the window / patch across the input image, moving by a given step size / stride across the image
* increase the size of the window / patch, and slide / scan across the input image, with the potential to detect larger-size pedestrians
> Artificial data synthesis for photo OCR can be undertaken by:
* Taking characters from free font libraries and pasting them onto random backgrounds
* Apply various operators, such as torsion, blurring, affine distortion, scaling, distortion
* Take actual positive image, then use artificial distortions, such as warping and other transforming
* With audio, we can add background sounds to increase the training dataset artificially
* These distortions **should be representative of what we would observe in the test set!**
> before creating new data, ensure that we are using a low bias classifier first, by plotting learning curves; if not, increase the number of features / hidden units until we do (have a low bias classifier), and then start adding in extra training examples
> think about how much work it would take to get 10x's the amount of data?
* consider artificial data synthesis (generate data from scratch, or using existing example and introduce distortions)
* collect the data and label them myself (ie: # eggs in a photo)
* crowd-source (ie: Amazon Mechanical Turk) - hire others to label the data for you, at low cost