machine_learning_fundamentals_notes_201801.txt

INDEX:
1.  == WEEK 2 - LINEAR REGRESSION ==
2.  == WEEK 3 - LOGISTIC REGRESSION ==
3.  == WEEK 4 & 5 - NEURAL NETWORKS ==
4.  == WEEK 6 - APPLICATION OF LEARNING ALGORITHMS ==
5.  == WEEK 7 - SUPPORT VECTOR MACHINES ==
6.  == WEEK 8 - UNSUPERVISED LEARNING ==
7.  == WEEK 9 - ANOMAOLY DETECTION ==
8.  == WEEK 10 - GRADIENT DESCENT WITH LARGE DATASETS ==
9.  == WEEK 11 - PHOTO OPTICAL CHARACTER RECOGNITION (OCR) ==


X. == ASSIGNMENT SOLUTION EXAMPLES ==


== WEEK 2 - LINEAR REGRESSION ==
Feature scaling:
> feature scaling (using range of values) & mean normalisation (or standard deviation), ie: (x1 - u1) / s1
> this assists with convergence speed

Learning rate:
> begin with low rate (ie: 10e-3), plot the change in cost function with each iteration and watch for convergence
> increase the learning rate with a 3 fold increase

Features & polynomial regression:
> We can improve our features and the form of our hypothesis function in a couple different ways.
> We can combine multiple features into one. For example, we can combine x1 and x2 into a new feature x3 by taking x1⋅x2.
> Our hypothesis function need not be linear (a straight line) if that does not fit the data well.  We can change the behavior or curve of our hypothesis function by making it a quadratic, cubic or square root function (or any other form).
eg: hθ(x)=θ0+θ1x1+θ2x2+θ3x3+θ4√(x4)

Using "Normal Equation" to solve to minimise cost function for linear regression:
> theta = pinv(X'.X).X'.Y
> the normal equation works by minimizing the cost function (J) by explicitly taking its derivatives with respect to the θj ’s, and setting them to zero.
> in this case, feature scaling / normalisation is not required, as we are solving for the global minima

Minimising cost function for linear regression:
> if n (# features) is large (ie: >=10,000), then use gradient descent
> if n (# features) is small (ie: <=10,000), then use "normal equation"

Gradient descent algorithm:
> for all features, j: 
θ(j):= θ(j) - alpha 1/m . sum{i=1:m}(hθ(x(sup i) - y(sup i)) . x(sub j)(sup i) 
> where j = feature (each feature iterated), and i = training example (each example iterated)
> basically, all θ (feature parameters) are simultaneously updated at learning rate of alpha relative to changes in the point derivative change in the cost function

Vectorization:
> allows for improved efficiency in solving problems
> eg: Yhat = matrix X * vector θ


== WEEK 3 - LOGISTIC REGRESSION ==
Classification:
> classification is required when predicting a small number of discrete values
> linear regression can produce values << 0 and >> 1, even if training examples are such that y∈{0,1}
> outliers can drastically change linear regression classification and thus significantly impair prediction accuracy
> logistic regression models can be set up to produce a classification hypothesis such that 0 <= hθ(x) <= 1
> sigmoid function == logistic function = hθ(x) = g(z) = 1 / (1 + e^-z) = 1 / (1 + e^-θ'X)
> hθ(x) is the estimated probability that y = 1 on input x

Decision Boundary:
> hθ(x):= g(z):= g(θ'x) >= 0.5 where θ'x >= 0  (ie θ'.0 = 1 / (1 + e^-θ'x) = 1 / (1 + e^-0) = 1 / (1 + 1) = 1 / 2
> therefore, we can predict that y = 1 where hθ(x) >= 0.5 which is when θ'x >= 0.
> the decision boundary reflects the line dividing parameters in the feature space between the regions where the hypothesis predicts y = 1 from the region that predict where y = 0
> basically, the way our logistic function g behaves is that when its input is greater than or equal to zero, its output is greater than or equal to 0.5
> non-linear decision boundaries are possible by using higher-order polynomials (ie: θ0 + θ1x2 + θ2x2 + θx1^2x2^2 ...)

Cost Function:
> we define the logistic regression cost function without a square, because this creates a non-convex cost function, which typically has multiple local optima (in contrast to a convex cost function which has a global minimum)
> writing the cost function in the following way ensures that it is convex for logistic regression
> J(θ):= 1/m ∑(i:m) Cost(hθ(x(i)),y(i))
> Cost(hθ(x),y):= -log(hθ(x)) where y = 1, (so, as hθ(x) --> 0, J(θ) --> infinity)  AND
> Cost(hθ(x),y):= -log(1 - hθ(x)) where y = 0, (so, as hθ(x) --> 1, J(θ) ---> infinity)
> therefore, Cost(hθ(x),y):= -y.log(hθ(x)) - (1 - y).log(1 - hθ(x))
> therefore, J(θ) = -1/m . [∑(i:m) -y.log(hθ(x(i))) + (1 - y(i)).log(1 - hθ(x(i)))]
> Vectorised:
h=g(Xθ)
J(θ)=1/m⋅(−y'log(h)−(1−y)'log(1−h))

Gradient descent algorithm:
> simultaneously update, for all features, n, is exactly the same as the linear regression update rule:
θ(n):= θ(n) - alpha 1/m . ∑(i:m) (hθ(x(i) - y(i)) . x(n)(i))
> where n = feature (each feature iterated), and i = training example (each example iterated), and alpha is the learning rate
> Vectorised:
θ:=θ−α/m.X'(g(Xθ)−y(hat))

Feature scaling:
> Important in logistic regression also!

Advanced Descent Algorithms:
> some incomprehensible stuff (Conjugate gradient, BFGS, L-BFGS)
> use "fminunc" function in Octave

Multiclass Classification:
> a single class (one of k classes) is separated out from the others and assigned a positive class (y=1), while everything else is bundled as a negative class (y=0)
> thus, we end up with k logistic regression classifiers, predicting a single class relative to all others bundled
> we run all k binary logistic regression classifiers, and pick the classifier that maximises hθ(x)

Overfitting:
> Underfitting is known as having bias
> Overfitting is known as having variance
> We want models to "generalise" to new test data
> If there are too many features and very few training examples then overfitting can be a problem 


Regularisation - Linear Regression:
> To avoid overfitting, we can apply a "regularisation parameter", λ (lambda), to the cost function, λ.∑(j:n)θj^2 - the λ determines how much our theta params are inflated in the cost function
> too large a λ will result in too high a pentalty, and err towards underfitting, while too small a λ will err towards overfitting (where there are many features used relative to the training set size)
>> this adds a penalty to the cost function for each θ, thus, the cost function tries to minimise / reduce the size of θ, while also trying to reduce the difference between predicted outcomes and actuals
> thus, J(θ) = 1/2m . ∑(i:m)(hθ(x(i)) - y(i))^2 . λ.∑(j:n)θj^2
> NOTE: we do not penalise θ0 (for the constant term) - we only penalise θ's for the features

Regularisation - Linear Regression - Gradient Descent:
> Minimise cost function by also subtracting the derivative of the regularisation equation, 
> Thus: θ(j):= θ(j) - alpha 1/m . [ ∑(i:m)(hθ(x(i)) - y(i)).x(j)(i) + λ/m.θj ]
> which is also written as: θ(j):= θ(j)(1 - alpha.λ/m) - alpha.1/m . ∑(i:m)(hθ(x(i)) - y(i)).x(j)(i)
> Determine simultaneously for all features, j
> Remember that the regularisation component only applies to θ(1:j), and not θ0 (for the constant)

Regularisation - Logistic Regression:
> We add in the regularisation term, λ/2m.∑(j:n)θ(j)^2
> thus, cost functon: 
J(θ) = -1/m . [ ∑(i:m) y(i).log(hθ(x(i))) + (1-y(i)).log(1-hθ(x(i))) ] + λ/2m.∑(j=1:n)θ(j)^2

Regularisation - Logistic Regression - Gradient Descent:
> The formula looks the same to regularisation of linear regression with gradient descent, above, however, noting that logistic regression takes a differently defined hθ(x):= 1 / (1 + e^-θ'x)

== WEEK 4 & 5 - NEURAL NETWORKS ==
* Logistic regression is good for discrete hypotheses (creating a definition boundary for straight lines and ellipses), but not for complex multi-feature non-linear problems
> eg: to solve complex non-linear boundaries, quadratic feature maps may be required, such as 50*50 pixel image = 2500 featurs = 3 million features, or 50 million features for a 100 * 100 pixel image)
> thus, we need something other than logistic regression

Model Representation:
* a neuron receives a bunch of inputs (dendrite pulses) from various feature units, and a bias node or unit (x0) 
> the bias unit, x0, takes a value of 1, however it can still have a differing weight (theta(0))
* the neuron will output a signal (axonal output) according to its activation function (ie: sigmoid / logistic)
* the feature thetas (or, parameters) are typically referred to as "weights" in neural networks
* input units --> hidden layer(s) --> output layer
* the matrix of weights controlling function mapping from layer (j) to (j+1) will take the dimensional form of: 
  [# units in layer (j+1) *  # inputs layer (j) + 1]  --> the + 1 is for the bias weight, 
  ie: layer (j+1) rows for each unit, and layer (j) + 1 columns for input feature units including a bias
> a(i)(j+1) =  g(z(i)(j)) = g(θ(i0)(j)x(0) + θ(i1)(j)x(1) + .... + θ(in)(j)x(n)
> activation of unit i in layer j is a function of theta (layer j, feature n)(layer j) * x(feature n)
* the hidden layers of nodes are sometimes called "activation units"

Forward Propagation:
> activation of inputs layer to hidden layers(s) to output layer
> a(i)(j) = g(x) = g(z(i)(j)) --> node (unit) i, layer j
> z(j+1) = θ(j).x(j) = θ(j).a(j)
> a(j+1) = g(z(j+1))
> vector representation: a(j) = x(j) = [x0 ; x1 ; ... ; xn] ; z(j) = [z(1)(j) + z(2)(j) + ... + z(n)(j)]
> z(j+1) = θ(j).a(j)
> θ(j) is a matrix with dimensions s(j) * n + 1, where s(j) = # activation nodes in layer j, and n = # features + 1 for bias node
==> consider the # activation nodes to be like the # of training examples, m
> a(j) is a vector with height n + 1

Cost function & Back Propagation Algorithm:
> L = number of layers; sl = number of units in layer l (not including bias unit / node); K = number of units in output layer
> l = current layer; j = node j in layer l; i = current training example (observation); m = number of training examples
> ie: a(sub j)(sup l) = activation of jth unit in layer l 
> binary classification = 1 output unit (K=1); multi-class classification, where K >= 3 classes, (ie: vector output [0;0;1;0]
> The cost function is take across all output nodes, K
> Backpropagation algorithm
* is essentially the "gradient descent"
* delta, ð, reflects the error term of node j in layer l
* For output layer: ð(j)(l) = a(j)(l) = h(θ(k))(j) - y(j), so vector notation: a(L) - y 
* For output layer: ie: the error term, delta, is the predicted value of unit a in layer l, minus the actual value of unit a in layer l
* For hidden layers: 
  ** vector notn: ð(j) = θ(j)'*ð(j+1) .* g(prime)(z(j)), where g(prime)(z(3)) = a(3) .* (1 - a(3))
* For input layer: no calculation, as these are actuals (observed)
* Backpropagation refers to the fact that we calculate the error term for the output layer, then iterate backwards through the hidden layers 
* ∆ (ie: capital delta) = ∆(ij)(l) = ∆(ij)(l) + a(l)ð(l+1)
* vector notn: ∆(l) = ∆(l) + ð(l+1)*a(l)'
> Calculation of the partial derivatives of J(theta) WRT theta(ij)(l) is:  [observation i, node j, layer l]
* where j=0 (bias node):     D(ij)(l) = 1/m ∆(ij)(l) 
* where j<>0 (feature node): D(ij)(l) = 1/m ∆(ij)(l) + λθ(ij)(l)
> We iterate over the training examples, running ForwardProp for each example then BackProp for each example

BackProp in Practice:
> before a crash, I wrote a whole lot of stuff on unrolling matrices to a row vector ie:theta_vec = [theta_1(:);...;theta_L-1(:)]
* Implementation notes:
thetaVector = [ Theta1(:); Theta2(:); Theta3(:); ]
deltaVector = [ D1(:); D2(:); D3(:) ]
Theta1 = reshape(thetaVector(1:110),10,11)
Theta2 = reshape(thetaVector(111:220),10,11)
Theta3 = reshape(thetaVector(221:231),1,11)
> and about gradient checking, using epsilon as an offset to J(theta), two-sided, to estimate the gradient to make sure that BackProp is doing the correct thing
* Implementation notes:
epsilon = 1e-4;
for i = 1:n,
  thetaPlus = theta;
  thetaPlus(i) += epsilon;
  thetaMinus = theta;
  thetaMinus(i) -= epsilon;
  gradApprox(i) = (J(thetaPlus) - J(thetaMinus))/(2*epsilon)
end;
> If theta is initialised to zero (or any value which is the same for all elements of theta), thetas from each feature (x, e 1:n) feeding into activation units of hidden layer will take on the same values following BackProp (gradient descent), and the activation units of the hidden layer will also take the same value!
* Thus, thetas are randomly initialised, eg:
If the dimensions of Theta1 is 10x11, Theta2 is 10x11 and Theta3 is 1x11.
Theta1 = rand(10,11) * (2 * INIT_EPSILON) - INIT_EPSILON;
Theta2 = rand(10,11) * (2 * INIT_EPSILON) - INIT_EPSILON;
Theta3 = rand(1,11) * (2 * INIT_EPSILON) - INIT_EPSILON;

Network Architecture:
> # input nodes = # features (dimensions)
> # output nodes = # classes ==> in a multi-class output, this is represented as a vector, eg: s(L) = 5, if y=3, [0;0;1;0;0]
> hidden layers: usually 1 layer is ok, and if >1, then ensure nodes in layer l == nodes in layer l+1

Calculate ForwardProp and BackProp for each training example, like so:
for i = 1:m,
   Perform forward propagation and backpropagation using example (x(i),y(i))
   (Get activations a(l) and delta terms d(l) for l = 2,...,L

== WEEK 6 - APPLICATION OF LEARNING ALGORITHMS ==
Model Selection:
> Generate hypotheses θ's for features across different models (ie: different order polynomials) on Training set (60%), choose the model with the lowest cost function on the Cross Validation set (20%), then estimate generalisation error of model on Test set (20%)
> Remember: Select model thetas using Training set, validate and select lowest cost function model on Cross Validation set, and finally, report on model generalisation / success using the Test set

Bias vs Variance:
> high bias (underfitting), high variance (overfitting)
> high bias: Training error and Cross Validation errors are high
> high variance: Training error is low and Cross Validation error is high (ie: J(theta,CV) >> J(theta,Training)
> Regularisation (lambda) value selection can be iterated across models, and cost function outcomes can be plotted across these models for Cross Validation and Training sets
* Low lambda will result in low penalties for thetas and thus more likely overfit the Training set, whereas large lambda will highly penalise thetas and likely result in an underfit (ie: just provide a bias term)

Learning Curves (Diagnostic):
> "Learning Curves" - plotting the cost functions for Training set and Cross Validation sets
> Helps detect if model is underfitted (bias) or overfitted (variance)
> Plot error by m (# of training examples used)
> Error for Training set:
* Is low for a small number of m (training examples)
* Increases for a higher number of m
> Error for Cross Validation set:
* Is high for small number of m (training examples) because the solution (thetas) does not generalise very well
* Decreases for a higher number of m
> In cases of high bias (underfit models), as m increases, Training set and Cross Validation set errors converge very early on, and are very similar (J(train) ~ J(CV))
    * ie: a high bias model cannot be helped by increasing the # of training examples (m)
> In cases of high variance (overfit models), as m increases, Training set and Cross Validation set errors increasingly converge, and there is a large gap between them (J(train) << J(CV))
    * ie: a high variation model can be helped by increasing the # of training examples (m)
> High Variance (OverFitting) can be fixed by:
    * Increasing training examples (m)
    * Using a smaller set of features
    * Increasing lambda (regularisation parameter penalty)
> Low Variance (Underfitting) can be fixed by:
    * Using additional features
    * Adding polynomial features
    * Decreasing lambda (regularisation parameter penalty)

Model Complexity Effects:
> Lower-order polynomials (low model complexity) have high bias and low variance. In this case, the model fits poorly consistently.
> Higher-order polynomials (high model complexity) fit the training data extremely well and the test data extremely poorly. These have low bias on the training data, but very high variance.
> In reality, we would want to choose a model somewhere in between, that can generalize well but also fits the data reasonably well.

Building a Spam Classifier:
How could you spend your time to improve the accuracy of this classifier?
> Collect lots of data (for example "honeypot" project but doesn't always work)
> Develop sophisticated features (for example: using email header data in spam emails)
> Develop algorithms to process your input in different ways (recognizing misspellings in spam).
It is difficult to tell which of the options will be most helpful.

Error Analysis:
The recommended approach to solving machine learning problems is to:
> Start with a simple algorithm, implement it quickly, and test it early on your cross validation data.
> Plot learning curves to decide if more data, more features, etc. are likely to help.
> Manually examine the errors on examples in the cross validation set and try to spot a trend where most of the errors were made.
* Use a numerical evaluation such as % error rate on the Cross Validation dataset to determine accuracy change across model choices

Skewed Classes:
> Have more examples in one class than another class (ie: rare diseases)
> Use a confusion matrix to determine accuracy:
* Precision: of predicted cases, what % were true-positive = True Positive / # Predicted Positive = True Positive / (True Positive + False Positive)
* Recall: of actual cases, what % did we accurately predict = True Positive / # Actual Positive = True Positive / (True Positive + False Negative)
> y = 1 when predicting rare cases
> If we increase the threshold for predicting rare (cancer?) cases to say >0.7 for logistic regression, we will increase precision but have a lower recall ==> ie: we might avoid the trauma of a false report of cancer
> If we lower the threshold for predicting rare (cancer?) cases to say >0.3 for logistic regression, we will increase recall (fewer false negatives) but have a lower precision ==> ie: we might send off more people for testing than we need to

F1 Score:
> F1 Score = 2 * (P*R) / (P + R), where P = Precision & R = Recall
> F1 Score element of {0,....,1}

Large data rationale:
> Use a large number of features, so likely to create more complex prediction, creating a low bias algorithm with low cost function
> use a large number of training examples, so unlikely to overfit, creating a low variance algorithm with cost function on training similar to cost function on testing 

== WEEK 7 - SUPPORT VECTOR MACHINES ==
SVM:
* An alternative take on logistic regression cost minimization formulua, such that:
** min (θ) C ∑(i=1:m) [ y(i) * cost1(θ'x(i)) + (1-y(i)) * cost0(θ'x(i)) ] + 1/2.∑(j=1:n).θ(j)^2 , where C = 1/λ
> the hypothesis, hθ(x) of SVMs does not output a probability, rather, a scalar 1 or 0 
* SVM hθ(x) = 1 if θ'x >> 0
* SVM hθ(x) = 0 otherwise, ie: if θ'x << 0
> SVM has a built in buffer-zone, and rather than centering around 0, it will predict:
* if y=1, θ'x >= 1
* if y=0, θ'x <= -1
> The buffer acts as a "large margin" classifier, which means that the predicted line between classes has the largest margin between classes
> When C (1/λ) is not very large, the SVM does a better job of ignoring outliers and finding a line of best prediction (ie: will underfit / apply more regularisation)
> SVMs are convex functions, so always have a *global* minima

Kernels:
> For predicting a non-linear boundary, adding higher order polynomial features can help predict y =1, such that θ0 + θ1x1 + ... θ5x^2.. >= 0
> We can create features, f, which take the similarity function or kernel (Gaussian kernel), 
ie: f1 = similarity(x,l(1)) = exp(-||x-l(1)||^2 / (2*ð)^2)
where ð = sigma; l = landmark, a proximity landmark; x = training example; ||foo|| is the distance between training example and the landmark
* if x ~ l(1), then f1 ~ e^0 ~ 1
* if x very farm from l(1), then f1 ~ e^-(large number) ~ 0
> Each landmark can be used to define a different feature
> Changes in sigma:
* Increase in sigma - reduces the negative exponent value, so a difference between x and l results in a more gradual drop of f to zero
* Decrease in sigma - increases the negative exponent value, so a difference between x and l results in a faster drop of f to zero
* Note that changes in sigma does not affect the max value that f obtains
> So, hypothesis function becomes:
* hθ(x) = θ0 + θ1f1 + θ2f2 + θ3f3 ... θnfn, for functions describing similarity b/w x and l
** ie: when x is close to a landmark (l), θ will be close to 1, and when it's further away, θ will be closer to 0
> Landmarks:
* Using landmarks and similarity functions gives the ability to predict very complex non-linear decision boundaries, and without creating complex polynomial features
* Landmarks are created based off values for x, so m examples becomes m landmarks, with a similarity function for each landmark, creating a vector of features f1--fm, where f0 = 1 for the bias/intercept term
* For each example, x(i)y(i), each features f1(i)-->fm(i) are calculated as the similarity relative to all other training examples, sim(x(i),l(1)) --> sim(x(i),l(m))
* This will become a feature vector, f(i):= [f0(i); f1(i); ...; fm(i)]
* Predict y = 1 if hθ(x):= θ'f >= 0, 
** min (θ) C ∑(i=1:m) [ y(i) * cost1(θ'f(i)) + (1-y(i)) * cost0(θ'f(i)) ] + 1/2.∑(j=1:n).θ(j)^2 , where C = 1/λ
** we do not regularise on θ0
** Large C: lower bias, higher variance [less regularisation] --> overfit
** Small C: higher bias, lower variance [more regularisation] --> underfit
** Large Sigma squared: features f(i) vary more smoothly, higher bias, lower variance --> underfit
** Small Sigma squared: features f(i) vary more rapidly, lower bias, higher variance --> overfit
> Modelling decisions:
1. Choice of C and sigma squared
2. Choice of kernel
> Linear kernel (ie: dont create features, just use x) might be chosen where the # features, n, is very high, and number of training examples, m, is very low; the linear kernel returns a straight-line decision boundary
> Gaussian kernel: NOTE: remember to perform feature scaling before using a Gaussian kernel
> Polynomial kernel: includes a constant and choice on the polynomial term (exponent degree)
> Other kernels: string kernel, chi square kernel, histogram intersection kernel...

SVM vs Logistic Regression:
> If n (features) is high and m (training examples is low), use logistic regression or SVM with a linear kernel; ie: n=10k, m=1k
> If n is small and m is intermediate (ie: n=1k, m=10k), use SVM with Gaussian kernel
> If n is small (~1k) and m is large (50k+), Gaussian kernels struggle with this, so 
    * create additional features, and 
    * use logistic regression or SVM without a kernel (ie: linear kernel)
> NOTE: A neural network will work well in all conditions, but might talke longer to train


== WEEK 8 - UNSUPERVISED LEARNING ==
Clustering:
> K-means clustering is the most popular clustering algorithm
* Initialises with K cluster centroids, where K = number of desired clusters
** observations are assigned to the closest cluster; these values are then averaged and the centroids are assigned this value
** observations are again assigned to the closest cluster, and this step repeats, until convergence
* input K clusters with training set x(1)...x(m), noting there are no 'y' values
* process:
    ** randomly init K cluster centroids, μ1-μk
    ** for i:m (# training examples), c(i):=index (from 1 to K) of cluster centroid closest to x(i), and to which x(i) is assigned
    ** for k = 1:K, μk:= average of points assigned to cluster k
    ** μc(i) = average of points assigned to the cluster to which x(i) has been assigned
        >> ie: if x(i) is assigned cluster 5, c(i) = 5, so μc(i) = μ5
* If a cluster ends up with zero examples assigned, it can either be removed / eliminated or randomly re-initialised
> Optimisation Obective:
* Minimise cost function J, finding c and μ to minimise the following:
min (c(1)...c(m), μ1...μK), J(c(1)...c(m),μ1...μK) = 1/m . ∑(i=1:m)||x(i) - μc(i)||^2, which is the norm of the difference between x(i) and uc(i)
** This cost function is also called the Distortion of the K-Means cost function
** The cost function should always converge to a minimum, and never increase (if it does, this means where may be a bug in the code)
> Random initialisation of cluster centroids
* Set K to < m (training examples)
* Randomly pick K training examples, and set cluster centroids μ1 ... μK to the value of these training examples
* Based on initialisation, there may be a local-optima solution, so to address this we can:
    ** Initialise K-means multiple times, ie: 100 iterations to generate cost function (distortion function, J) minima
    ** Pick cluster with lowest cost (distortion)
    ** If K is low (<10), this will make a big impact
    ** If K is large, multiple iterations might not be as effective and the initial solution will be OK
> Choosing K - the number of clusters
* Often this is manual - think about whether the solution is more useful in the real-world application
* The "elbow method" is to plot the Cost function J by # Clusters and where the drop-off in J becomes low, chose that # clusters

Dimensionality Reduction / Principal Component Analysis:
* Benefits:
    ** Reduces memory requirements for data storage (k such that >=99% of variance retained)
    ** Learning algorithms can run more rapidly (k such that >=99% of variance retained)
    ** Assist with visualisation (2 or 3 features from 50+)
* NOTE: Before performing PCA, undertake mean normalisation and feature scaling!
    ** ie: ( x(j) - μ(j) )/ s(j)
* PCA finds a directional vector, u(1), onto which to project features which minimises the projection error, reducing dimensions from 2 to 1
* Unlike linear-regression which minimises the distance between a line-of-best-fit and Y, PCA minimises the distance between both features values and the line, which is minimisation of the orthogonal distance 
* In reducing the dimension from n to k, compute the "covariance matrix", Sigma matrix, and the eigenvectors or "Single Value Decomposition" of matrix Sigma (Sigma is an n x n matrix); take the first 7 columns of the U matrix
* PCA Algorithm:
    ** Mean normalise and feature scale
    ** Sigma = 1/m . ∑(i=1:m) x(i).x(i)' = (1/m) * X' * X;  --> Covariance matrix
    ** [U,S,V] = svd(Sigma);
    ** Ureduce = U(:, 1:k);  % n * k matrix
    ** z = Ureduce' * X; % k * 1 vector  (ie: k * n . k * 1 = k * 1 vector)
> Reconstruction of X from compressed PCA representation can be approximated by:
    ** Xapprox = Ureduce * z; % n * 1 vector
> Choosing the Number of Principal Components
* PCA tries to minimise the average squared projection error, 
    Choose k to minimse projection error [eq 1]: 1/m . ∑(i=1:m) || X(i) - Xapprox(i) || ^2
    Given variance [eq 2]: 1/m . ∑(i=1:m) || X(i) ||^2
    ** Choose k to be the smallest value so that [eq 1] / [eq 2] <= 0.1%, or, "99% of variance is retained" in the projection (which means we can use these projections to still effectively explain variation in the target)
    ** To solve this, call [U,S,V] = svd(Sigma);
    ** Then, calculate smallest k for which: ∑(i=1:k)Sii / ∑(i=1:n)Sii >= 0.99 (ie: 99% of variance retained)
* When performing PCA, create the mapping based on the training set only, then apply these same maps to the cross validation and test sets
* Before performing PCA, build the model with x(i), and see how it works.  If it doesnt work well (ie: very slow, memory problem), then implement PCA and use z(i) ie: the compressed representation
* Note: do not use PCA to prevent overfitting - use regularisation instead.


== WEEK 9 - ANOMAOLY DETECTION ==
Density Estimation:
> for training set {x(1)...x(m)}, where x is an element of real numbers to need
* p(x) := π(j=1:n) p(x(j); mu(j); sigma(j)^2), where π reflects the product of the probabilities
* p(x(j)) is defined by the Gaussian probability distribution
* Declare an anomaly where p(x) < ε (epsilon) (value set based on experience?)
> model: p(x) = p(x(1); mu(1); sigma(1)^2) * ... * p(x(n); mu(n); sigma(n)^2)

Building Anomaly Detection System:
> Train: Build model, p(x), with "good" targets only, as the training data (60% split)
> Cross Validate: 20% of training set, using 50% of the "bad/anomalous" targets
* Predict y=1 (anomaly) if p (x) < ε (epsilon)
* Evaluate with true-+ve, false-+ve, true--ve, false--ve, Precision/Recall, F-score
* Using prediction accuracy is not ideal with skewed data (ie: rare anomolies) as prediction of non-anomalous data will be regarded as having high accuracy
* NOTE: We can use the cross validation set to choose a value for epsilon, ε 
> Test: 20% of training set, using 50% of the "bad/anomalous" targets

Anomaly Detection vs Supervised Learning:
> Anomaly Detection when:
    * very few positive (anomalous) examples
    * lots of negative examples
    * many types of anomalies, which are rare, and very difficult to predict
    * examples: fraud, manufacturing (aircraft engines), machines monitored in data centre
> Supervised Learning when:
    * large number of positive and negative examples
    * positive examples can be explained by the algorithm, and will predictable appear in future test sets
    * examples: email spam classification, weather prediction, cancer classification

Using non-Gaussian features:
> When creating the PDF for feature x(i), p(x(i); mu(i); sigma(i)^2)
* If x(i) is non-Gaussian, we can consider transforming it, ie: 
    * log(x(i))
    * log(x(i)+c)
    * sqrt(x(i))
    * x(i)^(1/3)
> Consider creating features that would take on unusually large or small values in the event of an anomaly

Building multivariate Gaussian anomaly detection model:
> (1) Fit model p(x) by:
* mu = 1/m ∑(i=1:m) x(i)
* Sigma = 1/m ∑(i=1:m) (x(i) - mu) . (x(i) - mu)'
> (2) For new value x, compute:
* p(x) = 1 / ( 2pi^n/2 |∑|^1/2 ) . exp(-1/2 (x - mu)' inverted(Sigma) (x - mu) )
* flag anomaly if p(x) < ε

Original model vs Multivariate model:
> original: > model: p(x) = p(x(1); mu(1); sigma(1)^2) * ... * p(x(n); mu(n); sigma(n)^2)
> Use the original model when:
* manually create features to capture anomalies
* create unusual features
* computationally cheaper, and scales well to large n (features)
* works well with small training set m (# training examples)
> Use the multivariate model when:
* automatically capture correlations between features
* computationally more expensive
* must have m > n else Sigma is non-invertible, and generally, we want m >= 10n to be useful

Collaborative Filtering:
> Given x(1)...x(nm) we can estimate θ(1)...θ(nu)
> We can then use the θ's to estimate x's
> on and on, until we get convergence of the minimisation of errors
* We can instead merge the minimisation functions into a single function to solve for θ's (parameters) and x's (features)
* We do not solve for the intercept feature, x(0)
* x(i) and θ(i) are randomised to small values to serve as symmetry breaking (similar to the random initialization of a neural network’s parameters) and ensures the algorithm learns features x(1),…,x(nm) that are different from each other.
* θ(j, 1:n(users)), x(i, 1:n(movies)) --> the θ's are described by each user for representative movies, while the x's (features) are describing values attached to each movie
> Create vectorised representation by:
* X[Capital] = [x(1)'; x(2)'; ... ; x(nm)'] --> where each x represents the features of each movie for each user
* θ[Capital] = [θ(1)'; θ(2)'; ... ; θ(nu)'] --> where each θ represents the parameters of each movie for each user
* Thus, X[Capital] . θ[Capital]' to solve
> One can mean-normalise to impute missing values for various users 


== WEEK 10 - GRADIENT DESCENT WITH LARGE DATASETS ==
> More data facilitates for more accurate models, but training on millions of examples can give rise to significant computational overhead
> To determine what m (# training exmples), one can plot a learning curve WRT m, and determine if there is high variance (overfitting) when m is small
> types of descent:
a) Batch gradient descent       - use m examples
b) Stochastic gradient descent  - use 1 examples
c) Mini batch gradient descent  - use b examples (where 1 < b <  m)


Stochastic Gradient Descent:
> "Batch gradient descent" is where the descent algorithm uses *all* of the training examples to calculate theta updates, and can be computationally expensive with large datasets
> Stochastic GD only deals with the cost of one training example at a time.
> The steps for SGD are (repeated in entirety up to 10 times):
* 1. Randomly shuffle dataset
* 2. Repeat for training examples i=1:m, :
    * Repeat for features j=0:n, :
        ** θj = θj - alpha.(hθ(x(i)) - y(i) ) . xj(i)
> Unlike batch GD, stochastic GD is a bit more circuitous, and doesnt descend exactly to the local minima, but will head down towards it then oscillate around the global minimum (up or down in cost), although the θ will approach the global minima
> Convergence can be checked by: 
    * 1) during learning, compute cost J, then update θ
    * 2) every 1,000 iterations, plot the cost J averaged over the last 1,000 examples processed
    * Troubleshooting:
        ** If there is a lot of noise in the plot, try increasing the # examples averaged over
        ** If the curve is increasing, this suggests divergence, so perhaps use a smaller alpha (rate of descent reduction)
    * To enhance descent toward the global minima, for SGD we can slowly decrease alpha to assist with convergence of θ, and this should reduce the oscillation

Mini-Batch Gradient Descent:
> b = mini-batch size, ie: # examples to use in gradient descent algorithm
> b is typically between {2,100}
> Repeat similarly to SGD, except in batches of 'b' examples, up to m examples
> Good vectorization implementation in MBGD can facilitate for speed greater than SGD

Online learning:
> Might be used for a constant stream of data (ie: whether user wants to pay for shipping, given origin / destination) to learn continuously, rather than store data as a constant dataset
* Update learning algorithm with Stochastic Gradient Descent then move on; optimise price offered based on this information
> Product search is another application, ie: someone searches for android phone, have 100 phones in store, offer 10 for sale
* Capture how many words in query match features of the phone, then capture whether the user clicks on teh phone (y=1 if clicked), the predict likelihood of clicking on phone, ie: "Predicted Click-Through-Rate"
> Other examples: special offers, customised news articles, product recommendations, collaborative filtering mechanism for CTR stuff...

Map-Reduce and Data Parallelism:
> Split the training dataset into portions and get different machines to run the calculations across the cost function and gradient descent, then unite the results later on
> Parallelise over machine cores, or across machines


== WEEK 11 - PHOTO OPTICAL CHARACTER RECOGNITION (OCR) ==

Photo OCR:
> Undertake the following steps (pipeline):
0. Input image
1. Text detection
    * apply expansion-operator to highlight the pixels which likely contain text
    * draw boxes around the identified text blocks with applicable aspect ratios (should be wider than taller)
2. Character Segmentation
    * scan the text blocks, and look for natural white-space breaks between characters using a 1D sliding window
3. Character Recognition / Classification
4. Token correction (word correction)
> When thinking about which part of the pipeline to invest the most time into, we can approach it in the following way, a method called "Ceiling Analysis":
    * Look at accuracy of entire system on test set
    * Look at change in accuracy when 100% accurate answers are provided to Text Detection
    * Look at change in accuracy when 100% accurate answers are provided to Character Segmentation
    * Look at change in accuracy when 100% accurate answers are provided to Character Recognition
* This will help us narrow down to the value of adding more resources to particular areas of the pipeline
> With face recognition, ceiling analysis might undertake the following steps: 
    * Develop pipeline:
        ** Camera Image --> Remove Background --> Face Detection --> Eyes & Nose & Mouth Segmentation --> Logistic Regression --> Label
    * Ceiling analysis: 
        ** Accuracy add of background removal
        ** Accuracy add of face detection
        ** Accuracy add of eyes detection
        ** Accuracy add of nose detection
        ** Accuracy add of mouth detection


> Supervised learning for pedestrian detection
* build training dataset with small positive images (y=1) and negative images (y=0, no pedestrians present)
* Take the large input images (with lots of potential pedestrians) and then:
    1. scan a window (small size) in the top left
    2. scan / slide the window / patch across the input image, moving by a given step size / stride across the image
    3. increase the size of the window / patch, and slide / scan across the input image, with the potential to detect larger-size pedestrians
> The above process applies to a photo OCR system
> Artificial data synthesis for photo OCR can be undertaken by:
* Taking characters from free font libraries and pasting them onto random backgrounds
* Apply various operators, such as torsion, blurring, affine distortion, scaling, distortion
* Take actual positive image, then use artificial distortions, such as warping and other transforming
* With audio, we can add background sounds to increase the training dataset artificially
* These distortions **should be representative of what we would observe in the test set!**
** Implementation Note: 
1. before creating new data, ensure that we are using a low bias classifier first, by plotting learning curves; if not, increase the number of features / hidden units until we do (have a low bias classifier), and then start adding in extra training examples
2. think about how much work it would take to get 10x's the amount of data?
    * consider artificial data synthesis (generate data from scratch, or using existing example and introduce distortions)
    * collect the data and label them myself (ie: # eggs in a photo)
    * crowd-source (ie: Amazon Mechanical Turk) - hire others to label the data for you, at low cost


== ASSIGNMENT SOLUTION EXAMPLES ==
https://github.com/liquidpie/andrew-ng-ml-solutions
https://github.com/kaleko/CourseraML
help: https://www.coursera.org/learn/machine-learning/discussions/all/threads/vgCyrQoMEeWv5yIAC00Eog?page=2
help: https://www.coursera.org/learn/machine-learning/profiles/4a373eb3f73b6c0a931e5ac2518d7298 


== LINKS for Later Neural Network Review ==
http://eugenezhulenev.com/blog/2014/11/14/stock-price-prediction-with-big-data-and-machine-learning/
https://nicholastsmith.wordpress.com/2016/04/20/stock-market-prediction-using-multi-layer-perceptrons-with-tensorflow/
http://jspauld.com/post/35126549635/how-i-made-500k-with-machine-learning-and-hft
https://github.com/martin-gorner/tensorflow-rnn-shakespeare
https://www.tensorflow.org/versions/master/get_started/mnist/pros
https://www.springboard.com/blog/beginners-guide-neural-network-in-python-scikit-learn-0-18/
https://www.udemy.com/courses/search/?q=neural%20networks&src=sac&kw=neural&sort=highest-rated&courseLabel=7380
http://stackabuse.com/tensorflow-neural-network-tutorial/
http://pybrain.org/docs/tutorial/datasets.html
http://sujitpal.blogspot.com.au/2014/07/handwritten-digit-recognition-with.html
https://pandas.pydata.org/pandas-docs/stable/dsintro.html
http://glitter.im/
https://github.com/wassname/rl-portfolio-management
https://github.com/ZhengyaoJiang/PGPortfolio/blob/master/user_guide.md
https://guides.github.com/activities/hello-world/
https://research.googleblog.com/2016/08/improving-inception-and-image.html
https://cloud.google.com/ml-engine/docs/getting-started-training-prediction
https://finance.yahoo.com/quote/%5EGSPC/history?period1=1514552400&period2=1514638800&interval=1d&filter=history&frequency=1d
https://stackoverflow.com/questions/42744903/keras-lstm-error-when-checking-model-input-dimension#
https://keras.io/getting-started/sequential-model-guide/
https://elitedatascience.com/keras-tutorial-deep-learning-in-python
http://cs231n.github.io/neural-networks-1/
https://github.com/apress/practical-ml-w-python (http://isbn.directory/book/9781484232064)