Machine learning is improving in a task T
, w.r.t. performance measure P
,
based on experience E
.
We try to fit a linear line to some data set, making these assumptions:
- There is a relationship between the variables
- This relationship is linear
- This relationship will hold in the future
We have some inputs X = {x₁, x₂, ..., xn}
and their corresponding labels T = {t₁, t₂, ... tn}
. We want to predict the t
for any x
.
We can approximate the data using a linear equation:
t = mx + c
Where m
and c
are learnt variables.
We need to derive m
and c
(a.k.a w₁
and w₂
) that minimises a loss
function.
We can vectorize this by converting m
and c
into a vector of weights W = {c, m}
and the input becomes X = {1, x}
. Therefore, we can do t = W.t ⋅ X
.
This has the benefit of us being able to add more input attributes to X
without having to restructure our algorithms.
A function that estimates how poorly our model performs on the dataset.
We use here squared difference:
L = avg((t - t')²)
Where t
is the ground truth, and t'
is our prediction.
Try to maximise the likelihood of the prediction being correct
L = p(t|X, w, σ²) = mult(N(w.t⋅xₙ, σ²) for xₙ in X)
where
X
is the input dataw
is the learnt weightsσ²
is the learnt standard deviation
Deriving these weights gives us the same results as least squares
We need to find dW/dL = 0
, i.e. what values of W
give use the minimum value
for our loss function
TODO: How do we know that this is the global minima? And not just a local minima/maxima?
We can derive from the loss function L
the minimum value is given by:
W' = (X.t ⋅ X)⁻¹ * X.t ⋅ t
Where W'
is our approximation for the perfect values of W
.
We can do this by adding extra variables to our X
input vector.
If we add x²
, x³
, etc. to X
then we can allow the model to make non-linear
predictions from a single input value x
.
Therefore, X
becomes [x⁰, x¹, x², ..., xn]
or [1, x, x², ..., xn]
for some
complexity n
.
The more we increase n
, the more our model can fit the data exactly. This has
two main effects:
- The model becomes very good at predicting data it has seen before
- The model becomes very bad at predicting data it hasn't seen before
This is called overfitting, and should be avoided. Therefore, we want to pick an
n
that is large enough to model the complexity of the data, but not too big to
overfit it.
Regularising is a technique to stop a model from overfitting its training data.
One way to regularise is append another term to the cost function:
L' = L + λσ(W)
Where λ
is the weighting of our regularisation, and σ
is some function that
combines the values of W
. For L1
regularisation, we have σ(W) = avg(abs(W))
, and for L2
regularisation, we have σ(W) = abs(W²)
.
Intuition: Model complexity (thus overfitting) increases as W
increases. If we
punish high values of W
, then the model will only increase W
(and therefore
complexity), if it is worth the gain in the original loss function.
- Add some random noise to make linear predictions look like real data
- Use additive noise:
tn = f(xn; W) + ε
- where
p(ε) = N(0, σ²)
- Model now determined by
W
andσ
- where
- For each
(t, x)
we have a likelihood oft
being observed, given by:p(t | x, W, σ²)
- We now want to maximise this function
- TODO: Go over the maths for this part
avg(d(truth[n] == prediction[n]))
where
d(b: bool) = 1 if b else 0
Can be used to get classification accuracy
- Create a table of True Positive, True Negative, False Positive, False Negative
- Can be expanded for >2 classes
Sensitivity = TP / (TP + FN) = TP / (All Positive)
Specificity = TN / (TN + FP) = TN / (All Negative)
- Plot sensitivity and complementary specificity (
1 - Spec
) on a graph - Modify the model in some way (e.g. change threshold)
- Check with modification has a point in the ROT graph closest to the top left corner (i.e. highest sensitivity and specificity)
- Get the area under the ROC curve
- Higher is better
- That way we get a performance metric that is independent of the variable we changed during ROC analysis (e.g. we might change threshold at run time)
Do one-against-all ROC analyses, so if you have 3 classes, you have 3 ROC graphs
- Take a percentage of the dataset and don't show it to the model (e.g. 80/20)
- Reduce training data
- Reduce representativeness, as you've lost some data
- Separate data into
K
sets - Train once with
K-1
of the sets, validated on the remaining set - Repeat, cycling the validation set
Special case of k-fold where K = N
- No training, just store training examples and use those
- Lazy, does all work on inference
- Local approximation / generalisation
- Store all
N
training examples - Choose parameter
k < N
- Find
k
nearest instances in training set - Find the most common class in these
k
nearest - In case of tie, assign randomly from tied
- Same as above, but average
k
nearest labels - Could use any distance measurement: Euclidean, Manhattan, cosine angle
- Cosine angle is measuring the difference in angles, so we essentially ignore the "magnitude" of the vectors
- Class imbalance, if you have a class with only 5 instances, and a
k
of 11, you will never classify this class - Classification performance cost is high, even though training performance cost is non-existent
- There may be irrelevant attributes in the dataset
- Not too much data attributes (~20)
- Because kNN takes into account each variable with equal weighting, and more attributes implies more attributes that don't mean anything
- Lots of training data
- Can learn complex target functions (lots of wiggly class boundaries)
- Multiply the effect of each label by the distance to that instance
- Generalisation vs overfitting
- Error of a model consists of two components:
M = B² + V
B²
is mismatch between model and actual process that generated dataV
is the variance, a more complex model has higher variance
- Non-prob models give hard classifications,
C=1, C=2...
- While prob models give percentage likeliness,
C=10% 1, 90% 2
- Provides a lot more information
- Important when the cost of misclassification is high
P(x|y) = P(y|x) * P(x) / P(y)
P(x)
prior belief: probability ofx
occurring, without any other dataP(y|x)
class-conditional likelihood: prob ofy
occurring givenx
occurredP(y)
data evidence: marginal probability of the observed datay
= Σ(P(y && x))
for allx
= Σ(P(y|x)P(x))
for allx
P(x|y)
posterior probability: probability ofx
occurring after seeing the new datay
- Pick the class that optimises the posterior probability
- For "maximum a posterior hypothesis":
argmax_c P(c|x)
wherec
is the class, andx
is the observed data= argmax_c P(x|c) * P(c) / P(x)
= argmax_c P(x|c) * P(c)
becauseP(x)
is independent ofc
, and can be omitted
- For "maximum likelihood hypothesis":
- We omit
P(c)
, therefore assuming all classes are equally likely to occur - The equation becomes:
argmax_c P(x|c)
- We omit
- We're now classifying a list of data
X = {x₁, x₂...}
- Maximum a posterior now becomes:
argmax_c P(x₁, x₂... | c) * P(c)
- The naive assumption is that all in
X
are conditionally independent, thus:P(X|c) = P(x₁|c) * P(x₂|c) * ...
- Although, theoretically should be violated...
- This assumption simplifies process, and works well in practice
- Example:
- If we have attributes "has runny nose", "has temperature", and want to predict "has flu", the equation becomes:
P(flu|RN, TMP) = P(RN|flu) * P(TMP|flu) * P(flu)
- We calculate mean (
μ
) and stddev (σ²
) for each attribute inX
- Then,
P(x|μ, σ²)
is the Gaussian formula - Then, we can calculate the MAP and ML hypotheses
- We can do multi-variate Gaussian where
μ
is a vector andΣ
is a matrix, this is non-naive - We can also use naive bayes assumption here to deal with small amounts of training data
- Given some data
X = {x₁, x₂...}
, can we group similar ones together? - Similarity
- We measure similarity as closeness, i.e.
(x₁ - x₂)²
- We measure similarity as closeness, i.e.
- We want to find a set
K
parameters - Each cluster is represented by it's center
- Say cluster
μ
containsX = {x₁, x₂..}
- We represent it as the average of
X
- Say cluster
- Algorithm:
- Randomly initialise
K
clustersC = {μ₁, μ₂...}
- Assign each of
X
to its closest cluster inC
- Update each cluster in
C
to be represented by the average of its new members - If the members of each
C
do not change, finish
- Otherwise, go to step 2
- Randomly initialise
- Alternate algorithm:
- Randomly set all in
X
to a random cluster inC
C
originally has no center untilX
s are assigned to it
- Update each cluster in
C
to be the average of its assigned elements - Continue from step 2 in original algorithm
- Randomly set all in
- We can evaluate the performance of a run of K-means by calculating the
intra-group distance estimate:
D = avg(for c in C, for x in c: (x - c)²)
- K-means will find local-minima for
D
- We can find a better
D
by re-running K-means several times, and picking the clustering with the bestD
- We can find a better
- How to choose
K
?- When we increase
K
,D
decreases, but we do not get better results - Best to increase
K
until you stop seeing large improvements inD
- When we increase
- If we have non-linear groups, it is hard to cluster it with linear K-means comparison metrics
- So we can project it into some new space, by perhaps squaring the elements
φ(x) = x²
k(x₁, x₂) = φ(x₁).t ⋅ φ(x₂)
- Some kernel functions:
- Linear kernel:
k(x₁, x₂) = x₁.t ⋅ x₂
- Gaussian kernel:
k(x₁, x₂) = exp{γ||x₁ - x₂||²}
- Polynomial kernel:
k(x₁, x₂) = (x₁.t ⋅ x₂ + c)^β
- Linear kernel:
- For K-means, whenever we compare to elements we now use this kernel function
- When calculating distance to cluster, we average the cluster then use the kernel function
- We can use the kernel trick to estimate the distance between an element of
X
and a cluster inC
- Kernel trick:
D(x, μ) = (x - μ).t ⋅ (x - μ)
μ = avg(for x in μ)
D(x, μ) = (x - sum(for x in μ)).t ⋅ (x - sum(for x in μ))
D(x, μ) = (x.t ⋅ x)
... - 2*sum((x't ⋅ x) for x' in μ)
... + avg((x'.t ⋅ x'') for x' in μ, for x'' in μ)
And then replace dot product with kernel function:
D(x, μ) = k(x, x)
... - 2*sum(k(x, x) for x' in μ)
... + avg(k(x', x'') for x' in μ, for x'' in μ)
- Sensitive to initialisation
- Computationally expensive, having to calculate
n²
distances each iteration - Need to run repeatedly
- Makes hard assumptions, each element is 100% belonging to 1 cluster
- Agglomerative (bottom-up) approach
- Each object starts as a singleton cluster
- The two most similar clusters are merged iteratively
- Stops when all objects are in the same cluster
- Divisive (top-down) approach
- Each object starts in same cluster
- Outsider objects from least cohesive cluster are removed iteratively
- Stops when all objects are in singleton clusters
- Measuring similarity between clusters:
- Smallest distance between elements
- Clusters can get very large
- Largest distance between links
- Small, round clusters
- Average distance between links
- Compromise between small and large
- Smallest distance between elements
Algorithm:
- Each object is a singleton cluster
- Compute distance between all pairs of clusters
- Merge the closest pair
- If there is more than one cluster, go to step 2
- Can tune the level of clustering easier
- More efficient than K-means
- Generative classifiers generate a model for each class
- e.g. Bayesian classifiers have some data for existing classes, and for new data try to see what class model fits best
- Discriminative classifiers attempt to define where these classes cross over
- TODO: These definitions seem a bit weak, are they correct?
- We want to find a linear hyperplane that separates the two classes positions
- Formalised as:
w.t ⋅ x + b = 0
- Formalised as:
- We can then find the best
w
andb
that optimise this decision boundary- End up having to perform
argmin_w 1/2 w.t ⋅ w
, with the constraint thattₙ(w.t ⋅ xₙ + b) ≥ 1
- See slides, I ain't copying that out
- End up having to perform
- Called SVMs due to the vectors closest to the decision boundary that "support" it
- Allow the some data points to lie closer to (or even over) the decision boundary
- This is done by relaxing the constraint to:
tₙ(w.t ⋅ xₙ + b) ≥ 1 - ξₙ
andξₙ ≥ 0
- The optimisation problem becomes:
argmin_w 1/2 w.t ⋅ w + C*sum(ξₙ)
C
controls the "softness" of the boundary
- Project into another space
x → φ(x)
- But we can just use the kernel tricks
x₁.t ⋅ x₂ → k(x₁, x₂)
- Gaussian kernel:
exp{-β||x₁ - x₂||²}
β
controls model complexity
- If we have
N'
classes, haveN'
different classifiers that do 1 vs rest classification - Each SVM returns a confidence score, we pick the one that gives the most confidence
- Can also train
N'(N' - 1) / 2
different binary classifiers- Assign according to maximum voting
- Addresses class imbalance in the data
- Dimensionality is the number of features in the data
- Gradually add features, seeing how it effects performance
- Make new features by merging old features (linearly/non-linearly)
X = Y⋅W
whereY
is(N x M)
, the original data with high dimensionalityM
W
is(M x D)
, the matrix that reduces dimensionality toD
X
is(N x D)
, the data with reduced dimensionalityD
- We want to preserve the interesting aspects of the data
- Define the columns of
W
one by one- This means we define one of the output dimensions one by one
xₙ = Y⋅wₙ
to get then
th output dimension- Each
wₙ
is defined so that- The variance of
xₙ
is minimised - The new column
wₙ
is orthogonal to previous columns
- The variance of
W
is the collection of eigenvectors of the covarience matrixΣ
ofY
- Calculating
Σ
Σₘₙ = avg((yₘ - μₘ)(yₙ - μₙ) for y in Y)
- If we set
Y' = Y - avg(Y)
Σ = (1/N)(Y'.t ⋅ Y')
- Calculating
W
wₙ.t ⋅ wₘ = 0
whenn ≠ m
- Chose
wₙ
to maximise variance:argmax_w var(Y'w)
var(Y'w) = (Y'w).t ⋅ (Y'w)
var(Y'w) = w.t ⋅ Y'.t ⋅ Y' ⋅ w
var(Y'w) = w.t ⋅ NΣ ⋅ w
argmax_w w.t ⋅ Σ ⋅ w
- Must have constraint
w.t ⋅ w = 1
, otherwise can just keep increasingw
to increase variance - Using Lagrange multiplier trick to add constraint in:
argmax_w (w.t⋅Σ⋅w - λ(w.t⋅w - 1))
- Differentiate w.r.t
w
and set to 02Σw - 2λw = 0
Σw = λw
- Solve using eigenvalue/eigenvector method (not in this module)
- Aside about eigenvectors
Σw = λw
means that matmul byΣ
forw
increases by a scalar factorw
is called an eigenvector, andλ
is the eigenvalue
- Solving
Σw = λw
gives usM
eigenvectors andM
eigenvalues- The one with the highest
λ
has the highest variance (we can see this by mutliplying both sides byw.t
- The one with the highest
- In summary: the principle components of
Y
are the eigenvectors{w₁, w₂...}
, ordered by their eigenvalues{λ₁, λ₂...}
- Combine weak classifiers to make a strong classifier
- Boosting
- Train a new classifier focusing on training examples misclassified b earlier model
- Bagging (bootstrap aggregation)
- Generate new training data as a random subset of original data
- Train new classifier on this subset
- For each split node
n
, go right iffₙ(x) ≤ thₙ
, and go left otherwise, wherex
isN
dimensional attributesfₙ
is of the formf(x) = w₀x₀ + w₁x₁ + ... + wₙxₙ
thₙ
is the threshold value
- For each leaf node, return a probability distribution across all classes
- Select a subset of the data
X
- Create a set of random features
f
f(x) = ax₁ + bx₂
for 2D data, wherea, b
are chosen randomly
- Try several combinations of
f
andth
- Separate
X
according tof
andth
intoXₗ, Xᵣ
- Pick the
f
andth
that optimise the information gain of this split IG(Xₗ, Xᵣ) = E(Xₗ + Xᵣ) - (|Xₗ|/|X|)E(Xₗ) - (|Xᵣ|/|X|)E(Xᵣ)
- where
E = -sum(P(c)log₂(P(c)) for c in classes)
- where
- Separate
- Recurse for left and right children
- Generate
n
binary decision trees- Each one uses a random subset of the training data
- Subsets can overlap
- Increases runtime
- Each one uses a random subset of the training data
- When evaluating, run all and average the response
- We can draw approximate point clouds based off the mean and covariance matrices of a point cloud
- We will assume two dimensions
- Uniform variance and no covariance:
- e.g.
[[a, 0], [0, a]]
- Circle point cloud, where
a
is proportional to its width
- e.g.
- Non-uniform variance and no covariance
- e.g.
[[a, 0], [0, b]]
- Ellipse point cloud, where
a
is proportional to its width onx
-axis, andb
is proportional to its heigh ony
-axis - Ellipse is always axis-aligned
- e.g.
- Uniform variance and covariance
- e.g.
[[a, c], [c, a]]
- Ellipse point cloud, where
a
is proportional to its width on both axes, andc
is how "thin" the ellipse is - Aligned with
x = y
line - Unless
c
is negative, then aligned withx = -y
- e.g.
- Non-uniform variance and covariance
- e.g.
[[a, c], [c, b]]
- Ellipse point cloud, where
a
is proportional to its width onx
-axis, andb
is proportional to its heigh ony
-axis, andc
is how "thin" the ellipse is - Assuming
b
is positive, is more aligned tox
-axis ifa > b
and vice versa
- e.g.
- Without naive bayes assumption
- Identical variance, no covariance
- Linear decision boundary
- Identical variance, identical covariance
- Linear decision boundary
- Non-identical variance, non-identical covariance
- Non-linear decision boundary
- Identical variance, no covariance
- With naive bayes assumption
- Identical variance
- Linear decision boundary
- Non-identical variance
- Non-linear decision boundary
- Identical variance
- Linear regression
- Bayesian classification (maximum a posterior)
- Bayesian classification (maximum likelihood)
- Better when classes are equally distributed
- Bayesian classification (naive bayes)
- Better for low amount of data
- K nearest neighbours
- Worse for high amount of attributes
- Because more likely to provide bad attributes
- Worse for high amount of attributes
- Support vector machines
- Better for low amount of data
- Better for dealing with outliers
- Randomised decision forests
- K-means clustering
- Hierarchical agglomerative clustering