Beauty in Randomness
Stochastic Gradient Descent (SGD):
- The main algorithm behind
Back Propogation
which enables Neural Networks to learn - This codebase shows SGD in an relaxed context (finding the average of an array of numbers)
- Problem Introduction
- SGD High Level Intuition
- Defining Terms
- Why Stochastic?
SGD
Algorithm- SGD in Python
- Plotting
SGD
With Varying Ranges
Although this isn't your average problem, perhaps your avg()
function is broken and you don't know that avg()
= sum of data
/ # of data points
.
However, for each guess
you make, you are given (1) a randomly sampled datapoint
and (2) the distance to that randomly sampled datapoint
.
Note: we will abbreviate Stochastic Gradient Descent as SGD
With the information of (1) random sample
and (2) distance
of guess to sample, let us try to tweak our guess
so that the distance is closer.
At a high level, SGD minimizes the loss/difference of a predicted and actual value, by choosing datapoints randomly and changing the guess by the gradient
or amount of loss from that random point over many training iterations.
Traveling Analogy:
- We are trying to walk towards USC, however we don't have the address--but hope is not lost, we have a guide!
- After each step, the guide will tell us how close we are to a random classroom
- As the traveler, we will incrementally take small steps and
guess
towards where we that random classroom is (since the classroom is at USC) and adjust our step size/direction based on what the guide tells us
SGD Process | Traveling Analogy | Average Problem |
---|---|---|
Target "Ground Truth" Value | USC | Average of the dataset |
Stochastic | Guide randomly choosing a classroom for us to walk towards | Choosing a random datapoint in the array |
Gradient | Guide telling us how far away we are from the classroom | Distance of guess from random datapoint |
Descent | Iteratively walking closer to campus by tweaking our path after each answer from the guide | Changing our guess with respect to the magnitude and direction of the loss |
- Intuition: How far away are we from our desired outcome
- Generally,
Loss Function
follows some general pattern ofPredicted Value By Algorithm - Actual Value in Training Dataset
Note: This is sometimes called anObjective Function
- Randomly choosing a datapoint at each training iteration
- Our intuition is that the desired value (
average
) is represented by each datapoint- Each datapoint has its own probability of being chosen and the desired value will reflect how prevalent this datapoint is
- Intuition: A relationship between each
input parameter
and the loss function - Mathematically represented as a set of numbers related to how each
guess
changes the loss- In order to model the relationship of
input parameters
to theLoss Function
, each component in the vector is apartial derivative
w.r.t the loss function - More information on Partial Derivates and Vectors
- In order to model the relationship of
- Intuition: If we now know the relationship between
input paremter
andloss function
, we can tweak certain parameters to minimize the loss function - Iteratively walk towards the direction that minimizes the loss (
negative to the gradient
)
- Intuition: If at every step towards USC we calculated our distance from every classroom, we would eventually get to USC, but this would take many extra calculations.
- Instead by
stochastically
or randomly choosing a classroom, we can get a good approximation of where we need to go
- Instead by
- Since the
Gradient
portion ofGradient Descent
requires taking the partial derivative of the Loss function:- (1) this requires the Loss function to be differentiable
- (2) may be extremely computationally costly when the loss function takes in many parameters (usually a large weight vector representing the features of the data)
1) Loss = Guess - Random Sampled Point
2) Guess.next = Guess.current - learning_rate * Loss
3) Guess.current <- Guess.next and repeat
Example:
input array
= [1,2,3]
| average
= (1+2+3)/3 = 2
learning_rate
= 0.5
(quite high but illustrative for our purposes)
Iteration | Guess.current | Stochastic Datapoint | Loss | Guess.next |
---|---|---|---|---|
1 | 0 | 3 |
Loss = 0 - 3 | 0 - (0.5)(-3) = 1.5 |
2 | 1.5 | 2 |
Loss = 1.5 - 2 | 1.5 - (0.5)(0.5) = 1.75 |
3 | 1.75 | 1 |
Loss = 1.75 - 1 | 1.75 - (0.5)(1) = 1.25 |
4 | 1.25 | 3 |
Loss = 3 - 1.25 | 1.25 - (0.5)(1.75) = 2.125 |
After 4 iterations, we have approximated the average to be ~2 which is the actual average.
- See in iteration 3 that the
guess
value was approaching the correct value of 2 but decreased because of choosing the data point1
- Since the learning rate is
0.5
, the actual guess is tweaked by a little bit (imagine if thelearning_rate
approached1.0
, then the guess would oscilate)- However, the same is true for values that increase the guess, the intuition is that the
learning_rate
dampens the effect of eachstochastic datapoint
- However, the same is true for values that increase the guess, the intuition is that the
- Thus, the Learning Rate is a
speed vs accuracy
tradeoff as seen in theOutlier Dataset
which takes 600 training iterations to descend on the correct average value
- When the loss is negative, that means the
guess
decreases the loss- Thus, we should keep doing what we were doing! Increase the guess to decrease the loss
- When the loss is positive, that means the
guess
increases the loss- Thus, we should go in the opposite direction of what we were doing! Decrease the guess to decrease the loss
Note: the loss is based off of a single input output pair
which we called our randomly sampled datapoint
, not every single value. Thus, the change that we make based on the (-)
sign may be in the wrong direction, but in conjunction with the learning_rate
dampens the effect of the wrong direction datapoints.
Then, if the data is highly spread out, we can see how the average will be "pulled" by extreme values but after enough sampling iterations each number should have approx. equal effect on the guess
- Use the
np.random.choice(list, # of items)
to extract the stochastic datapoint
def get_loss(guess: float, dataset: list) -> float:
"""
guess: a starting point on where to approach true average
dataset: a list representing all the numbers in the dataset
"""
return guess - np.random.choice(dataset, 1)
- Let the next guess be tweaked by a fraction of the loss (based on the
learning_rate
)
def get_next_step(guess: float, loss: float, learning_rate: float) -> float:
"""
returns the next guess as a small step towards minimizing the loss
Since `loss` is defined as `magnitude the guess if off from true average` or `guess - random datapoint`
"""
return (float) (guess - loss * learning_rate)
- Data is plotted using
matplotlib
and displays datapoints from all training iterations - Data is
decmiated
, a concept from Computer Vision, only including every10th pixel
or in this caseguess
to make table plotting easier- Plotting is performed using the
tabulate
library
- Plotting is performed using the
def decimate_array(long_list: list, interval: int = 10) -> list:
"""Returns every `interval` index of the input list"""
short_list = []
for i, val in enumerate(long_list):
if(i % interval == 0):
short_list.append(val)
return short_list
Modeling Stochastic Gradient Descent with datasets containing varying ranges and random initializations to solve the task of finding the average of a dataset.
Below are samples of data with varying degrees of distributions with the datasets:
low_range = [3, 4, 5]
medium_range = [3, 4, 5, 30]
high_range = [3, 4, 5, 300]
Range Type | Dataset | # of Training Iterations Needed |
---|---|---|
Tightly Clustered | [3, 4, 5] |
50 |
Small Outlier | [3, 4, 5, 30] |
50 |
[Large Outlier](#large outlier) | [3, 4, 5, 300] |
600 |