Stochastic Gradient Descent: The Average Problem

Beauty in Randomness

Stochastic Gradient Descent (SGD):

The main algorithm behind Back Propogation which enables Neural Networks to learn
This codebase shows SGD in an relaxed context (finding the average of an array of numbers)

Problem Introduction

Although this isn't your average problem, perhaps your avg() function is broken and you don't know that avg() = sum of data / # of data points.

However, for each guess you make, you are given (1) a randomly sampled datapoint and (2) the distance to that randomly sampled datapoint.

Note: we will abbreviate Stochastic Gradient Descent as SGD

SGD High Level Intuition

With the information of (1) random sample and (2) distance of guess to sample, let us try to tweak our guess so that the distance is closer.

At a high level, SGD minimizes the loss/difference of a predicted and actual value, by choosing datapoints randomly and changing the guess by the gradient or amount of loss from that random point over many training iterations.

Traveling Analogy:

We are trying to walk towards USC, however we don't have the address--but hope is not lost, we have a guide!
After each step, the guide will tell us how close we are to a random classroom
As the traveler, we will incrementally take small steps and guess towards where we that random classroom is (since the classroom is at USC) and adjust our step size/direction based on what the guide tells us

SGD Process	Traveling Analogy	`Average` Problem
Target "Ground Truth" Value	USC	Average of the dataset
Stochastic	Guide randomly choosing a classroom for us to walk towards	Choosing a random datapoint in the array
Gradient	Guide telling us how far away we are from the classroom	Distance of guess from random datapoint
Descent	Iteratively walking closer to campus by tweaking our path after each answer from the guide	Changing our guess with respect to the magnitude and direction of the `loss`

Defining Terms

Loss Function

Intuition: How far away are we from our desired outcome
Generally, Loss Function follows some general pattern of Predicted Value By Algorithm - Actual Value in Training Dataset Note: This is sometimes called an Objective Function

Stochastic

Randomly choosing a datapoint at each training iteration
Our intuition is that the desired value (average) is represented by each datapoint
- Each datapoint has its own probability of being chosen and the desired value will reflect how prevalent this datapoint is

Gradient

Intuition: A relationship between each input parameter and the loss function
Mathematically represented as a set of numbers related to how each guess changes the loss
- In order to model the relationship of input parameters to the Loss Function, each component in the vector is a partial derivative w.r.t the loss function
- More information on Partial Derivates and Vectors

Gradient Descent

Intuition: If we now know the relationship between input paremter and loss function, we can tweak certain parameters to minimize the loss function
Iteratively walk towards the direction that minimizes the loss (negative to the gradient)

Why Stochastic?

Intuition: If at every step towards USC we calculated our distance from every classroom, we would eventually get to USC, but this would take many extra calculations.
- Instead by stochastically or randomly choosing a classroom, we can get a good approximation of where we need to go
Since the Gradient portion of Gradient Descent requires taking the partial derivative of the Loss function:
- (1) this requires the Loss function to be differentiable
- (2) may be extremely computationally costly when the loss function takes in many parameters (usually a large weight vector representing the features of the data)

`SGD` Algorithm

1) Loss = Guess - Random Sampled Point
2) Guess.next = Guess.current - learning_rate * Loss
3) Guess.current <- Guess.next and repeat

Example: input array = [1,2,3] | average = (1+2+3)/3 = 2 learning_rate = 0.5 (quite high but illustrative for our purposes)

Iteration	Guess.current	Stochastic Datapoint	Loss	Guess.next
1	0	`3`	Loss = 0 - 3	0 - (0.5)(-3) = 1.5
2	1.5	`2`	Loss = 1.5 - 2	1.5 - (0.5)(0.5) = 1.75
3	1.75	`1`	Loss = 1.75 - 1	1.75 - (0.5)(1) = 1.25
4	1.25	`3`	Loss = 3 - 1.25	1.25 - (0.5)(1.75) = 2.125

After 4 iterations, we have approximated the average to be ~2 which is the actual average.

Impacts of the Learning Rate

See in iteration 3 that the guess value was approaching the correct value of 2 but decreased because of choosing the data point 1
Since the learning rate is 0.5, the actual guess is tweaked by a little bit (imagine if the learning_rate approached 1.0, then the guess would oscilate)
- However, the same is true for values that increase the guess, the intuition is that the learning_rate dampens the effect of each stochastic datapoint
Thus, the Learning Rate is a speed vs accuracy tradeoff as seen in the Outlier Dataset which takes 600 training iterations to descend on the correct average value

Impacts of the `(-)` sign

When the loss is negative, that means the guess decreases the loss
- Thus, we should keep doing what we were doing! Increase the guess to decrease the loss
When the loss is positive, that means theguess increases the loss
- Thus, we should go in the opposite direction of what we were doing! Decrease the guess to decrease the loss

^{Note: the loss is based off of a single input output pair which we called our randomly sampled datapoint, not every single value. Thus, the change that we make based on the (-) sign may be in the wrong direction, but in conjunction with the learning_rate dampens the effect of the wrong direction datapoints.}

^{Then, if the data is highly spread out, we can see how the average will be "pulled" by extreme values but after enough sampling iterations each number should have approx. equal effect on the guess}

SGD In Python

Calculating Loss

Use the np.random.choice(list, # of items) to extract the stochastic datapoint

def get_loss(guess: float, dataset: list) -> float:
  """
  guess: a starting point on where to approach true average
  dataset: a list representing all the numbers in the dataset
  """
  return guess - np.random.choice(dataset, 1)

Generating Next Guess

Let the next guess be tweaked by a fraction of the loss (based on the learning_rate)

def get_next_step(guess: float, loss: float, learning_rate: float) -> float:
  """
  returns the next guess as a small step towards minimizing the loss
  Since `loss` is defined as `magnitude the guess if off from true average` or `guess - random datapoint`
  """
  return (float) (guess - loss * learning_rate)

Displaying Data

Data is plotted using matplotlib and displays datapoints from all training iterations
Data is decmiated, a concept from Computer Vision, only including every 10th pixel or in this case guess to make table plotting easier
- Plotting is performed using the tabulate library

def decimate_array(long_list: list, interval: int = 10) -> list:
  """Returns every `interval` index of the input list"""
  short_list = []
  for i, val in enumerate(long_list):
    if(i % interval == 0):
      short_list.append(val)
  return short_list

Plotting `SGD` With Varying Ranges

Modeling Stochastic Gradient Descent with datasets containing varying ranges and random initializations to solve the task of finding the average of a dataset.

Below are samples of data with varying degrees of distributions with the datasets:

low_range = [3, 4, 5]
medium_range = [3, 4, 5, 30]
high_range = [3, 4, 5, 300]

Range Type	Dataset	# of Training Iterations Needed
Tightly Clustered	`[3, 4, 5]`	50
Small Outlier	`[3, 4, 5, 30]`	50
[Large Outlier](#large outlier)	`[3, 4, 5, 300]`	600

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
Assets		Assets
README.md		README.md
Stochastic_Gradient_Descent_The_Average_Problem.ipynb		Stochastic_Gradient_Descent_The_Average_Problem.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Stochastic Gradient Descent: The Average Problem

Table of Contents

Problem Introduction

SGD High Level Intuition

Defining Terms

Loss Function

Stochastic

Gradient

Gradient Descent

Why Stochastic?

`SGD` Algorithm

Impacts of the Learning Rate

Impacts of the `(-)` sign

SGD In Python

Calculating Loss

Generating Next Guess

Displaying Data

Plotting `SGD` With Varying Ranges

Tightly Clustered

Small Outlier

Large Outlier

About

Releases

Packages

Languages

Ky-Ng/Stochastic-Gradient-Descent-Modeler

Folders and files

Latest commit

History

Repository files navigation

Stochastic Gradient Descent: The Average Problem

Table of Contents

Problem Introduction

SGD High Level Intuition

Defining Terms

Loss Function

Stochastic

Gradient

Gradient Descent

Why Stochastic?

SGD Algorithm

Impacts of the Learning Rate

Impacts of the (-) sign

SGD In Python

Calculating Loss

Generating Next Guess

Displaying Data

Plotting SGD With Varying Ranges

Tightly Clustered

Small Outlier

Large Outlier

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

`SGD` Algorithm

Impacts of the `(-)` sign

Plotting `SGD` With Varying Ranges

Packages