recap4.html

---
layout: page
title: Lecture 4 Recap - Linear Classification
mathjax: true
weight: 0
---

<section class="main-container text">
    <div class="main">
      <h4>Date: February 5, 2020 (<a href="https://docs.google.com/forms/d/e/1FAIpQLSfFny0H7gR64WhBOGT-p-8cX0mnWybx36f_FdQzg-iEmDATRQ/viewform" target="_blank">Concept Check</a>, <a href="https://docs.google.com/forms/d/e/1FAIpQLSfFny0H7gR64WhBOGT-p-8cX0mnWybx36f_FdQzg-iEmDATRQ/viewanalytics" target="_blank">Class Responses</a>, <a href="{{ site.baseurl }}/ccsolutions" target="_blank">Solutions</a>)</h4>
      <h4>Relevant Textbook Sections: 3.1 - 3.5</h4>
      <h4>Cube: Supervised, Discrete, Nonprobabilistic</h4>

      <br>

      <h3>Lecture 4 Summary</h3>

     <ul>
        <li><a href="#recap4_1">Introduction</a></li>
        <li><a href="#recap4_2">Classification vs Regression</a></li>
        <li><a href="#recap4_3">Linear Classification</a></li>
        <li><a href="#recap4_4">Metrics</a></li>
      </ul>

      <h2 id="recap2_1">Relevant Videos</h2>
      <ul>
        <li><a href="https://www.youtube.com/watch?v=Y6DA7XFh_io&list=PLrH1CxyJ7Vqb-pHzfUClJNXBDAKajHE74&index=5&t=0s">Gradient descent</a></li>
        <li><a href="https://www.youtube.com/watch?v=iUzy4GEmSN4&list=PLrH1CxyJ7Vqb-pHzfUClJNXBDAKajHE74&index=6&t=0s">Binary Linear Classification</a></li>
      </ul>

      <br>

      <h2 id="recap4_1">Introduction</h2>

      The previous lecture, we covered probabilistic regression. As a recap, in probabilistic regression, a generative model
      allows us to perform different types of inference such as posterior inference for weight parameters as well as
      posterior predictive for new data.

      <br>
      <br>

      This lecture, we covered linear classification. The goal of classification is to identify a category $y$ given $\mathbf{x}$,
      rather than continuous $y$.

      <h2 id="recap4_2">Classification vs Regression</h2>
      Conceptually classification is not too different from regression. We follow the same general forms:

      <ol>
        <li>Choose a model (linear vs non-linear boundary)</li>
        <li>Choose a loss function
            <br>
            We will write out $y \in C_1,\cdots,C_K$.
            <br>
            Depending on the problem, we encode $y$ as $0/1$, $+/-$ or
            one-hot vectors $\begin{pmatrix} 0 & 0 & 1 & 0 \end{pmatrix}$
        </li>
      </ol>
      We can still use KNN for classification by returning the majority vote of the neighbors of $\mathbf{x}$


      <h2 id="recap4_3">Linear Classification</h2>

      <h3>Choose a Model : Linear Boundary </h3>
      We introduce a new parametric model. It is simple but we can use a basis $\phi$ to obtain complex boundaries of separation.
      $$\hat{y} = \text{sign}(\mathbf{w}^T\mathbf{x} + w_0)$$

      Before deciding on a loss, let's just understand this model and what it does:

      Consider the decision boundary $\mathbf{w}^T\mathbf{x} + w_0 = 0$:
      <br>
      In the 2D case:
      <br>
      $$\begin{align*}
      w_1x_1 + w_2x_2 + w_0 &= 0 \\
      x_2 &= -\frac{w_1}{w_2}x_1 - \frac{w_0}{w_2}
      \end{align*}$$

      This is the equation of a line, so we have a linear boundary!

      <br>
      <br>
      Generalizing: Consider a vector $\mathbf{s}$ connecting two points $\mathbf{x_1}$ and $\mathbf{x_2}$ on the boundary (
      $\mathbf{s} = \mathbf{x_2} - \mathbf{x_1}$)

      $$\begin{align*}
      \mathbf{s}\cdot \mathbf{w} &= \mathbf{x_2}\cdot \mathbf{w} - \mathbf{x_1} \cdot \mathbf{w} \\
      &= \mathbf{x_2}\cdot \mathbf{w} + w_0 - \mathbf{x_1} \cdot \mathbf{w} - w_0\\
      &= 0 - 0 = 0
      \end{align*}
      $$
      $\mathbf{s}$ is orthogonal to $\mathbf{w}$.

      <br>
      <br>

      This implies that $\mathbf{w}$ is orthogonal to the boundary. $w_0$ gives the offset.

      <h3>Choose a Loss Function : Hinge Loss</h3>

      Let's consider the the $0/1$ function:
      $$
      \ell_{0/1}(z) =
      \left\{ \begin{array}{cc}
      1 \quad& z > 0 \\
      0 \quad& \text{else}
      \end{array} \right.
      $$
      and the loss function
      $$\mathcal{L}(\textbf{w}) = \sum_{n=1}^N \ell_{0/1}\left(-y_n(\mathbf{w}^T\mathbf{x}_n + w_0)\right)$$
      that penalizes if the signs of $y_n$ and $\mathbf{w}^T\mathbf{x}_n + w_0$ do not match.

      <br>
      There is however an issue with this loss. It has uninformative gradient. We are either right or wrong.

      <br><br>
      Let us now consider hinge loss or linear rectifier function
      $$
      \ell_{\text{hinge}}(z) =
      \left\{ \begin{array}{cc}
      z \quad& z > 0 \\
      0 \quad& \text{else}
      \end{array} \right. = \max(0,z)
      $$
      and the loss function
      $$\begin{align*}
      \mathcal{L}(\textbf{w}) &= \sum_{n=1}^N \ell_{\text{hinge}}\left(-y_n(\mathbf{w}^T\mathbf{x}_n + w_0)\right) \\
      &= -\sum_{m \in S}y_m(\mathbf{w}^T\mathbf{x}_m + w_0)
      \end{align*}
      $$
      where the set $S$ consists of all $n$ such that $\text{sign}(y_n) \neq \text{sign}(\mathbf{w}^T\mathbf{x}_n + w_0)$
      <br><br>
      Now, we can take gradients!
      $$\frac{\partial}{\partial \mathbf{w}}\mathcal{L}(\textbf{w}) = -\sum_{m \in S}y_m\mathbf{x}_m$$
      Note: We have absorbed the bias term into $\mathbf{w}$ here
      <br><br>

      <h4>How to solve for $\mathbf{w}^*$</h4>
      We can use <strong>stochastic</strong> gradient descent to optimize $\mathbf{w}$ : Use a <stong>mini-batch</stong> of
      our data (good if datatset is larger; noisier gradient though!).

      <br><br>
      What if we took just <strong>one</strong> (incorrectly classified) datum:
      $$\mathcal{L}^{(i)}(\mathbf{w}) = -y_i\mathbf{w}^T\mathbf{x}_i$$ and
      $$\mathbf{w} \leftarrow \mathbf{w} + \eta y_i \mathbf{x}_i$$
      This is the 1958 <strong>Perceptron</strong> algorithm : if $\hat{y} = y_n$, do nothing; else do above, until no error.
      This converges if the data are linearly separable in the feature space.

      <!-- <h3>A Different Loss : Fisher's Discriminant</h3>
      If $\mathbf{w}$ projects $\mathbf{x}$ into 1-D, why not explicitly seek clustering in that space?
      <br><br>
      We define emperical means and variances
      $$\mathbf{m}_1 = \frac{1}{N_1}\sum_{y_n \in C_1}\mathbf{x}_n,
      ~~~ \mathbf{S}_1 = \frac{1}{N_1}\sum_{y_n \in C_1}(\mathbf{x}_n - \mathbf{m}_1)(\mathbf{x}_n - \mathbf{m}_1)^T \\
      \mathbf{m}_2 = \frac{1}{N_2}\sum_{y_n \in C_2}\mathbf{x}_n,
      ~~~ \mathbf{S}_2 = \frac{1}{N_2}\sum_{y_n \in C_2}(\mathbf{x}_n - \mathbf{m}_2)(\mathbf{x}_n - \mathbf{m}_2)^T
      $$

      After taking $\mathbf{w}^\mathbf{x}$, we have the means and variances of a <strong>scalar</strong> $z$
      $$m_1^\prime = \mathbf{w}^T\mathbf{m}_1, ~~~ v_1 = \mathbf{w}^T\mathbf{S}_1\mathbf{w} \\
      m_2^\prime = \mathbf{w}^T\mathbf{m}_2, ~~~ v_2 = \mathbf{w}^T\mathbf{S}_2\mathbf{w}
      $$

      We then define an objective
      $$\begin{align*}
      \mathcal{L} &= -\frac{(m_1^\prime - m_2^\prime)^2}{v_1 + v_2} \\
      &= -\frac{\mathbf{w}^T(\mathbf{m}_1 - \mathbf{m}_2)(\mathbf{m}_1-\mathbf{m}_2)^T\mathbf{w}}
      {\mathbf{w}^T(\mathbf{S}_1 + \mathbf{S}_2)\mathbf{w}}
      \end{align*}
      $$
      Intuitively, we want the means to be far from each other and the variances to be small.
      <br><br>
      Let $\mathbf{S}_B = (\mathbf{m}_1 - \mathbf{m}_2)(\mathbf{m}_1-\mathbf{m}_2)^T$ and
      $\mathbf{S}_W = \mathbf{S}_1 + \mathbf{S}_2$
      <br><br>
      We the gradient of $\mathcal{L}(\mathbf{w})$ with respect to $\mathbf{w}$ and setting it to $0$, we have
      $$
      \nabla\mathcal{L}(\mathbf{w}) = \frac{(2\mathbf{S}_B\mathbf{w})(\mathbf{w}^T\mathbf{S}_w\mathbf{w})
      - (2\mathbf{S}_w\mathbf{w})(\mathbf{w}^T\mathbf{S}_B\mathbf{w})}{(\mathbf{w}^T\mathbf{S}_w\mathbf{w})^2}
      $$
      Setting it to $0$,

      $$
      \begin{align*}
      \nabla\mathcal{L}(\mathbf{w}) &= 0 \\
      \mathbf{S}_B\mathbf{w}(\mathbf{w}^T\mathbf{S}_w\mathbf{w}) &= \mathbf{S}_w\mathbf{w}(\mathbf{w}^T\mathbf{S}_B\mathbf{w})
      \end{align*}
      $$
      $(\mathbf{w}^T\mathbf{S}_w\mathbf{w})$ and $(\mathbf{w}^T\mathbf{S}_B\mathbf{w})$ are scale factors and don't change direction.
      $\mathbf{S}_B\mathbf{w}$ is proportional to $(\mathbf{m}_1 - \mathbf{m}_2)$.
      <br>
      <br>
      Therefore,
      $$\mathbf{w} \propto \mathbf{S}_w^{-1}(\mathbf{m}_1 - \mathbf{m}_2)$$
      We start with the difference of means ($\mathbf{m}_1 - \mathbf{m}_2$) and rotate based on variance ($\mathbf{S}_w^{-1}$). -->

      <h2 id="recap4_4">Metrics</h2>
      There are four error metrics
      <table class="note-table">
        <tr><th>Metric</th><th>y</th><th>y-hat</th></tr>
        <tr><td>True Positive</td><td>1</td><td>1</td></tr>
        <tr><td>False Positive</td><td>0</td><td>1</td></tr>
        <tr><td>True Negative</td><td>0</td><td>0</td></tr>
        <tr><td>False Negative</td><td>1</td><td>0</td></tr>
      </table>
      <br>
      These metrics can be combined to determine different kinds of rates such as
      $$\begin{align*}
      \text{precision} &= \frac{\text{TP}}{\text{TP} + \text{FP}} \\ \\
      \text{accuracy} &= \frac{\text{TP} + \text{TN}}{\text{TP} + \text{TN} + \text{FP} + \text{FN}}\\ \\
      \text{true positive rate} &= \frac{\text{TP}}{\text{TP} + \text{FN}} \\ \\
      \text{false positive rate} &= \frac{\text{FP}}{\text{FP} + \text{TN}} \\ \\
      \text{recall} &= \frac{\text{TP}}{\text{TP} + \text{FN}} \\ \\
      \text{F1} &= \frac{2\cdot \text{precision}\cdot\text{recall}}{\text{precision}+\text{recall}}
      \end{align*}
      $$

    </div>
</section>