diff --git a/lecture-notes/lecture10-clustering.ipynb b/lecture-notes/lecture10-clustering.ipynb index a451a7a..6ae86f3 100644 --- a/lecture-notes/lecture10-clustering.ipynb +++ b/lecture-notes/lecture10-clustering.ipynb @@ -57,75 +57,21 @@ } }, "source": [ - "## 10.1.2. Clustering\n", + "Two related problems in the area of unsupervised learning we have looked at are __clustering__ and __density estimation__\n", "\n", - "Clustering is the problem of identifying distinct components in the data. Usually, we apply clustering when we assume that the data will have a certain structure, specifically:\n", + "* __Clustering__ is the problem of identifying distinct components in the data. Usually, we apply clustering when we assume that the data will have a certain structure, specifically:\n", "\n", - "\n", - "* Datapoints in a cluster are more similar to each other than to points in other clusters\n", - "\n", - "* Clusters are usually defined by their centers, and potentially by other shape parameters." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "slide" - } - }, - "source": [ - "## 10.1.3. Review: $K$-Means\n", - "\n", - "$K$-Means is the simplest example of a clustering algorithm.\n", - "Starting from random centroids, we repeat until convergence:\n", - "\n", - "1. Update each cluster: assign each point to its closest centroid.\n", - "\n", - "2. Set each centroid to be the center of its cluster." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "subslide" - } - }, - "source": [ - "This is best illustrated visually - see [Wikipedia](https://commons.wikimedia.org/wiki/File:K-means_convergence.gif)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "subslide" - } - }, - "source": [ - "$K$-Means has a number of limitations:\n", - "\n", - "* Clustering can get stuck in local minima\n", + " * Datapoints in a cluster are more similar to each other than to points in other clusters\n", + " \n", + " * Clusters are usually defined by their centers, and potentially by other shape parameters.\n", "\n", - "* Measuring clustering quality is hard and relies on heuristics\n", + " * A simple clustering algorithm we have looked at is $K$-Means\n", "\n", - "* Cluster assignment is binary and doesn't estimate confidence" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "slide" - } - }, - "source": [ - "## 10.1.4. Review: Density Estimation\n", + "* __Density Estimation__ is the problem of learning $P_\\theta$\n", + " $$P_\\theta(x) : \\mathcal{X} \\to [0,1].$$\n", + " on an unsupervised dataset $\\mathcal{D}$ to approximate the true data distribution $P_\\text{data}$.\n", "\n", - "An unsupervised probabilistic model is a probability distribution\n", - "$$P_\\theta(x) : \\mathcal{X} \\to [0,1].$$\n", - "Probabilistic models often have *parameters* $\\theta \\in \\Theta$." + " * If we successfully learn $P_\\theta \\approx P_\\text{data}$, then we can use $P_\\theta$ to solve many downstream tasks, including generation, outlier detection, and also __clustering__." ] }, { @@ -136,18 +82,7 @@ } }, "source": [ - "The task of density estimation is to learn a $P_\\theta$ on an unsupervised dataset $\\mathcal{D}$ to approximate the true data distribution $P_\\text{data}$." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "fragment" - } - }, - "source": [ - "We will use density estimation for clustering; first, we need a model $P_\\theta$." + "In this lecture, we will introduce __Gaussian mixture models__, and the __expectation maximization__ algorithm to learn the models. The expectation maximization algorithm will involve both density estimation and clustering." ] }, { @@ -158,9 +93,9 @@ } }, "source": [ - "## 10.1.5. Gaussian Mixture Models (GMM)\n", + "## 10.1.2. Gaussian Mixture Models (GMM)\n", "\n", - "Gaussian mixtures define a model of the form:\n", + "Gaussian mixtures is a probabilistic model of the form:\n", "$$P_\\theta (x,z) = P_\\theta (x | z) P_\\theta (z)$$\n", "\n", "* $z \\in \\mathcal{Z} = \\{1,2,\\ldots,K\\}$ is discrete and follows a categorical distribution $P_\\theta(z=k) = \\phi_k$.\n", @@ -212,7 +147,7 @@ } }, "source": [ - "## 10.1.6. Clustering With Known Cluster Assignments\n", + "## 10.1.3. Clustering With Known Cluster Assignments\n", "\n", "Let's first think about how we would do clustering if the identity of each cluster is known.\n", "\n", @@ -552,7 +487,7 @@ } }, "source": [ - "## 10.1.7. From Labeled to Unlabeled Clustering\n", + "## 10.1.4. From Labeled to Unlabeled Clustering\n", "\n", "We will now talk about how to train a GMM clustering model from unlabeled data." ] @@ -579,7 +514,7 @@ } }, "source": [ - "### 10.1.7.1. Maximum Marginal Likelihood Learning\n", + "### 10.1.4.1. Maximum Marginal Likelihood Learning\n", "\n", "Maximum marginal (log-)likelihood is a way of learning a probabilistic model on an unsupervised dataset $\\mathcal{D}$ by maximizing:\n", "$$\n", @@ -680,7 +615,7 @@ } }, "source": [ - "### 10.1.7.2. Recovering Clusters from GMMs\n", + "### 10.1.4.2. Recovering Clusters from GMMs\n", "\n", "Given a trained GMM model $P_\\theta (x,z) = P_\\theta (x | z) P_\\theta (z)$, it's easy to compute the *posterior* probability\n", "\n", @@ -710,7 +645,7 @@ } }, "source": [ - "## 10.1.8. Beyond Gaussian Mixtures\n", + "## 10.1.5. Beyond Gaussian Mixtures\n", "\n", "We will focus on Gaussian mixture models in this lecture, but there exist many other kinds of clustering:\n", "\n", @@ -869,7 +804,7 @@ "## 10.3.1. Deriving the E-Step\n", "\n", "In the E-step, we compute the posterior for each data point $x$ as follows\n", - " $$P_\\theta(z = k\\mid x) = \\frac{P_\\theta(z=k, x)}{P_\\theta(x)} = \\frac{P_\\theta(x | z=k) P_\\theta(z=k)}{\\sum_{l=1}^K P_\\theta(x | z=l) P_\\theta(z=l)}$$\n", + " $$P_\\theta(z = k\\mid x) = \\frac{P_\\theta(z=k, x)}{P_\\theta(x)} = \\frac{P_\\theta(x | z=k) P_\\theta(z=k)}{\\sum_{l=1}^K P_\\theta(x | z=l) P_\\theta(z=l)}= \\frac{\\mathcal{N}(x; \\mu_k, \\Sigma_k) \\cdot \\phi_k}{\\sum_{l=1}^K \\mathcal{N}(x; \\mu_l, \\Sigma_l) \\cdot \\phi_l}$$\n", "$P_\\theta(z = k\\mid x)$ defines a vector of probabilities that $x$ originates from component $k$ given the current set of parameters $\\theta$" ] }, @@ -901,7 +836,7 @@ } }, "source": [ - "We will start with $P_\\theta(x\\mid z=k) = \\mathcal{N}(x; \\mu_k, \\Sigma_k)$. We have to find $\\mu_k, \\Sigma_k$ that optimize\n", + "We will start with $P_\\theta(x\\mid z=k) = \\mathcal{N}(x; \\mu_k, \\Sigma_k)$. We have to find $\\mu_k^*, \\Sigma_k^*$ that optimize\n", "$$\n", "\\max_\\theta \\sum_{x^{(i)} \\in D} P(z=k|x^{(i)}) \\log P_\\theta(x^{(i)}|z=k)\n", "$$\n", @@ -918,11 +853,12 @@ "source": [ "Similar to how we did this in the supervised regime, we compute the derivative, set it to zero, and obtain closed form solutions:\n", "\\begin{align*}\n", - "\\mu_k & = \\frac{\\sum_{i=1}^n P(z=k|x^{(i)}) x^{(i)}}{n_k} \\\\\n", - "\\Sigma_k & = \\frac{\\sum_{i=1}^n P(z=k|x^{(i)}) (x^{(i)} - \\mu_k)(x^{(i)} - \\mu_k)^\\top}{n_k} \\\\\n", - "n_k & = \\sum_{i=1}^n P(z=k|x^{(i)}) \\\\\n", + "\\mu_k^* & = \\frac{\\sum_{i=1}^n P(z=k|x^{(i)}) x^{(i)}}{n_k} \\\\\n", + "\\Sigma_k^* & = \\frac{\\sum_{i=1}^n P(z=k|x^{(i)}) (x^{(i)} - \\mu_k^*)(x^{(i)} - \\mu_k^*)^\\top}{n_k} \\\\\n", "\\end{align*}\n", - "Intuitively, the optimal mean and covariance are the emprical mean and convaraince of the dataset $\\mathcal{D}$ when each element $x^{(i)}$ has a weight $P(z=k|x^{(i)})$." + "where $n_k = \\sum_{i=1}^n P(z=k|x^{(i)})$\n", + "\n", + "Intuitively, the optimal mean and covariance are the __empirical__ mean and convaraince of the dataset $\\mathcal{D}$ when each element $x^{(i)}$ has a weight $P(z=k|x^{(i)})$." ] }, { @@ -933,11 +869,11 @@ } }, "source": [ - "Similarly, we can show that the class priors are\n", + "Similarly, we can show that the optimal class priors $\\phi_k^*$ are\n", "\\begin{align*}\n", - "\\phi_k & = \\frac{n_k}{n} \\\\\n", - "n_k & = \\sum_{i=1}^n P(z=k|x^{(i)})\n", - "\\end{align*}" + "\\phi_k^* & = \\frac{n_k}{n} \\\\\n", + "\\end{align*}\n", + "Intuitively, the optimal $\\phi_k^*$ is just the proportion of data points with class $k$" ] }, { @@ -956,7 +892,7 @@ "\n", "1. (__E-Step__) For each $x^{(i)} \\in \\mathcal{D}$ compute $P_{\\theta_t}(z|x^{(i)})$\n", "\n", - "2. (__M-Step__) Compute parameters $\\mu_k, \\Sigma_k, \\phi_k$ using the above formulas" + "2. (__M-Step__) Compute optimal parameters $\\mu_k^*, \\Sigma_k^*, \\phi_k^*$ using the above formulas" ] }, { @@ -1385,36 +1321,6 @@ "source": [ "The above figure shows that when the number of clusters is below 4, adding more clusters helps reduce both training and holdout sets' negative log-likelihood. Nevertheless, when the number of clusters is above 4, adding more clusters reduces the training set's negative log-likelihood but increases the holdset's negative log-likelihood. The former situation represents underfitting -- you can make the model perform better on both training and holdout sets by making the model more expressive. The latter situation represents overfitting -- making the model more expressive makes the performance on the holdout set worse." ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "skip" - } - }, - "source": [ - "__Warning__: This process doesn't work as well as in supervised learning.\n", - "\n", - "For example, detecting overfitting with larger datasets will be paradoxically harder (try it!)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "slideshow": { - "slide_type": "slide" - } - }, - "source": [ - "## 10.4.4. Summary\n", - "\n", - "Generalization is important for supervised and unsupervised learning. In this section, we talk about generalization in GMMs (unsupervised learning). The takeaways are:\n", - "\n", - "* A probabilistic model can detect overfitting by comparing the likelihood of training data vs. that of holdout data.\n", - "\n", - "* We can reduce overfitting by making the model less expressive." - ] } ], "metadata": {