Skip to content

Commit

Permalink
Modify lecture notes for lecture 10 #2
Browse files Browse the repository at this point in the history
  • Loading branch information
topwasu committed Oct 20, 2022
1 parent 97eefab commit 1f5eeb8
Showing 1 changed file with 30 additions and 124 deletions.
154 changes: 30 additions & 124 deletions lecture-notes/lecture10-clustering.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -57,75 +57,21 @@
}
},
"source": [
"## 10.1.2. Clustering\n",
"Two related problems in the area of unsupervised learning we have looked at are __clustering__ and __density estimation__\n",
"\n",
"Clustering is the problem of identifying distinct components in the data. Usually, we apply clustering when we assume that the data will have a certain structure, specifically:\n",
"* __Clustering__ is the problem of identifying distinct components in the data. Usually, we apply clustering when we assume that the data will have a certain structure, specifically:\n",
"\n",
"<!-- * A cluster $C_k \\subseteq \\mathcal{X}$ can be thought of as a subset of the space $\\mathcal{X}$. -->\n",
"* Datapoints in a cluster are more similar to each other than to points in other clusters\n",
"\n",
"* Clusters are usually defined by their centers, and potentially by other shape parameters."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## 10.1.3. Review: $K$-Means\n",
"\n",
"$K$-Means is the simplest example of a clustering algorithm.\n",
"Starting from random centroids, we repeat until convergence:\n",
"\n",
"1. Update each cluster: assign each point to its closest centroid.\n",
"\n",
"2. Set each centroid to be the center of its cluster."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"This is best illustrated visually - see [Wikipedia](https://commons.wikimedia.org/wiki/File:K-means_convergence.gif)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"$K$-Means has a number of limitations:\n",
"\n",
"* Clustering can get stuck in local minima\n",
" * Datapoints in a cluster are more similar to each other than to points in other clusters\n",
" \n",
" * Clusters are usually defined by their centers, and potentially by other shape parameters.\n",
"\n",
"* Measuring clustering quality is hard and relies on heuristics\n",
" * A simple clustering algorithm we have looked at is $K$-Means\n",
"\n",
"* Cluster assignment is binary and doesn't estimate confidence"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## 10.1.4. Review: Density Estimation\n",
"* __Density Estimation__ is the problem of learning $P_\\theta$\n",
" $$P_\\theta(x) : \\mathcal{X} \\to [0,1].$$\n",
" on an unsupervised dataset $\\mathcal{D}$ to approximate the true data distribution $P_\\text{data}$.\n",
"\n",
"An unsupervised probabilistic model is a probability distribution\n",
"$$P_\\theta(x) : \\mathcal{X} \\to [0,1].$$\n",
"Probabilistic models often have *parameters* $\\theta \\in \\Theta$."
" * If we successfully learn $P_\\theta \\approx P_\\text{data}$, then we can use $P_\\theta$ to solve many downstream tasks, including generation, outlier detection, and also __clustering__."
]
},
{
Expand All @@ -136,18 +82,7 @@
}
},
"source": [
"The task of density estimation is to learn a $P_\\theta$ on an unsupervised dataset $\\mathcal{D}$ to approximate the true data distribution $P_\\text{data}$."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"We will use density estimation for clustering; first, we need a model $P_\\theta$."
"In this lecture, we will introduce __Gaussian mixture models__, and the __expectation maximization__ algorithm to learn the models. The expectation maximization algorithm will involve both density estimation and clustering."
]
},
{
Expand All @@ -158,9 +93,9 @@
}
},
"source": [
"## 10.1.5. Gaussian Mixture Models (GMM)\n",
"## 10.1.2. Gaussian Mixture Models (GMM)\n",
"\n",
"Gaussian mixtures define a model of the form:\n",
"Gaussian mixtures is a probabilistic model of the form:\n",
"$$P_\\theta (x,z) = P_\\theta (x | z) P_\\theta (z)$$\n",
"\n",
"* $z \\in \\mathcal{Z} = \\{1,2,\\ldots,K\\}$ is discrete and follows a categorical distribution $P_\\theta(z=k) = \\phi_k$.\n",
Expand Down Expand Up @@ -212,7 +147,7 @@
}
},
"source": [
"## 10.1.6. Clustering With Known Cluster Assignments\n",
"## 10.1.3. Clustering With Known Cluster Assignments\n",
"\n",
"Let's first think about how we would do clustering if the identity of each cluster is known.\n",
"\n",
Expand Down Expand Up @@ -552,7 +487,7 @@
}
},
"source": [
"## 10.1.7. From Labeled to Unlabeled Clustering\n",
"## 10.1.4. From Labeled to Unlabeled Clustering\n",
"\n",
"We will now talk about how to train a GMM clustering model from unlabeled data."
]
Expand All @@ -579,7 +514,7 @@
}
},
"source": [
"### 10.1.7.1. Maximum Marginal Likelihood Learning\n",
"### 10.1.4.1. Maximum Marginal Likelihood Learning\n",
"\n",
"Maximum marginal (log-)likelihood is a way of learning a probabilistic model on an unsupervised dataset $\\mathcal{D}$ by maximizing:\n",
"$$\n",
Expand Down Expand Up @@ -680,7 +615,7 @@
}
},
"source": [
"### 10.1.7.2. Recovering Clusters from GMMs\n",
"### 10.1.4.2. Recovering Clusters from GMMs\n",
"\n",
"Given a trained GMM model $P_\\theta (x,z) = P_\\theta (x | z) P_\\theta (z)$, it's easy to compute the *posterior* probability\n",
"\n",
Expand Down Expand Up @@ -710,7 +645,7 @@
}
},
"source": [
"## 10.1.8. Beyond Gaussian Mixtures\n",
"## 10.1.5. Beyond Gaussian Mixtures\n",
"\n",
"We will focus on Gaussian mixture models in this lecture, but there exist many other kinds of clustering:\n",
"\n",
Expand Down Expand Up @@ -869,7 +804,7 @@
"## 10.3.1. Deriving the E-Step\n",
"\n",
"In the E-step, we compute the posterior for each data point $x$ as follows\n",
" $$P_\\theta(z = k\\mid x) = \\frac{P_\\theta(z=k, x)}{P_\\theta(x)} = \\frac{P_\\theta(x | z=k) P_\\theta(z=k)}{\\sum_{l=1}^K P_\\theta(x | z=l) P_\\theta(z=l)}$$\n",
" $$P_\\theta(z = k\\mid x) = \\frac{P_\\theta(z=k, x)}{P_\\theta(x)} = \\frac{P_\\theta(x | z=k) P_\\theta(z=k)}{\\sum_{l=1}^K P_\\theta(x | z=l) P_\\theta(z=l)}= \\frac{\\mathcal{N}(x; \\mu_k, \\Sigma_k) \\cdot \\phi_k}{\\sum_{l=1}^K \\mathcal{N}(x; \\mu_l, \\Sigma_l) \\cdot \\phi_l}$$\n",
"$P_\\theta(z = k\\mid x)$ defines a vector of probabilities that $x$ originates from component $k$ given the current set of parameters $\\theta$"
]
},
Expand Down Expand Up @@ -901,7 +836,7 @@
}
},
"source": [
"We will start with $P_\\theta(x\\mid z=k) = \\mathcal{N}(x; \\mu_k, \\Sigma_k)$. We have to find $\\mu_k, \\Sigma_k$ that optimize\n",
"We will start with $P_\\theta(x\\mid z=k) = \\mathcal{N}(x; \\mu_k, \\Sigma_k)$. We have to find $\\mu_k^*, \\Sigma_k^*$ that optimize\n",
"$$\n",
"\\max_\\theta \\sum_{x^{(i)} \\in D} P(z=k|x^{(i)}) \\log P_\\theta(x^{(i)}|z=k)\n",
"$$\n",
Expand All @@ -918,11 +853,12 @@
"source": [
"Similar to how we did this in the supervised regime, we compute the derivative, set it to zero, and obtain closed form solutions:\n",
"\\begin{align*}\n",
"\\mu_k & = \\frac{\\sum_{i=1}^n P(z=k|x^{(i)}) x^{(i)}}{n_k} \\\\\n",
"\\Sigma_k & = \\frac{\\sum_{i=1}^n P(z=k|x^{(i)}) (x^{(i)} - \\mu_k)(x^{(i)} - \\mu_k)^\\top}{n_k} \\\\\n",
"n_k & = \\sum_{i=1}^n P(z=k|x^{(i)}) \\\\\n",
"\\mu_k^* & = \\frac{\\sum_{i=1}^n P(z=k|x^{(i)}) x^{(i)}}{n_k} \\\\\n",
"\\Sigma_k^* & = \\frac{\\sum_{i=1}^n P(z=k|x^{(i)}) (x^{(i)} - \\mu_k^*)(x^{(i)} - \\mu_k^*)^\\top}{n_k} \\\\\n",
"\\end{align*}\n",
"Intuitively, the optimal mean and covariance are the emprical mean and convaraince of the dataset $\\mathcal{D}$ when each element $x^{(i)}$ has a weight $P(z=k|x^{(i)})$."
"where $n_k = \\sum_{i=1}^n P(z=k|x^{(i)})$\n",
"\n",
"Intuitively, the optimal mean and covariance are the __empirical__ mean and convaraince of the dataset $\\mathcal{D}$ when each element $x^{(i)}$ has a weight $P(z=k|x^{(i)})$."
]
},
{
Expand All @@ -933,11 +869,11 @@
}
},
"source": [
"Similarly, we can show that the class priors are\n",
"Similarly, we can show that the optimal class priors $\\phi_k^*$ are\n",
"\\begin{align*}\n",
"\\phi_k & = \\frac{n_k}{n} \\\\\n",
"n_k & = \\sum_{i=1}^n P(z=k|x^{(i)})\n",
"\\end{align*}"
"\\phi_k^* & = \\frac{n_k}{n} \\\\\n",
"\\end{align*}\n",
"Intuitively, the optimal $\\phi_k^*$ is just the proportion of data points with class $k$"
]
},
{
Expand All @@ -956,7 +892,7 @@
"\n",
"1. (__E-Step__) For each $x^{(i)} \\in \\mathcal{D}$ compute $P_{\\theta_t}(z|x^{(i)})$\n",
"\n",
"2. (__M-Step__) Compute parameters $\\mu_k, \\Sigma_k, \\phi_k$ using the above formulas"
"2. (__M-Step__) Compute optimal parameters $\\mu_k^*, \\Sigma_k^*, \\phi_k^*$ using the above formulas"
]
},
{
Expand Down Expand Up @@ -1385,36 +1321,6 @@
"source": [
"The above figure shows that when the number of clusters is below 4, adding more clusters helps reduce both training and holdout sets' negative log-likelihood. Nevertheless, when the number of clusters is above 4, adding more clusters reduces the training set's negative log-likelihood but increases the holdset's negative log-likelihood. The former situation represents underfitting -- you can make the model perform better on both training and holdout sets by making the model more expressive. The latter situation represents overfitting -- making the model more expressive makes the performance on the holdout set worse."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"__Warning__: This process doesn't work as well as in supervised learning.\n",
"\n",
"For example, detecting overfitting with larger datasets will be paradoxically harder (try it!)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## 10.4.4. Summary\n",
"\n",
"Generalization is important for supervised and unsupervised learning. In this section, we talk about generalization in GMMs (unsupervised learning). The takeaways are:\n",
"\n",
"* A probabilistic model can detect overfitting by comparing the likelihood of training data vs. that of holdout data.\n",
"\n",
"* We can reduce overfitting by making the model less expressive."
]
}
],
"metadata": {
Expand Down

0 comments on commit 1f5eeb8

Please sign in to comment.