diff --git a/lecture-notes/lecture10-clustering.ipynb b/lecture-notes/lecture10-clustering.ipynb
index a451a7a..6ae86f3 100644
--- a/lecture-notes/lecture10-clustering.ipynb
+++ b/lecture-notes/lecture10-clustering.ipynb
@@ -57,75 +57,21 @@
     }
    },
    "source": [
-    "## 10.1.2. Clustering\n",
+    "Two related problems in the area of unsupervised learning we have looked at are __clustering__ and __density estimation__\n",
     "\n",
-    "Clustering is the problem of identifying distinct components in the data. Usually, we apply clustering when we assume that the data will have a certain structure, specifically:\n",
+    "* __Clustering__ is the problem of identifying distinct components in the data. Usually, we apply clustering when we assume that the data will have a certain structure, specifically:\n",
     "\n",
-    "<!-- * A cluster $C_k \\subseteq \\mathcal{X}$ can be thought of as a subset of the space $\\mathcal{X}$. -->\n",
-    "* Datapoints in a cluster are more similar to each other than to points in other clusters\n",
-    "\n",
-    "* Clusters are usually defined by their centers, and potentially by other shape parameters."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "slideshow": {
-     "slide_type": "slide"
-    }
-   },
-   "source": [
-    "## 10.1.3. Review: $K$-Means\n",
-    "\n",
-    "$K$-Means is the simplest example of a clustering algorithm.\n",
-    "Starting from random centroids, we repeat until convergence:\n",
-    "\n",
-    "1. Update each cluster: assign each point to its closest centroid.\n",
-    "\n",
-    "2. Set each centroid to be the center of its cluster."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "slideshow": {
-     "slide_type": "subslide"
-    }
-   },
-   "source": [
-    "This is best illustrated visually - see [Wikipedia](https://commons.wikimedia.org/wiki/File:K-means_convergence.gif)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "slideshow": {
-     "slide_type": "subslide"
-    }
-   },
-   "source": [
-    "$K$-Means has a number of limitations:\n",
-    "\n",
-    "* Clustering can get stuck in local minima\n",
+    "    * Datapoints in a cluster are more similar to each other than to points in other clusters\n",
+    "    \n",
+    "    * Clusters are usually defined by their centers, and potentially by other shape parameters.\n",
     "\n",
-    "* Measuring clustering quality is hard and relies on heuristics\n",
+    "    * A simple clustering algorithm we have looked at is $K$-Means\n",
     "\n",
-    "* Cluster assignment is binary and doesn't estimate confidence"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "slideshow": {
-     "slide_type": "slide"
-    }
-   },
-   "source": [
-    "## 10.1.4. Review: Density Estimation\n",
+    "* __Density Estimation__ is the problem of learning $P_\\theta$\n",
+    "    $$P_\\theta(x) : \\mathcal{X} \\to [0,1].$$\n",
+    "    on an unsupervised dataset $\\mathcal{D}$ to approximate the true data distribution $P_\\text{data}$.\n",
     "\n",
-    "An unsupervised probabilistic model is a probability distribution\n",
-    "$$P_\\theta(x) : \\mathcal{X} \\to [0,1].$$\n",
-    "Probabilistic models often have *parameters* $\\theta \\in \\Theta$."
+    "    * If we successfully learn $P_\\theta \\approx P_\\text{data}$, then we can use $P_\\theta$ to solve many downstream tasks, including generation, outlier detection, and also __clustering__."
    ]
   },
   {
@@ -136,18 +82,7 @@
     }
    },
    "source": [
-    "The task of density estimation is to learn a $P_\\theta$ on an unsupervised dataset $\\mathcal{D}$ to approximate the true data distribution $P_\\text{data}$."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "slideshow": {
-     "slide_type": "fragment"
-    }
-   },
-   "source": [
-    "We will use density estimation for clustering; first, we need a model $P_\\theta$."
+    "In this lecture, we will introduce __Gaussian mixture models__, and the __expectation maximization__ algorithm to learn the models. The expectation maximization algorithm will involve both density estimation and clustering."
    ]
   },
   {
@@ -158,9 +93,9 @@
     }
    },
    "source": [
-    "## 10.1.5. Gaussian Mixture Models (GMM)\n",
+    "## 10.1.2. Gaussian Mixture Models (GMM)\n",
     "\n",
-    "Gaussian mixtures define a model of the form:\n",
+    "Gaussian mixtures is a probabilistic model of the form:\n",
     "$$P_\\theta (x,z) = P_\\theta (x | z) P_\\theta (z)$$\n",
     "\n",
     "* $z \\in \\mathcal{Z} = \\{1,2,\\ldots,K\\}$ is discrete and follows a categorical distribution $P_\\theta(z=k) = \\phi_k$.\n",
@@ -212,7 +147,7 @@
     }
    },
    "source": [
-    "## 10.1.6. Clustering With Known Cluster Assignments\n",
+    "## 10.1.3. Clustering With Known Cluster Assignments\n",
     "\n",
     "Let's first think about how we would do clustering if the identity of each cluster is known.\n",
     "\n",
@@ -552,7 +487,7 @@
     }
    },
    "source": [
-    "## 10.1.7. From Labeled to Unlabeled Clustering\n",
+    "## 10.1.4. From Labeled to Unlabeled Clustering\n",
     "\n",
     "We will now talk about how to train a GMM clustering model from unlabeled data."
    ]
@@ -579,7 +514,7 @@
     }
    },
    "source": [
-    "### 10.1.7.1. Maximum Marginal Likelihood Learning\n",
+    "### 10.1.4.1. Maximum Marginal Likelihood Learning\n",
     "\n",
     "Maximum marginal (log-)likelihood is a way of learning a probabilistic model on an unsupervised dataset $\\mathcal{D}$ by maximizing:\n",
     "$$\n",
@@ -680,7 +615,7 @@
     }
    },
    "source": [
-    "### 10.1.7.2. Recovering Clusters from GMMs\n",
+    "### 10.1.4.2. Recovering Clusters from GMMs\n",
     "\n",
     "Given a trained GMM model $P_\\theta (x,z) = P_\\theta (x | z) P_\\theta (z)$, it's easy to compute the *posterior* probability\n",
     "\n",
@@ -710,7 +645,7 @@
     }
    },
    "source": [
-    "## 10.1.8. Beyond Gaussian Mixtures\n",
+    "## 10.1.5. Beyond Gaussian Mixtures\n",
     "\n",
     "We will focus on Gaussian mixture models in this lecture, but there exist many other kinds of clustering:\n",
     "\n",
@@ -869,7 +804,7 @@
     "## 10.3.1. Deriving the E-Step\n",
     "\n",
     "In the E-step, we compute the posterior for each data point $x$ as follows\n",
-    " $$P_\\theta(z = k\\mid x) = \\frac{P_\\theta(z=k, x)}{P_\\theta(x)} = \\frac{P_\\theta(x | z=k) P_\\theta(z=k)}{\\sum_{l=1}^K P_\\theta(x | z=l) P_\\theta(z=l)}$$\n",
+    " $$P_\\theta(z = k\\mid x) = \\frac{P_\\theta(z=k, x)}{P_\\theta(x)} = \\frac{P_\\theta(x | z=k) P_\\theta(z=k)}{\\sum_{l=1}^K P_\\theta(x | z=l) P_\\theta(z=l)}= \\frac{\\mathcal{N}(x; \\mu_k, \\Sigma_k) \\cdot \\phi_k}{\\sum_{l=1}^K \\mathcal{N}(x; \\mu_l, \\Sigma_l) \\cdot \\phi_l}$$\n",
     "$P_\\theta(z = k\\mid x)$ defines a vector of probabilities that $x$ originates from component $k$ given the current set of parameters $\\theta$"
    ]
   },
@@ -901,7 +836,7 @@
     }
    },
    "source": [
-    "We will start with $P_\\theta(x\\mid z=k) = \\mathcal{N}(x; \\mu_k, \\Sigma_k)$. We have to find $\\mu_k, \\Sigma_k$ that optimize\n",
+    "We will start with $P_\\theta(x\\mid z=k) = \\mathcal{N}(x; \\mu_k, \\Sigma_k)$. We have to find $\\mu_k^*, \\Sigma_k^*$ that optimize\n",
     "$$\n",
     "\\max_\\theta \\sum_{x^{(i)} \\in D} P(z=k|x^{(i)}) \\log P_\\theta(x^{(i)}|z=k)\n",
     "$$\n",
@@ -918,11 +853,12 @@
    "source": [
     "Similar to how we did this in the supervised regime, we compute the derivative, set it to zero, and obtain closed form solutions:\n",
     "\\begin{align*}\n",
-    "\\mu_k & = \\frac{\\sum_{i=1}^n P(z=k|x^{(i)}) x^{(i)}}{n_k} \\\\\n",
-    "\\Sigma_k & = \\frac{\\sum_{i=1}^n P(z=k|x^{(i)}) (x^{(i)} - \\mu_k)(x^{(i)} - \\mu_k)^\\top}{n_k} \\\\\n",
-    "n_k & = \\sum_{i=1}^n P(z=k|x^{(i)}) \\\\\n",
+    "\\mu_k^* & = \\frac{\\sum_{i=1}^n P(z=k|x^{(i)}) x^{(i)}}{n_k} \\\\\n",
+    "\\Sigma_k^* & = \\frac{\\sum_{i=1}^n P(z=k|x^{(i)}) (x^{(i)} - \\mu_k^*)(x^{(i)} - \\mu_k^*)^\\top}{n_k} \\\\\n",
     "\\end{align*}\n",
-    "Intuitively, the optimal mean and covariance are the emprical mean and convaraince of the dataset $\\mathcal{D}$ when each element $x^{(i)}$ has a weight $P(z=k|x^{(i)})$."
+    "where $n_k = \\sum_{i=1}^n P(z=k|x^{(i)})$\n",
+    "\n",
+    "Intuitively, the optimal mean and covariance are the __empirical__ mean and convaraince of the dataset $\\mathcal{D}$ when each element $x^{(i)}$ has a weight $P(z=k|x^{(i)})$."
    ]
   },
   {
@@ -933,11 +869,11 @@
     }
    },
    "source": [
-    "Similarly, we can show that the class priors are\n",
+    "Similarly, we can show that the optimal class priors $\\phi_k^*$ are\n",
     "\\begin{align*}\n",
-    "\\phi_k & = \\frac{n_k}{n} \\\\\n",
-    "n_k & = \\sum_{i=1}^n P(z=k|x^{(i)})\n",
-    "\\end{align*}"
+    "\\phi_k^* & = \\frac{n_k}{n} \\\\\n",
+    "\\end{align*}\n",
+    "Intuitively, the optimal $\\phi_k^*$ is just the proportion of data points with class $k$"
    ]
   },
   {
@@ -956,7 +892,7 @@
     "\n",
     "1. (__E-Step__) For each $x^{(i)} \\in \\mathcal{D}$ compute $P_{\\theta_t}(z|x^{(i)})$\n",
     "\n",
-    "2. (__M-Step__) Compute parameters $\\mu_k, \\Sigma_k, \\phi_k$ using the above formulas"
+    "2. (__M-Step__) Compute optimal parameters $\\mu_k^*, \\Sigma_k^*, \\phi_k^*$ using the above formulas"
    ]
   },
   {
@@ -1385,36 +1321,6 @@
    "source": [
     "The above figure shows that when the number of clusters is below 4, adding more clusters helps reduce both training and holdout sets' negative log-likelihood. Nevertheless, when the number of clusters is above 4, adding more clusters reduces the training set's negative log-likelihood but increases the holdset's negative log-likelihood. The former situation represents underfitting -- you can make the model perform better on both training and holdout sets by making the model more expressive. The latter situation represents overfitting -- making the model more expressive makes the performance on the holdout set worse."
    ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "slideshow": {
-     "slide_type": "skip"
-    }
-   },
-   "source": [
-    "__Warning__: This process doesn't work as well as in supervised learning.\n",
-    "\n",
-    "For example, detecting overfitting with larger datasets will be paradoxically harder (try it!)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "slideshow": {
-     "slide_type": "slide"
-    }
-   },
-   "source": [
-    "## 10.4.4. Summary\n",
-    "\n",
-    "Generalization is important for supervised and unsupervised learning. In this section, we talk about generalization in GMMs (unsupervised learning). The takeaways are:\n",
-    "\n",
-    "* A probabilistic model can detect overfitting by comparing the likelihood of training data vs. that of holdout data.\n",
-    "\n",
-    "* We can reduce overfitting by making the model less expressive."
-   ]
   }
  ],
  "metadata": {