Skip to content

Commit

Permalink
Minor updates to week 3 materials (#123)
Browse files Browse the repository at this point in the history
  • Loading branch information
HongleiXie authored Oct 7, 2024
1 parent 2bb2203 commit 082e1c0
Show file tree
Hide file tree
Showing 5 changed files with 52 additions and 1,230 deletions.
1,263 changes: 35 additions & 1,228 deletions 01_materials/notebooks/Stat_Inference.ipynb

Large diffs are not rendered by default.

Binary file modified 01_materials/slides/Clustering.pdf
Binary file not shown.
Binary file modified 01_materials/slides/Stat_Inference.pdf
Binary file not shown.
17 changes: 16 additions & 1 deletion 03_instructional_team/markdown_slides/Clustering.md
Original file line number Diff line number Diff line change
Expand Up @@ -94,16 +94,31 @@ Visualizing the relationship between flipper length and bill length, we can obse
---
##### WSSD
1. Calculate the cluster centers by taking the mean of each variable for all data points in a cluster.
2. Measure the sum of squared distances between each data point and its cluster center.
For example, suppose we have a cluster containing 4 observations, and we are using two variables, $x$ and $y$ , to cluster the data. Then we would compute the coordinates, $\mu_x$ and $\mu_y$ of the cluster center by
$$
\mu_x = \frac{1}{4}(x_1+x_2+x_3+x_4) \quad \mu_y = \frac{1}{4}(y_1+y_2+y_3+y_4)
$$
---
##### WSSD
2. Measure the sum of squared distances between each data point and its cluster center.
![bg right:45% w:600](./images/wssd.png)
WSSD is computed by summing the squared Euclidean distances between each data point and the cluster center.
$$
\begin{split}
\text{WSSD} = \left((x_1 - \mu_x)^2 + (y_1 - \mu_y)^2\right) + \left((x_2 - \mu_x)^2 + (y_2 - \mu_y)^2\right)\\
+ \left((x_3 - \mu_x)^2 + (y_3 - \mu_y)^2\right) + \left((x_4 - \mu_x)^2 + (y_4 - \mu_y)^2\right)
\end{split}
$$
---

##### WSSD
- A larger WSSD indicates that the cluster is more spread out, as it means data points are farther from the cluster center.

- To obtain the total WSSD, sum the WSSD values for all clusters, which involves adding up all squared distances for all observations.
![bg right:50% w:600](./images/all_wssd.png)

---

##### Clustering algorithm
- The K-means algorithm starts by choosing $K$ and randomly assigning observations to each of the $K$ clusters.
- Here, each data point is assigned to 1 of 3 clusters:
Expand Down
2 changes: 1 addition & 1 deletion 03_instructional_team/markdown_slides/Stat_Inference.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ Applying Statistical Concepts
- Instead, we use a **sample**, a subset of the population, to estimate the population parameter.
- **Sample estimate**: A numerical characteristic of the sample that approximates the population parameter.
- **Statistical inference**: Using a sample to make conclusions about the broader population.
![bg right:40% w:550](./images/population_vs_sample.png)
![bg right:40% w:400](./images/population_vs_sample.png)

---
##### Example dataset
Expand Down

0 comments on commit 082e1c0

Please sign in to comment.