Merge branch 'main' of https://github.com/UofT-DSI/applying_statistic…

…al_concepts
UofT-DSI · Sep 26, 2024 · c735240 · c735240
2 parents d8133cd + 7e2a89b
commit c735240
Showing 1 changed file with 218 additions and 43 deletions.
diff --git a/02_activities/assignments/assignment_3.ipynb b/02_activities/assignments/assignment_3.ipynb
@@ -73,10 +73,167 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 14,
    "id": "a431d282-f9ca-4d5d-8912-71ffc9d8ea19",
    "metadata": {},
-   "outputs": [],
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>sepal length (cm)</th>\n",
+       "      <th>sepal width (cm)</th>\n",
+       "      <th>petal length (cm)</th>\n",
+       "      <th>petal width (cm)</th>\n",
+       "      <th>species</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>5.1</td>\n",
+       "      <td>3.5</td>\n",
+       "      <td>1.4</td>\n",
+       "      <td>0.2</td>\n",
+       "      <td>0</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>4.9</td>\n",
+       "      <td>3.0</td>\n",
+       "      <td>1.4</td>\n",
+       "      <td>0.2</td>\n",
+       "      <td>0</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>4.7</td>\n",
+       "      <td>3.2</td>\n",
+       "      <td>1.3</td>\n",
+       "      <td>0.2</td>\n",
+       "      <td>0</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>4.6</td>\n",
+       "      <td>3.1</td>\n",
+       "      <td>1.5</td>\n",
+       "      <td>0.2</td>\n",
+       "      <td>0</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>5.0</td>\n",
+       "      <td>3.6</td>\n",
+       "      <td>1.4</td>\n",
+       "      <td>0.2</td>\n",
+       "      <td>0</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>...</th>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "      <td>...</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>145</th>\n",
+       "      <td>6.7</td>\n",
+       "      <td>3.0</td>\n",
+       "      <td>5.2</td>\n",
+       "      <td>2.3</td>\n",
+       "      <td>2</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>146</th>\n",
+       "      <td>6.3</td>\n",
+       "      <td>2.5</td>\n",
+       "      <td>5.0</td>\n",
+       "      <td>1.9</td>\n",
+       "      <td>2</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>147</th>\n",
+       "      <td>6.5</td>\n",
+       "      <td>3.0</td>\n",
+       "      <td>5.2</td>\n",
+       "      <td>2.0</td>\n",
+       "      <td>2</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>148</th>\n",
+       "      <td>6.2</td>\n",
+       "      <td>3.4</td>\n",
+       "      <td>5.4</td>\n",
+       "      <td>2.3</td>\n",
+       "      <td>2</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>149</th>\n",
+       "      <td>5.9</td>\n",
+       "      <td>3.0</td>\n",
+       "      <td>5.1</td>\n",
+       "      <td>1.8</td>\n",
+       "      <td>2</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "<p>150 rows × 5 columns</p>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "     sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \\\n",
+       "0                  5.1               3.5                1.4               0.2   \n",
+       "1                  4.9               3.0                1.4               0.2   \n",
+       "2                  4.7               3.2                1.3               0.2   \n",
+       "3                  4.6               3.1                1.5               0.2   \n",
+       "4                  5.0               3.6                1.4               0.2   \n",
+       "..                 ...               ...                ...               ...   \n",
+       "145                6.7               3.0                5.2               2.3   \n",
+       "146                6.3               2.5                5.0               1.9   \n",
+       "147                6.5               3.0                5.2               2.0   \n",
+       "148                6.2               3.4                5.4               2.3   \n",
+       "149                5.9               3.0                5.1               1.8   \n",
+       "\n",
+       "     species  \n",
+       "0          0  \n",
+       "1          0  \n",
+       "2          0  \n",
+       "3          0  \n",
+       "4          0  \n",
+       "..       ...  \n",
+       "145        2  \n",
+       "146        2  \n",
+       "147        2  \n",
+       "148        2  \n",
+       "149        2  \n",
+       "\n",
+       "[150 rows x 5 columns]"
+      ]
+     },
+     "execution_count": 14,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
    "source": [
     "from sklearn.datasets import load_iris\n",
     "# Load the Iris dataset\n",
@@ -134,80 +291,100 @@
    ]
   },
   {
-   "cell_type": "markdown",
-   "id": "4604ee03",
+   "cell_type": "code",
+   "execution_count": 16,
+   "id": "b8971d89",
    "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \\\n",
+      "0          -0.900681          1.019004          -1.340227         -1.315444   \n",
+      "1          -1.143017         -0.131979          -1.340227         -1.315444   \n",
+      "2          -1.385353          0.328414          -1.397064         -1.315444   \n",
+      "3          -1.506521          0.098217          -1.283389         -1.315444   \n",
+      "4          -1.021849          1.249201          -1.340227         -1.315444   \n",
+      "\n",
+      "   species  \n",
+      "0        0  \n",
+      "1        0  \n",
+      "2        0  \n",
+      "3        0  \n",
+      "4        0  \n"
+     ]
+    }
+   ],
    "source": [
-    "#### **Question 4:** \n",
-    "#### K-means clustering \n",
-    "Apply the K-Means clustering algorithm to the Iris dataset.\n",
-    "Choose the number of clusters (K=3, since there are three species) and fit the model.\n",
-    "Assign cluster labels to the original data and add them as a new column in the DataFrame."
+    "# Initialize the StandardScaler\n",
+    "scaler = StandardScaler()\n",
+    "\n",
+    "# Scale the features (excluding the species column)\n",
+    "scaled_features = scaler.fit_transform(iris_df.iloc[:, :-1])\n",
+    "\n",
+    "# Create a new DataFrame with scaled features\n",
+    "scaled_iris_df = pd.DataFrame(scaled_features, columns=iris_data.feature_names)\n",
+    "\n",
+    "# Add the species column back to the scaled DataFrame\n",
+    "scaled_iris_df['species'] = iris_df['species'].values\n",
+    "\n",
+    "# Display the first few rows of the scaled DataFrame\n",
+    "print(scaled_iris_df.head())"
    ]
   },
   {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "909df219",
+   "cell_type": "markdown",
+   "id": "b326e039",
    "metadata": {},
-   "outputs": [],
    "source": [
-    "# Your code here ..."
+    "Why is it important to standardize the features of a dataset before applying clustering algorithms like K-Means? Discuss the implications of using unstandardized data in your analysis."
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "0aefdee5",
+   "id": "fc34b4b7",
    "metadata": {},
    "source": [
-    "Discuss the results of the K-Means clustering. How well did the clusters match the true species?"
+    "> Your answer here ... "
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "7bcebc16",
+   "id": "4604ee03",
    "metadata": {},
    "source": [
-    "> Your answer here ..."
+    "#### **Question 4:** \n",
+    "#### K-means clustering \n",
+    "Apply the K-Means clustering algorithm to the Iris dataset.\n",
+    "Choose the number of clusters (K=3, since there are three species) and fit the model.\n",
+    "Assign cluster labels to the original data and add them as a new column in the DataFrame."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "cbca5c4b",
+   "id": "909df219",
    "metadata": {},
    "outputs": [],
    "source": [
-    "# Initialize the StandardScaler\n",
-    "scaler = StandardScaler()\n",
-    "\n",
-    "# Scale the features (excluding the species column)\n",
-    "scaled_features = scaler.fit_transform(iris_df.iloc[:, :-1])\n",
-    "\n",
-    "# Create a new DataFrame with scaled features\n",
-    "scaled_iris_df = pd.DataFrame(scaled_features, columns=iris_data.feature_names)\n",
-    "\n",
-    "# Add the species column back to the scaled DataFrame\n",
-    "scaled_iris_df['species'] = iris_df['species'].values\n",
-    "\n",
-    "# Display the first few rows of the scaled DataFrame\n",
-    "print(scaled_iris_df.head())"
+    "# Your code here ..."
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "68f4231e",
+   "id": "0aefdee5",
    "metadata": {},
    "source": [
-    "Why is it important to standardize the features of a dataset before applying clustering algorithms like K-Means? Discuss the implications of using unstandardized data in your analysis."
+    "Discuss the results of the K-Means clustering. How well did the clusters match the true species?"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "057ec7e9",
+   "id": "7bcebc16",
    "metadata": {},
    "source": [
-    "> Your answer here ... "
+    "> Your answer here ..."
    ]
   },
   {
@@ -216,11 +393,9 @@
    "metadata": {},
    "source": [
     "#### **Question 5:** \n",
-    "#### Bootstrapping for Cluster Stability.\n",
+    "#### Bootstrapping \n",
     "\n",
-    "Implement bootstrapping to assess the stability of the clusters obtained from K-Means.\n",
-    "Generate 100 bootstrap samples from the original dataset.\n",
-    "For each bootstrap sample, fit the K-Means model and record the cluster labels."
+    " Implement bootstrapping on the mean of one of the sepal or petal measurement variables (e.g., Sepal Length, Petal Width) to assess the stability of the mean estimate. Generate 1000 bootstrap samples, calculate the mean for each sample, and compute a 95% confidence interval to evaluate the variability in the estimate."
    ]
   },
   {
@@ -238,7 +413,7 @@
    "id": "29096311",
    "metadata": {},
    "source": [
-    "Reflect on the stability of the clusters based on the bootstrapping results. Are there samples that consistently change clusters across bootstraps?"
+    "Reflect on the variability observed in the bootstrapped means and discuss whether the mean of the selected variable appears to be a stable and reliable estimate based on the confidence interval and the spread of the bootstrapped means."
    ]
   },
   {
@@ -262,7 +437,7 @@
     "| **Data Inspection**                                    | Data is thoroughly inspected for the number of variables, observations, and data types, and relevant insights are noted. | Data inspection is missing or lacks detail.         |\n",
     "| **Data Visualization**                                 | Visualizations (e.g., scatter plots) are well-constructed and correctly interpreted to explore relationships between features and species. | Visualizations are poorly constructed or not correctly interpreted. |\n",
     "| **Clustering Implementation**                           | K-Means clustering is correctly implemented, and cluster labels are appropriately assigned to the dataset.            | K-Means clustering is missing or incorrectly implemented. |\n",
-    "| **Bootstrapping Process**                              | Bootstrapping is correctly performed, and results are used to assess cluster stability. | Bootstrapping is missing or incorrectly performed. |"
+    "| **Bootstrapping Process**                              | Bootstrapping is correctly performed, and results are used to assess variable mean stability. | Bootstrapping is missing or incorrectly performed. |"
    ]
   },
   {