Updated

UofT-DSI · Nov 19, 2024 · db52535 · db52535
1 parent cceee36
commit db52535
Show file tree

Hide file tree

Showing 2 changed files with 47 additions and 1,159 deletions.
diff --git a/02_activities/assignments/assignment_2.ipynb b/02_activities/assignments/assignment_2.ipynb
@@ -25,7 +25,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 2,
+   "execution_count": 14,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -43,7 +43,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 5,
+   "execution_count": 15,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -54,7 +54,7 @@
     "    'native-country', 'income'\n",
     "]\n",
     "adult_dt = (pd.read_csv(r\"C:\\Users\\ibast\\Downloads\\adult\\adult.data\", header = None, names = columns)\n",
-    "              .assign(income = lambda x: (x.income.str.strip() == '>50K')*1))\n"
+    "              .assign(income = lambda x: (x.income.str.strip() == '>50K')*1))"
    ]
   },
   {
@@ -75,7 +75,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 10,
+   "execution_count": 16,
    "metadata": {},
    "outputs": [
     {
@@ -90,7 +90,6 @@
     }
    ],
    "source": [
-    "\n",
     "X = adult_dt.drop(columns=['income'])\n",
     "Y = adult_dt['income']\n",
     "\n",
@@ -100,7 +99,7 @@
     "print(f'X_train shape: {X_train.shape}')\n",
     "print(f'X_test shape: {X_test.shape}')\n",
     "print(f'Y_train shape: {Y_train.shape}')\n",
-    "print(f'Y_test shape: {Y_test.shape}')\n"
+    "print(f'Y_test shape: {Y_test.shape}')"
    ]
   },
   {
@@ -119,18 +118,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "train_test_split, the random_state ensures that the random shuffling of data during splitting happens in a consistent way every time you run the code. This parameter can be set to any integer, and as long as the value remains the same, you will get the same split every time.\n",
-    "Useful for Reproducibility:\n",
-    "Consistency in Results: When working with machine learning models, it’s essential to obtain consistent results. By setting a fixed random_state, we ensure that any analysis or model we develop on the split data remains reproducible by others.\n",
-    "Comparison Across Models: If you compare multiple models or modify your approach, having the same train-test split allows you to attribute differences in results to model changes, not data variations.\n",
-    "Ease of Collaboration: For shared projects or published work, reproducibility ensures that collaborators or reviewers can replicate your results, strengthening the credibility and reliability of findings."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "*(Comment here.)*"
+    "train_test_split, the random_state ensures that the random shuffling of data during splitting happens in a consistent way every time you run the code. This parameter can be set to any integer, and as long as the value remains the same, you will get the same split every time. Useful for Reproducibility: Consistency in Results: When working with machine learning models, it’s essential to obtain consistent results. By setting a fixed random_state, we ensure that any analysis or model we develop on the split data remains reproducible by others. Comparison Across Models: If you compare multiple models or modify your approach, having the same train-test split allows you to attribute differences in results to model changes, not data variations. Ease of Collaboration: For shared projects or published work, reproducibility ensures that collaborators or reviewers can replicate your results, strengthening the credibility and reliability of findings."
    ]
   },
   {
@@ -168,11 +156,10 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 7,
    "metadata": {},
    "outputs": [],
    "source": [
-    "\n",
     "from sklearn.compose import ColumnTransformer\n",
     "from sklearn.impute import KNNImputer, SimpleImputer\n",
     "from sklearn.preprocessing import RobustScaler, OneHotEncoder\n",
@@ -198,7 +185,7 @@
     "        ('num', numerical_transformer, numerical_features),\n",
     "        ('cat', categorical_transformer, categorical_features)\n",
     "    ]\n",
-    ")\n"
+    ")"
    ]
   },
   {
@@ -219,7 +206,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 19,
+   "execution_count": null,
    "metadata": {},
    "outputs": [
     {
@@ -720,14 +707,12 @@
        "                ('classifier', RandomForestClassifier())])"
       ]
      },
-     "execution_count": 19,
+     "execution_count": 10,
      "metadata": {},
      "output_type": "execute_result"
     }
    ],
    "source": [
-    "\n",
-    "\n",
     "preprocessing = ColumnTransformer(\n",
     "    transformers=[\n",
     "        ('num', Pipeline([\n",
@@ -747,7 +732,8 @@
     "    ('classifier', RandomForestClassifier())\n",
     "])\n",
     "\n",
-    "pipe.fit(X_train, y_train)\n"
+    "# pipeline fitting\n",
+    "pipe.fit(X_train, Y_train)  \n"
    ]
   },
   {
@@ -765,18 +751,15 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 17,
    "metadata": {},
    "outputs": [],
    "source": [
-    "\n",
-    "\n",
     "# model pipeline\n",
     "model_pipeline = Pipeline(steps=[\n",
     "    ('preprocessing', preprocessor),  # Add the ColumnTransformer for preprocessing\n",
     "    ('classifier', RandomForestClassifier(random_state=42))  # Add the Random Forest classifier\n",
-    "])\n",
-    "\n"
+    "])"
    ]
   },
   {
@@ -788,19 +771,26 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 12,
+   "execution_count": 18,
    "metadata": {},
    "outputs": [],
    "source": [
     "from sklearn.model_selection import train_test_split\n",
     "\n",
     "# Assuming `X` is the features DataFrame and `Y` is the target DataFrame\n",
-    "X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=42)\n"
+    "X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=42)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Calculate the mean of each metric. "
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 19,
    "metadata": {},
    "outputs": [
     {
@@ -841,36 +831,36 @@
        "  <tbody>\n",
        "    <tr>\n",
        "      <th>3</th>\n",
-       "      <td>11.685792</td>\n",
-       "      <td>0.115046</td>\n",
+       "      <td>12.708740</td>\n",
+       "      <td>0.113908</td>\n",
        "      <td>-0.356791</td>\n",
        "      <td>-0.082511</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>0</th>\n",
-       "      <td>11.893299</td>\n",
-       "      <td>0.141734</td>\n",
+       "      <td>12.483332</td>\n",
+       "      <td>0.128106</td>\n",
        "      <td>-0.357675</td>\n",
        "      <td>-0.081189</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>1</th>\n",
-       "      <td>12.135140</td>\n",
-       "      <td>0.110448</td>\n",
+       "      <td>15.979565</td>\n",
+       "      <td>0.129162</td>\n",
        "      <td>-0.369239</td>\n",
        "      <td>-0.081516</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>2</th>\n",
-       "      <td>12.090408</td>\n",
-       "      <td>0.109544</td>\n",
+       "      <td>13.308411</td>\n",
+       "      <td>0.123250</td>\n",
        "      <td>-0.375988</td>\n",
        "      <td>-0.081469</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>4</th>\n",
-       "      <td>11.872093</td>\n",
-       "      <td>0.122813</td>\n",
+       "      <td>11.884736</td>\n",
+       "      <td>0.111188</td>\n",
        "      <td>-0.380379</td>\n",
        "      <td>-0.081368</td>\n",
        "    </tr>\n",
@@ -880,14 +870,14 @@
       ],
       "text/plain": [
        "    fit_time  score_time  test_neg_log_loss  train_neg_log_loss\n",
-       "3  11.685792    0.115046          -0.356791           -0.082511\n",
-       "0  11.893299    0.141734          -0.357675           -0.081189\n",
-       "1  12.135140    0.110448          -0.369239           -0.081516\n",
-       "2  12.090408    0.109544          -0.375988           -0.081469\n",
-       "4  11.872093    0.122813          -0.380379           -0.081368"
+       "3  12.708740    0.113908          -0.356791           -0.082511\n",
+       "0  12.483332    0.128106          -0.357675           -0.081189\n",
+       "1  15.979565    0.129162          -0.369239           -0.081516\n",
+       "2  13.308411    0.123250          -0.375988           -0.081469\n",
+       "4  11.884736    0.111188          -0.380379           -0.081368"
       ]
      },
-     "execution_count": 13,
+     "execution_count": 19,
      "metadata": {},
      "output_type": "execute_result"
     }
@@ -905,19 +895,12 @@
     "cv_df_sorted = cv_df.sort_values(by='test_neg_log_loss', ascending=False)\n",
     "\n",
     "# results\n",
-    "cv_df_sorted\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Calculate the mean of each metric. "
+    "cv_df_sorted"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 18,
+   "execution_count": 20,
    "metadata": {},
    "outputs": [
     {
@@ -931,15 +914,14 @@
     }
    ],
    "source": [
-    "\n",
     "cv_df = pd.DataFrame(cv_results)\n",
     "\n",
     "test_metrics = cv_df.filter(regex='test_')\n",
     "\n",
     "mean_metrics = test_metrics.mean()\n",
     "\n",
     "print(\"Mean of cross-validation folds:\")\n",
-    "print(mean_metrics)\n"
+    "print(mean_metrics)"
    ]
   },
   {
@@ -953,15 +935,15 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 21,
    "metadata": {},
    "outputs": [
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
       "Performance metrics on the testing data:\n",
-      "{'negative_log_loss': -0.378541864258079, 'roc_auc': np.float64(0.9016127012724575), 'accuracy': 0.8548469648889344, 'balanced_accuracy': np.float64(0.7752599723955951)}\n"
+      "{'negative_log_loss': -0.37940273142901965, 'roc_auc': np.float64(0.8994978803967568), 'accuracy': 0.8540280479066434, 'balanced_accuracy': np.float64(0.7732333499701753)}\n"
      ]
     }
    ],
@@ -983,7 +965,7 @@
     "}\n",
     "\n",
     "print(\"Performance metrics on the testing data:\")\n",
-    "print(performance_metrics)\n"
+    "print(performance_metrics)"
    ]
   },
   {
@@ -1013,12 +995,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Binary Classification Compatibility:\n",
-    "Most machine learning models, including the RandomForestClassifier, expect numerical labels for classification tasks. By converting the income column into binary values (0 and 1), where 0 represents an income of <=50K and 1 represents >50K, you make the target variable suitable for classification.\n",
-    "Simplification:\n",
-    "Instead of dealing with string labels like \">50K\" and \"<50K\", the target variable is converted into numeric form directly within the data loading process. This simplifies data handling and makes it easier to apply statistical methods.\n",
-    "Improved Model Performance:\n",
-    "Some machine learning algorithms, such as gradient boosting and logistic regression, perform better with numerical labels for binary classification, as they can compute loss functions more efficiently. By recoding the target, these algorithms can better compute the decision boundary.\n"
+    "Binary Classification Compatibility: Most machine learning models, including the RandomForestClassifier, expect numerical labels for classification tasks. By converting the income column into binary values (0 and 1), where 0 represents an income of <=50K and 1 represents >50K, you make the target variable suitable for classification. Simplification: Instead of dealing with string labels like \">50K\" and \"<50K\", the target variable is converted into numeric form directly within the data loading process. This simplifies data handling and makes it easier to apply statistical methods. Improved Model Performance: Some machine learning algorithms, such as gradient boosting and logistic regression, perform better with numerical labels for binary classification, as they can compute loss functions more efficiently. By recoding the target, these algorithms can better compute the decision boundary."
    ]
   },
   {