Skip to content

Commit

Permalink
Updated
Browse files Browse the repository at this point in the history
  • Loading branch information
MRKGITCODE committed Nov 19, 2024
1 parent cceee36 commit db52535
Show file tree
Hide file tree
Showing 2 changed files with 47 additions and 1,159 deletions.
117 changes: 47 additions & 70 deletions 02_activities/assignments/assignment_2.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@
},
{
"cell_type": "code",
"execution_count": 2,
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -43,7 +43,7 @@
},
{
"cell_type": "code",
"execution_count": 5,
"execution_count": 15,
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -54,7 +54,7 @@
" 'native-country', 'income'\n",
"]\n",
"adult_dt = (pd.read_csv(r\"C:\\Users\\ibast\\Downloads\\adult\\adult.data\", header = None, names = columns)\n",
" .assign(income = lambda x: (x.income.str.strip() == '>50K')*1))\n"
" .assign(income = lambda x: (x.income.str.strip() == '>50K')*1))"
]
},
{
Expand All @@ -75,7 +75,7 @@
},
{
"cell_type": "code",
"execution_count": 10,
"execution_count": 16,
"metadata": {},
"outputs": [
{
Expand All @@ -90,7 +90,6 @@
}
],
"source": [
"\n",
"X = adult_dt.drop(columns=['income'])\n",
"Y = adult_dt['income']\n",
"\n",
Expand All @@ -100,7 +99,7 @@
"print(f'X_train shape: {X_train.shape}')\n",
"print(f'X_test shape: {X_test.shape}')\n",
"print(f'Y_train shape: {Y_train.shape}')\n",
"print(f'Y_test shape: {Y_test.shape}')\n"
"print(f'Y_test shape: {Y_test.shape}')"
]
},
{
Expand All @@ -119,18 +118,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"train_test_split, the random_state ensures that the random shuffling of data during splitting happens in a consistent way every time you run the code. This parameter can be set to any integer, and as long as the value remains the same, you will get the same split every time.\n",
"Useful for Reproducibility:\n",
"Consistency in Results: When working with machine learning models, it’s essential to obtain consistent results. By setting a fixed random_state, we ensure that any analysis or model we develop on the split data remains reproducible by others.\n",
"Comparison Across Models: If you compare multiple models or modify your approach, having the same train-test split allows you to attribute differences in results to model changes, not data variations.\n",
"Ease of Collaboration: For shared projects or published work, reproducibility ensures that collaborators or reviewers can replicate your results, strengthening the credibility and reliability of findings."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"*(Comment here.)*"
"train_test_split, the random_state ensures that the random shuffling of data during splitting happens in a consistent way every time you run the code. This parameter can be set to any integer, and as long as the value remains the same, you will get the same split every time. Useful for Reproducibility: Consistency in Results: When working with machine learning models, it’s essential to obtain consistent results. By setting a fixed random_state, we ensure that any analysis or model we develop on the split data remains reproducible by others. Comparison Across Models: If you compare multiple models or modify your approach, having the same train-test split allows you to attribute differences in results to model changes, not data variations. Ease of Collaboration: For shared projects or published work, reproducibility ensures that collaborators or reviewers can replicate your results, strengthening the credibility and reliability of findings."
]
},
{
Expand Down Expand Up @@ -168,11 +156,10 @@
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"\n",
"from sklearn.compose import ColumnTransformer\n",
"from sklearn.impute import KNNImputer, SimpleImputer\n",
"from sklearn.preprocessing import RobustScaler, OneHotEncoder\n",
Expand All @@ -198,7 +185,7 @@
" ('num', numerical_transformer, numerical_features),\n",
" ('cat', categorical_transformer, categorical_features)\n",
" ]\n",
")\n"
")"
]
},
{
Expand All @@ -219,7 +206,7 @@
},
{
"cell_type": "code",
"execution_count": 19,
"execution_count": null,
"metadata": {},
"outputs": [
{
Expand Down Expand Up @@ -720,14 +707,12 @@
" ('classifier', RandomForestClassifier())])"
]
},
"execution_count": 19,
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"\n",
"\n",
"preprocessing = ColumnTransformer(\n",
" transformers=[\n",
" ('num', Pipeline([\n",
Expand All @@ -747,7 +732,8 @@
" ('classifier', RandomForestClassifier())\n",
"])\n",
"\n",
"pipe.fit(X_train, y_train)\n"
"# pipeline fitting\n",
"pipe.fit(X_train, Y_train) \n"
]
},
{
Expand All @@ -765,18 +751,15 @@
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 17,
"metadata": {},
"outputs": [],
"source": [
"\n",
"\n",
"# model pipeline\n",
"model_pipeline = Pipeline(steps=[\n",
" ('preprocessing', preprocessor), # Add the ColumnTransformer for preprocessing\n",
" ('classifier', RandomForestClassifier(random_state=42)) # Add the Random Forest classifier\n",
"])\n",
"\n"
"])"
]
},
{
Expand All @@ -788,19 +771,26 @@
},
{
"cell_type": "code",
"execution_count": 12,
"execution_count": 18,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.model_selection import train_test_split\n",
"\n",
"# Assuming `X` is the features DataFrame and `Y` is the target DataFrame\n",
"X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=42)\n"
"X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=42)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Calculate the mean of each metric. "
]
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 19,
"metadata": {},
"outputs": [
{
Expand Down Expand Up @@ -841,36 +831,36 @@
" <tbody>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>11.685792</td>\n",
" <td>0.115046</td>\n",
" <td>12.708740</td>\n",
" <td>0.113908</td>\n",
" <td>-0.356791</td>\n",
" <td>-0.082511</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>11.893299</td>\n",
" <td>0.141734</td>\n",
" <td>12.483332</td>\n",
" <td>0.128106</td>\n",
" <td>-0.357675</td>\n",
" <td>-0.081189</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>12.135140</td>\n",
" <td>0.110448</td>\n",
" <td>15.979565</td>\n",
" <td>0.129162</td>\n",
" <td>-0.369239</td>\n",
" <td>-0.081516</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>12.090408</td>\n",
" <td>0.109544</td>\n",
" <td>13.308411</td>\n",
" <td>0.123250</td>\n",
" <td>-0.375988</td>\n",
" <td>-0.081469</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>11.872093</td>\n",
" <td>0.122813</td>\n",
" <td>11.884736</td>\n",
" <td>0.111188</td>\n",
" <td>-0.380379</td>\n",
" <td>-0.081368</td>\n",
" </tr>\n",
Expand All @@ -880,14 +870,14 @@
],
"text/plain": [
" fit_time score_time test_neg_log_loss train_neg_log_loss\n",
"3 11.685792 0.115046 -0.356791 -0.082511\n",
"0 11.893299 0.141734 -0.357675 -0.081189\n",
"1 12.135140 0.110448 -0.369239 -0.081516\n",
"2 12.090408 0.109544 -0.375988 -0.081469\n",
"4 11.872093 0.122813 -0.380379 -0.081368"
"3 12.708740 0.113908 -0.356791 -0.082511\n",
"0 12.483332 0.128106 -0.357675 -0.081189\n",
"1 15.979565 0.129162 -0.369239 -0.081516\n",
"2 13.308411 0.123250 -0.375988 -0.081469\n",
"4 11.884736 0.111188 -0.380379 -0.081368"
]
},
"execution_count": 13,
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
Expand All @@ -905,19 +895,12 @@
"cv_df_sorted = cv_df.sort_values(by='test_neg_log_loss', ascending=False)\n",
"\n",
"# results\n",
"cv_df_sorted\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Calculate the mean of each metric. "
"cv_df_sorted"
]
},
{
"cell_type": "code",
"execution_count": 18,
"execution_count": 20,
"metadata": {},
"outputs": [
{
Expand All @@ -931,15 +914,14 @@
}
],
"source": [
"\n",
"cv_df = pd.DataFrame(cv_results)\n",
"\n",
"test_metrics = cv_df.filter(regex='test_')\n",
"\n",
"mean_metrics = test_metrics.mean()\n",
"\n",
"print(\"Mean of cross-validation folds:\")\n",
"print(mean_metrics)\n"
"print(mean_metrics)"
]
},
{
Expand All @@ -953,15 +935,15 @@
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 21,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Performance metrics on the testing data:\n",
"{'negative_log_loss': -0.378541864258079, 'roc_auc': np.float64(0.9016127012724575), 'accuracy': 0.8548469648889344, 'balanced_accuracy': np.float64(0.7752599723955951)}\n"
"{'negative_log_loss': -0.37940273142901965, 'roc_auc': np.float64(0.8994978803967568), 'accuracy': 0.8540280479066434, 'balanced_accuracy': np.float64(0.7732333499701753)}\n"
]
}
],
Expand All @@ -983,7 +965,7 @@
"}\n",
"\n",
"print(\"Performance metrics on the testing data:\")\n",
"print(performance_metrics)\n"
"print(performance_metrics)"
]
},
{
Expand Down Expand Up @@ -1013,12 +995,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Binary Classification Compatibility:\n",
"Most machine learning models, including the RandomForestClassifier, expect numerical labels for classification tasks. By converting the income column into binary values (0 and 1), where 0 represents an income of <=50K and 1 represents >50K, you make the target variable suitable for classification.\n",
"Simplification:\n",
"Instead of dealing with string labels like \">50K\" and \"<50K\", the target variable is converted into numeric form directly within the data loading process. This simplifies data handling and makes it easier to apply statistical methods.\n",
"Improved Model Performance:\n",
"Some machine learning algorithms, such as gradient boosting and logistic regression, perform better with numerical labels for binary classification, as they can compute loss functions more efficiently. By recoding the target, these algorithms can better compute the decision boundary.\n"
"Binary Classification Compatibility: Most machine learning models, including the RandomForestClassifier, expect numerical labels for classification tasks. By converting the income column into binary values (0 and 1), where 0 represents an income of <=50K and 1 represents >50K, you make the target variable suitable for classification. Simplification: Instead of dealing with string labels like \">50K\" and \"<50K\", the target variable is converted into numeric form directly within the data loading process. This simplifies data handling and makes it easier to apply statistical methods. Improved Model Performance: Some machine learning algorithms, such as gradient boosting and logistic regression, perform better with numerical labels for binary classification, as they can compute loss functions more efficiently. By recoding the target, these algorithms can better compute the decision boundary."
]
},
{
Expand Down
Loading

0 comments on commit db52535

Please sign in to comment.