Completed assignment #1

UofT-DSI · Sep 30, 2024 · 2731dbe · 2731dbe
1 parent 5476b71
commit 2731dbe
Showing 1 changed file with 52 additions and 14 deletions.
diff --git a/02_activities/assignments/assignment_1.ipynb b/02_activities/assignments/assignment_1.ipynb
@@ -34,7 +34,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 23,
    "id": "4a3485d6-ba58-4660-a983-5680821c5719",
    "metadata": {},
    "outputs": [],
@@ -96,7 +96,8 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# Your answer here"
+    "# Your answer here\n",
+    "wine_df.shape[0]"
    ]
   },
   {
@@ -114,7 +115,8 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# Your answer here"
+    "# Your answer here\n",
+    "wine_df.shape[1]"
    ]
   },
   {
@@ -132,7 +134,9 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# Your answer here"
+    "# Your answer here\n",
+    "print(f\"'class' type is: {wine_df['class'].dtypes}\")\n",
+    "print(f\"'levels' of 'class': {set(wine_df['class'])}\")"
    ]
   },
   {
@@ -151,7 +155,8 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# Your answer here"
+    "# Your answer here\n",
+    "print(f\"number of predictor variables: {wine_df.shape[1]-1}\")"
    ]
   },
   {
@@ -204,7 +209,8 @@
    "id": "403ef0bb",
    "metadata": {},
    "source": [
-    "> Your answer here..."
+    "> Your answer here...\n",
+    "Predictor variables need to be standarized because variables in different scale will impact the model fitting differently. By standardizing all predictor variables, it makes all variable to have the same amount of impact to the model."
    ]
   },
   {
@@ -217,10 +223,11 @@
   },
   {
    "cell_type": "markdown",
-   "id": "fdee5a15",
+   "id": "7628bd1b",
    "metadata": {},
    "source": [
-    "> Your answer here..."
+    "> Your answer here...\n",
+    "The 'Class' is the outcome of the prediction and there's no need to standardize."
    ]
   },
   {
@@ -233,10 +240,11 @@
   },
   {
    "cell_type": "markdown",
-   "id": "f0676c21",
+   "id": "ae8447fe",
    "metadata": {},
    "source": [
-    "> Your answer here..."
+    "> Your answer here...\n",
+    "This allows repeatability for the training and testing. The particular seed value is not important because all we care about is repeatibility between our training and testing."
    ]
   },
   {
@@ -251,15 +259,17 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 42,
    "id": "72c101f2",
    "metadata": {},
    "outputs": [],
    "source": [
     "# Do not touch\n",
     "np.random.seed(123)\n",
     "# Create a random vector of True and False values to split the data\n",
-    "split = np.random.choice([True, False], size=len(predictors_standardized), replace=True, p=[0.75, 0.25])"
+    "split = np.random.choice([True, False], size=len(predictors_standardized), replace=True, p=[0.75, 0.25])\n",
+    "\n",
+    "X_train, X_test, y_train, y_test = train_test_split(predictors_standardized, wine_df['class'], test_size=0.25)"
    ]
   },
   {
@@ -287,7 +297,21 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# Your code here..."
+    "# Your code here...\n",
+    "knn = KNeighborsClassifier()\n",
+    "\n",
+    "parameter_grid = {\n",
+    "    \"n_neighbors\": range(1, 51) # n_neighbors between 1 to 50\n",
+    "}\n",
+    "tune_grid = GridSearchCV(\n",
+    "    estimator=knn,             # knn\n",
+    "    param_grid=parameter_grid, # see above\n",
+    "    cv=10                      # 10-fold cross-validation\n",
+    ")\n",
+    "\n",
+    "# Grid search using training data\n",
+    "tune_grid.fit(X_train, y_train)\n",
+    "print(f\"Best value for n_neighbors is {tune_grid.best_params_['n_neighbors']}\")"
    ]
   },
   {
@@ -308,7 +332,21 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# Your code here..."
+    "# Your code here...\n",
+    "\n",
+    "# Initialize KNN with best N-value from the previous grid search\n",
+    "knn = KNeighborsClassifier(n_neighbors=tune_grid.best_params_['n_neighbors'])\n",
+    "\n",
+    "# Train the model\n",
+    "knn.fit(X_train, y_train)\n",
+    "\n",
+    "# Predict with test data\n",
+    "prediction = knn.predict(X_test)\n",
+    "\n",
+    "# Get the accuracy score\n",
+    "accuracy = accuracy_score(y_test, prediction)\n",
+    "\n",
+    "print(f\"Accuracy of the model on test data set is: {accuracy}\")\n"
    ]
   },
   {