diff --git a/02_activities/assignments/assignment_3.ipynb b/02_activities/assignments/assignment_3.ipynb index 2de1febca..b54c5060e 100644 --- a/02_activities/assignments/assignment_3.ipynb +++ b/02_activities/assignments/assignment_3.ipynb @@ -27,24 +27,18 @@ "source": [ "### Clustering and Resampling\n", "\n", - "Let's set up our workspace and use the **Iris dataset** from `scikit-learn`. This dataset is a classic dataset in machine learning and statistics, widely used for clustering tasks. It consists of 150 samples of iris flowers, each belonging to one of three species: Iris setosa, Iris versicolor, and Iris virginica. Here are the key features and characteristics of the dataset:\n", + "Let's set up our workspace and use the **Iris dataset** from `scikit-learn`. This dataset is a classic dataset in machine learning and statistics, widely used for clustering tasks. It consists of many samples of iris flowers. Here are the key features and characteristics of the dataset:\n", "\n", "##### Features:\n", "1. **Sepal Length**: The length of the sepal in centimeters.\n", "2. **Sepal Width**: The width of the sepal in centimeters.\n", "3. **Petal Length**: The length of the petal in centimeters.\n", - "4. **Petal Width**: The width of the petal in centimeters.\n", - "\n", - "##### Target Variable:\n", - "- **Species**: The species of the iris flower, which can take one of the following values:\n", - " - 0: Iris setosa\n", - " - 1: Iris versicolor\n", - " - 2: Iris virginica" + "4. **Petal Width**: The width of the petal in centimeters." ] }, { "cell_type": "code", - "execution_count": 10, + "execution_count": 1, "id": "4a3485d6-ba58-4660-a983-5680821c5719", "metadata": {}, "outputs": [], @@ -73,167 +67,10 @@ }, { "cell_type": "code", - "execution_count": 14, + "execution_count": null, "id": "a431d282-f9ca-4d5d-8912-71ffc9d8ea19", "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
sepal length (cm)sepal width (cm)petal length (cm)petal width (cm)species
05.13.51.40.20
14.93.01.40.20
24.73.21.30.20
34.63.11.50.20
45.03.61.40.20
..................
1456.73.05.22.32
1466.32.55.01.92
1476.53.05.22.02
1486.23.45.42.32
1495.93.05.11.82
\n", - "

150 rows × 5 columns

\n", - "
" - ], - "text/plain": [ - " sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) \\\n", - "0 5.1 3.5 1.4 0.2 \n", - "1 4.9 3.0 1.4 0.2 \n", - "2 4.7 3.2 1.3 0.2 \n", - "3 4.6 3.1 1.5 0.2 \n", - "4 5.0 3.6 1.4 0.2 \n", - ".. ... ... ... ... \n", - "145 6.7 3.0 5.2 2.3 \n", - "146 6.3 2.5 5.0 1.9 \n", - "147 6.5 3.0 5.2 2.0 \n", - "148 6.2 3.4 5.4 2.3 \n", - "149 5.9 3.0 5.1 1.8 \n", - "\n", - " species \n", - "0 0 \n", - "1 0 \n", - "2 0 \n", - "3 0 \n", - "4 0 \n", - ".. ... \n", - "145 2 \n", - "146 2 \n", - "147 2 \n", - "148 2 \n", - "149 2 \n", - "\n", - "[150 rows x 5 columns]" - ] - }, - "execution_count": 14, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ "from sklearn.datasets import load_iris\n", "# Load the Iris dataset\n", @@ -242,22 +79,10 @@ "# Convert to DataFrame\n", "iris_df = pd.DataFrame(iris_data.data, columns=iris_data.feature_names)\n", "\n", - "# Bind the disease progression (diabetes target) to the DataFrame\n", - "iris_df['species'] = iris_data.target\n", - "\n", - "\n", "# Display the DataFrame\n", - "iris_df\n", + "print(iris_df)\n", "\n", - "#Your code here ... " - ] - }, - { - "cell_type": "markdown", - "id": "99b725c5", - "metadata": {}, - "source": [ - "> Your answer here ..." + "# Your code here..." ] }, { @@ -268,7 +93,7 @@ "#### **Question 2:** \n", "#### Data-visualization\n", "\n", - "Create plots to visualize the relationships between the features (sepal length, sepal width, petal length, petal width).\n" + "Let's create plots to visualize the relationships between the features (sepal length, sepal width, petal length, petal width).\n" ] }, { @@ -278,7 +103,69 @@ "metadata": {}, "outputs": [], "source": [ - "# Your code here ..." + "def plot_feature_pairs(data, feature_names, color_labels=None, title_prefix=''):\n", + " \"\"\"\n", + " Helper function to create scatter plots for all possible pairs of features.\n", + " \n", + " Parameters:\n", + " - data: DataFrame containing the features to be plotted.\n", + " - feature_names: List of feature names to be used in plotting.\n", + " - color_labels: Optional. Cluster or class labels to color the scatter plots.\n", + " - title_prefix: Optional. Prefix for plot titles to distinguish between different sets of plots.\n", + " \"\"\"\n", + " # Create a figure for the scatter plots\n", + " plt.figure(figsize=(12, 10))\n", + " \n", + " # Counter for subplot index\n", + " plot_number = 1\n", + " \n", + " # Loop through each pair of features\n", + " for i in range(len(feature_names)):\n", + " for j in range(i + 1, len(feature_names)):\n", + " plt.subplot(len(feature_names)-1, len(feature_names)-1, plot_number)\n", + " \n", + " # Scatter plot colored by labels if provided\n", + " if color_labels is not None:\n", + " plt.scatter(data[feature_names[i]], data[feature_names[j]], \n", + " c=color_labels, cmap='viridis', alpha=0.7)\n", + " else:\n", + " plt.scatter(data[feature_names[i]], data[feature_names[j]], alpha=0.7)\n", + " \n", + " plt.xlabel(feature_names[i])\n", + " plt.ylabel(feature_names[j])\n", + " plt.title(f'{title_prefix}{feature_names[i]} vs {feature_names[j]}')\n", + " \n", + " # Increment the plot number\n", + " plot_number += 1\n", + "\n", + " # Adjust layout to prevent overlap\n", + " plt.tight_layout()\n", + "\n", + " # Show the plot\n", + " plt.show()\n", + "\n", + "# Get feature names\n", + "feature_names = iris_df.columns\n", + "\n", + "# Use the helper function to plot scatter plots without coloring by cluster labels\n", + "plot_feature_pairs(iris_df, feature_names, title_prefix='Original Data: ')" + ] + }, + { + "cell_type": "markdown", + "id": "a9701cd4", + "metadata": {}, + "source": [ + "**Question:**\n", + "- Do you notice any patterns or relationships between the different features? How might these patterns help in distinguishing between different species?" + ] + }, + { + "cell_type": "markdown", + "id": "35308e2c", + "metadata": {}, + "source": [ + "> Your answer..." ] }, { @@ -292,50 +179,27 @@ }, { "cell_type": "code", - "execution_count": 16, - "id": "b8971d89", + "execution_count": null, + "id": "b2cfec72", "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - " sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) \\\n", - "0 -0.900681 1.019004 -1.340227 -1.315444 \n", - "1 -1.143017 -0.131979 -1.340227 -1.315444 \n", - "2 -1.385353 0.328414 -1.397064 -1.315444 \n", - "3 -1.506521 0.098217 -1.283389 -1.315444 \n", - "4 -1.021849 1.249201 -1.340227 -1.315444 \n", - "\n", - " species \n", - "0 0 \n", - "1 0 \n", - "2 0 \n", - "3 0 \n", - "4 0 \n" - ] - } - ], + "outputs": [], "source": [ "# Initialize the StandardScaler\n", "scaler = StandardScaler()\n", "\n", - "# Scale the features (excluding the species column)\n", - "scaled_features = scaler.fit_transform(iris_df.iloc[:, :-1])\n", + "# Scale all the features in the dataset\n", + "scaled_features = scaler.fit_transform(iris_df)\n", "\n", "# Create a new DataFrame with scaled features\n", "scaled_iris_df = pd.DataFrame(scaled_features, columns=iris_data.feature_names)\n", "\n", - "# Add the species column back to the scaled DataFrame\n", - "scaled_iris_df['species'] = iris_df['species'].values\n", - "\n", "# Display the first few rows of the scaled DataFrame\n", "print(scaled_iris_df.head())" ] }, { "cell_type": "markdown", - "id": "b326e039", + "id": "035fa019", "metadata": {}, "source": [ "Why is it important to standardize the features of a dataset before applying clustering algorithms like K-Means? Discuss the implications of using unstandardized data in your analysis." @@ -343,7 +207,7 @@ }, { "cell_type": "markdown", - "id": "fc34b4b7", + "id": "53d77d5c", "metadata": {}, "source": [ "> Your answer here ... " @@ -356,9 +220,8 @@ "source": [ "#### **Question 4:** \n", "#### K-means clustering \n", - "Apply the K-Means clustering algorithm to the Iris dataset.\n", - "Choose the number of clusters (K=3, since there are three species) and fit the model.\n", - "Assign cluster labels to the original data and add them as a new column in the DataFrame." + "\n", + "Apply the K-Means clustering algorithm to the Iris dataset. Choose the value 3 for the number of clusters (`k=3`) and fit the model. Assign cluster labels to the original data and add them as a new column in the DataFrame." ] }, { @@ -368,44 +231,97 @@ "metadata": {}, "outputs": [], "source": [ - "# Your code here ..." + "# Your answer...\n", + "\n", + "clustered_iris_data = 🤷‍♂️\n", + "\n", + "\n", + "# Use the helper function to plot scatter plots, colored by cluster labels\n", + "plot_feature_pairs(clustered_iris_data, feature_names, color_labels=clustered_iris_data['Cluster'], title_prefix='Clustered Data: ')" ] }, { "cell_type": "markdown", - "id": "0aefdee5", + "id": "46914737", "metadata": {}, "source": [ - "Discuss the results of the K-Means clustering. How well did the clusters match the true species?" + "We chose `k=3` for the number of clusters arbitrarily. However, in a real-world scenario, it is important to determine the optimal number of clusters using appropriate methods.\n", + "\n", + "**Question**: What is one method commonly used to determine the optimal number of clusters in K-means clustering, and why is this method helpful?" ] }, { "cell_type": "markdown", - "id": "7bcebc16", + "id": "83349688", "metadata": {}, "source": [ - "> Your answer here ..." + "> Your answer here..." ] }, { "cell_type": "markdown", - "id": "3f76bf62", + "id": "a6bc2f4f", "metadata": {}, "source": [ "#### **Question 5:** \n", "#### Bootstrapping \n", "\n", - " Implement bootstrapping on the mean of one of the sepal or petal measurement variables (e.g., Sepal Length, Petal Width) to assess the stability of the mean estimate. Generate 1000 bootstrap samples, calculate the mean for each sample, and compute a 95% confidence interval to evaluate the variability in the estimate." + "Implement bootstrapping on the mean of Petal Width. Generate 10000 bootstrap samples, calculate the mean for each sample, and compute a 90% confidence interval." ] }, { "cell_type": "code", - "execution_count": 7, - "id": "ffefa9f2", + "execution_count": null, + "id": "be4c4011", "metadata": {}, "outputs": [], "source": [ - "# Your code here ...\n" + "# Your answer here...\n", + "\n", + "mean_petal_width = 🤷‍♂️\n", + "\n", + "np.random.seed(123)\n", + "\n", + "lower_bound = 🤷‍♂️\n", + "upper_bound = 🤷‍♂️\n", + "\n", + "# Display the result\n", + "print(f\"Mean of Petal Width: {mean_petal_width}\")\n", + "print(f\"90% Confidence Interval of Mean Petal Width: ({lower_bound}, {upper_bound})\")" + ] + }, + { + "cell_type": "markdown", + "id": "b9f73843", + "metadata": {}, + "source": [ + "**Question:**\n", + "- Why do we use bootstrapping in this context? What does it help us understand about the mean?" + ] + }, + { + "cell_type": "markdown", + "id": "16a6e104", + "metadata": {}, + "source": [ + "> Your answer..." + ] + }, + { + "cell_type": "markdown", + "id": "0741b2ca", + "metadata": {}, + "source": [ + "**Question:**\n", + "- What is the purpose of calculating the confidence interval from the bootstrap samples? How does it help us interpret the variability of the estimate?" + ] + }, + { + "cell_type": "markdown", + "id": "e5be82ec", + "metadata": {}, + "source": [ + "> Your answer..." ] }, { @@ -413,7 +329,9 @@ "id": "29096311", "metadata": {}, "source": [ - "Reflect on the variability observed in the bootstrapped means and discuss whether the mean of the selected variable appears to be a stable and reliable estimate based on the confidence interval and the spread of the bootstrapped means." + "**Question:**\n", + "\n", + "- Reflect on the variability observed in the bootstrapped means and discuss whether the mean of the Petal Width appears to be a stable and reliable estimate based on the confidence interval and the spread of the bootstrapped means." ] }, { @@ -421,7 +339,7 @@ "id": "0a7e6778", "metadata": {}, "source": [ - "> Your answer here ..." + "> Your answer here..." ] }, { @@ -455,7 +373,7 @@ "\n", "### Submission Parameters:\n", "* Submission Due Date: `HH:MM AM/PM - DD/MM/YYYY`\n", - "* The branch name for your repo should be: `assignment-1`\n", + "* The branch name for your repo should be: `assignment-3`\n", "* What to submit for this assignment:\n", " * This Jupyter Notebook (assignment_1.ipynb) should be populated and should be the only change in your pull request.\n", "* What the pull request link should look like for this assignment: `https://github.com//applying_statistical_concepts/pull/`\n", @@ -487,7 +405,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.10.10" + "version": "3.9.19" }, "vscode": { "interpreter": {