update pandas_groupby

stanfordjournalism · Feb 7, 2024 · c6da75d · c6da75d
1 parent 1d96730
commit c6da75d
Showing 1 changed file with 103 additions and 18 deletions.
diff --git a/content/pandas_groupby.ipynb b/content/pandas_groupby.ipynb
@@ -11,11 +11,11 @@
     "\n",
     "Doing so allows us to apply a variety of calculations to each group. \n",
     "\n",
-    "For example, we could group all the students in 2nd grade by their teacher, and then calculate the median test scores for each group of students to determine if a certain class is falling behind, on track, or excelling.\n",
+    "For example, we could group all the students in 2nd grade by their teacher, and then calculate the median test scores for each group of students. That would help us figure out if a certain class is falling behind, on track, or excelling.\n",
     "\n",
-    "Grouping data is not peculiar to Python and pandas. It has correllaries in a wide variety of tools such as spreadsheet (Excel Pivot Tables) and databases (SQL GROUP BY statements).\n",
+    "Grouping data is not peculiar to Python and `pandas`. A wide variety of tools provide the ability to group data and apply functions to each group, most notably spreadsheets (Excel Pivot Tables) and databases (SQL GROUP BY statements).\n",
     "\n",
-    "So how do we use it in pandas? \n",
+    "So how do we group data using `pandas`?\n",
     "\n",
     "Let's start with a simple, toy set of data to ensure we wrestle with the key concepts."
    ]
@@ -46,7 +46,7 @@
    "id": "86a05726-9083-42bf-9a7c-c42eb252d478",
    "metadata": {},
    "source": [
-    "First, we'll create a DataFrame from using this data."
+    "First, we'll create a DataFrame using this data."
    ]
   },
   {
@@ -79,9 +79,9 @@
     "\n",
     "Let's say we wanted to count the number of people from each state.\n",
     "\n",
-    "If we think through this logically, the very first step would be to group our data by the `state` column. \n",
+    "If we think through this logically, the very first step would be to organize, or group, our data by the `state` column. \n",
     "\n",
-    "You could do this manually or with basic Python as below:"
+    "You can do this with plain-old Python as below:"
    ]
   },
   {
@@ -129,7 +129,7 @@
    "id": "b8f5468a-fd4e-4be2-85e0-c3326806fb77",
    "metadata": {},
    "source": [
-    "If you examine this dictionary you can see that each person has been grouped into the appropriate state."
+    "If you examine this dictionary you can see that the record for each person has been placed into the appropriate state."
    ]
   },
   {
@@ -147,7 +147,7 @@
    "id": "f6adb7de-9739-40b1-bc3b-b5c7f253fbf1",
    "metadata": {},
    "source": [
-    "And you can of course now determine the number of people in each state by counting the length of each state's list."
+    "And now you can determine the number of people in each state by counting the length of each state's list."
    ]
   },
   {
@@ -158,10 +158,7 @@
    "outputs": [],
    "source": [
     "for state, people in states.items():\n",
-    "    num = len(people)\n",
-    "    # Use some a one-line condition to determine singular/plural\n",
-    "    person_or_people = 'person' if num == 1 else 'people'\n",
-    "    print(f\"{state} has {num} {person_or_people}\")"
+    "    print(state, len(people))"
    ]
   },
   {
@@ -195,7 +192,7 @@
    "source": [
     "When we run the above code, we see that we get a `DataFrameGroupBy` object. And that type of object happens to have a [groups](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.groups.html) attribute that let's us examine which rows ended up in each group. \n",
     "\n",
-    "Let's peek under the hood to see how it did."
+    "Let's peek under the hood to better understand the grouping operation."
    ]
   },
   {
@@ -244,7 +241,7 @@
     "\n",
     "That's all well and good, but generally we want to do *something* with our groups. \n",
     "\n",
-    "For example, we could count number of people from each state. Notice that we get the same counts as the more lengthy method using Python dictionaries.\n",
+    "For example, we could count the number of people in each state. Notice that we get the same counts as the more lengthy method using Python dictionaries.\n",
     "\n",
     "> The output is a bit wonky, at least compared to a SQL Group By query, since it includes all the columns in the data.\n",
     "\n"
@@ -267,7 +264,7 @@
    "source": [
     "You can also perform calculations on individual columns, which often makes more sense than applying an aggregate function to all columns in each group.\n",
     "\n",
-    "Here's how we'd count the cities by state using the `name` field."
+    "Here's how we'd count the rows in each group using the `state` field."
    ]
   },
   {
@@ -277,7 +274,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "group_obj.name.count()"
+    "group_obj.state.count()"
    ]
   },
   {
@@ -287,7 +284,7 @@
    "source": [
     "And how we'd find the max and median salary for each state.\n",
     "\n",
-    "> Note we're using `.reset_index()` now to restore the output to a proper DataFrame. We're also using `.rename` to be more explicit about the nature of the calculation."
+    "> Note we're using `.reset_index()` below to restore the output to a proper DataFrame. We're also using `.rename` to be more explicit about the nature of the calculation."
    ]
   },
   {
@@ -360,7 +357,7 @@
     "\n",
     "Our new output includes the state, and then as a form of nested data, the cities within those states. \n",
     "\n",
-    "Finally, let's sort the output from largest to small count."
+    "Finally, let's sort the output from highest to lowest `record_count`."
    ]
   },
   {
@@ -372,6 +369,94 @@
    "source": [
     "state_city.sort_values('record_count', ascending=False)"
    ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "9455eaf3-f71f-4501-a5af-844ff88f9bfd",
+   "metadata": {},
+   "source": [
+    "## Aggregating options\n",
+    "\n",
+    "You may have noticed that by using `.reset_index` in prior steps, we're ensuring we're always working with a proper DataFrame."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "2bdaab3e-93d5-4ce4-b904-1962d0dd44a0",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "type(state_city)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "836c7057-1ae9-4a04-98fa-dac2e1586667",
+   "metadata": {},
+   "source": [
+    "Since `state_city` is still a DataFrame, we have the option of performing further `groupby` operations on this already \"rolled up\" or aggregated data.\n",
+    "\n",
+    "But we can't use `count` this time since it will produce the wrong answer.\n",
+    "\n",
+    "Can you figure out why?"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "946df783-8e2f-4ac5-b57a-26131b194d24",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "state_city.groupby('state').record_count.count()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d2aca49f-d59c-42f8-98c4-d8bb4c678677",
+   "metadata": {},
+   "source": [
+    "Remember, `state_city` has rows containing the count for each combination of city and state. \n",
+    "\n",
+    "In this case, there are two California rows: one for people in LA and a second for people in San Francisco. Our remaining data only has one city per state, so only one row for each state.\n",
+    "\n",
+    "If we now group `state_city` by `state` and then `count` the rows by group, we're getting a count of the number of cities in each group rather than the number of people stored in `record_count`.\n",
+    "\n",
+    "Since this data is in a \"rolled up\" state -- ie already aggregated from more granular data -- we need to combine the **add** the numbers from `state_city` to produce a correct count of people in each state."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "0c0fb8c4-97b3-438e-807e-ca3cdf97d006",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "state_city.groupby('state').record_count.sum()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "815ae1ea-5772-437b-9b23-fe1ec4f288ec",
+   "metadata": {},
+   "source": [
+    "If you find this confusing, you wouldn't be alone. It can definitely get tricky to worth with data that has already been aggregated from more granular data.\n",
+    "\n",
+    "But never fear. You always have the option of performing a fresh `groupby` on the original data.\n",
+    "\n",
+    "Often that's the simplest and wisest course of action."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "7b0f480d-d64e-472e-8362-2bbd21fb6299",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df.groupby('state').state.count()"
+   ]
   }
  ],
  "metadata": {