-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathwrangle_summarize.qmd
125 lines (86 loc) · 4.35 KB
/
wrangle_summarize.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
---
title: "Summarizing Data"
---
## Library & Data Loading
Begin by loading any needed libraries and reading in an external data file for use in downstream examples.
:::panel-tabset
## [{{< fa brands r-project >}} R]{.r}
Load the `tidyverse` meta-package as well as our vertebrate data.
```{r r-start, warning = F, message = F}
# Load needed library
library(tidyverse)
# Load data
vert_r <- utils::read.csv(file = file.path("data", "verts.csv"))
# Check out first few rows
head(vert_r, n = 5)
```
## [{{< fa brands python >}} Python]{.py}
Load the `pandas` and `os` libraries as well as our vertebrate data.
```{python py-libs}
# Load needed libraries
import os
import pandas as pd
# Load data
vert_py = pd.read_csv(os.path.join("data", "verts.csv"))
# Check out first few rows
vert_py.head()
```
:::
## Defining Groups
Now that we have data, we can use it to demonstrate groupwise summarization! Both [{{< fa brands python >}} Python]{.py} and [{{< fa brands r-project >}} R]{.r} support defining the grouping structure in one step and then doing the actual summarization in subsequent steps.
Let's suppose that we want to calculate average weight and length within each species and year.
:::panel-tabset
## [{{< fa brands r-project >}} R]{.r}
In [{{< fa brands r-project >}} R]{.r}, we use the `group_by` function (from `dplyr`) to define our grouping variables. Note that this changes the class from `data.frame` to `tibble` (a special dataframe-like class defined by the tidyverse in part due to this operation).
`group_by`--like other functions in the tidyverse--allows for 'tidy select' column names where column names are provided without quotes.
```{r r-group}
# Define grouping variables
vert_r_grp <- dplyr::group_by(.data = vert_r, year, species)
# Check class of resulting object
class(vert_r_grp)
```
## [{{< fa brands python >}} Python]{.py}
In [{{< fa brands python >}} Python]{.py}, we can use the `groupby` method (available to all variables of type `DataFrame`). This method does accept multiple column labels but they must be provided as a list (i.e., wrapped in square brackets) and the labels themselves must be wrapped in quotes.
```{python py-group}
# Define grouping variables
vert_py_grp = vert_py.groupby(["year", "species"])
# Check type of resulting variable
type(vert_py_grp)
```
:::
## Groupwise Summarization
Once we've established the columns for which we want to calculate summary statistics we can move on to the tools that actually allow summarization steps to take place. Note that _summarization_ implicitly means that we should lose rows because we should only have one row per combination of grouping column content combination. Adding columns is not necessarily summarization if it doesn't shed rows.
:::panel-tabset
## [{{< fa brands r-project >}} R]{.r}
[{{< fa brands r-project >}} R]{.r} allows for multiple summary metrics to be calculate simultaneously within `dplyr`'s `summarize` function.
```{r silence-dplyr}
#| echo: false
# Silence dplyr summarization message
options(dplyr.summarise.inform = F)
```
```{r r-summary}
# Calculate average weight and length
vert_r_summary <- dplyr::summarize(.data = vert_r_grp,
weight_g = mean(weight_g, na.rm = T),
length_1_mm = mean(length_1_mm, na.rm = T),
length_2_mm = mean(length_2_mm, na.rm = T))
# Check out the first bit of that
head(vert_r_summary, n = 3)
```
## [{{< fa brands python >}} Python]{.py}
When summarizing data in [{{< fa brands python >}} Python]{.py} we need to calculate each metric separately and then combine them. Apologies that this uses the `concat` function in `pandas` that we have not previously covered but you can see a greater discussion of these functions/methods in the "Join" section.
Note the use of two modes of indexing a particular column.
```{python py-summary}
# Calculate average weight and length
mean_wt = vert_py_grp["weight_g"].mean()
mean_ln1 = vert_py_grp.length_1_mm.mean()
mean_ln2 = vert_py_grp["length_2_mm"].mean()
# Combine them
vert_py_summary = pd.concat([mean_wt, mean_ln1, mean_ln2], axis = 1)
# Do some small index reformatting steps
vert_py_summary = vert_py_summary.reset_index().rename_axis(mapper = None, axis = 1)
# Check out that
vert_py_summary.head(3)
```
The index re-setting is needed for [{{< fa brands python >}} Python]{.py} just to make sure the columns and rows are indexed as they should be.
:::