Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intro_to_Base_R: Switch iris -> penguins dataset #403

Merged
merged 11 commits into from
Mar 4, 2021
1 change: 1 addition & 0 deletions components/dictionary.txt
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
Adelie
aes
al
Alboukadel
Expand Down
53 changes: 30 additions & 23 deletions intro-to-R-tidyverse/01-intro_to_base_R.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -383,11 +383,18 @@ question_values %in% values_1_to_20

## Data frames

_Data frames are the most fundamental unit of data analysis in R._
_Data frames are one of the most useful tools for data analysis in R._
They are tables which consist of rows and columns, much like a _spreadsheet_.
Each column is a variable which behaves as a _vector_, and each row is an observation.
We will begin our exploration with the old trusted dataset `iris`, which comes with R.
Learn about this dataset using the standard help approach of `?iris`.
We will begin our exploration with dataset of measurements from three penguin species measured, which we can find in the [`palmerpenguins` package](https://allisonhorst.github.io/palmerpenguins/).
We'll talk more about packages soon!
To use this dataset, we will load it from the `palmerpenguins` package using a `::` (more on this later) and assign it to a variable named `penguins` in our current environment.

```{r penguin-library}
penguins <- palmerpenguins::penguins
```

![drawings of penguin species](diagrams/lter_penguins.png) Artwork by @allison_horst

### Exploring data frames

Expand All @@ -407,54 +414,54 @@ We can additionally explore _overall properties_ of the data frame with two diff

This provides summary statistics for each column:

```{r iris-summary}
summary(iris)
```{r penguins-summary}
summary(penguins)
```

This provides a short view of the **str**ucture and contents of the data frame.

```{r iris-str}
str(iris)
```{r penguins-str}
str(penguins)
```

You'll notice that the column `Species` is a _factor_: This is a special type of character variable that represents distinct categories known as "levels".
We have learned here that there are three levels in the `Species` column: setosa, versicolor, and virginica.
You'll notice that the column `species` is a _factor_: This is a special type of character variable that represents distinct categories known as "levels".
We have learned here that there are three levels in the `species` column: Adelie, Chinstrap, and Gentoo.
We might want to explore individual columns of the data frame more in-depth.
We can examine individual columns using the dollar sign `$` to select one by name:

```{r iris-subset}
# Extract Sepal.Length as a vector
iris$Sepal.Length
```{r penguins-subset}
# Extract bill_length_mm as a vector
penguins$bill_length_mm

# indexing operators can be used too
iris$Sepal.Width[1:10]
penguins$bill_depth_mm[1:10]
```

We can perform our regular vector operations on columns directly.

```{r iris-col-mean, live = TRUE}
# calculate the mean of the Sepal.Length column
mean(iris$Sepal.Length)
```{r penguins-col-mean, live = TRUE}
# calculate the mean of the bill_length_mm column
mean(penguins$bill_length_mm)
cansavvy marked this conversation as resolved.
Show resolved Hide resolved
```

We can also calculate the full summary statistics for a single column directly.

```{r iris-col-summary, live = TRUE}
# show a summary of the Sepal.Length column
summary(iris$Sepal.Length)
```{r penguins-col-summary, live = TRUE}
# show a summary of the bill_length_mm column
summary(penguins$bill_length_mm)
```

Extract `Species` as a vector and subset it to see a preview.

```{r iris-col-subset, live = TRUE}
```{r penguins-col-subset, live = TRUE}
# get the first 10 values of the Species column
iris$Species[1:10]
penguins$species[1:10]
```

And view its _levels_ with the `levels()` function.

```{r}
levels(iris$Species)
```{r penguin-levels}
levels(penguins$species)
```

## Files and directories
Expand Down
Loading