Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intro_to_Base_R: Switch iris -> penguins dataset #403

Merged
merged 11 commits into from
Mar 4, 2021
1 change: 1 addition & 0 deletions components/dictionary.txt
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
Adelie
aes
al
Alboukadel
Expand Down
51 changes: 29 additions & 22 deletions intro-to-R-tidyverse/01-intro_to_base_R.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -383,12 +383,19 @@ question_values %in% values_1_to_20

## Data frames

_Data frames are the most fundamental unit of data analysis in R._
_Data frames are one of the most useful tools for data analysis in R._
They are tables which consist of rows and columns, much like a _spreadsheet_.
Each column is a variable which behaves as a _vector_, and each row is an observation.
We will begin our exploration with the old trusted dataset `iris`, which comes with R.
Learn about this dataset using the standard help approach of `?iris`.
We will begin our exploration with dataset about penguins from the [`palmerpenguins` package](https://allisonhorst.github.io/palmerpenguins/).
cansavvy marked this conversation as resolved.
Show resolved Hide resolved
To use this dataset, we will need to extract it from the `palmerpenguins` using a `::` (more on this later).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm using the word "environment" here but I realized that this notebook uses the word "environment" a couple of different ways, to refer both to the overall interface, and to the variables that are present in the "workspace", shown in the Environment pane.

Suggested change
To use this dataset, we will need to extract it from the `palmerpenguins` using a `::` (more on this later).
To use this dataset, we will load it from the `palmerpenguins` package using a `::` (more on this later) and assign it to a variable named `penguins` in our current environment.

We should try to standardize on a single meaning for each word, if we can. Unfortunately, RStudio uses "workspace" (load/save/clear workspace) and "environment" (the name of the pane) in its interface to mean mostly the same thing, but it does mean that both of those words are problematic to describe the whole shebang, as we do on line 47 (and in the objectives).
Maybe we should call it the "RStudio Interface" which I don't love, but don't have anything better right now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. This is good thing to streamline, but probably outside the scope of this PR. I'll make an issue for it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


```{r penguin-library}
penguins <- palmerpenguins::penguins
```

`penguins` is a data frame with measurements and information on penguins of three different species.
cansavvy marked this conversation as resolved.
Show resolved Hide resolved

![](diagrams/lter_penguins.png)
cansavvy marked this conversation as resolved.
Show resolved Hide resolved
### Exploring data frames

The first step to using any data is to look at it!!!
Expand All @@ -407,54 +414,54 @@ We can additionally explore _overall properties_ of the data frame with two diff

This provides summary statistics for each column:

```{r iris-summary}
summary(iris)
```{r penguins-summary}
summary(penguins)
```

This provides a short view of the **str**ucture and contents of the data frame.

```{r iris-str}
str(iris)
```{r penguins-str}
str(penguins)
```

You'll notice that the column `Species` is a _factor_: This is a special type of character variable that represents distinct categories known as "levels".
We have learned here that there are three levels in the `Species` column: setosa, versicolor, and virginica.
You'll notice that the column `species` is a _factor_: This is a special type of character variable that represents distinct categories known as "levels".
We have learned here that there are three levels in the `species` column: Adelie, Chinstrap, and Gentoo.
We might want to explore individual columns of the data frame more in-depth.
We can examine individual columns using the dollar sign `$` to select one by name:

```{r iris-subset}
# Extract Sepal.Length as a vector
iris$Sepal.Length
```{r penguins-subset}
# Extract bill_length_mm as a vector
penguins$bill_length_mm

# indexing operators can be used too
iris$Sepal.Width[1:10]
penguins$bill_depth_mm[1:10]
```

We can perform our regular vector operations on columns directly.

```{r iris-col-mean, live = TRUE}
# calculate the mean of the Sepal.Length column
mean(iris$Sepal.Length)
```{r penguins-col-mean, live = TRUE}
# calculate the mean of the bill_length_mm column
mean(penguins$bill_length_mm)
cansavvy marked this conversation as resolved.
Show resolved Hide resolved
```

We can also calculate the full summary statistics for a single column directly.

```{r iris-col-summary, live = TRUE}
# show a summary of the Sepal.Length column
summary(iris$Sepal.Length)
```{r penguins-col-summary, live = TRUE}
# show a summary of the bill_length_mm column
summary(penguins$bill_length_mm)
```

Extract `Species` as a vector and subset it to see a preview.

```{r iris-col-subset, live = TRUE}
```{r penguins-col-subset, live = TRUE}
# get the first 10 values of the Species column
iris$Species[1:10]
penguins$species[1:10]
```

And view its _levels_ with the `levels()` function.

```{r}
cansavvy marked this conversation as resolved.
Show resolved Hide resolved
levels(iris$Species)
levels(penguins$species)
```

## Files and directories
Expand Down
Loading