-
Notifications
You must be signed in to change notification settings - Fork 3
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
adding hands-on without the code for teaching
- Loading branch information
Showing
1 changed file
with
268 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,268 @@ | ||
--- | ||
title: "Hands-on DuckDB & dplyr" | ||
execute: | ||
warning: false | ||
--- | ||
|
||
## The dataset | ||
|
||
ARCTIC SHOREBIRD DEMOGRAPHICS NETWORK [https://doi.org/10.18739/A2222R68W](https://doi.org/10.18739/A2222R68W){target="_blank"} | ||
|
||
Data set hosted by the NSF Arctic Data Center (<https://arcticdata.io>) | ||
|
||
Field data on shorebird ecology and environmental conditions were collected from 1993-2014 at 16 field sites in Alaska, Canada, and Russia. | ||
|
||
![Shorebird, copyright NYT](https://static01.nyt.com/images/2017/09/10/nyregion/10NATURE1/10NATURE1-superJumbo.jpg?quality=75&auto=webp){width=60%} | ||
|
||
Data were not collected in every year at all sites. Studies of the population ecology of these birds included nest-monitoring to determine timing of reproduction and reproductive success; live capture of birds to collect blood samples, feathers, and fecal samples for investigations of population structure and pathogens; banding of birds to determine annual survival rates; resighting of color-banded birds to determine space use and site fidelity; and use of light-sensitive geolocators to investigate migratory movements. Data on climatic conditions, prey abundance, and predators were also collected. Environmental data included weather stations that recorded daily climatic conditions, surveys of seasonal snowmelt, weekly sampling of terrestrial and aquatic invertebrates that are prey of shorebirds, live trapping of small mammals (alternate prey for shorebird predators), and daily counts of potential predators (jaegers, falcons, foxes). Detailed field methods for each year are available in the ASDN_protocol_201X.pdf files. All research was conducted under permits from relevant federal, state and university authorities. | ||
|
||
See `01_ASDN_Readme.txt` provided in the `data` folder for full metadata information about this data set. | ||
|
||
|
||
## Analyzing the bird dataset using csv files (raw data) | ||
|
||
Loading the necessary packages. DuckDB has its own R package that is mostly a wrapper around dbplyr and DBI. | ||
|
||
```{r} | ||
#| message: false | ||
library(tidyverse) | ||
library(dbplyr) # to query databases in a tidyverse style manner | ||
library(DBI) # to connect to databases | ||
# install.packages("duckdb") # install this package to get duckDB API | ||
library(duckdb) # Specific to duckDB | ||
``` | ||
|
||
Import the csv files with the bird species information: | ||
|
||
```{r} | ||
# Import the species | ||
``` | ||
|
||
Let's explore what is in the `Relevance` attribute/column: | ||
|
||
```{r} | ||
``` | ||
|
||
We are interested in the `Study species` because according to the metadata, they are the species that are included in the data sets for banding, resighting, and/or nest monitoring. | ||
|
||
Let us extract the species and sort them in alphabetical order: | ||
|
||
```{r} | ||
# list of the bird species included in the study | ||
``` | ||
|
||
### Average egg volume | ||
|
||
:::{.callout-tip} | ||
## Analysis | ||
***We would like to know what is the average egg size for each of those bird species. How would we do that?*** | ||
::: | ||
|
||
We will need more information that what we have in our species table. Actually we will need to also retrieve information from the nests and eggs monitoring table. | ||
|
||
An egg is in a nest, and a nest is associated with a species | ||
|
||
```{r} | ||
``` | ||
|
||
How do we join those tables? | ||
|
||
```{r} | ||
``` | ||
|
||
`Nest_Id` seems like promising as a foreign key!! | ||
|
||
```{r} | ||
``` | ||
|
||
`Species` is probably the field we will use to join nest to the species | ||
|
||
OK let's do it: | ||
|
||
First, we need to compute the average of the volume of an egg. We can use the following formula: | ||
|
||
$Volume=\frac{\Pi}6W^2L$ | ||
|
||
Where W is the width and L the length of the egg | ||
|
||
We can use mutate to do so: | ||
|
||
```{r} | ||
``` | ||
|
||
Now let's join this information to the nest table, and average by species | ||
|
||
```{r} | ||
``` | ||
|
||
Ideally we would like the scientific names... | ||
|
||
```{r} | ||
``` | ||
|
||
|
||
## Let's connect to our first database | ||
|
||
### Load the bird database | ||
|
||
This database has been built from the csv files we just analyzed, so the data should be very similar - note we did not say identical more on this in the last section: | ||
|
||
```{r} | ||
conn <- dbConnect(duckdb::duckdb(), dbdir = "./data/bird_database.duckdb", read_only = FALSE) | ||
``` | ||
|
||
List all the tables present in the database: | ||
|
||
```{r} | ||
dbListTables(conn) | ||
``` | ||
|
||
Let's have a look at the Species table | ||
|
||
```{r} | ||
``` | ||
|
||
You can filter the data and select columns: | ||
|
||
```{r} | ||
``` | ||
|
||
:::{.callout-note} | ||
## Note | ||
_Note that those are **not** data frames but tables. What `dbplyr` is actually doing behind the scenes is translating all those dplyr operations into SQL, sending the SQL code to query the database, retrieving results, etc._ | ||
::: | ||
|
||
#### How can I get a "real" data frame? | ||
|
||
You add `collect()` to your query. | ||
|
||
```{r} | ||
``` | ||
|
||
Note it means the full query is going to be ran and save in your R environment. This might slow things down, so you generally want to collect on the smallest data frame you can. | ||
|
||
|
||
#### How can you see the SQL query? | ||
|
||
Adding `show_query()` at the end of your code block will let you see the SQL code that has been used to query the database. | ||
|
||
```{r} | ||
# Add show_query() to the end to see what SQL it is sending! | ||
``` | ||
|
||
This is a great way to start getting familiar with the SQL syntax, because although you can do a lot with `dbplyr` you can not do everything that SQL can do. So at some point you might want to start using SQL directly. | ||
|
||
Here is how you could run the query using the SQL code directly: | ||
|
||
```{r} | ||
# query the database using SQL | ||
``` | ||
|
||
You can do pretty much anything with these quasi-tables, including grouping, summarization, joins, etc. | ||
|
||
Let's count how many species there are per Relevance categories: | ||
|
||
```{r} | ||
``` | ||
|
||
Does that code looks familiar? But this time, here is really the query that was used to retrieve this information: | ||
|
||
```{r} | ||
``` | ||
|
||
|
||
### Average egg volume analysis | ||
|
||
Let's reproduce the egg volume analysis we just did. We can calculate the average bird eggs volume per species directly on the database: | ||
|
||
```{r} | ||
``` | ||
|
||
Compute the volume using the same code as previously!! Yes, you can use mutate to create new columns on the tables object | ||
|
||
```{r} | ||
# Compute the egg volume | ||
``` | ||
|
||
:::{.callout-caution} | ||
_Limitation: no way to add or update data in the database, `dbplyr` is view only. If you want to add or update data, you'll need to use the `DBI` package functions._ | ||
::: | ||
|
||
|
||
Now let's join this information to the nest table, and average by species | ||
|
||
```{r} | ||
# Join the egg and nest tables to compute average | ||
``` | ||
|
||
What does this SQL query looks like? | ||
|
||
```{r} | ||
``` | ||
|
||
:::{.callout-note} | ||
## Question | ||
***Why does the SQL query include the volume computation?*** | ||
::: | ||
|
||
### Disconnecting from the database | ||
|
||
Before we close our session, it is good practice to disconnect from the database first | ||
|
||
```{r} | ||
DBI::dbDisconnect(conn, shutdown = TRUE) | ||
``` | ||
|
||
|
||
## How did we create this database | ||
|
||
You might be wondering how we created this database from our csv files. Most databases provide functions to import data from csv and other types of files. It is also possible to load data into the database programmatically from within R, one row at a time, using insert statements, but it is more common to load data from csv files. Note that since there is little data modeling within a csv file (the data does not have to be normalized or tidy), and no data type or value constraints can be enforced, a lot things can go wrong. Putting data in a database is thus a great opportunity to implement QA/QC and help you keep your data clean and tidy moving forward as new data are collected. | ||
|
||
To look at one example, below is the SQL code that was used to create the `Bird_eggs` table: | ||
|
||
```{sql eval=FALSE} | ||
CREATE TABLE Bird_eggs ( | ||
Book_page VARCHAR, | ||
Year INTEGER NOT NULL CHECK (Year BETWEEN 1950 AND 2015), | ||
Site VARCHAR NOT NULL, | ||
FOREIGN KEY (Site) REFERENCES Site (Code), | ||
Nest_ID VARCHAR NOT NULL, | ||
FOREIGN KEY (Nest_ID) REFERENCES Bird_nests (Nest_ID), | ||
Egg_num INTEGER NOT NULL CHECK (Egg_num BETWEEN 1 AND 20), | ||
Length FLOAT NOT NULL CHECK (Length > 0 AND Length < 100), | ||
Width FLOAT NOT NULL CHECK (Width > 0 AND Width < 100), | ||
PRIMARY KEY (Nest_ID, Egg_num) | ||
); | ||
COPY Bird_eggs FROM 'ASDN_Bird_eggs.csv' (header TRUE); | ||
``` | ||
|
||
DuckDB's `COPY` SQL command reads a csv file into a database table. Had we not already created the table in the previous statement, DuckDB would have created it automatically and guessed at column names and data types. But by explicitly declaring the table, we are able to add more characterization to the data. Notable in the above: | ||
|
||
- `NOT NULL` indicates that missing values are not allowed. | ||
- Constraints (e.g., `Egg_num BETWEEN 1 and 20`) express our expectations about the data. | ||
- A `FOREIGN KEY` declares that a value must refer to an existing value in another table, i.e., it must be a reference. | ||
- A `PRIMARY KEY` identifies a quantity that should be unique within each row, and that serves as a row identifier. | ||
|
||
Understand that a table declaration serves as more than documentation; the database actually enforces constraints. |