06-r-and-sql.Rmd

---
title: "SQL databases and R"
author: Data Carpentry contributors
---


```{r, echo=FALSE, purl=FALSE, message = FALSE, results='hide'}
source("setup.R")
if (file.exists("portalR.db")) file.remove("portalR.db")
```

------------

> ## Learning Objectives
>
> * Students will be able to connect to an existing database using R
> * Query an existing database through R using queries constructed with
>   variables
> * Create a database and populate it from R
> * Understand that scripted database interactions enhance reproducibility of
>   analysis

------------

# Introduction

Interacting with databases through scripted languages can offer advantages over
querying databases via a GUI interface.  A GUI interface for your database is
easier to use and allows the quick viewing of adhoc queries.  Querying a
database with a programatic interface (in this case R, but it could be any
language) are slightly more complicated.  However the trade-off is that data
manipulations are preserved in the code.  Aggregations, summaries and other
database operations are preserved.  Therefore those pre-analysis data
manipulation steps are not lost and can be reproduced later by yourself or
others.

# Connecting R to sqlite databases

R can connect to databases through a number of packages.  In our case we we will
use [RSQLite](http://cran.r-project.org/web/packages/RSQLite/index.html) to
connect to existing SQLite3 databases.  However you should be able to connect on
almost any database in R via
JDBC[http://cran.r-project.org/web/packages/RJDBC/index.html](http://cran.r-project.org/web/packages/RJDBC/index.html)
or [ODBC](http://cran.r-project.org/web/packages/RODBC/index.html), or specific
database packages (such as we are doing, or
[MySQL](http://cran.r-project.org/web/packages/RMySQL/index.html) ).

To begin these exercises we'll connect to the mammal database you've just
created.

```{r connect, purl=FALSE}
library(RSQLite)
## Set dbname and driver out of convenience
myDB <- "data/portal_mammals.sqlite"
conn <- dbConnect(drv = SQLite(), dbname= myDB)
```

Now you're connected to your database, you can get information about the tables
in the database.  While SQLite manager will provide this for you, you'll need to
get it yourself with some basic SQL commands

# Basic queries

You can grab information about all tables like this:

```{r Get Table information, eval=FALSE}
dbGetQuery(conn, "SELECT type, tbl_name  FROM sqlite_master")
```

`RSQLite` has some functions that easily list basic information about your
database and tables. Here you can see the types and names of fields and get a
count of records.

```{r table details, purl=FALSE}
dbListTables(conn)
dbListFields(conn, "surveys")
dbGetQuery(conn, "SELECT count(*) FROM surveys")
```

Great!  That's all there is to it. You are now connected to your database and
know something about the tables in it.  From here you can manipulate it the
exact same way you can from SQLite Manager except in a scripted manner. Let's
try some basic queries from the previous lesson.  Querying simply takes a
connection to a database and query as inputs and returns a dataframe with the
results.

```{r basic queries, purl=FALSE}
q <- 'SELECT DISTINCT year, species_id FROM surveys'
result <-  dbGetQuery(conn, q)
head(result)
```


> ## Challenge
>
> Write a query that gets counts of genus by plot type. If this seems to
> advanced, then get counts of genus by plot_id (eliminates one of the >joins)

### Answer

```{r genus by type, purl=FALSE}
q <- "SELECT d.plot_type , c.genus, count(*)
FROM
(SELECT a.genus, b.plot
FROM species a
JOIN surveys b
ON a.species_id = b.species_id) c
JOIN plots d
ON c.plot = d.plot_id
GROUP BY d.plot_type,c.genus"

result <- dbGetQuery(conn,q)
head(result)

```

In the above answer you'll note that I use a common technique in SQL queries
called aliasing.  Aliasing is simply a way to reference a table or sub query,
usually within a join. In the above answer aliases allow an easy way to give a
shorthand name to a table, e.g. `plots d`. Sub queries also require an alias if
you want to use them in a join.


All we've done so far is execute the same sorts of queries that can easily be
made with a GUI.  Now let's try leveraging the power of scripted queries.
Imagine you want to know how many rodents were found every other year.  To get
this we'll get the range of dates from the database, sequence them by two and
make new queries.

```{r Get years of data, purl=FALSE}
yearRange <- dbGetQuery(conn,"SELECT min(year),max(year) FROM surveys")
years <- seq(yearRange[,1],yearRange[,2],by=2)
```

Next we'll build our query string using the `paste()` function.

```{r creating query, purl=FALSE}
q <- paste("
SELECT a.year,b.taxa,count(*) as count
FROM surveys a
JOIN species b
ON a.species_id = b.species_id
AND b.taxa = 'Rodent'
AND a.year in (",
paste(years,collapse=",")
,")
GROUP BY a.year, b.taxa",
sep = "" )
rCount <- dbGetQuery(conn,q)
head(rCount)
```

With the nested paste commands we were able to construct a query
programatically, without having to type out all the years.  This could also be
done with a for loop, especially if the query to be constructed is more
complicated.

# Building your workflow


Up until now we been working with an existing data. However we can use R as a
way to build up databases from existing flat files.  We'll use the flat files
that make up the mammals database to recreate it.  First let's read in the
files.

```{r Generate a dataframe, purl=FALSE}
species <- read.csv("data/species.csv")
surveys <- read.csv("data/surveys.csv")
plots <- read.csv("data/plots.csv")
```

We have some data now. Next we'll need to create a database and add our tables.

```{r create database, purl=FALSE}
myDB <- "portalR.db"
myConn <- dbConnect(drv = SQLite(), dbname= myDB)
dbListTables(myConn)
```

By opening up a connection, you've created a new database.  If you list the
tables you'll see that the database is empty. Next we'll add the dataframes we
just read in to the database.

```{r add data, purl=FALSE}
dbWriteTable(myConn,"species",species)
dbListTables(myConn)
dbGetQuery(myConn,"SELECT * from species limit 10")
```

If you check the location of your database you'll see that data is automatically
being written to disk. Not only does R and RSQLite provide easy ways to query
existing databases, it also allows you to easily create your own databases from
flat files.

> ## Challenge
>
> Add the remaining tables to the existing database. Query your database from
> SQLite manager just to verify that you're adding data.

> ## Extra Challenge
>
> Run some of your queries from earlier in the lesson to verify that you have
> faithfully recreated the mammals database.