forked from datacarpentry/R-ecology-lesson
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path06-r-and-sql.Rmd
239 lines (191 loc) · 7.96 KB
/
06-r-and-sql.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
---
title: "SQL databases and R"
author: Data Carpentry contributors
---
```{r, echo=FALSE, purl=FALSE, message = FALSE, results='hide'}
source("setup.R")
if (file.exists("portalR.db")) file.remove("portalR.db")
```
> ## Learning Objectives
>
> By the end of this lesson the learner will:
>
> * Know how to connect R to an existing database
> * Be able to use SQL syntax inside R to to query a database and return a subset of data
> * Be able to use R and SQL to create a new database from existing csv files.
> * Understand that scripted database interactions enhance the reproducibility of their analysis
------------
# Introduction
Interacting with databases through scripted languages can offer advantages over
querying databases via a GUI interface. A GUI interface for your database is
easier to use and allows the quick viewing of adhoc queries. Querying a
database with a programmatic interface (in this case R, but it could be any
language) are slightly more complicated. However the trade-off is that data
manipulations are preserved in the code. Aggregations, summaries and other
database operations are preserved. Therefore those pre-analysis data
manipulation steps are not lost and can be reproduced later by yourself or
others.
# Connecting R to sqlite databases
R can connect to databases through a number of packages. In our case we will
use [RSQLite](http://cran.r-project.org/web/packages/RSQLite/index.html) to
connect to existing SQLite3 databases. However you should be able to connect to
almost any database in R via
[JDBC](http://cran.r-project.org/web/packages/RJDBC/index.html)
or [ODBC](http://cran.r-project.org/web/packages/RODBC/index.html), or specific
database packages (such as we are doing, or
[MySQL](http://cran.r-project.org/web/packages/RMySQL/index.html) ).
To begin these exercises we'll connect to the mammal database you've just
created. We'll need the RSQLite package so be sure to install it
first. `install.packages('RSQLite')`.
```{r connect, purl=FALSE}
library(RSQLite)
## Set dbname and driver out of convenience
myDB <- "data/portal_mammals.sqlite"
conn <- dbConnect(drv = SQLite(), dbname= myDB)
```
Now we're connected to our database, let's explore the table structure in the
database. Remember we could see the list of tables in the SQLite Firefox
gui. In R, we'll need to use SQL commands.
# Running SQL queries from R
We can view information about the database structure which includes a list of
all tables like this:
```{r Get Table information, eval=FALSE}
dbGetQuery(conn, "SELECT type, tbl_name FROM sqlite_master")
```
The `RSQLite` package also has functions that can be used to list both tables
and fields within a table. Here you can see the types and names of fields and
get a count of records.
```{r table details, purl=FALSE}
dbListTables(conn)
#get list of fields in the surveys table
dbListFields(conn,"surveys")
dbGetQuery(conn,"SELECT count(*) FROM surveys")
```
It's good practice to always close a connection that you open in R. Let's do
that next. Note that once you've closed a connection, you will have to open
a new connection to query and import the data again.
```{r close Connection, purl=FALSE}
dbDisconnect(conn)
```
We've now covered the basics of connecting to a database and exploring its
basic structure. From here we can write queries to access subsets of the data
within our database, using the same methods that we used in SQLite.
Let's try some basic queries from the previous lesson. Querying simply takes
a connection to a database and query as inputs and returns a dataframe with the
results.
Before we can do this, we need to re-establish a database connection.
```{r reconnect, purl=FALSE}
## Set dbname and driver out of convenience
myDB <- "data/portal_mammals.sqlite"
conn <- dbConnect(drv = SQLite(), dbname= myDB)
```
Next, we build our query. We can use the `dbGetQuery` function to run the query
and access data returned in a `data.frame` object.
```{r basic queries, purl=FALSE}
q <- 'SELECT DISTINCT year, species_id FROM surveys'
result <- dbGetQuery(conn, q)
head(result)
```
> ## Challenge
>
> 1. Write a query that returns counts of genus by `plot_id`
> 2. You can join multiple tables together in SQL using the following syntax
> where foreign key refers to your unique id (e.g., `species_id`):
>
> SELECT table.col, table.col
> FROM table1 JOIN table2
> ON table1.key = table2.key
> JOIN table3 ON table2.key = table3.key
>
> Write a query that joins the species, plot, and survey tables together. The
> query should return counts of genus by plot type. Then create a bar plot of
> your results in R.
```{r genus by type, purl=FALSE}
q <- "SELECT d.plot_type , c.genus, count(*)
FROM
(SELECT a.genus, b.plot_id
FROM species a
JOIN surveys b
ON a.species_id = b.species_id) c
JOIN plots d
ON c.plot_id = d.plot_id
GROUP BY d.plot_type,c.genus"
result <- dbGetQuery(conn,q)
head(result)
```
All we've done so far is execute the same sorts of queries that can easily be
made with a GUI. Now let's try leveraging the power of scripted queries.
Imagine you want to know how many rodents were found every other year. To get
this we'll get the range of dates from the database, sequence them by two and
make new queries.
```{r Get years of data, purl=FALSE}
yearRange <- dbGetQuery(conn,"SELECT min(year),max(year) FROM surveys")
years <- seq(yearRange[,1],yearRange[,2],by=2)
```
Next we'll build our query string using the `paste()` function.
```{r creating query, purl=FALSE}
q <- paste("
SELECT a.year,b.taxa,count(*) as count
FROM surveys a
JOIN species b
ON a.species_id = b.species_id
AND b.taxa = 'Rodent'
AND a.year in (",
paste(years,collapse=",")
,")
GROUP BY a.year, b.taxa",
sep = "" )
rCount <- dbGetQuery(conn,q)
head(rCount)
```
With the nested paste commands we were able to construct a query
programmatically, without having to type out all the years. This could also be
done with a for loop, especially if the query to be constructed is more
complicated.
Creating a SQLite DB using R
------
We can use R to create a new database and associated structure (also known as
the schema) databases from existing `csv` files. Let's recreate the mammals
database that we've been working with, in R. First let's read in the `csv`
files.
Up until now we been working with an existing data. However we can use R as a
way to build up databases from existing flat files. We'll use the flat files
that make up the mammals database to recreate it. First let's read in the
files.
```{r Generate a dataframe, purl=FALSE}
species <- read.csv("data/species.csv")
surveys <- read.csv("data/surveys.csv")
plots <- read.csv("data/plots.csv")
```
Next we'll need to create a database and add our tables.
```{r create database, purl=FALSE}
myDB <- "portalR.db"
myConn <- dbConnect(drv = SQLite(), dbname= myDB)
dbListTables(myConn)
```
By opening up a connection, you've created a new database. If you list the
tables you'll see that the database is empty. Next we'll add the dataframes we
just read in to the database.
```{r add data, purl=FALSE}
dbWriteTable(myConn,"species",species)
dbListTables(myConn)
dbGetQuery(myConn,"SELECT * from species limit 10")
```
If you check the location of our database you'll see that data is automatically
being written to disk. Not only does R and RSQLite provide easy ways to query
existing databases, it also allows you to easily create your own databases from
flat files.
If you check the location of your database you'll see that data is automatically
being written to disk. Not only does R and RSQLite provide easy ways to query
existing databases, it also allows you to easily create your own databases from
flat files.
> ## Challenge
>
> Add the remaining tables to the existing database. Open the new database
> using the Firefox SQLite manager to verify that the database was built
> successfully.
> ## Extra Challenge
>
> Run some of your queries from earlier in the lesson to verify that you have
> faithfully recreated the mammals database.
Don't forget to close your database connection when you are done with it!