forked from bborgesr/wsds2017
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathpresentation.Rmd
351 lines (236 loc) · 11.8 KB
/
presentation.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
---
title: "Interacting with Databases from R and Shiny"
author: "Barbara Borges Ribeiro"
date: "WSDS 2017"
output:
slidy_presentation:
font_adjustment: +1
duration: 30
transition: 0
css: assets/style.css
footer: " // <span id = email_key>email</span>: <span id = email_value>[email protected]</span> // <span id = repo_key>slides and code</span>: <span id = repo_value>github.com/bborgesr/wsds2017</span>"
---
# Show of hands!
<!-- ABSTRACT
Being able to interact with an external relational database (like MySQL) is an increasingly important skill in data science. While this process may be a lot slower than dealing with in-memory data in R, it is infinitely more scalable. In particular, it's important to know how to establish a connection to a database, how to execute safe queries using SQL (goodbye SQL injections!) and how to close the connection. These skills allow you to read and write from a remote database. In this talk, I'll go over how to do these in R, using the DBI, dplyr and pool packages. I'll place special importance on these best practices when applied to the interactive context of a Shiny app. Interacting with databases through a Shiny app allows you to build an app that lets users modify data on a database, without knowing anything about SQL or R!
This talk is meant for a wide audience. Those in the audience already familiar with databases, R and Shiny will be able to see best practices in action. Those in the audience who are generally unfamiliar with the topic will be able to see how much is currently possible and have references to all the best practices through my publicly available slides.
-->
> - Never used R
> - Beginner R user
> - Intermediate to advanced R user
<hr/>
> - Never used Shiny
> - Beginner Shiny user
> - Intermediate to advanced Shiny user
# Overview
**PART I**
- DB best practices in *R*
- `DBI` standardizes how to interact with a database
- `odbc` is a DBI backend for any database with an ODBC driver
- use `dplyr` syntax to talk to a database
- use `pool` (my package) to connect to a database from Shiny
**PART II**
- Shiny app demo: a CRUD app using `DBI`, `RSQLite`, `dplyr`, and `pool`
- Connect to a *SQLite* database from Shiny
- **C**reate, **R**ead, **U**pdate and **D**elete data from database
- See updated information using `reactivePoll`
# Databases, and its many flavors
- **Relational databases** <-- *We'll focus on these*
- **NoSQL/object oriented databases**
# Databases, and its many flavors
**Relational**-ish **databases**
- RDBMSs (relational database management systems) store data in columns and rows, which in turn make up tables
- A table in RDBMS is like a spreadsheet.
- Use *SQL*
- MySQL, PostgreSQL, SQLite
- Apache Hive and Cloudera Impala for distributed systems (relational-like)
- *R* packages: `DBI`, `odbc`, `dplyr` (and `dbplyr`), `pool`
# Databases, and its many flavors
**NoSQL/object oriented databases**
- These do not follow the table/row/column approach of RDBMS
- Good for working with large amounts of data that do not require structure
- Less concerned with storing them in ordered tables than they are with simply making them available for fast access
- MongoDB, CouchDB, HBase, Cassandra
# DBI (theory)
- `DBI` defines the generic **D**ata**B**ase **I**nterface for R. The idea is to standardize how to interact with a database from *R* (connect, disconnect, read, write and mutate data safely from *R*)
- The connection to individual DBMS is provided by other packages that import `DBI` (`DBI`-compliant backends) and implement the methods for the generics defined in `DBI`
- Current goal: ensure maximum portability and exchangeability and reduce the effort for implementing a new DBI backend (through the `DBItest` package and the [DBI specification](https://cran.r-project.org/web/packages/DBI/vignettes/spec.html))
# DBI (practice)
```{r, eval=FALSE}
con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
DBI::dbWriteTable(con, "iris", iris)
DBI::dbGetQuery(con, "SELECT count() FROM iris")
#> count()
#> 1 150
DBI::dbDisconnect(con)
```
# DBI (practice) -- *SQL* injections edition!
```{r, eval=FALSE}
sql <- "SELECT * FROM X WHERE name = ?name"
DBI::sqlInterpolate(DBI::ANSI(), sql, name = "Hadley")
#> <SQL> SELECT * FROM X WHERE name = 'Hadley'
# This is safe because the single quote has been double escaped
DBI::sqlInterpolate(DBI::ANSI(), sql, name = "H'); DROP TABLE--;")
#> <SQL> SELECT * FROM X WHERE name = 'H''); DROP TABLE--;'
```
# DBI (practice) -- *SQL* injections edition!
_bobby-tables_ from **xkcd**:
<img src="assets/bobby-tables.png" width="800px">
<!-- ![bobby-tables from xkcd](assets/bobby-tables.png) -->
# odbc (theory)
- ODBC (Open Database Connectivity) is a specification for a database API. This API is independent of any one DBMS, operating system or programming language. The functions in the ODBC API are implemented by developers of DBMS-specific drivers. ([source](https://docs.microsoft.com/en-us/sql/odbc/reference/what-is-odbc))
- The `odbc` package provides a DBI compliant backend for any database with an ODBC driver (although anyone can write a driver, most of these tend to be paid, enterprise products).
- This allows for an efficient, easy to setup connection to any database with ODBC drivers available (RStudio Server Pro will soon bundle several of these drivers including Microsoft SQL Server, Oracle, MySQL, PostgreSQL, SQLite, Cloudera Impala, Apache Hive and others).
- Recognized in the brand-new RStudio IDE "Connections" pane (*demo!*)
# _Aside_: what is a database driver?
> In a computer system, an adaptor program is required for making a connection to another system of different type. Similar to connecting a printer to a computer by using a printer driver, a DBMS (database management system) needs a database driver that enables a database connection in other systems. ([source](http://www.jdatalab.com/information_system/2017/02/16/database-driver.html))
- `odbc` acts as "middleman" driver
- Why is this useful?
# odbc (practice)
```{r, eval=FALSE}
con <- DBI::dbConnect(odbc::odbc(),
Driver = "{postgresql}",
Server = "postgresdemo.cfd8mtk93q6a.us-west-2.rds.amazonaws.com",
Port = 5432,
Database = "postgresdemo",
UID = "guest",
PWD = "guest"
)
DBI::dbGetQuery(con, "SELECT * FROM city LIMIT 2;")
#> id name countrycode district population
#> 1 1 Kabul AFG Kabol 1780000
#> 2 2 Qandahar AFG Qandahar 237500
DBI::dbDisconnect(con)
```
# dplyr (theory)
- **Idea**: use `dplyr` syntax to talk to databases (no *SQL* involved for the end user).
- `dplyr` (and the brand-new `dbplyr`) wrap and extend a lot of `DBI` methods, so that you can use `dplyr` + *R* directly to interact with your database (instead of `DBI` + *SQL*, which is what `dplyr` does for you)
- With the recent revamp, you can a LOT with `dplyr` (reading and transforming data, writing tables, querying the database)
- You can combine `DBI` and `dplyr` as much as you want!
- **Bottom line**: Especially if you're already familiar with the `dplyr` verbs (mainly, `filter()`, `select()`, `mutate()`, `group_by()`, and `summarise()`), using `dplyr` to interact with databases is a great idea.
# dplyr (practice)
```{r, eval=FALSE}
library(dplyr)
con <- DBI::dbConnect(RMySQL::MySQL(),
dbname = "shinydemo",
host = "shiny-demo.csa7qlmguqrf.us-east-1.rds.amazonaws.com",
username = "guest",
password = "guest"
)
con %>% tbl("City") %>% head(2)
#> # Source: lazy query [?? x 5]
#> # Database: mysql 5.5.5-10.0.17-MariaDB
#> # [guest@shiny-demo (...) amazonaws.com:/shinydemo]
#> ID Name CountryCode District Population
#> <dbl> <chr> <chr> <chr> <dbl>
#> 1 1 Kabul AFG Kabol 1780000
#> 2 2 Qandahar AFG Qandahar 237500
DBI::dbDisconnect(con)
```
# pool (theory)
**Problem**: how to interact with a database from Shiny?
- Per session, there is only a single R process and potentially multiple users
- Also, establishing connections takes time and they can go down at any time
- So, you don’t want a fresh connection every for every user action (because that’s slow), and you don’t want one connection per app (because that’s unreliable)...
<hr/>
- The `pool` package allows you to manage a shared pool of connections for your app, giving you both speed (good performance) and reliability (connection management).
# pool (theory)
- `pool` is mainly important when in a Shiny app (or another interactive app with an R backend), but it can be used in other situations with no problem.
- `pool` integrates seamlessly with both `DBI` and `dplyr` (the only noticeable differences are in the create/connect and close/disconnect functions).
- Is on CRAN, as of a month ago! (maintenance release coming soon)
# pool (practice)
```{r, eval=FALSE}
library(dplyr)
pool <- pool::dbPool(RMySQL::MySQL(),
dbname = "shinydemo",
host = "shiny-demo.csa7qlmguqrf.us-east-1.rds.amazonaws.com",
username = "guest",
password = "guest"
)
pool %>% tbl("City") %>% head(2)
#> # Source: lazy query [?? x 5]
#> # Database: mysql 5.5.5-10.0.17-MariaDB
#> # [guest@shiny-demo (...) amazonaws.com:/shinydemo]
#> ID Name CountryCode District Population
#> <dbl> <chr> <chr> <chr> <dbl>
#> 1 1 Kabul AFG Kabol 1780000
#> 2 2 Qandahar AFG Qandahar 237500
pool::poolClose(pool)
```
# Resources
- All packages mentioned here are open-source and available on Github
- `DBI`, `dplyr`, `odbc`, `pool` **AND** best practices, security, authentication, examples: https://db.rstudio.com/
- `pool` (and general `shiny`): http://shiny.rstudio.com/articles/ (Databases section)
<hr/>
<!-- A lot of these things are very new and there is a commitment by many people and organizations to improve the DB ecosystem in R. The RStudio blog is one way to stay up to date with most of these changes. -->
- https://blog.rstudio.org/
- These slides + app source code: https://github.com/bborgesr/wsds2017
<hr/>
**Getting help**
- https://community.rstudio.com/c/shiny
- use the databases in this talk to try stuff out!
# Shiny app
<img src="assets/app-snap.png" width="800px">
# Shiny app: skeleton
```{r, eval=FALSE}
library(shiny)
library(shinydashboard)
library(dplyr)
library(pool)
pool <- dbPool(RSQLite::SQLite(), dbname = "db.sqlite")
tbls <- reactiveFileReader(500, NULL, "db.sqlite",
function(x) db_list_tables(pool)
)
# ui
ui <- dashboardPage(...)
# server
server <- function(input, output, session) {...}
shinyApp(ui, server)
```
# Shiny app: create table (adapted!)
```{r, eval=FALSE}
# ui (snippet)
actionButton("create", "Create table"),
textInput("tableName", "Table name"),
numericInput("ncols", "Number of columns"),
uiOutput("cols")
# server (snippet)
output$cols <- renderUI({
input$tableName
cols <- vector("list", input$ncols)
for (i in seq_len(input$ncols)) {
textInput(paste0("colName", i), "Column name"),
selectInput(paste0("colType", i), "Column type",
c(Integer = "INT", Character = "VARCHAR"))
}
cols
})
observeEvent(input$create, {
# finalCols is a list. E.g: list(ID = "INT", item = "VARCHAR", count = "INT")
db_create_table(pool, input$tableName, finalCols)
})
```
# Shiny app: read table (adapted!)
```{r, eval=FALSE}
# ui (snippet)
selectInput("tableName", "Table name", NULL),
checkboxGroupInput("select", "Choose columns to read"),
selectInput("filter", "Choose column to filter on", NULL),
checkboxGroupInput("vals", "Choose values to include"),
tableOutput("res")
# server (snippet)
observeEvent(tbls(), {
updateSelectInput(session, "tableName", choices = tbls())
})
observe({
cols <- db_query_fields(pool, input$tableName)
updateCheckboxGroupInput(session, "select", choices = cols)
})
output$res <- renderTable({
pool %>%
tbl(input$tableName) %>%
select(input$select) %>%
filter(input$filter %in% input$vals)
})
```
# Shiny app: in action!