Merge branch 'gh-pages' of https://github.com/SISBID/Module1 into gh-…

…pages
SISBID · Jul 17, 2019 · 66e7502 · 66e7502
2 parents 386d86a + 08b1fa3
commit 66e7502
Show file tree

Hide file tree

Showing 4 changed files with 124 additions and 17 deletions.
diff --git a/labs/databases-key.Rmd b/labs/databases-key.Rmd
@@ -0,0 +1,99 @@
+---
+title: "Databases lab"
+author: "Jeff Leek"
+date: "July 12, 2016"
+output: html_document
+---
+
+1. Download and load the nycflights data with the command `install.packages('nycflights13')` and `library(nycflights13)`.
+
+```{r}
+install.packages("nycflights13")
+library(nycflights13)
+```
+
+
+2. Use the `pryr` package to figure out the size of the `flights` object. 
+
+```{r}
+pryr::object_size(flights)
+```
+
+
+3. Create a sqlite database, then add a table "flights" with the flights data from this package.
+
+```{r}
+my_flights_db <- src_sqlite("my_flights_db.sqlite3", 
+create = TRUE)
+flights_sqlite <- copy_to(my_flights_db, 
+flights, temporary = FALSE,overwrite = TRUE)
+```
+
+
+4. Inspect the tables using the `src_tbls` command to make sure the copying happened correctly.
+
+```{r}
+src_tbls(my_flights_db)
+my_flights_db %>% tbl("flights")
+```
+
+
+5. Find the average delay time for American Airlines (hint: the abbreviation is AA).
+
+```{r}
+ave_delay = my_flights_db %>%
+  tbl("flights") %>%
+  filter(carrier == "AA") %>%
+  summarise(ave_delay = mean(dep_delay))
+```
+
+
+
+6. How long does it take to collect the results of your computation for 5? 
+
+```{r}
+system.time(my_flights_db %>%
+  tbl("flights") %>%
+  filter(carrier == "AA") %>%
+  summarise(ave_delay = mean(dep_delay)))
+```
+
+
+7. Can you figure out the average delay time for each airline? 
+
+```{r}
+ave_delay = my_flights_db %>%
+  tbl("flights") %>%
+  group_by(carrier) %>%
+  summarise(ave_delay = mean(dep_delay))
+```
+
+8. Can you add a variable for averge delay by carrier to the database? 
+
+```{r}
+
+### Doesn't work!! No mutate 
+#ave_delay = my_flights_db %>%
+#  tbl("flights") %>%
+#  group_by(carrier) %>%
+#  mutate(ave_delay = mean(dep_delay))
+
+ave_delay = my_flights_db %>%
+  tbl("flights") %>%
+  group_by(carrier) %>%
+  summarise(ave_delay = mean(dep_delay)) %>%
+  collect()
+
+
+### Doesn't work! Have to copy it over
+
+# my_flights_db %>%
+#  tbl("flights") %>%
+#   left_join(ave_delay)
+
+
+ my_flights_db %>%
+  tbl("flights") %>%
+   left_join(ave_delay,copy=TRUE)
+  
+```
diff --git a/labs/databases-lab.Rmd b/labs/databases-lab.Rmd
@@ -15,8 +15,8 @@ output: html_document
 
 5. Find the average delay time for American Airlines (hint: the abbreviation is AA).
 
-6. Can you add a variable for delay time to the database? 
+6. How long does it take to collect the results of your computation for 5? 
 
-7. How long does it take to collect the results of your computation for 5? 
+7. Can you figure out the average delay time for each airline? 
 
-8. Can you figure out the average delay time for each airline? 
+8. Can you add a variable for delay time to the database? 
diff --git a/labs/dplyr-key.Rmd b/labs/dplyr-key.Rmd
@@ -19,7 +19,7 @@ library(janitor)
 # Have to skip one row because there is an extra header
 kg = read_excel("1000genomes.xlsx",sheet=4,skip=1)
 # subset to just low coverage
-kg = kg[,1:7]
+kg = kg %>% select(1:7)
 
 
 # make column names easier to handle
@@ -36,7 +36,7 @@ kg %>% group_by(platform_4) %>% summarize(sum(total_sequence_5))
 
 5. 
 ```{r}
-kg %>% group_by(center_3) %>% summarize(sum(total_sequence_5))
+kg %>% group_by(center_3) %>% summarize( center_total =  sum(total_sequence_5))
 ```
 
 6. 

diff --git a/labs/merging-lab-key.Rmd b/labs/merging-lab-key.Rmd
@@ -18,7 +18,7 @@ library(janitor)
 # Have to skip one row because there is an extra header
 kg_s4 = read_excel("1000genomes.xlsx",sheet=4,skip=1)
 # subset to just low coverage
-kg_s4 = kg_s4[,1:7]
+kg_s4 = kg_s4 %>% select(1:7)
 kg_s4 = kg_s4 %>% clean_names()
 dim(kg_s4)
 
@@ -40,6 +40,11 @@ sj = semi_join(kg_s4,kg_s1)
 ## left join
 lj = left_join(kg_s4,kg_s1)
 
+## How I actually write this
+
+lj = kg_s4 %>%
+  left_join(kg_s1)
+
 ## outer/full join
 oj = merge(kg_s4,kg_s1,all=TRUE)
 ```
@@ -52,22 +57,25 @@ dim(lj)
 dim(oj)
 ```
 
-7. They are the same 
+7. They are the same but different order
 
 ```{r}
 ## Check if names are the same
-sum(names(lj)==names(oj))
-## Check if values that aren't NA are the same
-sum(lj==oj,na.rm=T)
-## Check if NAs are the same
-sum(is.na(lj)==is.na(oj),na.rm=T)
+sum(names(lj) %in% names(oj))
+
 ```
 
-8. They are not the same because they have different dimensions. 
+
+8 
+
 ```{r}
-lj = left_join(kg_s4,kg_s1)
-dim(lj)
-lj2 = left_join(kg_s1,kg_s4)
-dim(lj2)
+
+lj = kg_s4 %>%
+  left_join(kg_s1)
+
+lj2 = kg_s1 %>%
+  left_join(kg_s4)
 ```
 
+
+