diff --git a/README.md b/README.md index f1e42c9..4069f6f 100644 --- a/README.md +++ b/README.md @@ -9,7 +9,9 @@ Each session contains the relevant datasets for the session, as well as the R fi easily create dynamic documents, presentations and reports from R. It combines markdown (simple formatting syntax) and embedded R code chunks that are run and can perform calculations. The R Markdown files are identified with a .Rmd suffix. See [How to use the session material](#how_to) for details on how to use R Markdown. -Within each session, the Rmd files which end with `_incomplete` are a version of the session material that is missing some code. Students should work through this version, and refer to the complete solution for answers. Session 2 contains two versions of the Rmd files. One version uses the old dplyr gather() and spread(), the other uses pivot_(). +Within each session, the Rmd files which end with `_incomplete` are a version of the session material that is missing some code. Students should work through this version, and refer to the complete solution for answers. Code examples and exercises are provided in an R script file (ending with `_working_script.R`) in each session folder along with an R script containing solutions. + +Session 2 contains two versions of the Rmd files. One version uses the old dplyr gather() and spread(), the other uses pivot_(). The housekeeping folder contains material that was used to develop the course and does not form part of the training. diff --git a/session1/intro_to_r_session1.Rmd b/session1/intro_to_r_session1.Rmd index 011c31a..b2d14f3 100644 --- a/session1/intro_to_r_session1.Rmd +++ b/session1/intro_to_r_session1.Rmd @@ -720,7 +720,7 @@ sepal_length_average <- iris %>% -In the above code, R is taking the 'iris' dataset, grouping it by Species and then (note that the data is "piped" like water into the group_by command using the pipe symbol '%>%' instead of specifying the 'iris' dataset as the first argument within the group_by command), and then outputting the mean length by species. The new mean number of Sepal.Length variable we've decided to call 'ave'. The results are saved into a new dataset called 'sepal_length_average'. +In the above code, R is taking the 'iris' dataset, grouping it by Species (note that the data is "piped" like water into the group_by command using the pipe symbol '%>%' instead of specifying the 'iris' dataset as the first argument within the group_by command), and then outputting the mean length by species. The new mean number of Sepal.Length variable we've decided to call 'ave'. The results are saved into a new dataset called 'sepal_length_average'. @@ -730,8 +730,8 @@ The pipe operator simply passes through the object on the left hand side as the ```{r results="hide"} sepal_length_average <- - summarise(group_by(iris,Species), - ave=mean(Sepal.Length)) + summarise(group_by(iris, Species), + ave = mean(Sepal.Length)) @@ -988,7 +988,7 @@ staff_all <- full_join(staff_salaries, ``` -For more information about the different sorts of joins and other data transformation functions see the 'Data Transformation Cheat Sheet' at: https://www.rstudio.com/resources/cheatsheets/ +For more information about the different sorts of joins and other data transformation functions see the 'Data Transformation Cheat Sheet' at: https://rstudio.github.io/cheatsheets/html/data-transformation.html ### 4.2 Exporting data @@ -1001,7 +1001,7 @@ A command to export data into csv format is write_csv from the readr package (th ```{r results="hide"} -write_csv(iris_petals, path = "iris_petals.csv") +write_csv(iris_petals, file = "iris_petals.csv") ``` @@ -1039,10 +1039,9 @@ There are lots of resources that can help you develop your R knowledge, but belo + DataCamp is a website which hosts multiple online courses that teach coding. Their 'Introduction to R' course is free to complete and provides a broader overview in the basic concepts for coding in R. A link to the course can be found here: https://www.datacamp.com/courses/free-introduction-to-r. -+ Another good resource is the 'R for Data Science' online book: [r4ds.had.co.nz/](r4ds.had.co.nz/), written by Hadley Wickham, who is a data scientist at RStudio, who developed the tidyverse package that we introduced earlier. It gives a really good overview of R and how his package works with it. ++ Another good resource is the 'R for Data Science' online book: [https://r4ds.hadley.nz/](https://r4ds.hadley.nz/), written by Hadley Wickham, who is a data scientist at RStudio, who developed the tidyverse package that we introduced earlier. It gives a really good overview of R and how his package works with it. -+ RStudio has also developed a list of 'cheatsheets' which give quick overviews of the functions contained in different packages, which can be quickly referred to: https://www.rstudio.com/resources/cheatsheets/ Some can be accessed directly through the top menu help > Cheatsheets e.g. 'Data Transformation with dplyr'. ++ RStudio has also developed a list of 'cheatsheets' which give quick overviews of the functions contained in different packages, which can be quickly referred to: https://posit.co/resources/cheatsheets/ Some can be accessed directly through the top menu help > Cheatsheets e.g. 'Data Transformation with dplyr'. - -Further resources can be found on Saltire Analytical Professions pages (Analytical Professions > Analytical Tools > Analytical Software) \ No newline at end of file +Further resources can be found on Stats group sharepoint site: https://scotsconnect.sharepoint.com/sites/StatisticsGroup-Org-SG/SitePages/R-Resources.aspx \ No newline at end of file diff --git a/session1/intro_to_r_session1_incomplete.Rmd b/session1/intro_to_r_session1_incomplete.Rmd index 709c9eb..3482d14 100644 --- a/session1/intro_to_r_session1_incomplete.Rmd +++ b/session1/intro_to_r_session1_incomplete.Rmd @@ -229,15 +229,13 @@ Of course, you can also use Google, Stack Overflow, R-Yammer or the R-user group ### 1.7 Exercises +Open the script intro_to_r_session1_working_script.R and complete the following: +1. Create a new value called y which is equal to 17. -1. Create a new R script file in which you can store all commands you make during this exercise. Save it as 'Intro_R_Exercises.R'. - -2. Create a new value called y which is equal to 17. - -3. Now multiply y by 78. What answer do you get? +2. Now multiply y by 78. What answer do you get? - 4. What does the command "head" do? +3. What does the command "head" do? @@ -721,7 +719,7 @@ sepal_length_average <- iris %>% -In the above code, R is taking the 'iris' dataset, grouping it by Species and then (note that the data is "piped" like water into the group_by command using the pipe symbol '%>%' instead of specifying the 'iris' dataset as the first argument within the group_by command), and then outputting the mean length by species. The new mean number of Sepal.Length variable we've decided to call 'ave'. The results are saved into a new dataset called 'sepal_length_average'. +In the above code, R is taking the 'iris' dataset, grouping it by Species (note that the data is "piped" like water into the group_by command using the pipe symbol '%>%' instead of specifying the 'iris' dataset as the first argument within the group_by command), and then outputting the mean length by species. The new mean number of Sepal.Length variable we've decided to call 'ave'. The results are saved into a new dataset called 'sepal_length_average'. @@ -731,8 +729,8 @@ The pipe operator simply passes through the object on the left hand side as the ```{r results="hide"} sepal_length_average <- - summarise(group_by(iris,Species), - ave=mean(Sepal.Length)) + summarise(group_by(iris, Species), + ave = mean(Sepal.Length)) @@ -986,7 +984,7 @@ staff_all <- ``` -For more information about the different sorts of joins and other data transformation functions see the 'Data Transformation Cheat Sheet' at: https://www.rstudio.com/resources/cheatsheets/ +For more information about the different sorts of joins and other data transformation functions see the 'Data Transformation Cheat Sheet' at: https://rstudio.github.io/cheatsheets/html/data-transformation.html ### 4.2 Exporting data @@ -999,7 +997,7 @@ A command to export data into csv format is write_csv from the readr package (th ```{r results="hide"} -write_csv(iris_petals, path = "iris_petals.csv") +write_csv(iris_petals, file = "iris_petals.csv") ``` @@ -1035,10 +1033,9 @@ There are lots of resources that can help you develop your R knowledge, but belo + DataCamp is a website which hosts multiple online courses that teach coding. Their 'Introduction to R' course is free to complete and provides a broader overview in the basic concepts for coding in R. A link to the course can be found here: https://www.datacamp.com/courses/free-introduction-to-r. -+ Another good resource is the 'R for Data Science' online book: [r4ds.had.co.nz/](r4ds.had.co.nz/), written by Hadley Wickham, who is a data scientist at RStudio, who developed the tidyverse package that we introduced earlier. It gives a really good overview of R and how his package works with it. - -+ RStudio has also developed a list of 'cheatsheets' which give quick overviews of the functions contained in different packages, which can be quickly referred to: https://www.rstudio.com/resources/cheatsheets/ Some can be accessed directly through the top menu help > Cheatsheets e.g. 'Data Transformation with dplyr'. ++ Another good resource is the 'R for Data Science' online book: [https://r4ds.hadley.nz/](https://r4ds.hadley.nz/), written by Hadley Wickham, who is a data scientist at RStudio, who developed the tidyverse package that we introduced earlier. It gives a really good overview of R and how his package works with it. ++ RStudio has also developed a list of 'cheatsheets' which give quick overviews of the functions contained in different packages, which can be quickly referred to: https://posit.co/resources/cheatsheets/ Some can be accessed directly through the top menu help > Cheatsheets e.g. 'Data Transformation with dplyr'. -Further resources can be found on Saltire Analytical Professions pages (Analytical Professions > Analytical Tools > Analytical Software) \ No newline at end of file +Further resources can be found on Stats group sharepoint site: https://scotsconnect.sharepoint.com/sites/StatisticsGroup-Org-SG/SitePages/R-Resources.aspx \ No newline at end of file diff --git a/session1/intro_to_r_session1_solutions.R b/session1/intro_to_r_session1_solutions.R new file mode 100644 index 0000000..00b7323 --- /dev/null +++ b/session1/intro_to_r_session1_solutions.R @@ -0,0 +1,301 @@ + +# Session 1 --------------------------------------------------------------- + + +## Section 1: Introduction ------------------------------------------------ + + +### 1.4 Examples ---------------------------------------------------------- + +#1.4.1 +x <- 3 + +#1.4.2 +x + +#1.4.3 +x <- c(3, 2, 4) + +### 1.6 Examples ---------------------------------------------------------- + +#1.6.1 +?mean + +### 1.7 Exercises --------------------------------------------------------- + + +# 1. Create a new value called y which is equal to 17. + +y <- 17 + +# 2. Now multiply y by 78. What answer do you get? + +y * 78 #1326 + +# 3. What does the command "head" do? + +?head + +## Section 2: Data Processing --------------------------------------------- + +### 2.1 Examples ---------------------------------------------------------- + +#2.1.1 +getwd() + +#2.1.2 +setwd("C:/Users/u446122/Documents/OFFLINE/Training/intro_to_r/intro_to_r") + + +### 2.2 Examples ---------------------------------------------------------- + +#2.2.1 +library("tidyverse") + +#2.2.2 +help(package=dplyr) + +### 2.3 Examples ---------------------------------------------------------- + +#2.3.1 +chick_weights <- read_csv("chickweights.csv") + +#2.3.2 +data(iris) +iris + +### 2.4 Examples ---------------------------------------------------------- + +#2.4.1 +View(iris) + +#2.4.2 +str(iris) + +#2.4.3 +summary(iris) + +#2.4.4 +iris[10,4] + +#2.4.5 +iris[c(10, 12),4] + +#2.4.6 +iris[10,1:3] + +#2.4.7 +iris[-10,1:3] + +#2.4.8 +iris$Species + +### 2.5 Examples ---------------------------------------------------------- + +#2.5.1 +class(iris$Sepal.Length) + +#2.5.2 +iris$Sepal.Length <- as.integer(iris$Sepal.Length) + +#2.5.3 +iris$Sepal.Length <- as.numeric(iris$Sepal.Length) + +#2.5.4 +data(iris) +iris + +#2.5.5 +levels(iris$Species) + +#2.5.6 +iris$Species <- relevel(iris$Species, "versicolor") + +#2.5.7 +iris$Species <- as.character(iris$Species) + +#2.5.8 +iris$Species <- as.factor(iris$Species) + + +### 2.7 Exercises --------------------------------------------------------- + +# reload the iris dataset + +data(iris) + +# 1. Find the mean and median of the Sepal.Length variable in the iris dataset. + +mean(iris$Sepal.Length) + +median(iris$Sepal.Length) + +# 2. Find the max and min for the Petal.Length variable in the iris dataset. + +max(iris$Petal.Length) + +min(iris$Petal.Length) + +## Section 3: Data wrangling and 'group by' calculations ------------------ + +### 3.1 Examples ---------------------------------------------------------- + +#3.1.1 +setosa_sepal_leng <- filter(iris, Species == "setosa" & Petal.Length < 1.5) + +#3.1.2 +setosa_sepal_leng_av <- summarise(setosa_sepal_leng, ave = mean(Sepal.Length)) + +#3.1.3 +setosa_sepal_leng_av <- iris %>% + filter(Species == "setosa" & Petal.Length < 1.5) %>% + summarise(ave = mean(Sepal.Length)) + +#3.1.4 +setosa_sepal_leng_av <- summarise(filter(iris, Species == "setosa" & Petal.Length < 1.5), ave = mean(Sepal.Length)) + + +### 3.2 Examples ---------------------------------------------------------- + +#3.2.1 +?dplyr::summarise +?group_by + +#3.2.2 +sepal_length_average <- iris %>% + group_by(Species) %>% + summarise(ave = mean(Sepal.Length)) + +#3.2.3 +sepal_length_average <- + summarise(group_by(iris, Species), + ave = mean(Sepal.Length)) + +#3.2.4 +sepal_length_average <- iris %>% + group_by(Species) %>% + summarise(ave = mean(Sepal.Length), + count=n()) + +#3.2.5 +sepal_length_average <- iris %>% + group_by(Species) %>% + summarise(ave = mean(Sepal.Length), count=n()) %>% + ungroup() + +### 3.3 Examples ---------------------------------------------------------- + +#3.3.1 +iris_no_sepal_length <- iris %>% + select(-Sepal.Length) + +#3.3.2 +iris_petals <- iris %>% + select(-c(Sepal.Length, Sepal.Width)) + +### 3.4 Examples ---------------------------------------------------------- + +#3.4.1 +iris_petals <- iris_petals %>% + rename(P.Length = Petal.Length, + P.Width = Petal.Width) + +### 3.5 Examples ---------------------------------------------------------- + +#3.5.1 +?mutate + +#3.5.2 +iris_petals <- iris_petals %>% + mutate(P.Area = P.Length * P.Width) + +### 3.6 Examples ---------------------------------------------------------- + +#3.6.1 +iris <- iris %>% + mutate(small_p_length = if_else(Petal.Length<2,1,0)) + + +### 3.7 Exercises --------------------------------------------------------- + +# reload the iris dataset and load library + +library(tidyverse) +data(iris) + +# 1. Using group_by and summarise, calculate the average and max petal width for each species. + +iris_summary <- iris %>% + group_by(Species) %>% + summarise(ave_petal_width = mean(Petal.Width), + max_petal_width = max(Petal.Width)) %>% + ungroup() + +# 2. Using select and filter to produce a table of sepal length and widths for irises where the sepal width is greater than 3. + +iris_filtered <- iris %>% + select(Sepal.Length, Sepal.Width) %>% + filter(Sepal.Width > 3) + +## Section 4: Merging data, missing values and exporting ------------------ + +### 4.1 Examples ---------------------------------------------------------- + +#4.1.1 +staff_salaries <- read_csv("staff_salaries.csv") +staff_sickness <- read_csv("staff_sickness.csv") + +#4.1.2 +staff_sickness <- staff_sickness %>% + rename(staff_id = staff_id_no) + +#4.1.3 +staff_merge <- inner_join(staff_salaries, + staff_sickness, + by="staff_id") + +#4.1.4 +staff_salary_preserved_with_sickness_joined <- + left_join(staff_salaries, + staff_sickness, + by="staff_id") + +#4.1.5 +staff_sickness_preserved_with_sickness_joined <- + right_join(staff_salaries, + staff_sickness, + by="staff_id") + +#4.1.6 +staff_sickness_preserved_with_salaries_joined <- + left_join(staff_sickness, + staff_salaries, + by="staff_id") + +#4.1.7 +staff_all <- full_join(staff_salaries, + staff_sickness, + by="staff_id") + + +### 4.2 Examples ---------------------------------------------------------- + +#4.2.1 +write_csv(iris_petals, file = "iris_petals.csv") + + +### 4.3 Exercises --------------------------------------------------------- + +# reload the iris dataset, and load tidyverse + +library(tidyverse) +data(iris) + +# 1. Create a new dataset called iris_sepals which includes the species, sepal length and sepal width + +iris_sepals <- iris %>% + select(Species, Sepal.Length, Sepal.Width) + +# 2. Export the dataset iris_sepals to a csv file. + +write_csv(iris_sepals, file = "iris_sepals.csv") diff --git a/session1/intro_to_r_session1_working_script.R b/session1/intro_to_r_session1_working_script.R new file mode 100644 index 0000000..015c80c --- /dev/null +++ b/session1/intro_to_r_session1_working_script.R @@ -0,0 +1,306 @@ + +# Session 1 --------------------------------------------------------------- + + +## Section 1: Introduction ------------------------------------------------ + + +### 1.4 Examples ---------------------------------------------------------- + +#1.4.1 +x <- 3 + +#1.4.2 +x + +#1.4.3 +x <- c(3, 2, 4) + +### 1.6 Examples ---------------------------------------------------------- + +#1.6.1 +?mean + +### 1.7 Exercises --------------------------------------------------------- + + +# 1. Create a new value called y which is equal to 17. + + + + + + +# 2. Now multiply y by 78. What answer do you get? + + + + + + +# 3. What does the command "head" do? + + + + + + +## Section 2: Data Processing --------------------------------------------- + +### 2.1 Examples ---------------------------------------------------------- + +#2.1.1 +getwd() + +#2.1.2 +setwd("C:/Users/u446122/Documents/OFFLINE/Training/intro_to_r/intro_to_r") + + +### 2.2 Examples ---------------------------------------------------------- + +#2.2.1 +library("tidyverse") + +#2.2.2 +help(package=dplyr) + +### 2.3 Examples ---------------------------------------------------------- + +#2.3.1 +chick_weights <- read_csv("chickweights.csv") + +#2.3.2 +data(iris) +iris + +### 2.4 Examples ---------------------------------------------------------- + +#2.4.1 +View(iris) + +#2.4.2 +str(iris) + +#2.4.3 +summary(iris) + +#2.4.4 +iris[10,4] + +#2.4.5 +iris[c(10, 12),4] + +#2.4.6 +iris[10,1:3] + +#2.4.7 +iris[-10,1:3] + +#2.4.8 +iris$Species + +### 2.5 Examples ---------------------------------------------------------- + +#2.5.1 +class(iris$Sepal.Length) + +#2.5.2 +iris$Sepal.Length <- as.integer(iris$Sepal.Length) + +#2.5.3 +iris$Sepal.Length <- as.numeric(iris$Sepal.Length) + +#2.5.4 +data(iris) +iris + +#2.5.5 +levels(iris$Species) + +#2.5.6 +iris$Species <- relevel(iris$Species, "versicolor") + +#2.5.7 +iris$Species <- as.character(iris$Species) + +#2.5.8 +iris$Species <- as.factor(iris$Species) + + +### 2.7 Exercises --------------------------------------------------------- + +# reload the iris dataset + +data(iris) + +# 1. Find the mean and median of the Sepal.Length variable in the iris dataset. + + + + + + +# 2. Find the max and min for the Petal.Length variable in the iris dataset. + + + + + + +## Section 3: Data wrangling and 'group by' calculations ------------------ + +### 3.1 Examples ---------------------------------------------------------- + +#3.1.1 +setosa_sepal_leng <- filter(iris, Species == "setosa" & Petal.Length < 1.5) + +#3.1.2 +setosa_sepal_leng_av <- summarise(setosa_sepal_leng, ave = mean(Sepal.Length)) + +#3.1.3 +setosa_sepal_leng_av <- iris %>% + filter(Species == "setosa" & Petal.Length < 1.5) %>% + summarise(ave = mean(Sepal.Length)) + +#3.1.4 +setosa_sepal_leng_av <- summarise(filter(iris, Species == "setosa" & Petal.Length < 1.5), ave = mean(Sepal.Length)) + + +### 3.2 Examples ---------------------------------------------------------- + +#3.2.1 +?dplyr::summarise +?group_by + +#3.2.2 +sepal_length_average <- iris %>% + group_by(Species) %>% + summarise(ave = mean(Sepal.Length)) + +#3.2.3 +sepal_length_average <- + summarise(group_by(iris, Species), + ave = mean(Sepal.Length)) + +#3.2.4 +sepal_length_average <- iris %>% + group_by(Species) %>% + summarise(ave = mean(Sepal.Length), + count=n()) + +#3.2.5 +sepal_length_average <- iris %>% + group_by(Species) %>% + summarise(ave = mean(Sepal.Length), count=n()) %>% + ungroup() + + +### 3.3 Examples (incomplete) ---------------------------------------------- + +#3.3.1 +iris_no_sepal_length <- iris %>% + select() + +#3.3.2 +iris_petals <- iris %>% + select() + + +### 3.4 Examples (incomplete) ---------------------------------------------- + +#3.4.1 +iris_petals <- iris_petals %>% + rename( + + ) + +### 3.5 Examples (incomplete) ---------------------------------------------- + +#3.5.1 +?mutate + +#3.5.2 +iris_petals <- iris_petals %>% + mutate() + + +### 3.6 Examples (incomplete) ---------------------------------------------- + +#3.6.1 +iris <- iris %>% + mutate(small_p_length = if_else()) + + +### 3.7 Exercises --------------------------------------------------------- + +# reload the iris dataset + +data(iris) + +# 1. Using group_by and summarise, calculate the average and max petal width for each species. + +iris_summary <- + + + + +# 2. Using select and filter to produce a table of sepal length and widths for irises where the sepal width is greater than 3. + +iris_filtered <- + + + + +## Section 4: Merging data, missing values and exporting ------------------ + +### 4.1 Examples (incomplete) ---------------------------------------------- + +#4.1.1 +staff_salaries <- read_csv("staff_salaries.csv") +staff_sickness <- read_csv("staff_sickness.csv") + +#4.1.2 +staff_sickness <- staff_sickness %>% + rename() + +#4.1.3 +staff_merge <- inner_join(staff_salaries, + staff_sickness, + by="staff_id") + +#4.1.4 +staff_salary_preserved_with_sickness_joined <- + left_join( + + ) + +#4.1.5 +staff_sickness_preserved_with_sickness_joined <- + +#4.1.6 +staff_sickness_preserved_with_salaries_joined <- + + +#4.1.7 +staff_all <- + +### 4.2 Examples ---------------------------------------------------------- + +#4.2.1 +write_csv(iris_petals, file = "iris_petals.csv") + + +### 4.3 Exercises --------------------------------------------------------- + +# reload the iris dataset, and load tidyverse + +library(tidyverse) +data(iris) + +# 1. Create a new dataset called iris_sepals which includes the species, sepal length and sepal width + +iris_sepals <- + +# 2. Export the dataset iris_sepals to a csv file. + +write_csv() \ No newline at end of file diff --git a/session2/intro_to_R_session2.Rmd b/session2/intro_to_R_session2.Rmd index b4746d0..16ece83 100644 --- a/session2/intro_to_R_session2.Rmd +++ b/session2/intro_to_R_session2.Rmd @@ -9,6 +9,8 @@ output: html_document knitr::opts_chunk$set(echo = TRUE, eval = TRUE, warning = FALSE, message = FALSE) ``` +## 1. Introduction + In this session, we will analyse a mock dataset. Social Security Scotland administers four benefits: Benefit A - Benefit D. You have collected data summarising the monthly applications for each of these benefits between January 2020 and January 2025. @@ -24,10 +26,13 @@ You need to: #Read in libraries library(tidyverse) +library(lubridate) ``` -# Import and explore the data +## 2. Exploring the data + +### 2.1 Import the data The data is saved in the session 2 folder, in a file "benefits.csv". We import the data to create a tibble called benefits. We view the first few lines to see what the data is that is included @@ -43,7 +48,7 @@ View(benefits) ``` -## Data exploration +### 2.2 Data exploration First, we look at the structure of benefits. Recall the function str() @@ -83,7 +88,7 @@ sum(is.na(benefits)) If you want to return the index of datapoints with missing data, you can use the function which() -## Exploratory plots +### 2.3 Exploratory plots Now we can create quick exploratory plots of the data. Look at the help function for plot() @@ -98,9 +103,7 @@ plot(benefits$date, benefits$D) ``` -# Wrangling and plotting the applications numbers over time - - +## 3. Data Wrangling R has powerful (and pretty plotting) functionality in the library ggplot2, which is part of the tidyverse environment. All of tidyverse uses long data. In long format, each row corresponds to data from one measurement, and repeated measurements are on different rows. @@ -109,7 +112,7 @@ pivot_longer() transforms a dataset so that it is has more rows and fewer column (Note, pivot_longer and a related function, pivot_wider, became available in a relatively recent version of R. Older code may use gather() and spread()) -## Data wrangling +### 3.1 Tidying the Data Our dataset is currently in wide format, with the number of applications for each benefit in different columns. @@ -179,7 +182,10 @@ benefit_long <- bind_rows(benefit_long, benefit_total) ``` -## Plotting +## 4. Plotting with ggplot + +### 4.1 ggplot syntax + The syntax for ggplot has the following elements: 1. create a ggplot object which is defined by the dataset, and the aesthetic (including the variables). @@ -257,6 +263,8 @@ time_series_plot5 ``` +### 4.2 Exercise + Finally, we could perform the wrangling and plotting in a concise chunk ```{r AllAnalysis} @@ -306,11 +314,11 @@ time_series_plot ``` -# Plotting the mean applications per year as a bar graph +### 4.3 Plotting the mean applications per year as a bar graph To take the average by year, we need to create a factor variable for the year which we can use for grouping. The lubridate package which comes with tidyverse has a convenient function year(). We will need to mutate benefit_long to append this factor variable -## Wrangling +#### Wrangling ```{r CreateYearIndicator} @@ -340,7 +348,7 @@ head(benefit_by_year) ``` -## Plotting bar graph +#### Plotting bar graph Now we create a ggplot object. The x-axis will show the year, and will use colour for the benefit. @@ -374,7 +382,7 @@ bar_graph_plot ``` -# Further training +## 5. Further training We have only touched on the capabilities of ggplot2. A good set of tutorials on ggplot can be found at http://r-statistics.co/Complete-Ggplot2-Tutorial-Part1-With-R-Code.html diff --git a/session2/intro_to_R_session2_incomplete.Rmd b/session2/intro_to_R_session2_incomplete.Rmd index 2b92b95..e4bf19b 100644 --- a/session2/intro_to_R_session2_incomplete.Rmd +++ b/session2/intro_to_R_session2_incomplete.Rmd @@ -6,9 +6,11 @@ output: html_document --- ```{r setup, include=FALSE} -knitr::opts_chunk$set(echo = TRUE, eval = TRUE, warning = FALSE, message = FALSE) +knitr::opts_chunk$set(echo = TRUE, eval = FALSE, warning = FALSE, message = FALSE) ``` +## 1. Introduction + In this session, we will analyse a mock dataset. Social Security Scotland administers four benefits: Benefit A - Benefit D. You have collected data summarising the monthly applications for each of these benefits between January 2020 and January 2025. @@ -26,7 +28,9 @@ library(tidyverse) ``` -# Import and explore the data +## 2. Exploring the data + +### 2.1 Import the data The data is saved in the session 2 folder, in a file "benefits.csv". We import the data to create a tibble called benefits. We view the first few lines to see what the data is that is included @@ -42,7 +46,7 @@ benefits <- ``` -## Data exploration +### 2.2 Data exploration First, we look at the structure of benefits. Recall the function str() @@ -82,7 +86,7 @@ How can we get the total number of missing values in the dataset? If you want to return the index of datapoints with missing data, you can use the function which() -## Exploratory plots +### 2.3 Exploratory plots Now we can create quick exploratory plots of the data. Look at the help function for plot() @@ -97,9 +101,7 @@ plot() ``` -# Wrangling and plotting the applications numbers over time - - +## 3. Data Wrangling R has powerful (and pretty plotting) functionality in the library ggplot2, which is part of the tidyverse environment. All of tidyverse uses long data. In long format, each row corresponds to data from one measurement, and repeated measurements are on different rows. @@ -108,7 +110,7 @@ pivot_longer() transforms a dataset so that it is has more rows and fewer column (Note, pivot_longer and a related function, pivot_wider, became available in a relatively recent version of R. Older code may use gather() and spread()) -## Data wrangling +### 3.1 Tidying the Data Our dataset is currently in wide format, with the number of applications for each benefit in different columns. @@ -180,7 +182,10 @@ benefit_long <- ``` -## Plotting +## 4. Plotting with ggplot + +### 4.1 ggplot syntax + The syntax for ggplot has the following elements: 1. create a ggplot object which is defined by the dataset, and the aesthetic (including the variables). 2. Specify geometries (i.e. plot types). Geometries such as plotting points, lines and bars are available. The specification of geometries enables one to plot various plot types on the same set of axes. See help(package = ggplot2) under the letter g for the various geometries available. @@ -190,8 +195,8 @@ Here, we will plot the data with a point geometry ```{r TimeSeriesPlot1} time_series_plot <- ggplot(data = benefit_long, - aes(x=date, y = apps)) + - #Add the geometry + aes(x = date, y = apps)) + + #Add the geometry geom_point() #call the object to show the plot @@ -259,7 +264,7 @@ time_series_plot5 Finally, we could perform the wrangling and plotting in a concise chunk -Exercise: +### 4.2 Exercise See if you can accomplish the same thing as the chunks above to produce the time_series_plot5 object in one chunk @@ -298,11 +303,11 @@ time_series_plot ``` -# Plotting the mean applications per year as a bar graph +### 4.3 Plotting the mean applications per year as a bar graph -To take the average by year, we need to create a factor variable for the year which we can use for grouping. The lubridate package which comes with tidyverse has a convenient function year(). We will need to mutate benefit_long to append this factor variable +#### Wrangling -## Wrangling +To take the average by year, we need to create a factor variable for the year which we can use for grouping. The lubridate package which comes with tidyverse has a convenient function year(). We will need to mutate benefit_long to append this factor variable ```{r CreateYearIndicator} @@ -331,7 +336,7 @@ head(benefit_by_year) ``` -## Plotting bar graph +#### Plotting bar graph Now we create a ggplot object. The x-axis will show the year, and will use colour for the benefit. @@ -363,7 +368,7 @@ bar_graph_plot ``` -# Further training +## 5. Further training We have only touched on the capabilities of ggplot2. A good set of tutorials on ggplot can be found at http://r-statistics.co/Complete-Ggplot2-Tutorial-Part1-With-R-Code.html diff --git a/session2/intro_to_R_session2_solutions.R b/session2/intro_to_R_session2_solutions.R new file mode 100644 index 0000000..fc2d0be --- /dev/null +++ b/session2/intro_to_R_session2_solutions.R @@ -0,0 +1,296 @@ +# Session 2 --------------------------------------------------------------- + +#Read in libraries +library(tidyverse) +library(lubridate) + +## Section 2: Exploring the Data ------------------------------------------ + +### 2.1 Examples ---------------------------------------------------------- + +#2.1.1 import data +benefits <- read_csv("./benefits.csv") + +#2.1.2 view the first few lines of the tibble +head(benefits) + +#2.1.3 view the entire tibble in a new tab +View(benefits) + + +### 2.2 Examples ---------------------------------------------------------- + +#2.2.1 data structure +str(benefits) + +#2.2.2 summary statistics +summary(benefits) + +#2.2.3 check for missing values +is.na(benefits) + +#2.2.4 total missing values +sum(is.na(benefits)) + + +### 2.3 Examples ---------------------------------------------------------- + +#2.3.1 quick plots +plot(benefits$date, benefits$A) +plot(benefits$date, benefits$B) +plot(benefits$date, benefits$C) +plot(benefits$date, benefits$D) + + +## Section 3: Data Wrangling ---------------------------------------------- + +### 3.1 Examples ---------------------------------------------------------- + +#3.1.1 pivot_longer documentation +?pivot_longer + +#3.1.2 wide to long +benefit_long <- pivot_longer(benefits, + cols = -date, + names_to = "benefit", + values_to = "apps") +head(benefit_long) + +#3.1.3 alternate way of specifying cols +benefit_long <- pivot_longer(benefits, + cols = 2:5, + names_to = "benefit", + values_to = "apps") +head(benefit_long) + +#3.1.4 add total +benefit_total <- benefit_long %>% + group_by(date) %>% + summarise(benefit = "Total", + apps = sum(apps)) %>% + ungroup() + +head(benefit_total) + +#3.1.5 append benefit_total to benefit_long +benefit_long <- bind_rows(benefit_long, benefit_total) + + +## Section 4: Data Wrangling ---------------------------------------------- + +### 4.1 Examples ---------------------------------------------------------- + +#4.1.1 Time series plot +time_series_plot <- ggplot(data = benefit_long, + aes(x = date, y = apps)) + + #Add the geometry + geom_point() + +#call the object to show the plot +time_series_plot + + +#4.1.2 separate series with different colours +time_series_plot2 <- ggplot(data = benefit_long, + aes(x = date, + y = apps, + group = benefit, + colour = benefit)) + + geom_point() + +time_series_plot2 + +#4.1.3 split into panels +time_series_plot3 <- time_series_plot2 + + facet_grid(rows = vars(benefit)) + +time_series_plot3 + +#4.1.4 best fit line +time_series_plot4 <- time_series_plot3 + + geom_line() + + geom_smooth(method = "lm") #Uses a linear model for fit and confidence interval + +time_series_plot4 + +#4.1.5 apply a theme +time_series_plot5 <- time_series_plot4 + + xlab("Date") + + ylab("Application number") + + labs(colour = "Benefit") + + theme_bw() + +time_series_plot5 + +### 4.2 Exercises --------------------------------------------------------- + +#1. create the time_series_plot5 in a concise bit of code + +#Import +benefits <- read_csv("./benefits.csv") + +#Wide to long +benefit_long <- pivot_longer(benefits, + cols = -date, + names_to = "benefit", + values_to = "apps") +#Get total +benefit_total <- benefit_long %>% + group_by(date) %>% + summarise(apps = sum(apps), + benefit = "Total") %>% + ungroup() + +#Append benefit_total to benefit_long +benefit_long <- benefit_long %>% + bind_rows(benefit_total) + +#Plotting +time_series_plot <- ggplot(data = benefit_long, + aes(x=date, + y = apps, + group = benefit, + colour = benefit)) + + geom_point() + + facet_grid(rows = vars(benefit)) + + geom_line() + + geom_smooth(method = "lm") + + xlab("Date") + + ylab("Application number") + + labs(colour = "Benefit") + + theme_bw() + +time_series_plot + + +### 4.3 Examples ---------------------------------------------------------- + +#4.3.1 add a variable to the benefit_long tibble containing the year, and convert the variable to a factor +benefit_long <- benefit_long %>% + mutate(year = year(date)) %>% + mutate(year = as.factor(year)) #These two lines could be combined: mutate(year = as.factor(year(date))) + +#4.3.2 summarise average number of benefits by year, and add error bar max/min values +benefit_by_year <- benefit_long %>% + filter(benefit != "Total") %>% + group_by(year, benefit) %>% + summarise(average_apps = mean(apps), + error_bar_min = average_apps - sd(apps), + error_bar_max = average_apps + sd(apps) + ) + +head(benefit_by_year) + +#4.3.3 plot as a bar chart +bar_graph_plot <- ggplot(data = benefit_by_year, + aes(x = year, + y = average_apps, + colour = benefit, + group = benefit + )) + +dodge <- position_dodge(width=0.9) + +bar_graph_plot <- bar_graph_plot+ + geom_col(aes(fill = benefit), + position = dodge) + + geom_errorbar(aes(ymin = error_bar_min, + ymax = error_bar_max), + position = dodge, + width = 1 + ) + + xlab("") + + ylab("Average yearly applications") + +bar_graph_plot + +## Additional Exercises --------------------------------------------------- + +#1. Read in "UKgas.csv" from the `./additional_exercises/data` folder and inspect the data. +# (The data has been created from one of R datasets https://www.rdocumentation.org/packages/datasets/versions/3.6.2/topics/UKgas) +UKgas <- read_csv("./additional_exercises/data/UKgas.csv") +head(UKgas) + +#2. Create a new tibble of the data in long format with a column to specify the quarter. +UKgas_l <- UKgas %>% + pivot_longer(cols = -year, + names_to = "quarter", + values_to = "gas_consumption") + +#3. Compute the mean quarterly UKgas consumption across years (Your new tibble will have four rows and 2 columns) +UKgas_by_quarter <- UKgas_l %>% + group_by(quarter) %>% + summarise(mean_quarterly_gas = mean(gas_consumption, + na.rm = TRUE)) + +#4. Compute the mean gas consumption for each year (Your tibble will have 27 rows and 2 columns) +UKgas_by_year <- UKgas_l %>% + group_by(year) %>% + summarise(mean_annual_gas = mean(gas_consumption, + na.rm = TRUE)) + +#5. **Bonus:** Convert your long tibble back to wide. This should be the same as the UKgas data. +# Compute the mean gas consumption by year. +# Hint: Have a look at https://stackoverflow.com/questions/50352735/calculate-the-mean-of-some-columns-using-dplyrmutate +# (Your tibble will have 27 rows and 6 columns). As you can see, working with long data is simpler in R. +UKgas <- UKgas %>% + mutate(mean_annual_gas =rowMeans(select(., Qtr1, + Qtr2, + Qtr3, + Qtr4))) + +#6. Plot the UKgas consumption by year as a line graph, with quarters shown in different colours. +# Change the axes labels to something of your choice and add a title. +g <- ggplot(UKgas_l, aes(x = year, + y = gas_consumption, + group = quarter, + colour = quarter)) + + geom_line() + + xlab("Year") + ## Alternatively, all labels can be changed in a single line using labels(title = "", x="", y="") + ylab("Gas consumption (mTherms)") + +g + + +#7. Plot the same as above, but include a line for the mean gas consumption across quarters. +# You will first need to append the UKgas_by_year to your data +UKgas_l_with_mean <- UKgas_l %>% + bind_rows(UKgas_by_year %>% + # Specify the value that should appear in the "quarter" column + mutate(quarter = "Mean") %>% + # ensure that column names match + rename(gas_consumption = mean_annual_gas)) + +g2 <- ggplot(UKgas_l_with_mean, aes(x = year, + y = gas_consumption, + group = quarter, + colour = quarter)) + + geom_line() + + xlab("Year") + + ylab("Gas consumption (mTherms)") + +g2 + + +#8. Create the same plot as above (including the mean), but use thin lines for quarter, and a thick line for the mean. +# You will need to add a new numeric variable to the data used in the previous exercise that specifies a value for line thickness. +# See the examples in `?geom_line` for details around specifying aesthetics for the line graph and how to do this by group. +# You will also need to look at `?scale_linewidth` +UKgas_l_with_mean <- UKgas_l_with_mean %>% + # Add a linewidth value. The numbers are not crucial as these will be rescaled in the plot + mutate(linewidth = if_else(quarter == "Mean", 2, 1)) + +g3 <- ggplot(UKgas_l_with_mean, aes(x = year, + y = gas_consumption, + group = quarter, + colour = quarter)) + + geom_line(aes(linewidth = linewidth)) + + #specify the range that the linewidths should span, and disable the linewidth legend + scale_linewidth(range = c(0.1, 2), guide = "none") + + xlab("Year") + + ylab("Gas consumption (mTherms)") + +g3 + + + diff --git a/session2/intro_to_r_session2_working_script.R b/session2/intro_to_r_session2_working_script.R new file mode 100644 index 0000000..aec0f06 --- /dev/null +++ b/session2/intro_to_r_session2_working_script.R @@ -0,0 +1,227 @@ +# Session 2 --------------------------------------------------------------- + +#Read in libraries +library(tidyverse) +library(lubridate) + +## Section 2: Exploring the Data ------------------------------------------ + +### 2.1 Examples (incomplete) --------------------------------------------- + +#2.1.1 import data +benefits <- + +#2.1.2 view the first few lines of the tibble + + +#2.1.3 view the entire tibble in a new tab + + +### 2.2 Examples (incomplete) --------------------------------------------- + +#2.2.1 data structure + + +#2.2.2 summary statistics + + +#2.2.3 check for missing values + + +#2.2.4 total missing values + + +### 2.3 Examples (incomplete) --------------------------------------------- + +#2.3.1 quick plots +plot() +plot() +plot() +plot() + + +## Section 3: Data Wrangling ---------------------------------------------- + +### 3.1 Examples (incomplete) --------------------------------------------- + +#3.1.1 pivot_longer documentation +?pivot_longer + +#3.1.2 wide to long +benefit_long <- pivot_longer( + + + +) +head(benefit_long) + +#3.1.3 alternate way of specifying cols +benefit_long <- pivot_longer( + + + +) +head(benefit_long) + +#3.1.4 add total +benefit_total <- benefit_long %>% + group_by() %>% + summarise(benefit = + apps = , + ) %>% + ungroup() + +head(benefit_total) + +#3.1.5 append benefit_total to benefit_long +benefit_long <- + + +## Section 4: Data Wrangling ---------------------------------------------- + +### 4.1 Examples ---------------------------------------------------------- + +#4.1.1 Time series plot +time_series_plot <- ggplot(data = benefit_long, + aes(x = date, y = apps)) + + #Add the geometry + geom_point() + +#call the object to show the plot +time_series_plot + + +#4.1.2 separate series with different colours +time_series_plot2 <- ggplot(data = benefit_long, + aes(x = date, + y = apps, + group = benefit, + colour = benefit)) + + geom_point() + +time_series_plot2 + +#4.1.3 split into panels +time_series_plot3 <- time_series_plot2 + + facet_grid(rows = vars(benefit)) + +time_series_plot3 + +#4.1.4 best fit line +time_series_plot4 <- time_series_plot3 + + geom_line() + + geom_smooth(method = "lm") #Uses a linear model for fit and confidence interval + +time_series_plot4 + +#4.1.5 apply a theme +time_series_plot5 <- time_series_plot4 + + xlab("Date") + + ylab("Application number") + + labs(colour = "Benefit") + + theme_bw() + +time_series_plot5 + +### 4.2 Exercises --------------------------------------------------------- + +#1. create the time_series_plot5 in a concise bit of code + +#Import +benefits <- + +#Wide to long +benefit_long <- + +#Get total +benefit_total <- + +#Append benefit_total to benefit_long +benefit_long <- + + +#Plotting +time_series_plot <- + + + + + +time_series_plot + + +### 4.3 Examples (incomplete) --------------------------------------------- + +#4.3.1 add a variable to the benefit_long tibble containing the year, and convert the variable to a factor +benefit_long <- benefit_long %>% + mutate() %>% + mutate() + +#4.3.2 summarise average number of benefits by year, and add error bar max/min values +benefit_by_year <- benefit_long %>% + filter() %>% + group_by() %>% + summarise(average_apps = , + error_bar_min = , + error_bar_max = + ) + +head(benefit_by_year) + +#4.3.3 plot as a bar chart +bar_graph_plot <- ggplot( + + + + + + +) + +dodge <- position_dodge(width=0.9) + +bar_graph_plot <- bar_graph_plot + + geom_col() + + geom_errorbar( + ) + + xlab() + + ylab() + +bar_graph_plot + +## Additional Exercises --------------------------------------------------- + +#1. Read in "UKgas.csv" from the `./additional_exercises/data` folder and inspect the data. +# (The data has been created from one of R datasets https://www.rdocumentation.org/packages/datasets/versions/3.6.2/topics/UKgas) +UKgas <- + +#2. Create a new tibble of the data in long format with a column to specify the quarter. +UKgas_l <- + +#3. Compute the mean quarterly UKgas consumption across years (Your new tibble will have four rows and 2 columns) + +UKgas_by_quarter <- + +#4. Compute the mean gas consumption for each year (Your tibble will have 27 rows and 2 columns) +UKgas_by_year <- + +#5. **Bonus:** Convert your long tibble back to wide. This should be the same as the UKgas data. +# Compute the mean gas consumption by year. +# Hint: Have a look at https://stackoverflow.com/questions/50352735/calculate-the-mean-of-some-columns-using-dplyrmutate +# (Your tibble will have 27 rows and 6 columns). As you can see, working with long data is simpler in R. +UKgas <- + +#6. Plot the UKgas consumption by year as a line graph, with quarters shown in different colours. +# Change the axes labels to something of your choice and add a title. + + +#7. Plot the same as above, but include a line for the mean gas consumption across quarters. +# You will first need to append the UKgas_by_year to your data + + +#8. Create the same plot as above (including the mean), but use thin lines for quarter, and a thick line for the mean. +# You will need to add a new numeric variable to the data used in the previous exercise that specifies a value for line thickness. +# See the examples in `?geom_line` for details around specifying aesthetics for the line graph and how to do this by group. +# You will also need to look at `?scale_linewidth` + +