diff --git a/01-intro-to-r.html b/01-intro-to-r.html new file mode 100644 index 00000000..41d074da --- /dev/null +++ b/01-intro-to-r.html @@ -0,0 +1,849 @@ + +Geospatial Data Carpentry for Urbanism: Introduction to R and RStudio +
+ Geospatial Data Carpentry for Urbanism +
+ +
+
+ + + + + +
+
+

Introduction to R and RStudio

+

Last updated on 2024-01-29 | + + Edit this page

+ + + +
+ +
+ + + +
+

Overview

+
+
+
+
+

Questions

+
  • How to find your way around RStudio?
  • +
  • How to manage projects in R?
  • +
  • How to install packages?
  • +
  • How to interact with R?
  • +
+
+
+
+
+
+

Objectives

+

After completing this episode, participants should be able to…

+
  • Create self-contained projects in RStudio
  • +
  • Install additional packages using R code.
  • +
  • Manage packages
  • +
  • Define a variable
  • +
  • Assign data to a variable
  • +
  • Call functions
  • +
+
+
+
+
+
+

Project management in RStudio

+

RStudio is an integrated development environment (IDE), which means +it provides a (much prettier) interface for the R software. For RStudio +to work, you need to have R installed on your computer. But R is +integrated into RStudio, so you never actually have to open R +software.

+

RStudio provides a useful feature: creating projects - self-contained +working space (i.e. working directory), to which R will refer to, when +looking for and saving files. You can create projects in existing +directories (folders) or create a new one.

+
+

Creating RStudio Project

+

We’re going to create a project in RStudio in a new directory. To +create a project, go to:

+
  • File
  • +
  • New Project
  • +
  • New directory
  • +
  • Place the project that you will easily find on your laptop and name +the project data-carpentry +
  • +
  • Create project
  • +
+
+

Organising working directory

+

Creating an RStudio project is a good first step towards good project +management. However, most of the time it is a good idea to organize +working space further. This is one suggestion of how your R project can +look like. Let’s go ahead and create the other folders:

+
  • +data/ - should be where your raw data is. READ +ONLY +
  • +
  • +data_output/ - should be where your data output is +saved READ AND WRITE +
  • +
  • +documents/ - all the documentation associated with the +project (e.g. cookbook)
  • +
  • +fig_output/ - your figure outputs go here WRITE +ONLY +
  • +
  • +scripts/ - all your code goes here READ AND +WRITE +
  • +
RStudio  project logo with five lines, each leading from the logo towards  one of the five boxes with texts: 'data/', 'data_output/',  'documents/', 'fig_output/', 'scripts/'
R project organization

You can create these folders as you would any other folders on your +laptop, but R and RStudio offer handy ways to do it directly in your +RStudio session.

+

You can use RStudio interface to create a folder in your project by +going to lower-bottom pane, files tab, and clicking on Folder icon. A +dialog box will appear, allowing you typing a name of a folder you want +to create.

+

An alternative solution is to create the folders using R command +dir.create(). In the console type:

+
+

R +

+
+dir.create('data')
+dir.create('data_output')
+dir.create('documents')
+dir.create('fig_output')
+dir.create('scripts')
+
+ +
+
+

Two main ways to interact with R

+

There are two main ways to interact with R through RStudio:

+
  • test and play environment within the interactive R +console +
  • +
  • write and save an R script (.R +file) +
  • +
+
+ +
+
+

Callout +

+
+

When you open the RStudio or create the Rstudio project, you will see +Console window on the left by default. Once you create an R script, it +is placed in the upper left pane. The Console is moved to the bottom +left pane.

+
+
+
+

Each of the modes o interactions has its advantages and +drawbacks.

+ + + + + + + + + +
ConsoleR script
ProsImmediate resultsWork lost once you close RStudio
ConsComplete record of your workMessy if you just want to print things out
+
+

Creating a script

+

During the workshop we will mostly use an .R script to +have a full documentation of what has been written. This way we will +also be able to reproduce the results. Let’s create one now and save it +in the scripts directory.

+
  • File
  • +
  • New File
  • +
  • R Script
  • +
  • A new Untitled script will appear in the source +pane.
  • +
  • Save it using floppy disc icon.
  • +
  • Select the scripts/ folder as the file location
  • +
  • Name the script intro-to-r.R +
  • +
+
+

Running the code

+

Note that all code written in the script can be also executed at a +spot in the
+interactive console. We will now learn how to run the code both in the +console and the script.

+
  • In the Console you run the code by hitting Enter at the +end of the line
  • +
  • In the R script there are two way to execute the code: +
    • You can use the Run button on the top right of the +script window.
    • +
    • Alternatively, you can use a keyboard shortcut: Ctrl + +Enter or Command + Return for MAC +users.
    • +
  • +

In both cases, the active line (the line where your cursor is placed) +or a highlighted snippet of code will be executed. A common source of +error in scripts, such as a previously created object not found, is code +that has not been executed in previous lines: make sure that all code +has been executed as described above. To run all lines before the active +line, you can use the keyboard shortcut Ctrl + Alt ++ B on Windows/Linux or Command + +option + B on Mac.

+
+
+ +
+
+

Escaping +

+
+

The console shows it’s ready to get new commands with +> sign. It will show + sign if it still +requires input for the command to be executed.

+

Sometimes you don’t know what is missing/ you change your mind and +want to run something else, or your code is running much too long and +you just want it to stop. The way to do it is to press +Esc.

+
+
+
+
+
+

Packages

+

A great power of R lays in packages: add-on sets of +functions that are build by the community and once they go +through a quality process they are available to download from a +repository called CRAN. They need to be explicitly +activated. Now, we will be using tidyverse package, which +is actually a collection of useful packages. Another package that we +will use is here.

+

You were asked to install tidyverse package in the +preparation for the workshop. You need to install a package only once, +so you won’t have to do it again. We will however need to install the +here package. To do so, please go to your script and +type:

+
+

R +

+
+install.packages('here')
+
+
+
+ +
+
+

Callout +

+
+

If you are not sure if you have tidyverse packaged +installed, you can check it in the Packages tab in the +bottom right pane. In the search box start typing +‘tidyverse’ and see if it appears in the list of installed +packages. If not, you will need to install it by writing in the +script:

+
+

R +

+
+install.packages('tidyverse')
+
+
+
+
+
+
+ +
+
+

Commenting your code +

+
+

Now we have a bit of an issue with our script. As mentioned, the +packages need to be installed only once, but now, they will be installed +each time we run the script, which can take a lot of time if we’re +installing a large package like tidyverse.

+

To keep a trace of you installing the packages, without executing it, +you can use a comment. In R, anything that is written after +a has sign #, is ignored in execution. Thanks to this +feature, you can annotate your code. Let’s adapt our script by changing +the first lines into comments:

+
+

R +

+
+# install.packages('here')
+# install.packages('tidyverse')
+
+
+
+
+

Installing packages is not sufficient to work with them. You will +need to load them each time you want to use them. To do that you use +library() command:

+
+

R +

+
+# Load packages
+library(tidyverse)
+library(here)
+
+
+
+

Handling paths

+

You have created a project which is your working directory, and a few +sub-folders, that will help you organise your project better. But now, +each time you will save or retrieve a file from those folders, you will +need to specify the path from the folder you are in (most likely the +scripts/ folder) to those files.

+

That can become complicated and might cause a reproducibility +problem, if the person using your code (including future you) is working +in a different sub-folder.

+

We will use the here() package to tackle this issue. +This package converts relative paths from the root (main folder) of your +project to absolute paths (the exact location on your computer). For +instance, instead of writing out the full path like +“C:/Users/YourName/Documents/r-geospatial-urban/data/file.csv” or +“~/Documents/r-geospatial-urban/data/file.csv”, you can use the +here() function to create a path relative to your project’s +main directory. This makes your code more portable and reproducible, as +it doesn’t depend on a specific location of your project on your +computer.

+

It might be confusing, so let’s see how it works. We will use the +here() function from the here package. In the +console, we write:

+
+

R +

+
+here()
+here('data')
+
+

You all probably have something different printed out. And this is +fine, because here adapts to your computer’s specific +situation.

+
+
+

Download files

+

We still need to download data for the first part of the workshop. +You can do it with the function download.file(). We will +save it in the data/ folder, where the raw +data should go. In the script, we will write:

+
+

R +

+
+# Download the data
+download.file('https://bit.ly/geospatial_data', 
+              here('episodes', 'data','gapminder_data.csv'))
+
+
+
+ +
+
+

Importing data into R +

+
+

Three of the most common ways of importing data in R are:

+
  • loading a package with pre-installed data;
  • +
  • downloading data from a URL;
  • +
  • reading a file from your computer.
  • +

For larger datasets, database connections or API requests are also +possible. We will not cover these in the workshop.

+
+
+
+
+
+
+

Introduction to R

+

You can use R as calculator, you can for example write:

+
+

R +

+
+1+100
+1*100
+1/100
+
+
+

Variables and assignment

+

However, what’s more useful is that in R we can store values and use +them whenever we need to. We using the assignment operator +<-, like this:

+
+

R +

+
+x <- 1/40
+
+

Notice that assignment does not print a value. Instead, we’ve stored +it for later in something called a variable. x variable now +contains the value 0.025:

+
+

R +

+
+x
+
+

Look for the Environment tab in the upper right pane of +RStudio. You will see that x and its value have appeared in +the list of Values. Our variable x can be used in place of +a number in any calculation that expects a number, e.g. when calculating +a square root:

+
+

R +

+
+sqrt(x)
+
+

Variables can be also reassigned. This means that we can assign a new +value to variable x:

+
+

R +

+
+x <- 100
+x
+
+

You can use one variable to create a new one:

+
+

R +

+
+y <- sqrt(x) # you can use value stored in object x to create y
+y
+
+
+
+ +
+
+

Key Points +

+
+
  • Use RStudio to write and run R programs.
  • +
  • Use install.packages() to install packages.
  • +
  • Use library() to load packages.
  • +
+
+
+ +
+
+ + + +
+
+ + +
+
+
+ +
Back To Top +
+
+ + diff --git a/02-data-structures.html b/02-data-structures.html new file mode 100644 index 00000000..53264a57 --- /dev/null +++ b/02-data-structures.html @@ -0,0 +1,896 @@ + +Geospatial Data Carpentry for Urbanism: Data Structures +
+ Geospatial Data Carpentry for Urbanism +
+ +
+
+ + + + + +
+
+

Data Structures

+

Last updated on 2024-01-29 | + + Edit this page

+ + + +
+ +
+ + + +
+

Overview

+
+
+
+
+

Questions

+
  • What are the basic data types in R?
  • +
  • How do I represent categorical information in R?
  • +
+
+
+
+
+
+

Objectives

+

After completing this episode, participants should be able to…

+
  • To be aware of the different types of data.
  • +
  • To begin exploring data frames, and understand how they are related +to vectors, factors and lists.
  • +
  • To be able to ask questions from R about the type, class, and +structure of an object.
  • +
+
+
+
+
+

Vectors +

+

So far we’ve looked on individual values. Now we will move to a data +structure called vectors. Vectors are arrays of values of the same data +type.

+
+
+ +
+
+

Data types +

+
+

Data type refers to a type of information that is stored by a value. +It can be:

+
  • +numerical (a number)
  • +
  • +integer (a number without information about decimal +points)
  • +
  • +logical (a boolean - are values TRUE or FALSE?)
  • +
  • +character (a text/ string of characters)
  • +
  • +complex (a complex number)
  • +
  • +raw (raw bytes)
  • +

We won’t discuss complex or raw data type +in the workshop.

+
+
+
+
+
+ +
+
+

Data structures +

+
+

Vectors are the most common and basic data structure in R but you +will come across other data structures such as data frames, lists and +matrices as well. In short:

+
  • data.frames is a two-dimensional data structure in which columns are +vectors of the same length that can have different data types. We will +use this data structure in this lesson.
  • +
  • lists can have an arbitrary structure and can mix data types;
  • +
  • matrices are two-dimensional data structures containing elements of +the same data type.
  • +

For a more detailed description, see Data +Types and Structures.

+

Note that vector data in the geospatial context is different from +vector data types. More about vector data in a later lesson!

+
+
+
+

You can create a vector with a c() function.

+
+

R +

+
+numeric_vector <- c(2, 6, 3) # vector of numbers - numeric data type.
+numeric_vector
+
+
+

OUTPUT +

+
[1] 2 6 3
+
+
+

R +

+
+character_vector <- c('banana', 'apple', 'orange') # vector of words - or strings of characters- character data type
+character_vector
+
+
+

OUTPUT +

+
[1] "banana" "apple"  "orange"
+
+
+

R +

+
+logical_vector <- c(TRUE, FALSE, TRUE) # vector of logical values (is something true or false?)- logical data type.
+logical_vector
+
+
+

OUTPUT +

+
[1]  TRUE FALSE  TRUE
+
+
+

Combining vectors

+

The combine function, c(), will also append things to an +existing vector:

+
+

R +

+
+ab_vector <- c('a', 'b')
+ab_vector
+
+
+

OUTPUT +

+
[1] "a" "b"
+
+
+

R +

+
+abcd_vector <- c(ab_vector, 'c', 'd')
+abcd_vector
+
+
+

OUTPUT +

+
[1] "a" "b" "c" "d"
+
+
+
+

Missing values

+
+
+ +
+
+

Exercise +

+
+

Combine the abcd_vector with the +numeric_vector in R. What is the data type of this new +vector and why?

+
+
+
+
+
+ +
+
+
combined_vector <- c(abcd_vector, numeric_vector)
+combined_vector
+

The combined vector is a character vector. Because vectors can only +hold one data type and abcd_vector cannot be interpreted as +numbers, the numbers in numeric_vector are coerced +into characters.

+
+
+
+
+

A common operation you want to perform is to remove all the missing +values (in R denoted as NA). Let’s have a look how to do +it:

+
+

R +

+
+with_na <- c(1, 2, 1, 1, NA, 3, NA ) # vector including missing value
+
+

First, let’s try to calculate mean for the values in this vector

+
+

R +

+
+mean(with_na) # mean() function cannot interpret the missing values
+
+
+

OUTPUT +

+
[1] NA
+
+
+

R +

+
+mean(with_na, na.rm = T) # You can add the argument na.rm=TRUE to calculate the result while ignoring the missing values.
+
+
+

OUTPUT +

+
[1] 1.6
+
+

However, sometimes, you would like to have the NA +permanently removed from your vector. For this you need to identify +which elements of the vector hold missing values with +is.na() function.

+
+

R +

+
+is.na(with_na) #  This will produce a vector of logical values, stating if a statement 'This element of the vector is a missing value' is true or not
+
+
+

OUTPUT +

+
[1] FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE
+
+
+

R +

+
+!is.na(with_na) # # The ! operator means negation ,i.e. not is.na(with_na)
+
+
+

OUTPUT +

+
[1]  TRUE  TRUE  TRUE  TRUE FALSE  TRUE FALSE
+
+

We know which elements in the vectors are NA. Now we +need to retrieve the subset of the with_na vector that is +not NA. Sub-setting in R is done with square +brackets[ ].

+
+

R +

+
+without_na <- with_na[ !is.na(with_na) ] # this notation will return only the elements that have TRUE on their respective positions
+
+without_na
+
+
+

OUTPUT +

+
[1] 1 2 1 1 3
+
+
+

Factors +

+

Another important data structure is called a factor. +Factors look like character data, but are used to represent categorical +information.

+

Factors create a structured relation between the different levels +(values) of a categorical variable, such as days of the week or +responses to a question in a survey. While factors look (and often +behave) like character vectors, they are actually treated as numbers by +R, which is useful for computing summary statistics about +their distribution, running regression analysis, etc. So you need to be +very careful when treating them as strings.

+
+

Create factors

+

Once created, factors can only contain a pre-defined set of values, +known as levels.

+
+

R +

+
+nordic_str <- c('Norway', 'Sweden', 'Norway', 'Denmark', 'Sweden')
+nordic_str # regular character vectors printed out
+
+
+

OUTPUT +

+
[1] "Norway"  "Sweden"  "Norway"  "Denmark" "Sweden" 
+
+
+

R +

+
+nordic_cat <- factor(nordic_str) # factor() function converts a vector to factor data type
+nordic_cat # With factors, R prints out additional information - 'Levels'
+
+
+

OUTPUT +

+
[1] Norway  Sweden  Norway  Denmark Sweden 
+Levels: Denmark Norway Sweden
+
+
+
+

Inspect factors

+

R will treat each unique value from a factor vector as a +level and (silently) assign numerical values to it. +This can come in handy when performing statistical analysis. You can +inspect and adapt levels of the factor.

+
+

R +

+
+levels(nordic_cat) # returns all levels of a factor vector.  
+
+
+

OUTPUT +

+
[1] "Denmark" "Norway"  "Sweden" 
+
+
+

R +

+
+nlevels(nordic_cat) # returns number of levels in a vector
+
+
+

OUTPUT +

+
[1] 3
+
+
+
+

Reorder levels

+

Note that R sorts the levels in the alphabetic order, +not in the order of occurrence in the vector. R assigns +value of:

+
  • 1 to level ‘Denmark’,
  • +
  • 2 to ‘Norway’
  • +
  • 3 to ‘Sweden’.
  • +

This is important as it can affect e.g. the order in which categories +are displayed in a plot or which category is taken as a baseline in a +statistical model.

+

You can reorder the categories using factor() function. +This can be useful, for instance, to select a reference category (first +level) in a regression model or for ordering legend items in a plot, +rather than using the default category systematically (i.e. based on +alphabetical order).

+
+

R +

+
+nordic_cat <- factor(nordic_cat, levels = c('Norway' , 'Denmark', 'Sweden'))  # now Norway should be the first category, Denmark second and Sweden third
+
+nordic_cat
+
+
+

OUTPUT +

+
[1] Norway  Sweden  Norway  Denmark Sweden 
+Levels: Norway Denmark Sweden
+
+
+
+ +
+
+

Callout +

+
+

There is more than one way to reorder factors. Later in the lesson, +we will use fct_relevel() function from +forcats package to do the reordering.

+
+

R +

+
+# nordic_cat <- fct_relevel(nordic_cat, 'Norway' , 'Denmark', 'Sweden') # now Norway should be the first category, Denmark second and Sweden third
+
+nordic_cat
+
+
+

OUTPUT +

+
[1] Norway  Sweden  Norway  Denmark Sweden 
+Levels: Norway Denmark Sweden
+
+
+
+
+

You can also inspect vectors with str() function. In +factor vectors, it shows the underlying values of each category. You can +also see the structure in the environment tab of RStudio.

+
+

R +

+
+str(nordic_cat) 
+
+
+

OUTPUT +

+
 Factor w/ 3 levels "Norway","Denmark",..: 1 3 1 2 3
+
+
+
+ +
+
+

Note of caution +

+
+

Remember that once created, factors can only contain a pre-defined +set of values, known as levels. It means that whenever you try to add +something to the factor outside of this set, it will become an +unknown/missing value detonated by R as +NA.

+
+

R +

+
+nordic_str
+
+
+

OUTPUT +

+
[1] "Norway"  "Sweden"  "Norway"  "Denmark" "Sweden" 
+
+
+

R +

+
+nordic_cat2 <- factor(nordic_str, levels = c('Norway', 'Denmark'))
+nordic_cat2 # since we have not included Sweden in the list of factor levels, it has become NA.
+
+
+

OUTPUT +

+
[1] Norway  <NA>    Norway  Denmark <NA>   
+Levels: Norway Denmark
+
+
+
+
+
+
+ +
+
+

Key Points +

+
+
  • The mostly used basic data types in R are numeric, +integer, logical, and +character +
  • +
  • Use factors to represent categories in R.
  • +
+
+
+ +
+
+
+ + +
+
+
+ +
Back To Top +
+
+ + diff --git a/03-explore-data.html b/03-explore-data.html new file mode 100644 index 00000000..f353ce9a --- /dev/null +++ b/03-explore-data.html @@ -0,0 +1,943 @@ + +Geospatial Data Carpentry for Urbanism: Exploring Data Frames & Data frame Manipulation with dplyr +
+ Geospatial Data Carpentry for Urbanism +
+ +
+
+ + + + + +
+
+

Exploring Data Frames & Data frame Manipulation with dplyr

+

Last updated on 2024-01-29 | + + Edit this page

+ + + +
+ +
+ + + +
+

Overview

+
+
+
+
+

Questions

+
  • What is a data frame?
  • +
  • How can I read data in R?
  • +
  • How can I get basic summary information about my data set?
  • +
  • How can I select specific rows and/or columns from a data +frame?
  • +
  • How can I combine multiple commands into a single command?
  • +
  • How can I create new columns or remove existing columns from a data +frame?
  • +
+
+
+
+
+
+

Objectives

+

After completing this episode, participants should be able to…

+
  • Describe what a data frame is.
  • +
  • Load external data from a .csv file into a data frame.
  • +
  • Summarize the contents of a data frame.
  • +
  • Select certain columns in a data frame with the dplyr function +select.
  • +
  • Select certain rows in a data frame according to filtering +conditions with the dplyr function filter.
  • +
  • Link the output of one dplyr function to the input of another +function with the ‘pipe’ operator %>%.
  • +
  • Add new columns to a data frame that are functions of existing +columns with mutate.
  • +
  • Use the split-apply-combine concept for data analysis.
  • +
  • Use summarize, group_by, and count to split a data frame into groups +of observations, apply a summary statistics for each group, and then +combine the results.
  • +
+
+
+
+
+
+

Exploring +Data frames

+

Now we turn to the bread-and-butter of working with R: +working with tabular data. In R data are stored in a data +structure called data frames.

+

A data frame is a representation of data in the format of a +table where the columns are vectors +that all have the same length.

+

Because columns are vectors, each column must contain a +single type of data (e.g., characters, numeric, +factors). For example, here is a figure depicting a data frame +comprising a numeric, a character, and a logical vector.

+


Source:Data +Carpentry R for Social Scientists

+
+

Reading data

+

read.csv() is a function used to read coma separated +data files (.csv format)). There are other functions for +files separated with other delimiters. We’re gonna read in the +gapminder data set with information about countries’ size, +GDP and average life expectancy in different years.

+
+

R +

+
+gapminder <- read_csv("data/gapminder_data.csv")
+
+
+
+

Exploring dataset

+

Let’s investigate the gapminder data frame a bit; the +first thing we should always do is check out what the data looks +like.

+

It is important to see if all the variables (columns) have the data +type that we require. For instance, a column might have numbers stored +as characters, which would not allow us to make calculations with those +numbers.

+
+

R +

+
+str(gapminder) 
+
+
+

OUTPUT +

+
spc_tbl_ [1,704 × 6] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
+ $ country  : chr [1:1704] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
+ $ year     : num [1:1704] 1952 1957 1962 1967 1972 ...
+ $ pop      : num [1:1704] 8425333 9240934 10267083 11537966 13079460 ...
+ $ continent: chr [1:1704] "Asia" "Asia" "Asia" "Asia" ...
+ $ lifeExp  : num [1:1704] 28.8 30.3 32 34 36.1 ...
+ $ gdpPercap: num [1:1704] 779 821 853 836 740 ...
+ - attr(*, "spec")=
+  .. cols(
+  ..   country = col_character(),
+  ..   year = col_double(),
+  ..   pop = col_double(),
+  ..   continent = col_character(),
+  ..   lifeExp = col_double(),
+  ..   gdpPercap = col_double()
+  .. )
+ - attr(*, "problems")=<externalptr> 
+
+

We can see that the gapminder object is a data.frame +with 1704 observations (rows) and 6 variables (columns).

+

In each line after a $ sign, we see the name of each +column, its type and first few values.

+
+

First look at the dataset

+

There are multiple ways to explore a data set. Here are just a few +examples:

+
+

R +

+
+head(gapminder) # see first 6  rows of the data set
+
+
+

OUTPUT +

+
# A tibble: 6 × 6
+  country      year      pop continent lifeExp gdpPercap
+  <chr>       <dbl>    <dbl> <chr>       <dbl>     <dbl>
+1 Afghanistan  1952  8425333 Asia         28.8      779.
+2 Afghanistan  1957  9240934 Asia         30.3      821.
+3 Afghanistan  1962 10267083 Asia         32.0      853.
+4 Afghanistan  1967 11537966 Asia         34.0      836.
+5 Afghanistan  1972 13079460 Asia         36.1      740.
+6 Afghanistan  1977 14880372 Asia         38.4      786.
+
+
+

R +

+
+summary(gapminder) # gives basic statistical information about each column. Information format differes by data type.
+
+
+

OUTPUT +

+
   country               year           pop             continent        
+ Length:1704        Min.   :1952   Min.   :6.001e+04   Length:1704       
+ Class :character   1st Qu.:1966   1st Qu.:2.794e+06   Class :character  
+ Mode  :character   Median :1980   Median :7.024e+06   Mode  :character  
+                    Mean   :1980   Mean   :2.960e+07                     
+                    3rd Qu.:1993   3rd Qu.:1.959e+07                     
+                    Max.   :2007   Max.   :1.319e+09                     
+    lifeExp        gdpPercap       
+ Min.   :23.60   Min.   :   241.2  
+ 1st Qu.:48.20   1st Qu.:  1202.1  
+ Median :60.71   Median :  3531.8  
+ Mean   :59.47   Mean   :  7215.3  
+ 3rd Qu.:70.85   3rd Qu.:  9325.5  
+ Max.   :82.60   Max.   :113523.1  
+
+
+

R +

+
+nrow(gapminder) # returns number of rows in a dataset
+
+
+

OUTPUT +

+
[1] 1704
+
+
+

R +

+
+ncol(gapminder) # returns number of columns in a dataset
+
+
+

OUTPUT +

+
[1] 6
+
+
+
+

Dollar sign ($)

+

When you’re analyzing a data set, you often need to access its +specific columns.

+

One handy way to access a column is using it’s name and a dollar sign +$:

+
+

R +

+
+country_vec <- gapminder$country  # Notation means: From dataset gapminder, give me column country. You can see that the column accessed in this way is just a vector of characters. 
+
+head(country_vec)
+
+
+

OUTPUT +

+
[1] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan"
+[6] "Afghanistan"
+
+

Note that the calling a column with a $ sign will return +a vector, it’s not a data frame anymore.

+
+
+
+
+

Data +frame Manipulation with dplyr

+
+

Select

+

Let’s start manipulating the data.

+

First, we will adapt our data set, by keeping only the columns we’re +interested in, using the select() function from the +dplyr package:

+
+

R +

+
+year_country_gdp <- select(gapminder, year, country, gdpPercap) 
+
+head(year_country_gdp)
+
+
+

OUTPUT +

+
# A tibble: 6 × 3
+   year country     gdpPercap
+  <dbl> <chr>           <dbl>
+1  1952 Afghanistan      779.
+2  1957 Afghanistan      821.
+3  1962 Afghanistan      853.
+4  1967 Afghanistan      836.
+5  1972 Afghanistan      740.
+6  1977 Afghanistan      786.
+
+
+
+

Pipe

+

Now, this is not the most common notation when working with +dplyr package. dplyr offers an operator +%>% called a pipe, which allows you build up very +complicated commands in a readable way.

+

In newer installation of R you can also find a notation +|> . This pipe works in a similar way. The main +difference is that you don’t need to load any packages to have it +available.

+

The select() statement with pipe would look like +that:

+
+

R +

+
+year_country_gdp <- gapminder %>% 
+  select(year,country,gdpPercap)
+
+head(year_country_gdp)
+
+
+

OUTPUT +

+
# A tibble: 6 × 3
+   year country     gdpPercap
+  <dbl> <chr>           <dbl>
+1  1952 Afghanistan      779.
+2  1957 Afghanistan      821.
+3  1962 Afghanistan      853.
+4  1967 Afghanistan      836.
+5  1972 Afghanistan      740.
+6  1977 Afghanistan      786.
+
+

First we define data set, then - with the use of pipe we pass it on +to the select() function. This way we can chain multiple +functions together, which we will be doing now.

+
+
+

Filter

+

We already know how to select only the needed columns. But now, we +also want to filter the rows of our data set via certain conditions with +filter() function. Instead of doing it in separate steps, +we can do it all together.

+

In the gapminder data set, we want to see the results +from outside of Europe for the 21st century.

+
+

R +

+
+year_country_gdp_euro <- gapminder %>% 
+  filter(continent != "Europe" & year >= 2000) %>% # & operator (AND) - both conditions must be met
+  select(year, country, gdpPercap)
+
+head(year_country_gdp_euro)
+
+
+

OUTPUT +

+
# A tibble: 6 × 3
+   year country     gdpPercap
+  <dbl> <chr>           <dbl>
+1  2002 Afghanistan      727.
+2  2007 Afghanistan      975.
+3  2002 Algeria         5288.
+4  2007 Algeria         6223.
+5  2002 Angola          2773.
+6  2007 Angola          4797.
+
+
+

Exercise 1

+
+

Challenge Write a single command (which can span +multiple lines and includes pipes) that will produce a data frame that +has the values for life expectancy, country and year, only for Eurasia. +How many rows does your data frame have and why?

+

Solution

+
+
+

R BG-INFO +

+
+year_country_gdp_eurasia <- gapminder %>% 
+  filter(continent == "Europe" | continent == "Asia") %>% # | operator (OR) - one of the conditions must be met
+  select(year, country, gdpPercap)
+
+nrow(year_country_gdp_eurasia)
+
+
+

OUTPUT +

+
[1] 756
+
+
+
+
+

Group and summarize

+

So far, we have provided summary statistics on the whole dataset, +selected columns, and filtered the observations. But often instead of +doing that, we would like to know statistics about all of the +continents, presented by group.

+
+

R +

+
+gapminder %>% # select the dataset
+  group_by(continent) %>% # group by continent
+  summarize(avg_gdpPercap = mean(gdpPercap)) # summarize function creates statistics for the data set 
+
+
+

OUTPUT +

+
# A tibble: 5 × 2
+  continent avg_gdpPercap
+  <chr>             <dbl>
+1 Africa            2194.
+2 Americas          7136.
+3 Asia              7902.
+4 Europe           14469.
+5 Oceania          18622.
+
+
+

Exercise 2

+
+

Challenge Calculate the average life expectancy per +country. Which country has the longest average life expectancy and which +has the shortest average life expectancy?

+

Hint Use max() and min() +functions to find minimum and maximum.

+

Solution

+
+
+

R BG-INFO +

+
+gapminder %>%
+   group_by(country) %>%
+   summarize(avg_lifeExp=mean(lifeExp)) %>%
+   filter(avg_lifeExp == min(avg_lifeExp) | avg_lifeExp == max(avg_lifeExp))
+
+
+

OUTPUT +

+
# A tibble: 2 × 2
+  country      avg_lifeExp
+  <chr>              <dbl>
+1 Iceland             76.5
+2 Sierra Leone        36.8
+
+
+
+

Multiple groups and summary variables

+

You can also group by multiple columns:

+
+

R +

+
+gapminder %>%
+  group_by(continent, year) %>%
+  summarize(avg_gdpPercap = mean(gdpPercap))
+
+
+

OUTPUT +

+
# A tibble: 60 × 3
+# Groups:   continent [5]
+   continent  year avg_gdpPercap
+   <chr>     <dbl>         <dbl>
+ 1 Africa     1952         1253.
+ 2 Africa     1957         1385.
+ 3 Africa     1962         1598.
+ 4 Africa     1967         2050.
+ 5 Africa     1972         2340.
+ 6 Africa     1977         2586.
+ 7 Africa     1982         2482.
+ 8 Africa     1987         2283.
+ 9 Africa     1992         2282.
+10 Africa     1997         2379.
+# ℹ 50 more rows
+
+

On top of this, you can also make multiple summaries of those +groups:

+
+

R +

+
+gdp_pop_bycontinents_byyear <- gapminder %>%
+  group_by(continent,year) %>%
+  summarize(
+    avg_gdpPercap = mean(gdpPercap),
+    sd_gdpPercap = sd(gdpPercap),
+    avg_pop = mean(pop),
+    sd_pop = sd(pop),
+    n_obs = n()
+    )
+
+
+
+
+

Frequencies

+

If you need only a number of observations per group, you can use the +count() function

+
+

R +

+
+gapminder %>%
+    group_by(continent) %>%
+    count()
+
+
+

OUTPUT +

+
# A tibble: 5 × 2
+# Groups:   continent [5]
+  continent     n
+  <chr>     <int>
+1 Africa      624
+2 Americas    300
+3 Asia        396
+4 Europe      360
+5 Oceania      24
+
+
+
+

Mutate

+

Frequently you’ll want to create new columns based on the values in +existing columns. For example, instead of only having the GDP per +capita, we might want to create a new GDP variable and convert its units +into Billions. For this, we’ll use mutate().

+
+

R +

+
+gapminder_gdp <- gapminder %>%
+  mutate(gdpBillion = gdpPercap*pop/10^9)
+
+head(gapminder_gdp)
+
+
+

OUTPUT +

+
# A tibble: 6 × 7
+  country      year      pop continent lifeExp gdpPercap gdpBillion
+  <chr>       <dbl>    <dbl> <chr>       <dbl>     <dbl>      <dbl>
+1 Afghanistan  1952  8425333 Asia         28.8      779.       6.57
+2 Afghanistan  1957  9240934 Asia         30.3      821.       7.59
+3 Afghanistan  1962 10267083 Asia         32.0      853.       8.76
+4 Afghanistan  1967 11537966 Asia         34.0      836.       9.65
+5 Afghanistan  1972 13079460 Asia         36.1      740.       9.68
+6 Afghanistan  1977 14880372 Asia         38.4      786.      11.7 
+
+
+
+ +
+
+

Key Points +

+
+
  • We can use the select() and filter() +functions to select certain columns in a data frame and to subset it +based a specific conditions.
  • +
  • With mutate(), we can create new columns in a data +frame with values based on existing columns.
  • +
  • By combining group_by() and summarize() in +a pipe (%>%) chain, we can generate summary statistics +for each group in a data frame.
  • +
+
+
+ +
+
+ + + +
+
+ + +
+
+
+ +
Back To Top +
+
+ + diff --git a/04-intro-to-visualisation.html b/04-intro-to-visualisation.html new file mode 100644 index 00000000..ababb29d --- /dev/null +++ b/04-intro-to-visualisation.html @@ -0,0 +1,698 @@ + +Geospatial Data Carpentry for Urbanism: Introduction to visualisation +
+ Geospatial Data Carpentry for Urbanism +
+ +
+
+ + + + + +
+
+

Introduction to visualisation

+

Last updated on 2024-01-29 | + + Edit this page

+ + + +
+ +
+ + + +
+

Overview

+
+
+
+
+

Questions

+
  • How can I create a basic plot in R?
  • +
  • How can I add features to a plot?
  • +
  • How can I get basic summary information about my data set?
  • +
  • How can I include addition information via a colours palette.
  • +
  • How can I find more information about a function and its +arguments?
  • +
  • How can I create new columns or remove existing columns from a data +frame?
  • +
+
+
+
+
+
+

Objectives

+

After completing this episode, participants should be able to…

+
  • Generate plots to visualise data with ggplot2.
  • +
  • Add plot layers to incrementally build a more complex plot.
  • +
  • Use the fill argument for colouring surfaces, and +modify colours with the viridis or scale_manual packages.
  • +
  • Explore the help documentation.
  • +
  • Save and format your plot via the ggsave() +function.
  • +
+
+
+
+
+
+

Introduction +to Visualisation

+

The package ggplot2 is a powerful plotting system. We +will start with an introduction of key features of ggplot2. +In the following parts of this workshop, you will use this package to +visualize geospatial data. gg stands for grammar of +graphics, the idea that three components are needed to create a +graph:

+
  • data,
  • +
  • aesthetics - a coordinate system on which we map the data (what is +represented on x axis, what on y axis), and
  • +
  • geometries - visual representation of the data (points, bars, +etc.)
  • +

Fun part about ggplot2 is that you can add layers to the +plot to provide more information and to make it more beautiful.

+

First, lets plot the distribution of life expectancy in the +gapminder dataset:

+
+

R +

+
+  ggplot(data = gapminder,  aes(x = lifeExp) ) + # aesthetics layer 
+  geom_histogram() # geometry layer
+
+

You can see that in ggplot you use + as a +pipe, to add layers. Within the ggplot() call, it is the +only pipe that will work. But, it is possible to chain operations on a +data set with a pipe that we have already learned: %>% ( +or |>) and follow them by ggplot syntax.

+

Let’s create another plot, this time only on a subset of +observations:

+
+

R +

+
+gapminder %>%  # we select a data set
+  filter(year == 2007 & 
+         continent == 'Americas') %>%  # and filter it to keep only one year and one continent
+  ggplot(aes(x = country, y = gdpPercap)) +  # the x and y axes represent values of columns
+  geom_col()  # we select a column graph as a geometry
+
+

Now, you can iteratively improve how the plot looks like. For +example, you might want to flip it, to better display the labels.

+
+

R +

+
+gapminder %>%  
+  filter(year == 2007, 
+         continent == 'Americas') %>% 
+  ggplot(aes(x = country, y = gdpPercap)) + 
+  geom_col()+ 
+  coord_flip()  # flip axes
+
+

One thing you might want to change here is the order in which +countries are displayed. It would be easier to compare GDP per capita, +if they were showed in order. To do that, we need to reorder factor +levels (you remember, we’ve already done this before).

+

Now the order of the levels will depend on another variable - GDP per +capita.

+
+

R +

+
+gapminder %>%  
+  filter(year == 2007, 
+         continent == 'Americas') %>% 
+  mutate(country = fct_reorder(country, gdpPercap )) %>%  # reorder factor levels
+  ggplot(aes(x = country , y = gdpPercap)) + 
+  geom_col() +
+  coord_flip()
+
+

Let’s make things more colourful - let’s represent the average life +expectancy of a country by colour

+
+

R +

+
+gapminder %>%  
+  filter(year == 2007, 
+         continent == 'Americas') %>% 
+  mutate(country = fct_reorder(country, gdpPercap )) %>%
+  ggplot(aes(x = country, y = gdpPercap, fill = lifeExp  )) +  # fill argument for colouring surfaces, colour for points and lines
+  geom_col()+ 
+  coord_flip()
+
+

We can also adapt the colour scale. Common choice that is used for +its readability and colorblind-proofness are the palettes available in +the viridis package.

+
+

R +

+
+gapminder %>%  
+  filter(year == 2007, 
+         continent == 'Americas') %>% 
+  mutate(country = fct_reorder(country, gdpPercap )) %>%
+  ggplot(aes(x = country, y = gdpPercap, fill = lifeExp   )) + 
+  geom_col()+ 
+  coord_flip()+
+  scale_fill_viridis_c()  # _c stands for continuous scale
+
+

Maybe we don’t need that much information about the life expectancy. +We only want to know if it’s below or above average. We will make use of +the if_else() function inside mutate() to +create a new column lifeExpCat with the value +high if life expectancy is above average and +low otherwise. Note the usage of the if_else() +function: +if_else(<condition>, <value if TRUE>, <value if FALSE>).

+
+

R +

+
+p <-  # this time let's save the plot in an object
+  gapminder %>%  
+  filter(year == 2007 & 
+         continent == 'Americas') %>% 
+  mutate(country = fct_reorder(country, gdpPercap ),
+         lifeExpCat = if_else(lifeExp >= mean(lifeExp), 'high', 'low')) %>%
+  ggplot(aes(x = country, y = gdpPercap, fill = lifeExpCat)) + 
+  geom_col()+ 
+  coord_flip()+
+  scale_fill_manual(values = c('light blue', 'orange'))  # customize the colours of the fill aesthetic
+
+

Since we saved a plot as an object, nothing has been printed out. +Just like with any other object in R, if you want to see +it, you need to call it.

+
+

R +

+
+p
+
+

Now we can make use of the saved object and add things to it.

+

Let’s also give it a title and name the axes:

+
+

R +

+
+p <- 
+  p +
+  ggtitle('GDP per capita in Americas', subtitle = 'Year 2007') +
+  xlab('Country')+
+  ylab('GDP per capita')
+
+p
+
+
+
+

Writing +data

+
+

Saving the plot

+

Once we are happy with our plot we can save it in a format of our +choice. Remember to save it in the dedicated folder.

+
+

R +

+
+ggsave(plot = p, 
+       filename = here('fig_output','plot_americas_2007.pdf'))  # By default, ggsave() saves the last displayed plot, but you can also explicitly name the plot you want to save
+
+
+

ERROR +

+
Error in grDevices::pdf(file = filename, ..., version = version): cannot open file '/home/runner/work/r-geospatial-urban/r-geospatial-urban/site/built/fig_output/plot_americas_2007.pdf'
+
+
+

Using help documentation

+

My saved plot is not very readable. We can see why it happened by +exploring the help documentation. We can do that by writing directly in +the console:

+
+

R +

+
+?ggsave
+
+

We can read that it uses the “size of the current graphics device”. +That would explain why our saved plots look slightly different. Feel +free to explore the documentation to see how to adapt the size e.g. by +adapting width, height and units +parameter.

+
+
+
+

Saving the data

+

Another output of your work you want to save is a cleaned data set. +In your analysis, you can then load directly that data set. Let’s say we +want to save the data only for Americas:

+
+

R +

+
+gapminder_amr_2007 <- gapminder %>%
+  filter(year == 2007 & continent == 'Americas') %>%
+  mutate(country_reordered = fct_reorder(country, gdpPercap ), 
+         lifeExpCat = if_else(lifeExp >= mean(lifeExp), 'high', 'low'))
+
+write.csv(gapminder_amr_2007, here('data_output', 'gapminder_americas_2007.csv'), row.names=FALSE)
+
+
+

ERROR +

+
Error in file(file, ifelse(append, "a", "w")): cannot open the connection
+
+
+
+ +
+
+

Key Points +

+
+
  • With ggplot2, we use the + operator to +combine plot layers and incrementally build a more complex plot.
  • +
  • In the aesthetics (aes()), we can assign variables to +the x and y axes and use the fill argument for colouring +surfaces.
  • +
  • With scale_fill_viridis_c() and +scale_fill_manual() we can assign new colours to our +plot.
  • +
  • To open the help documentation for a function, we run the name of +the function preceded by the ? sign.
  • +
+
+
+ +
+
+ + + +
+
+ + +
+
+
+ +
Back To Top +
+
+ + diff --git a/09-open-and-plot-vector-layers.html b/09-open-and-plot-vector-layers.html index 566ab33c..93aa67c8 100644 --- a/09-open-and-plot-vector-layers.html +++ b/09-open-and-plot-vector-layers.html @@ -1,5 +1,5 @@ -Geospatial Data Carpentry with R for Urbanists: Open and Plot Vector Layers +
+ Geospatial Data Carpentry for Urbanism +
+ +
+
+ + + + + +
+
+

Introduction to R and RStudio

+

Last updated on 2024-01-29 | + + Edit this page

+ + + +

Estimated time: 50 minutes

+ +
+ +
+ + + +
+

Overview

+
+
+
+
+

Questions

+
  • How to find your way around RStudio?
  • +
  • How to manage projects in R?
  • +
  • How to install packages?
  • +
  • How to interact with R?
  • +
+
+
+
+
+
+

Objectives

+

After completing this episode, participants should be able to…

+
  • Create self-contained projects in RStudio
  • +
  • Install additional packages using R code.
  • +
  • Manage packages
  • +
  • Define a variable
  • +
  • Assign data to a variable
  • +
  • Call functions
  • +
+
+
+
+
+
+

Project management in RStudio

+

RStudio is an integrated development environment (IDE), which means +it provides a (much prettier) interface for the R software. For RStudio +to work, you need to have R installed on your computer. But R is +integrated into RStudio, so you never actually have to open R +software.

+

RStudio provides a useful feature: creating projects - self-contained +working space (i.e. working directory), to which R will refer to, when +looking for and saving files. You can create projects in existing +directories (folders) or create a new one.

+
+

Creating RStudio Project

+

We’re going to create a project in RStudio in a new directory. To +create a project, go to:

+
  • File
  • +
  • New Project
  • +
  • New directory
  • +
  • Place the project that you will easily find on your laptop and name +the project data-carpentry +
  • +
  • Create project
  • +
+
+

Organising working directory

+

Creating an RStudio project is a good first step towards good project +management. However, most of the time it is a good idea to organize +working space further. This is one suggestion of how your R project can +look like. Let’s go ahead and create the other folders:

+
  • +data/ - should be where your raw data is. READ +ONLY +
  • +
  • +data_output/ - should be where your data output is +saved READ AND WRITE +
  • +
  • +documents/ - all the documentation associated with the +project (e.g. cookbook)
  • +
  • +fig_output/ - your figure outputs go here WRITE +ONLY +
  • +
  • +scripts/ - all your code goes here READ AND +WRITE +
  • +
RStudio  project logo with five lines, each leading from the logo towards  one of the five boxes with texts: 'data/', 'data_output/',  'documents/', 'fig_output/', 'scripts/'
R project organization

You can create these folders as you would any other folders on your +laptop, but R and RStudio offer handy ways to do it directly in your +RStudio session.

+

You can use RStudio interface to create a folder in your project by +going to lower-bottom pane, files tab, and clicking on Folder icon. A +dialog box will appear, allowing you typing a name of a folder you want +to create.

+

An alternative solution is to create the folders using R command +dir.create(). In the console type:

+
+

R +

+
+dir.create('data')
+dir.create('data_output')
+dir.create('documents')
+dir.create('fig_output')
+dir.create('scripts')
+
+
+
+ +
+
+

In interest of time, focus on one way of creating the folders. You +can showcase an alternative method with just one example.

+

Once you have finished, ask the participants if they have managed to +create a R Project and get the same folder structure. To do this, use +green and red stickers.

+

This will become important, as we use relative paths together with +here package to read and write objects.

+
+
+
+
+
+
+

Two main ways to interact with R

+

There are two main ways to interact with R through RStudio:

+
  • test and play environment within the interactive R +console +
  • +
  • write and save an R script (.R +file) +
  • +
+
+ +
+
+

Callout +

+
+

When you open the RStudio or create the Rstudio project, you will see +Console window on the left by default. Once you create an R script, it +is placed in the upper left pane. The Console is moved to the bottom +left pane.

+
+
+
+

Each of the modes o interactions has its advantages and +drawbacks.

+ + + + + + + + + +
ConsoleR script
ProsImmediate resultsWork lost once you close RStudio
ConsComplete record of your workMessy if you just want to print things out
+
+

Creating a script

+

During the workshop we will mostly use an .R script to +have a full documentation of what has been written. This way we will +also be able to reproduce the results. Let’s create one now and save it +in the scripts directory.

+
  • File
  • +
  • New File
  • +
  • R Script
  • +
  • A new Untitled script will appear in the source +pane.
  • +
  • Save it using floppy disc icon.
  • +
  • Select the scripts/ folder as the file location
  • +
  • Name the script intro-to-r.R +
  • +
+
+

Running the code

+

Note that all code written in the script can be also executed at a +spot in the
+interactive console. We will now learn how to run the code both in the +console and the script.

+
  • In the Console you run the code by hitting Enter at the +end of the line
  • +
  • In the R script there are two way to execute the code: +
    • You can use the Run button on the top right of the +script window.
    • +
    • Alternatively, you can use a keyboard shortcut: Ctrl + +Enter or Command + Return for MAC +users.
    • +
  • +

In both cases, the active line (the line where your cursor is placed) +or a highlighted snippet of code will be executed. A common source of +error in scripts, such as a previously created object not found, is code +that has not been executed in previous lines: make sure that all code +has been executed as described above. To run all lines before the active +line, you can use the keyboard shortcut Ctrl + Alt ++ B on Windows/Linux or Command + +option + B on Mac.

+
+
+ +
+
+

Escaping +

+
+

The console shows it’s ready to get new commands with +> sign. It will show + sign if it still +requires input for the command to be executed.

+

Sometimes you don’t know what is missing/ you change your mind and +want to run something else, or your code is running much too long and +you just want it to stop. The way to do it is to press +Esc.

+
+
+
+
+
+

Packages

+

A great power of R lays in packages: add-on sets of +functions that are build by the community and once they go +through a quality process they are available to download from a +repository called CRAN. They need to be explicitly +activated. Now, we will be using tidyverse package, which +is actually a collection of useful packages. Another package that we +will use is here.

+

You were asked to install tidyverse package in the +preparation for the workshop. You need to install a package only once, +so you won’t have to do it again. We will however need to install the +here package. To do so, please go to your script and +type:

+
+

R +

+
+install.packages('here')
+
+
+
+ +
+
+

Callout +

+
+

If you are not sure if you have tidyverse packaged +installed, you can check it in the Packages tab in the +bottom right pane. In the search box start typing +‘tidyverse’ and see if it appears in the list of installed +packages. If not, you will need to install it by writing in the +script:

+
+

R +

+
+install.packages('tidyverse')
+
+
+
+
+
+
+ +
+
+

Commenting your code +

+
+

Now we have a bit of an issue with our script. As mentioned, the +packages need to be installed only once, but now, they will be installed +each time we run the script, which can take a lot of time if we’re +installing a large package like tidyverse.

+

To keep a trace of you installing the packages, without executing it, +you can use a comment. In R, anything that is written after +a has sign #, is ignored in execution. Thanks to this +feature, you can annotate your code. Let’s adapt our script by changing +the first lines into comments:

+
+

R +

+
+# install.packages('here')
+# install.packages('tidyverse')
+
+
+
+
+

Installing packages is not sufficient to work with them. You will +need to load them each time you want to use them. To do that you use +library() command:

+
+

R +

+
+# Load packages
+library(tidyverse)
+library(here)
+
+
+
+

Handling paths

+

You have created a project which is your working directory, and a few +sub-folders, that will help you organise your project better. But now, +each time you will save or retrieve a file from those folders, you will +need to specify the path from the folder you are in (most likely the +scripts/ folder) to those files.

+

That can become complicated and might cause a reproducibility +problem, if the person using your code (including future you) is working +in a different sub-folder.

+

We will use the here() package to tackle this issue. +This package converts relative paths from the root (main folder) of your +project to absolute paths (the exact location on your computer). For +instance, instead of writing out the full path like +“C:/Users/YourName/Documents/r-geospatial-urban/data/file.csv” or +“~/Documents/r-geospatial-urban/data/file.csv”, you can use the +here() function to create a path relative to your project’s +main directory. This makes your code more portable and reproducible, as +it doesn’t depend on a specific location of your project on your +computer.

+

It might be confusing, so let’s see how it works. We will use the +here() function from the here package. In the +console, we write:

+
+

R +

+
+here()
+here('data')
+
+

You all probably have something different printed out. And this is +fine, because here adapts to your computer’s specific +situation.

+
+
+

Download files

+

We still need to download data for the first part of the workshop. +You can do it with the function download.file(). We will +save it in the data/ folder, where the raw +data should go. In the script, we will write:

+
+

R +

+
+# Download the data
+download.file('https://bit.ly/geospatial_data', 
+              here('episodes', 'data','gapminder_data.csv'))
+
+
+
+ +
+
+

Importing data into R +

+
+

Three of the most common ways of importing data in R are:

+
  • loading a package with pre-installed data;
  • +
  • downloading data from a URL;
  • +
  • reading a file from your computer.
  • +

For larger datasets, database connections or API requests are also +possible. We will not cover these in the workshop.

+
+
+
+
+
+
+

Introduction to R

+

You can use R as calculator, you can for example write:

+
+

R +

+
+1+100
+1*100
+1/100
+
+
+

Variables and assignment

+

However, what’s more useful is that in R we can store values and use +them whenever we need to. We using the assignment operator +<-, like this:

+
+

R +

+
+x <- 1/40
+
+

Notice that assignment does not print a value. Instead, we’ve stored +it for later in something called a variable. x variable now +contains the value 0.025:

+
+

R +

+
+x
+
+

Look for the Environment tab in the upper right pane of +RStudio. You will see that x and its value have appeared in +the list of Values. Our variable x can be used in place of +a number in any calculation that expects a number, e.g. when calculating +a square root:

+
+

R +

+
+sqrt(x)
+
+

Variables can be also reassigned. This means that we can assign a new +value to variable x:

+
+

R +

+
+x <- 100
+x
+
+

You can use one variable to create a new one:

+
+

R +

+
+y <- sqrt(x) # you can use value stored in object x to create y
+y
+
+
+
+ +
+
+

Key Points +

+
+
  • Use RStudio to write and run R programs.
  • +
  • Use install.packages() to install packages.
  • +
  • Use library() to load packages.
  • +
+
+
+ +
+
+ + + +
+
+ + +
+
+
+ +
Back To Top +
+
+ + diff --git a/instructor/02-data-structures.html b/instructor/02-data-structures.html new file mode 100644 index 00000000..16b1b657 --- /dev/null +++ b/instructor/02-data-structures.html @@ -0,0 +1,898 @@ + +Geospatial Data Carpentry for Urbanism: Data Structures +
+ Geospatial Data Carpentry for Urbanism +
+ +
+
+ + + + + +
+
+

Data Structures

+

Last updated on 2024-01-29 | + + Edit this page

+ + + +

Estimated time: 12 minutes

+ +
+ +
+ + + +
+

Overview

+
+
+
+
+

Questions

+
  • What are the basic data types in R?
  • +
  • How do I represent categorical information in R?
  • +
+
+
+
+
+
+

Objectives

+

After completing this episode, participants should be able to…

+
  • To be aware of the different types of data.
  • +
  • To begin exploring data frames, and understand how they are related +to vectors, factors and lists.
  • +
  • To be able to ask questions from R about the type, class, and +structure of an object.
  • +
+
+
+
+
+

Vectors +

+

So far we’ve looked on individual values. Now we will move to a data +structure called vectors. Vectors are arrays of values of the same data +type.

+
+
+ +
+
+

Data types +

+
+

Data type refers to a type of information that is stored by a value. +It can be:

+
  • +numerical (a number)
  • +
  • +integer (a number without information about decimal +points)
  • +
  • +logical (a boolean - are values TRUE or FALSE?)
  • +
  • +character (a text/ string of characters)
  • +
  • +complex (a complex number)
  • +
  • +raw (raw bytes)
  • +

We won’t discuss complex or raw data type +in the workshop.

+
+
+
+
+
+ +
+
+

Data structures +

+
+

Vectors are the most common and basic data structure in R but you +will come across other data structures such as data frames, lists and +matrices as well. In short:

+
  • data.frames is a two-dimensional data structure in which columns are +vectors of the same length that can have different data types. We will +use this data structure in this lesson.
  • +
  • lists can have an arbitrary structure and can mix data types;
  • +
  • matrices are two-dimensional data structures containing elements of +the same data type.
  • +

For a more detailed description, see Data +Types and Structures.

+

Note that vector data in the geospatial context is different from +vector data types. More about vector data in a later lesson!

+
+
+
+

You can create a vector with a c() function.

+
+

R +

+
+numeric_vector <- c(2, 6, 3) # vector of numbers - numeric data type.
+numeric_vector
+
+
+

OUTPUT +

+
[1] 2 6 3
+
+
+

R +

+
+character_vector <- c('banana', 'apple', 'orange') # vector of words - or strings of characters- character data type
+character_vector
+
+
+

OUTPUT +

+
[1] "banana" "apple"  "orange"
+
+
+

R +

+
+logical_vector <- c(TRUE, FALSE, TRUE) # vector of logical values (is something true or false?)- logical data type.
+logical_vector
+
+
+

OUTPUT +

+
[1]  TRUE FALSE  TRUE
+
+
+

Combining vectors

+

The combine function, c(), will also append things to an +existing vector:

+
+

R +

+
+ab_vector <- c('a', 'b')
+ab_vector
+
+
+

OUTPUT +

+
[1] "a" "b"
+
+
+

R +

+
+abcd_vector <- c(ab_vector, 'c', 'd')
+abcd_vector
+
+
+

OUTPUT +

+
[1] "a" "b" "c" "d"
+
+
+
+

Missing values

+
+
+ +
+
+

Exercise +

+
+

Combine the abcd_vector with the +numeric_vector in R. What is the data type of this new +vector and why?

+
+
+
+
+
+ +
+
+
combined_vector <- c(abcd_vector, numeric_vector)
+combined_vector
+

The combined vector is a character vector. Because vectors can only +hold one data type and abcd_vector cannot be interpreted as +numbers, the numbers in numeric_vector are coerced +into characters.

+
+
+
+
+

A common operation you want to perform is to remove all the missing +values (in R denoted as NA). Let’s have a look how to do +it:

+
+

R +

+
+with_na <- c(1, 2, 1, 1, NA, 3, NA ) # vector including missing value
+
+

First, let’s try to calculate mean for the values in this vector

+
+

R +

+
+mean(with_na) # mean() function cannot interpret the missing values
+
+
+

OUTPUT +

+
[1] NA
+
+
+

R +

+
+mean(with_na, na.rm = T) # You can add the argument na.rm=TRUE to calculate the result while ignoring the missing values.
+
+
+

OUTPUT +

+
[1] 1.6
+
+

However, sometimes, you would like to have the NA +permanently removed from your vector. For this you need to identify +which elements of the vector hold missing values with +is.na() function.

+
+

R +

+
+is.na(with_na) #  This will produce a vector of logical values, stating if a statement 'This element of the vector is a missing value' is true or not
+
+
+

OUTPUT +

+
[1] FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE
+
+
+

R +

+
+!is.na(with_na) # # The ! operator means negation ,i.e. not is.na(with_na)
+
+
+

OUTPUT +

+
[1]  TRUE  TRUE  TRUE  TRUE FALSE  TRUE FALSE
+
+

We know which elements in the vectors are NA. Now we +need to retrieve the subset of the with_na vector that is +not NA. Sub-setting in R is done with square +brackets[ ].

+
+

R +

+
+without_na <- with_na[ !is.na(with_na) ] # this notation will return only the elements that have TRUE on their respective positions
+
+without_na
+
+
+

OUTPUT +

+
[1] 1 2 1 1 3
+
+
+

Factors +

+

Another important data structure is called a factor. +Factors look like character data, but are used to represent categorical +information.

+

Factors create a structured relation between the different levels +(values) of a categorical variable, such as days of the week or +responses to a question in a survey. While factors look (and often +behave) like character vectors, they are actually treated as numbers by +R, which is useful for computing summary statistics about +their distribution, running regression analysis, etc. So you need to be +very careful when treating them as strings.

+
+

Create factors

+

Once created, factors can only contain a pre-defined set of values, +known as levels.

+
+

R +

+
+nordic_str <- c('Norway', 'Sweden', 'Norway', 'Denmark', 'Sweden')
+nordic_str # regular character vectors printed out
+
+
+

OUTPUT +

+
[1] "Norway"  "Sweden"  "Norway"  "Denmark" "Sweden" 
+
+
+

R +

+
+nordic_cat <- factor(nordic_str) # factor() function converts a vector to factor data type
+nordic_cat # With factors, R prints out additional information - 'Levels'
+
+
+

OUTPUT +

+
[1] Norway  Sweden  Norway  Denmark Sweden 
+Levels: Denmark Norway Sweden
+
+
+
+

Inspect factors

+

R will treat each unique value from a factor vector as a +level and (silently) assign numerical values to it. +This can come in handy when performing statistical analysis. You can +inspect and adapt levels of the factor.

+
+

R +

+
+levels(nordic_cat) # returns all levels of a factor vector.  
+
+
+

OUTPUT +

+
[1] "Denmark" "Norway"  "Sweden" 
+
+
+

R +

+
+nlevels(nordic_cat) # returns number of levels in a vector
+
+
+

OUTPUT +

+
[1] 3
+
+
+
+

Reorder levels

+

Note that R sorts the levels in the alphabetic order, +not in the order of occurrence in the vector. R assigns +value of:

+
  • 1 to level ‘Denmark’,
  • +
  • 2 to ‘Norway’
  • +
  • 3 to ‘Sweden’.
  • +

This is important as it can affect e.g. the order in which categories +are displayed in a plot or which category is taken as a baseline in a +statistical model.

+

You can reorder the categories using factor() function. +This can be useful, for instance, to select a reference category (first +level) in a regression model or for ordering legend items in a plot, +rather than using the default category systematically (i.e. based on +alphabetical order).

+
+

R +

+
+nordic_cat <- factor(nordic_cat, levels = c('Norway' , 'Denmark', 'Sweden'))  # now Norway should be the first category, Denmark second and Sweden third
+
+nordic_cat
+
+
+

OUTPUT +

+
[1] Norway  Sweden  Norway  Denmark Sweden 
+Levels: Norway Denmark Sweden
+
+
+
+ +
+
+

Callout +

+
+

There is more than one way to reorder factors. Later in the lesson, +we will use fct_relevel() function from +forcats package to do the reordering.

+
+

R +

+
+# nordic_cat <- fct_relevel(nordic_cat, 'Norway' , 'Denmark', 'Sweden') # now Norway should be the first category, Denmark second and Sweden third
+
+nordic_cat
+
+
+

OUTPUT +

+
[1] Norway  Sweden  Norway  Denmark Sweden 
+Levels: Norway Denmark Sweden
+
+
+
+
+

You can also inspect vectors with str() function. In +factor vectors, it shows the underlying values of each category. You can +also see the structure in the environment tab of RStudio.

+
+

R +

+
+str(nordic_cat) 
+
+
+

OUTPUT +

+
 Factor w/ 3 levels "Norway","Denmark",..: 1 3 1 2 3
+
+
+
+ +
+
+

Note of caution +

+
+

Remember that once created, factors can only contain a pre-defined +set of values, known as levels. It means that whenever you try to add +something to the factor outside of this set, it will become an +unknown/missing value detonated by R as +NA.

+
+

R +

+
+nordic_str
+
+
+

OUTPUT +

+
[1] "Norway"  "Sweden"  "Norway"  "Denmark" "Sweden" 
+
+
+

R +

+
+nordic_cat2 <- factor(nordic_str, levels = c('Norway', 'Denmark'))
+nordic_cat2 # since we have not included Sweden in the list of factor levels, it has become NA.
+
+
+

OUTPUT +

+
[1] Norway  <NA>    Norway  Denmark <NA>   
+Levels: Norway Denmark
+
+
+
+
+
+
+ +
+
+

Key Points +

+
+
  • The mostly used basic data types in R are numeric, +integer, logical, and +character +
  • +
  • Use factors to represent categories in R.
  • +
+
+
+ +
+
+
+ + +
+
+
+ +
Back To Top +
+
+ + diff --git a/instructor/03-explore-data.html b/instructor/03-explore-data.html new file mode 100644 index 00000000..ddd40b5c --- /dev/null +++ b/instructor/03-explore-data.html @@ -0,0 +1,945 @@ + +Geospatial Data Carpentry for Urbanism: Exploring Data Frames & Data frame Manipulation with dplyr +
+ Geospatial Data Carpentry for Urbanism +
+ +
+
+ + + + + +
+
+

Exploring Data Frames & Data frame Manipulation with dplyr

+

Last updated on 2024-01-29 | + + Edit this page

+ + + +

Estimated time: 12 minutes

+ +
+ +
+ + + +
+

Overview

+
+
+
+
+

Questions

+
  • What is a data frame?
  • +
  • How can I read data in R?
  • +
  • How can I get basic summary information about my data set?
  • +
  • How can I select specific rows and/or columns from a data +frame?
  • +
  • How can I combine multiple commands into a single command?
  • +
  • How can I create new columns or remove existing columns from a data +frame?
  • +
+
+
+
+
+
+

Objectives

+

After completing this episode, participants should be able to…

+
  • Describe what a data frame is.
  • +
  • Load external data from a .csv file into a data frame.
  • +
  • Summarize the contents of a data frame.
  • +
  • Select certain columns in a data frame with the dplyr function +select.
  • +
  • Select certain rows in a data frame according to filtering +conditions with the dplyr function filter.
  • +
  • Link the output of one dplyr function to the input of another +function with the ‘pipe’ operator %>%.
  • +
  • Add new columns to a data frame that are functions of existing +columns with mutate.
  • +
  • Use the split-apply-combine concept for data analysis.
  • +
  • Use summarize, group_by, and count to split a data frame into groups +of observations, apply a summary statistics for each group, and then +combine the results.
  • +
+
+
+
+
+
+

Exploring +Data frames

+

Now we turn to the bread-and-butter of working with R: +working with tabular data. In R data are stored in a data +structure called data frames.

+

A data frame is a representation of data in the format of a +table where the columns are vectors +that all have the same length.

+

Because columns are vectors, each column must contain a +single type of data (e.g., characters, numeric, +factors). For example, here is a figure depicting a data frame +comprising a numeric, a character, and a logical vector.

+


Source:Data +Carpentry R for Social Scientists

+
+

Reading data

+

read.csv() is a function used to read coma separated +data files (.csv format)). There are other functions for +files separated with other delimiters. We’re gonna read in the +gapminder data set with information about countries’ size, +GDP and average life expectancy in different years.

+
+

R +

+
+gapminder <- read_csv("data/gapminder_data.csv")
+
+
+
+

Exploring dataset

+

Let’s investigate the gapminder data frame a bit; the +first thing we should always do is check out what the data looks +like.

+

It is important to see if all the variables (columns) have the data +type that we require. For instance, a column might have numbers stored +as characters, which would not allow us to make calculations with those +numbers.

+
+

R +

+
+str(gapminder) 
+
+
+

OUTPUT +

+
spc_tbl_ [1,704 × 6] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
+ $ country  : chr [1:1704] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
+ $ year     : num [1:1704] 1952 1957 1962 1967 1972 ...
+ $ pop      : num [1:1704] 8425333 9240934 10267083 11537966 13079460 ...
+ $ continent: chr [1:1704] "Asia" "Asia" "Asia" "Asia" ...
+ $ lifeExp  : num [1:1704] 28.8 30.3 32 34 36.1 ...
+ $ gdpPercap: num [1:1704] 779 821 853 836 740 ...
+ - attr(*, "spec")=
+  .. cols(
+  ..   country = col_character(),
+  ..   year = col_double(),
+  ..   pop = col_double(),
+  ..   continent = col_character(),
+  ..   lifeExp = col_double(),
+  ..   gdpPercap = col_double()
+  .. )
+ - attr(*, "problems")=<externalptr> 
+
+

We can see that the gapminder object is a data.frame +with 1704 observations (rows) and 6 variables (columns).

+

In each line after a $ sign, we see the name of each +column, its type and first few values.

+
+

First look at the dataset

+

There are multiple ways to explore a data set. Here are just a few +examples:

+
+

R +

+
+head(gapminder) # see first 6  rows of the data set
+
+
+

OUTPUT +

+
# A tibble: 6 × 6
+  country      year      pop continent lifeExp gdpPercap
+  <chr>       <dbl>    <dbl> <chr>       <dbl>     <dbl>
+1 Afghanistan  1952  8425333 Asia         28.8      779.
+2 Afghanistan  1957  9240934 Asia         30.3      821.
+3 Afghanistan  1962 10267083 Asia         32.0      853.
+4 Afghanistan  1967 11537966 Asia         34.0      836.
+5 Afghanistan  1972 13079460 Asia         36.1      740.
+6 Afghanistan  1977 14880372 Asia         38.4      786.
+
+
+

R +

+
+summary(gapminder) # gives basic statistical information about each column. Information format differes by data type.
+
+
+

OUTPUT +

+
   country               year           pop             continent        
+ Length:1704        Min.   :1952   Min.   :6.001e+04   Length:1704       
+ Class :character   1st Qu.:1966   1st Qu.:2.794e+06   Class :character  
+ Mode  :character   Median :1980   Median :7.024e+06   Mode  :character  
+                    Mean   :1980   Mean   :2.960e+07                     
+                    3rd Qu.:1993   3rd Qu.:1.959e+07                     
+                    Max.   :2007   Max.   :1.319e+09                     
+    lifeExp        gdpPercap       
+ Min.   :23.60   Min.   :   241.2  
+ 1st Qu.:48.20   1st Qu.:  1202.1  
+ Median :60.71   Median :  3531.8  
+ Mean   :59.47   Mean   :  7215.3  
+ 3rd Qu.:70.85   3rd Qu.:  9325.5  
+ Max.   :82.60   Max.   :113523.1  
+
+
+

R +

+
+nrow(gapminder) # returns number of rows in a dataset
+
+
+

OUTPUT +

+
[1] 1704
+
+
+

R +

+
+ncol(gapminder) # returns number of columns in a dataset
+
+
+

OUTPUT +

+
[1] 6
+
+
+
+

Dollar sign ($)

+

When you’re analyzing a data set, you often need to access its +specific columns.

+

One handy way to access a column is using it’s name and a dollar sign +$:

+
+

R +

+
+country_vec <- gapminder$country  # Notation means: From dataset gapminder, give me column country. You can see that the column accessed in this way is just a vector of characters. 
+
+head(country_vec)
+
+
+

OUTPUT +

+
[1] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan"
+[6] "Afghanistan"
+
+

Note that the calling a column with a $ sign will return +a vector, it’s not a data frame anymore.

+
+
+
+
+

Data +frame Manipulation with dplyr

+
+

Select

+

Let’s start manipulating the data.

+

First, we will adapt our data set, by keeping only the columns we’re +interested in, using the select() function from the +dplyr package:

+
+

R +

+
+year_country_gdp <- select(gapminder, year, country, gdpPercap) 
+
+head(year_country_gdp)
+
+
+

OUTPUT +

+
# A tibble: 6 × 3
+   year country     gdpPercap
+  <dbl> <chr>           <dbl>
+1  1952 Afghanistan      779.
+2  1957 Afghanistan      821.
+3  1962 Afghanistan      853.
+4  1967 Afghanistan      836.
+5  1972 Afghanistan      740.
+6  1977 Afghanistan      786.
+
+
+
+

Pipe

+

Now, this is not the most common notation when working with +dplyr package. dplyr offers an operator +%>% called a pipe, which allows you build up very +complicated commands in a readable way.

+

In newer installation of R you can also find a notation +|> . This pipe works in a similar way. The main +difference is that you don’t need to load any packages to have it +available.

+

The select() statement with pipe would look like +that:

+
+

R +

+
+year_country_gdp <- gapminder %>% 
+  select(year,country,gdpPercap)
+
+head(year_country_gdp)
+
+
+

OUTPUT +

+
# A tibble: 6 × 3
+   year country     gdpPercap
+  <dbl> <chr>           <dbl>
+1  1952 Afghanistan      779.
+2  1957 Afghanistan      821.
+3  1962 Afghanistan      853.
+4  1967 Afghanistan      836.
+5  1972 Afghanistan      740.
+6  1977 Afghanistan      786.
+
+

First we define data set, then - with the use of pipe we pass it on +to the select() function. This way we can chain multiple +functions together, which we will be doing now.

+
+
+

Filter

+

We already know how to select only the needed columns. But now, we +also want to filter the rows of our data set via certain conditions with +filter() function. Instead of doing it in separate steps, +we can do it all together.

+

In the gapminder data set, we want to see the results +from outside of Europe for the 21st century.

+
+

R +

+
+year_country_gdp_euro <- gapminder %>% 
+  filter(continent != "Europe" & year >= 2000) %>% # & operator (AND) - both conditions must be met
+  select(year, country, gdpPercap)
+
+head(year_country_gdp_euro)
+
+
+

OUTPUT +

+
# A tibble: 6 × 3
+   year country     gdpPercap
+  <dbl> <chr>           <dbl>
+1  2002 Afghanistan      727.
+2  2007 Afghanistan      975.
+3  2002 Algeria         5288.
+4  2007 Algeria         6223.
+5  2002 Angola          2773.
+6  2007 Angola          4797.
+
+
+

Exercise 1

+
+

Challenge Write a single command (which can span +multiple lines and includes pipes) that will produce a data frame that +has the values for life expectancy, country and year, only for Eurasia. +How many rows does your data frame have and why?

+

Solution

+
+
+

R BG-INFO +

+
+year_country_gdp_eurasia <- gapminder %>% 
+  filter(continent == "Europe" | continent == "Asia") %>% # | operator (OR) - one of the conditions must be met
+  select(year, country, gdpPercap)
+
+nrow(year_country_gdp_eurasia)
+
+
+

OUTPUT +

+
[1] 756
+
+
+
+
+

Group and summarize

+

So far, we have provided summary statistics on the whole dataset, +selected columns, and filtered the observations. But often instead of +doing that, we would like to know statistics about all of the +continents, presented by group.

+
+

R +

+
+gapminder %>% # select the dataset
+  group_by(continent) %>% # group by continent
+  summarize(avg_gdpPercap = mean(gdpPercap)) # summarize function creates statistics for the data set 
+
+
+

OUTPUT +

+
# A tibble: 5 × 2
+  continent avg_gdpPercap
+  <chr>             <dbl>
+1 Africa            2194.
+2 Americas          7136.
+3 Asia              7902.
+4 Europe           14469.
+5 Oceania          18622.
+
+
+

Exercise 2

+
+

Challenge Calculate the average life expectancy per +country. Which country has the longest average life expectancy and which +has the shortest average life expectancy?

+

Hint Use max() and min() +functions to find minimum and maximum.

+

Solution

+
+
+

R BG-INFO +

+
+gapminder %>%
+   group_by(country) %>%
+   summarize(avg_lifeExp=mean(lifeExp)) %>%
+   filter(avg_lifeExp == min(avg_lifeExp) | avg_lifeExp == max(avg_lifeExp))
+
+
+

OUTPUT +

+
# A tibble: 2 × 2
+  country      avg_lifeExp
+  <chr>              <dbl>
+1 Iceland             76.5
+2 Sierra Leone        36.8
+
+
+
+

Multiple groups and summary variables

+

You can also group by multiple columns:

+
+

R +

+
+gapminder %>%
+  group_by(continent, year) %>%
+  summarize(avg_gdpPercap = mean(gdpPercap))
+
+
+

OUTPUT +

+
# A tibble: 60 × 3
+# Groups:   continent [5]
+   continent  year avg_gdpPercap
+   <chr>     <dbl>         <dbl>
+ 1 Africa     1952         1253.
+ 2 Africa     1957         1385.
+ 3 Africa     1962         1598.
+ 4 Africa     1967         2050.
+ 5 Africa     1972         2340.
+ 6 Africa     1977         2586.
+ 7 Africa     1982         2482.
+ 8 Africa     1987         2283.
+ 9 Africa     1992         2282.
+10 Africa     1997         2379.
+# ℹ 50 more rows
+
+

On top of this, you can also make multiple summaries of those +groups:

+
+

R +

+
+gdp_pop_bycontinents_byyear <- gapminder %>%
+  group_by(continent,year) %>%
+  summarize(
+    avg_gdpPercap = mean(gdpPercap),
+    sd_gdpPercap = sd(gdpPercap),
+    avg_pop = mean(pop),
+    sd_pop = sd(pop),
+    n_obs = n()
+    )
+
+
+
+
+

Frequencies

+

If you need only a number of observations per group, you can use the +count() function

+
+

R +

+
+gapminder %>%
+    group_by(continent) %>%
+    count()
+
+
+

OUTPUT +

+
# A tibble: 5 × 2
+# Groups:   continent [5]
+  continent     n
+  <chr>     <int>
+1 Africa      624
+2 Americas    300
+3 Asia        396
+4 Europe      360
+5 Oceania      24
+
+
+
+

Mutate

+

Frequently you’ll want to create new columns based on the values in +existing columns. For example, instead of only having the GDP per +capita, we might want to create a new GDP variable and convert its units +into Billions. For this, we’ll use mutate().

+
+

R +

+
+gapminder_gdp <- gapminder %>%
+  mutate(gdpBillion = gdpPercap*pop/10^9)
+
+head(gapminder_gdp)
+
+
+

OUTPUT +

+
# A tibble: 6 × 7
+  country      year      pop continent lifeExp gdpPercap gdpBillion
+  <chr>       <dbl>    <dbl> <chr>       <dbl>     <dbl>      <dbl>
+1 Afghanistan  1952  8425333 Asia         28.8      779.       6.57
+2 Afghanistan  1957  9240934 Asia         30.3      821.       7.59
+3 Afghanistan  1962 10267083 Asia         32.0      853.       8.76
+4 Afghanistan  1967 11537966 Asia         34.0      836.       9.65
+5 Afghanistan  1972 13079460 Asia         36.1      740.       9.68
+6 Afghanistan  1977 14880372 Asia         38.4      786.      11.7 
+
+
+
+ +
+
+

Key Points +

+
+
  • We can use the select() and filter() +functions to select certain columns in a data frame and to subset it +based a specific conditions.
  • +
  • With mutate(), we can create new columns in a data +frame with values based on existing columns.
  • +
  • By combining group_by() and summarize() in +a pipe (%>%) chain, we can generate summary statistics +for each group in a data frame.
  • +
+
+
+ +
+
+ + + +
+
+ + +
+
+
+ +
Back To Top +
+
+ + diff --git a/instructor/04-intro-to-visualisation.html b/instructor/04-intro-to-visualisation.html new file mode 100644 index 00000000..c4d21b54 --- /dev/null +++ b/instructor/04-intro-to-visualisation.html @@ -0,0 +1,700 @@ + +Geospatial Data Carpentry for Urbanism: Introduction to visualisation +
+ Geospatial Data Carpentry for Urbanism +
+ +
+
+ + + + + +
+
+

Introduction to visualisation

+

Last updated on 2024-01-29 | + + Edit this page

+ + + +

Estimated time: 12 minutes

+ +
+ +
+ + + +
+

Overview

+
+
+
+
+

Questions

+
  • How can I create a basic plot in R?
  • +
  • How can I add features to a plot?
  • +
  • How can I get basic summary information about my data set?
  • +
  • How can I include addition information via a colours palette.
  • +
  • How can I find more information about a function and its +arguments?
  • +
  • How can I create new columns or remove existing columns from a data +frame?
  • +
+
+
+
+
+
+

Objectives

+

After completing this episode, participants should be able to…

+
  • Generate plots to visualise data with ggplot2.
  • +
  • Add plot layers to incrementally build a more complex plot.
  • +
  • Use the fill argument for colouring surfaces, and +modify colours with the viridis or scale_manual packages.
  • +
  • Explore the help documentation.
  • +
  • Save and format your plot via the ggsave() +function.
  • +
+
+
+
+
+
+

Introduction +to Visualisation

+

The package ggplot2 is a powerful plotting system. We +will start with an introduction of key features of ggplot2. +In the following parts of this workshop, you will use this package to +visualize geospatial data. gg stands for grammar of +graphics, the idea that three components are needed to create a +graph:

+
  • data,
  • +
  • aesthetics - a coordinate system on which we map the data (what is +represented on x axis, what on y axis), and
  • +
  • geometries - visual representation of the data (points, bars, +etc.)
  • +

Fun part about ggplot2 is that you can add layers to the +plot to provide more information and to make it more beautiful.

+

First, lets plot the distribution of life expectancy in the +gapminder dataset:

+
+

R +

+
+  ggplot(data = gapminder,  aes(x = lifeExp) ) + # aesthetics layer 
+  geom_histogram() # geometry layer
+
+

You can see that in ggplot you use + as a +pipe, to add layers. Within the ggplot() call, it is the +only pipe that will work. But, it is possible to chain operations on a +data set with a pipe that we have already learned: %>% ( +or |>) and follow them by ggplot syntax.

+

Let’s create another plot, this time only on a subset of +observations:

+
+

R +

+
+gapminder %>%  # we select a data set
+  filter(year == 2007 & 
+         continent == 'Americas') %>%  # and filter it to keep only one year and one continent
+  ggplot(aes(x = country, y = gdpPercap)) +  # the x and y axes represent values of columns
+  geom_col()  # we select a column graph as a geometry
+
+

Now, you can iteratively improve how the plot looks like. For +example, you might want to flip it, to better display the labels.

+
+

R +

+
+gapminder %>%  
+  filter(year == 2007, 
+         continent == 'Americas') %>% 
+  ggplot(aes(x = country, y = gdpPercap)) + 
+  geom_col()+ 
+  coord_flip()  # flip axes
+
+

One thing you might want to change here is the order in which +countries are displayed. It would be easier to compare GDP per capita, +if they were showed in order. To do that, we need to reorder factor +levels (you remember, we’ve already done this before).

+

Now the order of the levels will depend on another variable - GDP per +capita.

+
+

R +

+
+gapminder %>%  
+  filter(year == 2007, 
+         continent == 'Americas') %>% 
+  mutate(country = fct_reorder(country, gdpPercap )) %>%  # reorder factor levels
+  ggplot(aes(x = country , y = gdpPercap)) + 
+  geom_col() +
+  coord_flip()
+
+

Let’s make things more colourful - let’s represent the average life +expectancy of a country by colour

+
+

R +

+
+gapminder %>%  
+  filter(year == 2007, 
+         continent == 'Americas') %>% 
+  mutate(country = fct_reorder(country, gdpPercap )) %>%
+  ggplot(aes(x = country, y = gdpPercap, fill = lifeExp  )) +  # fill argument for colouring surfaces, colour for points and lines
+  geom_col()+ 
+  coord_flip()
+
+

We can also adapt the colour scale. Common choice that is used for +its readability and colorblind-proofness are the palettes available in +the viridis package.

+
+

R +

+
+gapminder %>%  
+  filter(year == 2007, 
+         continent == 'Americas') %>% 
+  mutate(country = fct_reorder(country, gdpPercap )) %>%
+  ggplot(aes(x = country, y = gdpPercap, fill = lifeExp   )) + 
+  geom_col()+ 
+  coord_flip()+
+  scale_fill_viridis_c()  # _c stands for continuous scale
+
+

Maybe we don’t need that much information about the life expectancy. +We only want to know if it’s below or above average. We will make use of +the if_else() function inside mutate() to +create a new column lifeExpCat with the value +high if life expectancy is above average and +low otherwise. Note the usage of the if_else() +function: +if_else(<condition>, <value if TRUE>, <value if FALSE>).

+
+

R +

+
+p <-  # this time let's save the plot in an object
+  gapminder %>%  
+  filter(year == 2007 & 
+         continent == 'Americas') %>% 
+  mutate(country = fct_reorder(country, gdpPercap ),
+         lifeExpCat = if_else(lifeExp >= mean(lifeExp), 'high', 'low')) %>%
+  ggplot(aes(x = country, y = gdpPercap, fill = lifeExpCat)) + 
+  geom_col()+ 
+  coord_flip()+
+  scale_fill_manual(values = c('light blue', 'orange'))  # customize the colours of the fill aesthetic
+
+

Since we saved a plot as an object, nothing has been printed out. +Just like with any other object in R, if you want to see +it, you need to call it.

+
+

R +

+
+p
+
+

Now we can make use of the saved object and add things to it.

+

Let’s also give it a title and name the axes:

+
+

R +

+
+p <- 
+  p +
+  ggtitle('GDP per capita in Americas', subtitle = 'Year 2007') +
+  xlab('Country')+
+  ylab('GDP per capita')
+
+p
+
+
+
+

Writing +data

+
+

Saving the plot

+

Once we are happy with our plot we can save it in a format of our +choice. Remember to save it in the dedicated folder.

+
+

R +

+
+ggsave(plot = p, 
+       filename = here('fig_output','plot_americas_2007.pdf'))  # By default, ggsave() saves the last displayed plot, but you can also explicitly name the plot you want to save
+
+
+

ERROR +

+
Error in grDevices::pdf(file = filename, ..., version = version): cannot open file '/home/runner/work/r-geospatial-urban/r-geospatial-urban/site/built/fig_output/plot_americas_2007.pdf'
+
+
+

Using help documentation

+

My saved plot is not very readable. We can see why it happened by +exploring the help documentation. We can do that by writing directly in +the console:

+
+

R +

+
+?ggsave
+
+

We can read that it uses the “size of the current graphics device”. +That would explain why our saved plots look slightly different. Feel +free to explore the documentation to see how to adapt the size e.g. by +adapting width, height and units +parameter.

+
+
+
+

Saving the data

+

Another output of your work you want to save is a cleaned data set. +In your analysis, you can then load directly that data set. Let’s say we +want to save the data only for Americas:

+
+

R +

+
+gapminder_amr_2007 <- gapminder %>%
+  filter(year == 2007 & continent == 'Americas') %>%
+  mutate(country_reordered = fct_reorder(country, gdpPercap ), 
+         lifeExpCat = if_else(lifeExp >= mean(lifeExp), 'high', 'low'))
+
+write.csv(gapminder_amr_2007, here('data_output', 'gapminder_americas_2007.csv'), row.names=FALSE)
+
+
+

ERROR +

+
Error in file(file, ifelse(append, "a", "w")): cannot open the connection
+
+
+
+ +
+
+

Key Points +

+
+
  • With ggplot2, we use the + operator to +combine plot layers and incrementally build a more complex plot.
  • +
  • In the aesthetics (aes()), we can assign variables to +the x and y axes and use the fill argument for colouring +surfaces.
  • +
  • With scale_fill_viridis_c() and +scale_fill_manual() we can assign new colours to our +plot.
  • +
  • To open the help documentation for a function, we run the name of +the function preceded by the ? sign.
  • +
+
+
+ +
+
+ + + +
+
+ + +
+
+
+ +
Back To Top +
+
+ + diff --git a/instructor/09-open-and-plot-vector-layers.html b/instructor/09-open-and-plot-vector-layers.html index addfb4fe..e8277b20 100644 --- a/instructor/09-open-and-plot-vector-layers.html +++ b/instructor/09-open-and-plot-vector-layers.html @@ -1,5 +1,5 @@ -Geospatial Data Carpentry with R for Urbanists: Open and Plot Vector Layers