paceR
is a collection of functions that make it easy to get data from the University of Calgary's PACE Database into R for further analysis.
To use the tools, you must have access to the PACE Database. If you don't know how to do this, you can ask Fernando, Urs, or John.
To use this package, you first need to install devtools
with:
install.packages("devtools")
Then, you can install the latest development version of paceR
from github:
library(devtools)
devtools::install_github("camposfa/paceR")
After you have installed the package once, you can simply load it in the future using:
library(paceR)
This package makes heavy use of the data manipulation packages stringr, lubridate, tidyr, and dplyr. If not already installed, paceR
will install these packages automatically. It also provides a convenient function to load them all into your R session, if you want to use them for other tasks:
load_pace_packages()
To get data from PACE, you must have a username and password for the PACE database, and you must create an SSH tunnel to the database server.
If you have set up an SSH key (Fernando can help you do this), you can create the SSH tunnel by running this command in R:
system('ssh -f [email protected] -L 3307:localhost:3306 -N')
If you have not set up an SSH key but you do have a username and password for the PACE database, then you can still create the tunnel by opening Terminal and typing the following in Terminal, not in R!:
ssh -f camposf@pacelab.ucalgary.ca -L 3307:localhost:3306 -N
Replace the "camposf" part with your username. You will then be prompted to enter your password. Once this is done, the tunnel has been created and you can return to R.
When you made an account, you should have received instructions from John about how to create the SSH tunnel using Putty/Plink. Just run this and have it open in the background.
With the tunnel created, we can now connect to the database(s) in PACE. The primary database is called monkey
. There is also a secondary database called paceR
that has a variety of convenient views that are designed to be used with this package. We'll create connections to both.
# Connect to monkey database
pace_db <- src_mysql(group = "PACE", user = "camposf", dbname = "monkey", password = NULL)
# Connect to paceR database
paceR_db <- src_mysql(group = "PACE", user = "camposf", dbname = "paceR", password = NULL)
It should go without saying, but the above lines are just an example--if copied verbatim, they will not work for anyone but me (Fernando). You must first set up the SSH key and then modify these lines for your particular account.
Once you get the connection worked out, you now can pull data from the database. Whenever you download data, you must pass the name of the database connection that you're using. To do this correctly, it is crucial that you understand a major design decision that affects how the functions can be used.
The best way to get data from the database is to use the convenient saved "views". These are stored in the paceR
database, and they should be called using the functions that begin with getv_
# Get a Individuals data
(i2 <- getv_Individual(paceR_db))
#> Source: local data frame [2,645 x 18]
#>
#> IndividualID Project PrimateSpecies NameOf CodeName DateOfBirth
#> (int) (chr) (chr) (chr) (chr) (date)
#> 1 1 SR CCAP 2Tufts 2TUF 1988-01-01
#> 2 2 SR CCAP A-1 A-1- <NA>
#> 3 3 SR CCAP Abu ABU- 2005-04-25
#> 4 6 SR CCAP Al AL-- 1985-01-01
#> 5 7 SR CCAP Alfredo ALFR 1997-01-01
#> 6 8 SR CCAP Alien ALIE 1996-01-02
#> 7 9 SR CCAP Amy AMY- 1989-01-01
#> 8 10 SR CCAP Amy96 AM96 1996-01-01
#> 9 11 SR CCAP Babaganouj BABA 1992-01-01
#> 10 12 SR CCAP BabyFace BABY 1991-01-01
#> .. ... ... ... ... ... ...
#> Variables not shown: Sex (chr), BirthdateSource (chr), Mother (chr),
#> GroupAtBirthName (chr), GroupAtBirthCode (chr), DateOfFirstSighting
#> (date), DayDifference (int), AgeClassAtFirstSighting (chr),
#> GroupAtFirstSightingName (chr), GroupAtFirstSightingCode (chr),
#> Phenotype (chr), ProjectID (int)
# Get a condensed version
(i2 <- getv_Individual(paceR_db, full = FALSE))
#> Source: local data frame [2,645 x 5]
#>
#> IndividualID NameOf Project DateOfBirth Sex
#> (int) (chr) (chr) (date) (chr)
#> 1 1 2Tufts SR 1988-01-01 Female
#> 2 2 A-1 SR <NA> Male
#> 3 3 Abu SR 2005-04-25 Female
#> 4 6 Al SR 1985-01-01 Male
#> 5 7 Alfredo SR 1997-01-01 Male
#> 6 8 Alien SR 1996-01-02 Male
#> 7 9 Amy SR 1989-01-01 Female
#> 8 10 Amy96 SR 1996-01-01 Unknown
#> 9 11 Babaganouj SR 1992-01-01 Male
#> 10 12 BabyFace SR 1991-01-01 Male
#> .. ... ... ... ... ...
Currently, the following functions are available:
getv_CensusMonthly()
getv_Individual()
getv_Phenology()
If you want to download raw database tables, then you should use function get_pace_tbl()
. All tables are stored in the monkey
database, and so when you use this function, you must pass the connection to the monkey
database in addition to the name of the table that you want to download. For example:
# Get the raw individuals table
(i <- get_pace_tbl(pace_db, "tblIndividual"))
#> Source: local data frame [2,645 x 20]
#>
#> ID ProjectID PrimateSpeciesID NameOf CodeName DateOfBirth
#> (int) (int) (int) (chr) (chr) (chr)
#> 1 1 1 1 2Tufts 2TUF 1988-01-01
#> 2 2 1 1 A-1 A-1- NA
#> 3 3 1 1 Abu ABU- 2005-04-25
#> 4 6 1 1 Al AL-- 1985-01-01
#> 5 7 1 1 Alfredo ALFR 1997-01-01
#> 6 8 1 1 Alien ALIE 1996-01-02
#> 7 9 1 1 Amy AMY- 1989-01-01
#> 8 10 1 1 Amy96 AM96 1996-01-01
#> 9 11 1 1 Babaganouj BABA 1992-01-01
#> 10 12 1 1 BabyFace BABY 1991-01-01
#> .. ... ... ... ... ... ...
#> Variables not shown: BirthdateSource (chr), SexID (int), MotherID (int),
#> MatrilineID (int), GroupAtBirthID (int), DateOfFirstSighting (chr),
#> DayDifference (int), AgeClassAtFirstSightingID (int),
#> GroupAtFirstSightingID (int), VisionPhenotypeID (int), Comments (chr),
#> Comments_2 (chr), CommentsJFA (chr), CommentsGenetics (chr)
# Get the raw deaths table
(d <- get_pace_tbl(pace_db, "tblIndividualDeath"))
#> Source: local data frame [1,055 x 7]
#>
#> ID IndividualID DateOfDeath CauseOfDeathID SourceOfInformation
#> (int) (int) (chr) (int) (chr)
#> 1 0 7152 1984-05-29 157 DODKnown
#> 2 1 1 1997-03-15 5 NA
#> 3 2 2 2004-02-13 5 NA
#> 4 3 6 1997-02-28 5 NA
#> 5 4 8 2004-02-20 5 NA
#> 6 5 9 1997-02-24 5 NA
#> 7 6 10 1997-01-28 5 NA
#> 8 7 12 1997-03-15 5 NA
#> 9 8 15 2000-05-23 5 NA
#> 10 9 16 1997-03-15 5 NA
#> .. ... ... ... ... ...
#> Variables not shown: DateOfDeathFromCensus (chr), Comments (chr)
Note that the "foreign keys" (i.e., the columns that end with "ID") are just uninformative numbers! To make use of the data, you might need to join the tables by their ID relevant fields.
# Join the individuals and deaths tables
id <- left_join(i, d, by = c("ID" = "IndividualID"))
You can see that there are many more ID fields that would need to be joined. It can very inconvenient to work with the data this way!