-
Notifications
You must be signed in to change notification settings - Fork 15
Writing download functions
After writing a few of these download functions (in the downloads.R file), I've compiled a list of some notes and helpful tips to make the experience all the more pleasant. Any additional tips or tricks are welcome!
-
After opening up R, there are a few things you need to load in before running any of the
downloads.R
code:library(reshape2)
library(devtools)
install_github("willpearse/fulltext")
library(fulltext)
source('/path/to/nacdb/R/utility.R')
-
Once you get settled, find a paper to get data from and download to your computer. Then you can open the file in R to have a look at it:
data <- read.delim("~/Desktop/PanTHERIA_1-0_WR05_Aug2008.txt")
-
Look through the metadata and begin to figure out which columns are useful, what the units are, etc.
-
You can use
names(data)
to pull out just the names for each column, which can make it easier to extract just the ones you'd like to keep. -
Make sure that any meaningless info is removed, and that NAs are in place where data is absent.
The first is .matrix.melt
, and the second is .df.melt
. You can use either - what matters is you use the one that simplest for the kind of data you have when you download it from the website.
You cannot write a download function without using one of these options
.df.melt
turns your downloaded data into a format that nacdb
can work with. .df.melt
takes five arguments, only three of which are required:
-
species
- a vector with all the species that were found in the study -
sites
- a vector with all the sites that were found in the study -
value
- the abundances or presence/absence information for all the observations in the study -
species.metadata
- (optional, but recommended!) adata.frame
containing the meta-data for all the species in the study -
sites.metadata
- (optional, but recommended!) adata.frame
containing the meta-data for all the sites in the study
An example:
.adler.2007 <- function(...){
data <- read.csv(ft_get_si("E088-161", "allrecords.csv", from = "esa_archives"))
site <- sapply(strsplit(data$plotyear, "-"), function(x) x[1])
year <- sapply(strsplit(data$plotyear, "-"), function(x) x[2])
return(.df.melt(data$species, site, data$area, site.metadata=data.frame(year=year))
}
Here we've written a function that downloads data from a paper whose first author was Adler (it was in 2007). We grab the data from the ESA Archives paper associated with it (whose ID is E088-161), and then we split out the site IDs from the year in which each plot was surveyed. We then give this information to .df.melt
, making sure that our site.metadata
is stored in a data.frame
. The tricky party about this is finding the data and figuring out the format it's in: using df.melt
is, depressingly, the (comparatively) easy part.
Sometimes, your data will come in a different format: a matrix where sites are rows, and species are columns. In that case, you can use .matrix.melt
, which takes three arguments, only one of which is needed:
-
x
- a matrix where species' are in columns, and sites in rows, and the elements of the matrix are the abundances or presence/absences of species at each site -
site.metadata
- (optional, but recommended!) adata.frame
containing the meta-data for all the species in the study. Note that this should have as many rows as there are sites in the dataset -
species.metadata
- (optional, but recommended!) adata.frame
containing the meta-data for all the species in the study. Note that this should have as many rows as there are species in the dataset
An example:
.adler.2007 <- function(...){
data <- read.csv(ft_get_si("E088-161", "allrecords.csv", from = "esa_archives"))
comm <- with(data, tapply(area, list(species, plot_year), sum, na.rm=TRUE))
return(.matrix.melt(comm))
}
This is a slightly contrived example, because I wanted to use the same dataset as we had above, but you can see the general pattern. We grab data and then in order to showcase .matrix.melt
we turn it into a matrix, but of course we wouldn't do this for a real study. We can then use matrix.melt
to merge up this data, and then return it out to the user.