040_data.Rmd

<style>@import url(style.css);</style>
[Introduction to Data Analysis](index.html "Course index")

# 4. Basic data operators

[gapminder]: http://factbased.blogspot.fr/2012/07/convenient-access-to-gapminder-datasets.html
[ckan]: http://ckan.org/ "CKAN"
[cb]: http://colorbrewer2.org/ "ColorBrewer 2.0 (Cynthia Brewer)"
[ct]: http://crantastic.org/ "CRANtastic"
[ct-onlinedata]: http://crantastic.org/tags/onlineData "CRANtastic: onlineData"
[dm]: http://datamarket.com/ "DataMarket"
[dm-api]: https://github.com/DataMarket/rdatamarket#readme
[ggplot2]: http://docs.ggplot2.org/current/
[ggplot2-scale]: http://docs.ggplot2.org/current/scale_continuous.html
[ggplot2-theme]: http://docs.ggplot2.org/current/theme.html
[quandl]: http://www.quandl.com/ "Quandl"
[quandl-r]: http://www.quandl.com/help/packages/r "Quandl: R package"
[us-oa]: http://blogstats.wordpress.com/2013/05/10/managing-government-information-as-an-asset/
[faostat]: http://cran.r-project.org/web/packages/FAOSTAT/ "FAOSTAT (CRAN)"

Today's session will show how to open a variety of datasets from different online sources, and to make sure that you know how to convert and import data to prepare it for analysis. The next code block introduces a standard way to check that you have the right packages installed for this page's code: you will find identical blocks on every other course page.

```{r packages, message=FALSE, warning=FALSE}
# Load packages.
packages <- c("ggplot2", "WDI")
packages <- lapply(packages, FUN = function(x) {
  if(!require(x, character.only = TRUE)) {
    install.packages(x)
    library(x, character.only = TRUE)
  }
})
```

## Getting data from within R

A few years ago, the [Gapminder][gapminder] initiative used elegant motion charts to call for the liberation of UN data. Such calls for open data have been met by limited but tangible initiatives to put data online, with specific attention to [open access][us-oa] formats and programming facilities (API).

Today, there is a growing ecology of online data repositories and [data APIs for R][so-datR]: have a look, for instance, at [CKAN][ckan], at [Data Market][dm] and its [API][dm-api], or at [Quandl][quandl] and its [R package][quandl-r], at [FAOSTAT][faostat], at the [onlineData][ct-onlinedata] tag at [CRANtastic][ct]…

As a means of introduction, let's take a look at the [World Bank Indicators][wdi], using the [dedicated `WDI` package][gh-wdi]. The package comes with good documentation and the `WDIsearch()` function, which you can use to look for indicators straight from R. The example below will download [central government debt][wdi-cgd] in percent of gross domestic product for a few high income countries.

[so-datR]: http://stats.stackexchange.com/questions/12670/data-apis-feeds-available-as-packages-in-r
[wdi]: http://data.worldbank.org/about/data-overview/methodologies
[wdi-r]: https://github.com/vincentarelbundock/WDI
[wdi-cgd]: http://data.worldbank.org/indicator/GC.DOD.TOTL.GD.ZS

```{r WDI-data}
# Get WDI data.
wdi <- WDI(country = c("US", "GB", "DE","FR", "GR"), 
           indicator = "GC.DOD.TOTL.GD.ZS", start = 2005, end = 2011, 
           extra = TRUE, cache = NULL)
# Check result.
str(wdi)
```

The data can be used as such to draw a plot of central government debt in a few countries over the past few years. The `ggplot2` package provides a wealth of options to build, color and annotate such plots: the example below shows how to plot smoothed trends of governmental central debt over time for each country in the data, using a custom [ColorBrewer][cb] scheme to color the lines.

```{r WDI-plot-1-auto, fig.width = 10, fig.height = 6, tidy = FALSE, warning = FALSE, message = FALSE}
# Smoothed time series plot.
g = qplot(data = wdi, x = year, y = GC.DOD.TOTL.GD.ZS,
          colour = country, se = FALSE, geom = c("smooth", "point")) +
  scale_colour_brewer("Country", palette = "Set1") +
  labs(title = "Central government debt, total (% GDP)\n", y = NULL, x = NULL)
# View result.
g
```

The [`ggplot2`][ggplot2] syntax used in the example above is easily adaptable to create other plots, as you can just add new graphical elements to it. The next code block picks up the `g` graph object and adds countries as text labels at the end of the time series, rather than as a separate legend. More tweaking is needed in the [scales][ggplot2-scale] and [theme][ggplot2-theme] options.

```{r wdi-plot-2-auto, fig.width = 10, fig.height = 6, tidy = FALSE, warning = FALSE, message = FALSE}
g + geom_text(data = subset(wdi, year == 2011),
              aes(x = 2011.25, y = GC.DOD.TOTL.GD.ZS, label = country), 
              hjust = 0) +
  scale_x_continuous(lim = c(2005, 2012.5)) +
  theme(legend.position = "none", panel.grid.minor = element_blank())
```

## Reading and saving plain text

To replicate these figures, you will need to save its data into a data table. The standard, universally readable format is comma-separated values (CSV), which can be saved with a plain text (TXT) file extension. Our steps to downloading data will often involve saving both the original data source and a "local" plain text copy in this format, to minimize reliance on proprietary formats.

```{r wdi-csv-1}
# Target file location.
file = "data/wdi.govdebt.0511.csv"
# Export CSV file.
write.csv(wdi, file, row.names = FALSE)
# Read CSV file again.
wdi <- read.csv(file)
```

Comma-separated values will be our standard, so that we can always use the `read.csv()` function to read our data. Note that the `row.names = FALSE` option will avoid saving the generally useless row numbers into the first column of the file, and that we could also use the `read.table()` function with the `sep = ","` and `header = TRUE` options to import CSV data with variable names on top:

```{r wdi-csv-2}
# Alternative CSV import.
wdi <- read.table(file, sep = ",", header = TRUE)
# Check result.
str(wdi)
```

Note, finally, that by default, R will apply factors to columns with character data on import, which means that we will often use `read.csv()` with the `stringsAsFactors = FALSE` option when we want to import raw text data. There are many more options to explore, like UTF-8 encoding or using quotes to enclose the values, but our data I/O routine handles these by default.

In the case of the WDI data that we fetched here, saving was pretty straightforward because the data came through a specific API. In real life, importing, converting and preparing data for analysis can be much more messy, so we will now take the time to see many different ways to get data in and out of R, from plain text tables to Microsoft Excel spreadsheets and other common tabular formats.

> Next: [Import/Export](041_dataio.html).