## (C) (cc by-sa) Wouter van Atteveldt, file generated juni 06 2014

Note on the data used in this howto: This data can be downloaded from http://piketty.pse.ens.fr/files/capital21c/en/xls/, but the excel format is a bit difficult to parse at it is meant to be human readable, with multiple header rows etc. For that reason, I've extracted csv files for some interesting tables that I've uploaded to http://vanatteveldt.com/uploads/rcourse/data

Playing with data in R

To demonstrate R, we will use the data from Piketty's 'Capital in the 21st Century'

download.file("http://vanatteveldt.com/wp-content/uploads/rcourse/data/income_topdecile.csv", 
    destfile = "income_topdecile.csv")
income = read.csv("income_topdecile.csv")

We've downloaded a csv file and read it into a new variable income, which should appear in your environment list. You can click on the file to inspect it visually, but we can also use the head command:

head(income, n = 10)

##    Year U.S. U.K. Germany France Sweden Europe
## 1  1900 0.41 0.47    0.45   0.46   0.46   0.46
## 2  1901   NA   NA      NA     NA     NA     NA
## 3  1902   NA   NA      NA     NA     NA     NA
## 4  1903   NA   NA      NA     NA     NA     NA
## 5  1904   NA   NA      NA     NA     NA     NA
## 6  1905   NA   NA      NA     NA     NA     NA
## 7  1906   NA   NA      NA     NA     NA     NA
## 8  1907   NA   NA      NA     NA     NA     NA
## 9  1908   NA   NA      NA     NA     NA     NA
## 10 1909   NA   NA      NA     NA     NA     NA

As you can see, the values are NA (missing) for most rows, especially in the earlier period. Let's throw out all data containing missing values using the na.omit function:

income = na.omit(income)
head(income)

##    Year U.S. U.K. Germany France Sweden Europe
## 1  1900 0.41 0.47    0.45   0.46   0.46   0.46
## 11 1910 0.41 0.47    0.44   0.47   0.46   0.46
## 21 1920 0.45 0.41    0.39   0.42   0.36   0.39
## 31 1930 0.45 0.39    0.42   0.43   0.38   0.40
## 41 1940 0.36 0.34    0.34   0.33   0.33   0.34
## 51 1950 0.34 0.30    0.33   0.34   0.29   0.32

Much better. Now, we can list the variables in the file using names and get the numbers of rows or columns with nrow and ncol, respectively:

names(income)

## [1] "Year"    "U.S."    "U.K."    "Germany" "France"  "Sweden"  "Europe"

nrow(income)

## [1] 12

ncol(income)

## [1] 7

We can also ask for a summary of each of the variables in the file using the summary command:

summary(income)

##       Year           U.S.            U.K.          Germany     
##  Min.   :1900   Min.   :0.330   Min.   :0.280   Min.   :0.310  
##  1st Qu.:1928   1st Qu.:0.355   1st Qu.:0.323   1st Qu.:0.328  
##  Median :1955   Median :0.410   Median :0.385   Median :0.350  
##  Mean   :1955   Mean   :0.403   Mean   :0.373   Mean   :0.364  
##  3rd Qu.:1982   3rd Qu.:0.450   3rd Qu.:0.412   3rd Qu.:0.398  
##  Max.   :2010   Max.   :0.480   Max.   :0.470   Max.   :0.450  
##      France          Sweden          Europe     
##  Min.   :0.310   Min.   :0.220   Min.   :0.290  
##  1st Qu.:0.330   1st Qu.:0.268   1st Qu.:0.320  
##  Median :0.335   Median :0.295   Median :0.340  
##  Mean   :0.369   Mean   :0.322   Mean   :0.357  
##  3rd Qu.:0.422   3rd Qu.:0.365   3rd Qu.:0.393  
##  Max.   :0.470   Max.   :0.460   Max.   :0.460

This lists the range, mean, etc. for each variable. We can select any column from a data frame using variable$column:

income$U.S.

##  [1] 0.41 0.41 0.45 0.45 0.36 0.34 0.34 0.33 0.37 0.42 0.47 0.48

This gives a vector of numbers representing the different cells in that column. We can use various functions such as mean, sum, and length to get information about a vector.

length(income$U.S.)

## [1] 12

mean(income$U.S.)

## [1] 0.4025

mean(income$Europe)

## [1] 0.3575

As perhaps expected, the mean income inequality in Europe is lower than than in the U.S.. Let's do a t-test to see if the difference is significant:

t.test(income$U.S., income$Europe, paired = T)

## 
## 	Paired t-test
## 
## data:  income$U.S. and income$Europe
## t = 2.615, df = 11, p-value = 0.02406
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.007119 0.082881
## sample estimates:
## mean of the differences 
##                   0.045

So, with p<.02 we can conclude that the income distribution in the U.S. is more unequal than in Europe. Let's make a simple plot of the income inequality in the U.S. and Europe (reproducing fig 9.8 on page 324)

plot(x = income$Year, y = income$U.S., type = "l", ylab = "Top decile income share", 
    xlab = "Year", ylim = c(0, 0.5))
lines(x = income$Year, y = income$Europe, col = "red")

As you can see, income distribution in pre-WWI Europe is actually more unequal than in the U.S., but this is reversed during the 1910's and inequality diverges after the 1970's. Still, the lines are probably correlated:

cor.test(income$U.S., income$Europe)

## 
## 	Pearson's product-moment correlation
## 
## data:  income$U.S. and income$Europe
## t = 1.492, df = 10, p-value = 0.1666
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.1950  0.8038
## sample estimates:
##    cor 
## 0.4267

So, although the correlation is moderate at 0.43, it is not significant (due to a lack of data points)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2_playing.md

2_playing.md

Playing with data in R

Files

2_playing.md

Latest commit

History

2_playing.md

File metadata and controls

Playing with data in R