Using R to analyse data through multiple regression analysis and graphical analysis, using the 'diamonds' dataset provided by ggplot2. This consists of the prices and characteristics of approximately 54000 individual diamonds.
- Andrew Burnie
- Thomas Bradley
For this program to work, download R from The R Project, followed by RStudio. Further usage instructions for each file follow below.
Download this repository using either git clone or by unpacking the zip into it's own folder.
diamonds_graphics.R
- Demonstrates how to create scatter plots and bar plots in R. Additionally, the differences between the default plot functions and the abilities of the R package, ggplot2, are demonstrated.
diamonds_stats.R
- Uses R to apply multiple regression analysis to a cross-sectional database.
- Open the file of interest in RStudio.
- Run the setup code at the beginning of the file.
- Read through the file to find code of interest.
- Select this block of code and click Run to see the output.
##Model Assumptions
An Ordinary Least Squares approach is used. This means that the Gauss-Markov assumptions have been made:
- Strict Exogeneity: the errors in the regression have a mean of zero and are uncorrelated with the independent variables included.
- Errors follow a normal distribution.
- Homoskedasticity: the error term has the same variance across observations.
- Sufficient Variation in data.
- No Perfect Multicollinearity.
- The model is linear in its parameters.
- The sample is random.
The validity of both the assumptions of homoskedasticity and no perfect multicollinearity will be statistically assessed.
##Further Useful Resources
A major strength of R is the variety of packages available. The below were used:
- Documentation for ggplot2, which has more useful default graphical preferences and an easier syntax than the default package. It also provides the 'diamonds' dataset. See the website for more details: ggplot2.org.
- Documentation for lmtest, which provides diagnostic tests for linear regression, e.g. heteroskedasticity testing.
- Documentation for sandwich, which provides a variety of robust standard error types.
- Documentation for Hmisc, which provides a useful function to produce a matrix of correlation coefficients.
Note that these packages must first be installed, e.g. install.packages("sandwich")
, and then required, e.g. require("sandwich")
in order to be used.
##References
Long J. S., Ervin L. H. (2000), Using Heteroscedasticity Consistent Standard Errors in the Linear Regression Model. The American Statistician, 54, 217–224.
Drew Dimmery provides a useful guide on how to implement robust standard errors here.