Skip to content

Using R to analyse data through multiple regression analysis

Notifications You must be signed in to change notification settings

trbradley/applying-R-to-data

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 

Repository files navigation

Applying R To A Cross-Sectional Dataset

Using R to analyse data through multiple regression analysis and graphical analysis, using the 'diamonds' dataset provided by ggplot2. This consists of the prices and characteristics of approximately 54000 individual diamonds.

Authors

  • Andrew Burnie
  • Thomas Bradley

Installation

For this program to work, download R from The R Project, followed by RStudio. Further usage instructions for each file follow below.

Download this repository using either git clone or by unpacking the zip into it's own folder.

Purpose Of Each File

diamonds_graphics.R

  • Demonstrates how to create scatter plots and bar plots in R. Additionally, the differences between the default plot functions and the abilities of the R package, ggplot2, are demonstrated.

diamonds_stats.R

  • Uses R to apply multiple regression analysis to a cross-sectional database.

Using This Repository

  1. Open the file of interest in RStudio.
  2. Run the setup code at the beginning of the file.
  3. Read through the file to find code of interest.
  4. Select this block of code and click Run to see the output.

##Model Assumptions

An Ordinary Least Squares approach is used. This means that the Gauss-Markov assumptions have been made:

  1. Strict Exogeneity: the errors in the regression have a mean of zero and are uncorrelated with the independent variables included.
  2. Errors follow a normal distribution.
  3. Homoskedasticity: the error term has the same variance across observations.
  4. Sufficient Variation in data.
  5. No Perfect Multicollinearity.
  6. The model is linear in its parameters.
  7. The sample is random.

The validity of both the assumptions of homoskedasticity and no perfect multicollinearity will be statistically assessed.

##Further Useful Resources

A major strength of R is the variety of packages available. The below were used:

  • Documentation for ggplot2, which has more useful default graphical preferences and an easier syntax than the default package. It also provides the 'diamonds' dataset. See the website for more details: ggplot2.org.
  • Documentation for lmtest, which provides diagnostic tests for linear regression, e.g. heteroskedasticity testing.
  • Documentation for sandwich, which provides a variety of robust standard error types.
  • Documentation for Hmisc, which provides a useful function to produce a matrix of correlation coefficients.

Note that these packages must first be installed, e.g. install.packages("sandwich"), and then required, e.g. require("sandwich") in order to be used.

##References

Long J. S., Ervin L. H. (2000), Using Heteroscedasticity Consistent Standard Errors in the Linear Regression Model. The American Statistician, 54, 217–224.

Drew Dimmery provides a useful guide on how to implement robust standard errors here.

About

Using R to analyse data through multiple regression analysis

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages