Materials for the short course on Statistical Data Cleaning for Business Statistics at the European Establishment Statistics Workshop 2019
Slot 1
Topic | time (m) |
---|---|
Introduction | 20 |
Reading dirty data | 30 |
Approximate matching | 50 |
Data validation | 50 |
Slot 2
Topic | time (m) |
---|---|
Error localization | 20 |
Imputation | 50 |
Adjusting | 20 |
Monitoring | 30 |
Wrap-up | 10 |
The course form is highly hands-on. Each topic starts with an approximately 10-15 minute session where you run and adapt some R code. Next, I will provide background and details on what you just did. After that there is a more in-depth assignment. Depending on time and topic we will discuss the topic more in-depth after that.
Bring a laptop
Participants are expected to have a basic knowledge of R/RStudio, explicitly:
- Work with the R command line and R scripts
- Read/write CSV data
- Some basic data manipulations and plots
- I highly recommend working with RStudio projects.
- R See https://r-project.org
- (Recommended) Rstudio
Execute the following R code to install the necessary packages.
install.packages(c(
"validate"
, "errorlocate"
, "simputation"
, "rspa"
, "daff"
, "jsonlite"
, "XML"
, "readr"
, "stringr"
, "lumberjack")
, dependencies=TRUE)
This work is licensed under a Creative Commons Attribution 4.0 International License.