Materials for the useR!2021 tutorial on data validation
- Make sure you have a recent version of R (>= 4.1.0) installed.
- During the tutorial we will use the RStudio IDE, but this is not mandatory for participants. Please note that you need RStudio version 1.4.1717 or higher to work with R >= 4.1.0.
Install the following packages, by copying the code below.
install.packages(c("validate","validatetools","validatedb"
,"RSQLite","lumberjack"))
install.packages("simputation", dependencies=TRUE)
We expect participants to have some basic knowledge of (base) R. No knowledge
about particular packages is required. You should be familiar with data frames,
reading from and writing to csv
, selecting columns and rows, and working with
R scripts.
If you want to brush up your R knowledge, you can follow the excellent free online tutorial by Norm Matloff.
If you work with RStudio, we strongly advise you to work in an RStudio Project so data and scripts are found within the local project path.
- Opening and hands-on introduction to 'validate' workflow (20 min)
- Presentation: theory of data validation + Q & A (20 min)
- Breakout assignment & discussions in groups (20 min)
- Feedback on results of the breakout groups (10 min)
- Focus on different validation tasks. Small programming assignments, from simple to complex tasks. (40 min)
- Hands-on introduction to lumberjack (15 min)
- Presentation: monitoring data in R and Q&A (15 min)
- Hands-on: lumberjack and validate (20 min)
- hands-on introduction to the {validatetools} package (20 min)
- Presentation and Q&A validation rule management (20 min)
- Closing and Q&A (10 min)
- Data validation infrastructure for R van der Loo and de Jonge (JSS, 2021)
- Monitoring data in R with the lumberjack package van der Loo (JSS, 2021)
- Data Validation van der Loo and de Jonge (Wiley StatsRef Online, 2020)
- Statistical Data Cleaning with Applications in R van der Loo and de Jonge (Wiley, 2018)