Error localization

Find errors in data given a set of validation rules. The errorlocate helps to identify obvious errors in raw datasets.

It works in tandem with the package validate. With validate you formulate data validation rules to which the data must comply.

For example:

“age cannot be negative”: age >= 0.
“if a person is married, he must be older then 16 years”: if (married ==TRUE) age > 16.
“Profit is turnover minus cost”: profit == turnover - cost.

While validate can check if a record is valid or not, it does not identify which of the variables are responsible for the invalidation. This may seem a simple task, but is actually quite tricky: a set of validation rules forms a web of dependent variables: changing the value of an invalid record to repair for rule 1, may invalidate the record for rule 2.

errorlocate provides a small framework for record based error detection and implements the Felligi Holt algorithm. This algorithm assumes there is no other information available then the values of a record and a set of validation rules. The algorithm minimizes the (weighted) number of values that need to be adjusted to remove the invalidation.

Installation

errorlocate can be installed from CRAN:

install.packages("errorlocate")

Beta versions can be installed with drat:

drat::addRepo("data-cleaning")
install.packages("errorlocate")

The latest development version of errorlocate can be installed from github with devtools:

devtools::install_github("data-cleaning/errorlocate")

Usage

library(errorlocate)
#> Loading required package: validate
rules <- validator( profit == turnover - cost
                  , cost >= 0.6 * turnover
                  , turnover >= 0
                  , cost >= 0 # is implied
)

data <- data.frame(profit=750, cost=125, turnover=200)

data_no_error <- replace_errors(data, rules)

# faulty data was replaced with NA
print(data_no_error)
#>   profit cost turnover
#> 1     NA  125      200

er <- errors_removed(data_no_error)

print(er)
#> call:  locate_errors(data, x, ref, ..., cl = cl) 
#> located  1  error(s).
#> located  0  missing value(s).
#> Use 'summary', 'values', '$errors' or '$weight', to explore and retrieve the errors.

summary(er)
#> Variable:
#>       name errors missing
#> 1   profit      1       0
#> 2     cost      0       0
#> 3 turnover      0       0
#> Errors per record:
#>   errors records
#> 1      1       1

er$errors
#>      profit  cost turnover
#> [1,]   TRUE FALSE    FALSE

Name		Name	Last commit message	Last commit date
Latest commit History 288 Commits
.github		.github
R		R
examples		examples
issues		issues
man		man
tests		tests
uRos2018		uRos2018
useR2017		useR2017
vignettes		vignettes
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
DESCRIPTION		DESCRIPTION
NAMESPACE		NAMESPACE
NEWS.md		NEWS.md
README.Rmd		README.Rmd
README.md		README.md
_pkgdown.yml		_pkgdown.yml
appveyor.yml		appveyor.yml
codecov.yml		codecov.yml
errorlocate.Rproj		errorlocate.Rproj

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Error localization

Installation

Usage

About

Releases

Packages

Contributors 3

Languages

data-cleaning/errorlocate

Folders and files

Latest commit

History

Repository files navigation

Error localization

Installation

Usage

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages