In emerging Big Data scenarios, obtaining timely, high-quality answers to aggregate queries is difficult due to the challenges of processing and cleaning large, dirty data sets. To increase the speed of query processing, there has been a resurgence of interest in sampling-based approximate query processing (SAQP). In its usual formulation, however, SAQP does not address data cleaning at all, and in fact, exacerbates answer quality problems by introducing by sampling error. We explore the use of sampling to actually improve answer quality. We introduce the Sample-and-Clean framework, which applies data cleaning to a relatively small subset of the data and uses the results of the cleaning process to lessen the impact of dirty data on aggregate query answers.
forked from sjyk/sampleclean
-
Notifications
You must be signed in to change notification settings - Fork 1
SampleClean+BlinkDB
License
thisisdhaas/sampleclean
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
About
SampleClean+BlinkDB
Resources
License
Stars
Watchers
Forks
Packages 0
No packages published
Languages
- Scala 87.0%
- Python 4.8%
- Shell 4.2%
- Java 3.9%
- CSS 0.1%