GitHub

A Sample-and-Clean Framework for Fast and Accurate Query Processing on Dirty Data

In emerging Big Data scenarios, obtaining timely, high-quality answers to aggregate queries is difficult due to the challenges of processing and cleaning large, dirty data sets. To increase the speed of query processing, there has been a resurgence of interest in sampling-based approximate query processing (SAQP). In its usual formulation, however, SAQP does not address data cleaning at all, and in fact, exacerbates answer quality problems by introducing by sampling error. We explore the use of sampling to actually improve answer quality. We introduce the Sample-and-Clean framework, which applies data cleaning to a relatively small subset of the data and uses the results of the cleaning process to lessen the impact of dirty data on aggregate query answers.

Name		Name	Last commit message	Last commit date
Latest commit History 1,283 Commits
bin		bin
conf		conf
data/files		data/files
hive_blinkdb @ 8b0d2ca		hive_blinkdb @ 8b0d2ca
lib		lib
project		project
sbt		sbt
src		src
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
run		run
scalastyle-config.xml		scalastyle-config.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A Sample-and-Clean Framework for Fast and Accurate Query Processing on Dirty Data

About

Releases

Packages

Languages

License

sjyk/sampleclean

Folders and files

Latest commit

History

Repository files navigation

A Sample-and-Clean Framework for Fast and Accurate Query Processing on Dirty Data

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages