-
Notifications
You must be signed in to change notification settings - Fork 4
SampleClean User Guide
SCINITIALIZE creates a SampleClean view, this view is maintained in memory and its schema is recorded in an in-memory catalog. We currently support uniform samples from the entire base table.
SCINITIALIZE sampleName(arg1, arg2, ..., argk) FROM table SAMPLEWITH ratio;
Internally, we maintain two views: _clean and _dirty, which also contain a hash and a duplicate count. Before running any data cleaning techniques, these two views are identical. Example Creation Query:
sampleclean> SCINITIALIZE cities_sample (city, country, population, area, density) FROM cities SAMPLEWITH 0.1;
Currently, we only support samples of entire tables. A work around is to create a temporary table with a CREATE TABLE ... AS SELECT
statement. Furthermore, we currently do not support typing of data and represent all attributes as JAVA Strings; which are automatically processed as numbers if you run a numerical query.
If we are unhappy with our data cleaning results, we can reset our clean sample back to its original state with:
screset sampleName;
To delete a sample:
sampleclean> drop table sampleName_clean;
sampleclean> drop table sampleName_dirty;
We provide a set of text format cleaning primitives.
This forces an attribute to be a "number", by gathering only the digit characters.
SCFORMAT sampleName attr NUMBER;
This removes leading and trailing spaces.
SCFORMAT sampleName attr TRIM;
Replace a pattern with a new string.
SCFORMAT sampleName attr REPLACE pattern newString;
We also provide some commonly used outlier detection techniques.
We can remove data that deviates z standard deviations away from the mean. Common choices for z are 2 (95% under normal approximation) and 2.6 (99% under normal approximation)
SCOUTLIER sampleClean attr PARAMETRIC z
We can remove data that deviates z median absolute deviations away from the mean. A common choice is 6 (used in robust LOESS smoothing)
SCOUTLIER sampleClean attr NONPARAMETRIC z
Proximity based outlier removal relies on the fact that outliers will be far away on average from other points.
SCOUTLIER sampleClean attr PROXIMITY z
You can write custom filters for SampleClean views. These filters KEEP records that match the filter. You can run any hiveql predicate function.
SCFILTER sampleName <HiveQL Predicate>
As quick aliases, we provide two commands sccount which returns the number of tuples in the clean sample, and scshow which prints all of the records in the clean sample.
SCCOUNT sampleName;
SCSHOW sampleName;
SampleClean estimates two forms of results: answers (RawSC) and corrections (NormalizedSC). An answer is an approximate aggregation query run directly on the cleaned sample, a correction is an estimated correction to correct a query on the entire dataset. We currently support only single attribute aggregations and group by statements.
SELECTRAWSC SUM(attr) FROM sampleName WHERE predicate GROUP BY gbkey;
SELECTNSC SUM(attr) FROM sampleName WHERE predicate GROUP BY gbkey;