Skip to content

SampleClean User Guide

Sanjay Krishnan edited this page May 15, 2014 · 6 revisions

Creating and Manipulating Samples

Creating A Sample

SCINITIALIZE creates a SampleClean view, this view is maintained in memory and its schema is recorded in an in-memory catalog. We currently support uniform samples from the entire base table.

    SCINITIALIZE sampleName(arg1, arg2, ..., argk) FROM table SAMPLEWITH ratio;

Internally, we maintain two views: _clean and _dirty, which also contain a hash and a duplicate count. Before running any data cleaning techniques, these two views are identical. Example Creation Query:

    sampleclean> SCINITIALIZE cities_sample (city, country, population, area, density) FROM cities SAMPLEWITH 0.1;

Currently, we only support samples of entire tables. A work around is to create a temporary table with a CREATE TABLE ... AS SELECT statement. Furthermore, we currently do not support typing of data and represent all attributes as JAVA Strings; which are automatically processed as numbers if you run a numerical query.

Resetting A Sample

If we are unhappy with our data cleaning results, we can reset our clean sample back to its original state with:

    screset sampleName;

Deleting A Sample

To delete a sample:

    sampleclean> drop table sampleName_clean;
    sampleclean> drop table sampleName_dirty;

Data Cleaning

Text Formatting

We provide a set of text format cleaning primitives.

SCFORMAT NUMBER

This forces an attribute to be a "number", by gathering only the digit characters.

    SCFORMAT sampleName attr NUMBER;

SCFORMAT TRIM

This removes leading and trailing spaces.

    SCFORMAT sampleName attr TRIM;

SCFORMAT REPLACE

Replace a pattern with a new string.

    SCFORMAT sampleName attr REPLACE pattern newString;

Outlier Removal

We also provide some commonly used outlier detection techniques.

SCOUTLIER PARAMETRIC

We can remove data that deviates z standard deviations away from the mean. Common choices for z are 2 (95% under normal approximation) and 2.6 (99% under normal approximation)

    SCOUTLIER sampleClean attr PARAMETRIC z

SCOUTLIER NONPARAMETRIC

We can remove data that deviates z median absolute deviations away from the mean. A common choice is 6 (used in robust LOESS smoothing)

    SCOUTLIER sampleClean attr NONPARAMETRIC z

SCOUTLIER PROXIMITIY

Proximity based outlier removal relies on the fact that outliers will be far away on average from other points.

    SCOUTLIER sampleClean attr PROXIMITY z

Rule-based Filtering

You can write custom filters for SampleClean views. These filters KEEP records that match the filter. You can run any hiveql predicate function.

    SCFILTER sampleName <HiveQL Predicate>

Querying SampleClean Views

SCCount and SCShow

As quick aliases, we provide two commands sccount which returns the number of tuples in the clean sample, and scshow which prints all of the records in the clean sample.

    SCCOUNT sampleName;
    SCSHOW sampleName;

RawSC and NormalizedSC

SampleClean estimates two forms of results: answers (RawSC) and corrections (NormalizedSC). An answer is an approximate aggregation query run directly on the cleaned sample, a correction is an estimated correction to correct a query on the entire dataset. We currently support only single attribute aggregations and group by statements.

    SELECTRAWSC SUM(attr) FROM sampleName WHERE predicate GROUP BY gbkey;
    SELECTNSC SUM(attr) FROM sampleName WHERE predicate GROUP BY gbkey;