Skip to content
Shanfang edited this page May 2, 2018 · 7 revisions

Introduction

This repo contains implementation of sketching algorithms for size of join estimation. The update performance of sketches can be significantly improved if only a sample of the data is sketched, without significant degradation in the accuracy. In this repo, Bernoulli sampling is used. For details of the sampling algorithms and sketching techniques, please checkout the references page.

Prerequisite

Install GSL

If you are using Mac, follow these steps:

  1. launch the terminal
  2. run
    ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)" < /dev/null 2> /dev/null
  3. run
    brew install gsl

For other systems, please checkout the documentation on GSL

How to run the code

  1. run make
  2. run ./sketch_bernoulli_sampling.out followed by the following parameters:
    dom_size
    tuples_no
    buckets_no
    rows_no
    DIST_PARAM
    DIST_SHUFF
    SAMP_PROB
    num_runs
  3. run make clean to remove all intermediate files.

Notes about input arguments

  1. dom_size defines the size of domain for I
  2. tuples_no defines the number of tuples in a relation
  3. buckets_no defines the number of buckets when generating sketch
  4. row_no defines number of rows for the generated sketch
  5. DIST_PARAM determines the shape of zipf’s distribution.
  6. DIST_SHUFF is the decor_param. decor_param = 0 corresponds to complete randomness; decor_param = 1 corresponds to complete positive correlation or identical relations
  7. bernoulli_p set the probability p for Bernoulli sampling
  8. runs_no defines number of runs for test
Clone this wiki locally