-
Notifications
You must be signed in to change notification settings - Fork 0
Sampling
F’ and G’ are samples from relations F and G, respectively.
f'i and g’i be the frequency of the value i in F’ and G’, respectively.
Size of join of the sample relations is:
Note: f ’i and g’i are random variables depending on the type of sampling and the sampling process parameters. Most characterization of sampling can be carried out without specifying the type of sampling.
The above mentioned estimator can be biased, but a constant corrections that scales for the difference in size between the samples and the original relations can be made to obtain unbiased estimator. The estimator is:
There are two cases: 1) F and G are different and samples are obtained independently. 2) F and G are identical, compute size of self-join.
Estimator:
Expectation:
Variance:
Estimator:
Expectation:
Variance:
Each tuple in F and G is selected independently in the sample F’ and G’ with probability p or q, 0 <= p <= 1, 0 <= q <= 1.
Expectation:
Where f ’i and g’i are independent binomial random variables, f ’i =Binomial(f ’i, p) and g’i = Binomial(g’i, q) . In this case, the scaling factor is 1/pq
Estimator:
Expectation:
Variance:
Estimator:
Expectation:
Variance: