Implement the table tolerance function #12

cugni · 2021-05-21T14:51:21Z

What is Tolerance?

Tolerance marks the error a user allows in an aggregation, within a confidence interval. That means that, giving a CI of 95% for example, 95 of 100 times runs of the same query, the answer would have a relative error of [0.0, tolerance]

How do we calculate Tolerance?

The idea is with a user-provided tolerance value, we can estimate the required sample size to satisfy the query that computes the mean with a predefined level of certainty.

Given a confidence level of say 95%, we want to determine a confidence interval for which 95% of all our mean estimations will fall within the target range. That is, given a sample of size n drawn from a population (with $\mu$ and $\sigma$ as population mean and variance respectively), determine the confidence interval of the sample mean $\bar{x}$ so it has a 95% chance of containing $\mu$

$Confidence\ Interval = [\bar{x}-Z\frac{\sigma}{\sqrt{n}},\bar{x} Z\frac{\sigma}{\sqrt{n}}]$

In other words;

$Pr(\bar{x}-Z\frac{\sigma}{\sqrt{n}}<=\mu<=\bar{x} Z\frac{\sigma}{\sqrt{n}}) = 0.95$

Here, the Central Limit Theorem is taken into account:

Regardless of the distribution of the population(as long a $\mu$ and $\sigma$ are finite), the distribution of the sample means is normal.

As well as the notion of Standard Error of the Mean:

Given a single sample of size n, how can we determine how far its mean $\bar{x}$ is from the population mean $\mu$? The answer, $SEM=\frac{\sigma}{\sqrt{n}}$ , reflects the standard deviation of the sample means and can be estimated as $\frac{s}{\sqrt{n}}$ , with s being the standard deviation of the sample.

Tolerance is the Relative Standard Error (RSE) of the distribution of the sample means. The formula of the RSE can be expressed in terms of the Standard Error (SE) and the Estimated Mean ( $\bar{x}$ ).

Consequently, the RSE can be estimated from the Standard Error ( $SE=\frac{\sigma}{\sqrt{n}}$ ) of the Sample Mean and the Estimated Mean ( $\bar{x}$ with the formula $RSE = \frac{SE}{\bar{x}}*100$ .

$RSE=\frac{Z\frac{\sigma}{\sqrt{n}}}{\bar{x}} \le t$

Another way to put it is; "we want that the error of the mean $\frac{\sigma}{\sqrt{n}}$ to be less than the tolerance applied to the estimated mean ( $\bar{x}*t$ )";

$\bar{x}*t=Z\frac{\sigma}{\sqrt{n}}$

$t = \frac{Z*\frac{\sigma}{\sqrt{n}}}{\bar{x}}$

Both ways lead to the same equation which allows determining the sample size as follows;

$\sqrt{n} = \frac{Z*\sigma}{t*\bar{x}}$

$n=(\frac{{Z*\sigma}}{t*\bar{x}})^2$

$n=\frac{1}{t^2}(\frac{{Z*\sigma}}{\bar{x}})^2$

Standard Error of the Mean, $SEM=\frac{\sigma}{\sqrt{n}}$ , can be estimated as $\frac{s}{\sqrt{n}}$ , with s being the standard deviation of the sample. It can be done because of the assumption of normality.

$n=\frac{1}{t^2}(\frac{{Z*s}}{\bar{x}})^2$

Deviation of the sample mean from the population mean is the SEM, and we want the percentage of error with respect to the mean, which should have tolerance as upper bound (ratio of the error of the SEM $\le tolerance$ ). This gives us;

$\frac{SEM}{\bar{x}} \le tolerance$

This issue has the scope to collect all the information about the table tolerance and guide a bit the future development.
Missing steps.

formally define the algorithm for the table tolerance
Implement the algorithm inside the qbeast-spark library.
finish the .tolerance shortcut io.qbeast.spark.implicts
Set up a comprehensive confidence testing.

The text was updated successfully, but these errors were encountered:

alexeiakimov · 2021-09-22T09:44:33Z

Here are some issues found while working on the tolerance feature.

Right now the tolerance is defined for the mean (or avg) function only. A similar concept for other types of aggregate functions like min, max, etc can have a different name and a different range of admissible values (for tolerance it is [0, 1]).
The tolerance is defined as sampleDeviation * zScore / mean / sqrt(sampleSize). It is not clear if the tolerance is still efficient if the mean is 0 or is close to 0.
Current implementation just extracts the column with avg function and calculate the mean using samples of the whole table. Suppose the user specified val df = spark.read.format("qbeast").load("...").where("value > 100").agg(avg("value")).tolerance(0.01). The sampling should apply the specified where condition otherwise the returned average can be wrong.
zScore is hardcoded, should it be a parameter specified by user?

alexeiakimov · 2021-09-22T09:50:14Z

We should also define more precisely what kind of queries we want to support, so the user can have a clear understanding whether a given query is supported or not. Can we define it in terms of SQL syntax tree or alike, maybe a bit informal?

osopardo1 self-assigned this Jun 7, 2021

cugni transferred this issue from another repository Sep 23, 2021

osopardo1 added enhancement New feature or request help wanted Extra attention is needed labels Sep 28, 2021

eavilaes added this to To Do in Features Nov 2, 2021

osopardo1 mentioned this issue Feb 6, 2023

Add more complex per file statistics #154

Closed

osopardo1 assigned cugni and unassigned osopardo1 Feb 6, 2023

osopardo1 added documentation Improvements or additions to documentation and removed help wanted Extra attention is needed documentation Improvements or additions to documentation labels Mar 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement the table tolerance function #12

Implement the table tolerance function #12

cugni commented May 21, 2021

alexeiakimov commented Sep 22, 2021 •

edited

Loading

alexeiakimov commented Sep 22, 2021

Implement the table tolerance function #12

Implement the table tolerance function #12

Comments

cugni commented May 21, 2021

alexeiakimov commented Sep 22, 2021 • edited Loading

alexeiakimov commented Sep 22, 2021

alexeiakimov commented Sep 22, 2021 •

edited

Loading