Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement the table tolerance function #12

Open
4 tasks
cugni opened this issue May 21, 2021 · 2 comments
Open
4 tasks

Implement the table tolerance function #12

cugni opened this issue May 21, 2021 · 2 comments
Assignees
Labels
enhancement New feature or request
Projects

Comments

@cugni
Copy link
Member

cugni commented May 21, 2021

What is Tolerance?

Tolerance marks the error a user allows in an aggregation, within a confidence interval. That means that, giving a CI of 95% for example, 95 of 100 times runs of the same query, the answer would have a relative error of [0.0, tolerance]

How do we calculate Tolerance?

The idea is with a user-provided tolerance value, we can estimate the required sample size to satisfy the query that computes the mean with a predefined level of certainty.

Given a confidence level of say 95%, we want to determine a confidence interval for which 95% of all our mean estimations will fall within the target range. That is, given a sample of size n drawn from a population (with and as population mean and variance respectively), determine the confidence interval of the sample mean so it has a 95% chance of containing

In other words;

Here, the Central Limit Theorem is taken into account:

Regardless of the distribution of the population(as long a and are finite), the distribution of the sample means is normal.

As well as the notion of Standard Error of the Mean:

Given a single sample of size n, how can we determine how far its mean $\bar{x}$ is from the population mean $\mu$? The answer, , reflects the standard deviation of the sample means and can be estimated as , with s being the standard deviation of the sample.

Tolerance is the Relative Standard Error (RSE) of the distribution of the sample means. The formula of the RSE can be expressed in terms of the Standard Error (SE) and the Estimated Mean ( ).

Consequently, the RSE can be estimated from the Standard Error ( ) of the Sample Mean and the Estimated Mean ( with the formula .

Another way to put it is; "we want that the error of the mean to be less than the tolerance applied to the estimated mean ( )";

Both ways lead to the same equation which allows determining the sample size as follows;

Standard Error of the Mean, , can be estimated as , with s being the standard deviation of the sample. It can be done because of the assumption of normality.

Deviation of the sample mean from the population mean is the SEM, and we want the percentage of error with respect to the mean, which should have tolerance as upper bound (ratio of the error of the SEM ). This gives us;

This issue has the scope to collect all the information about the table tolerance and guide a bit the future development.
Missing steps.

  • formally define the algorithm for the table tolerance
  • Implement the algorithm inside the qbeast-spark library.
  • finish the .tolerance shortcut io.qbeast.spark.implicts
  • Set up a comprehensive confidence testing.
@osopardo1 osopardo1 self-assigned this Jun 7, 2021
@alexeiakimov
Copy link
Contributor

alexeiakimov commented Sep 22, 2021

Here are some issues found while working on the tolerance feature.

  1. Right now the tolerance is defined for the mean (or avg) function only. A similar concept for other types of aggregate functions like min, max, etc can have a different name and a different range of admissible values (for tolerance it is [0, 1]).
  2. The tolerance is defined as sampleDeviation * zScore / mean / sqrt(sampleSize). It is not clear if the tolerance is still efficient if the mean is 0 or is close to 0.
  3. Current implementation just extracts the column with avg function and calculate the mean using samples of the whole table. Suppose the user specified val df = spark.read.format("qbeast").load("...").where("value > 100").agg(avg("value")).tolerance(0.01). The sampling should apply the specified where condition otherwise the returned average can be wrong.
  4. zScore is hardcoded, should it be a parameter specified by user?

@alexeiakimov
Copy link
Contributor

We should also define more precisely what kind of queries we want to support, so the user can have a clear understanding whether a given query is supported or not. Can we define it in terms of SQL syntax tree or alike, maybe a bit informal?

@cugni cugni transferred this issue from another repository Sep 23, 2021
@osopardo1 osopardo1 added enhancement New feature or request help wanted Extra attention is needed labels Sep 28, 2021
@eavilaes eavilaes added this to To Do in Features Nov 2, 2021
@osopardo1 osopardo1 assigned cugni and unassigned osopardo1 Feb 6, 2023
@osopardo1 osopardo1 added documentation Improvements or additions to documentation and removed help wanted Extra attention is needed documentation Improvements or additions to documentation labels Mar 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Development

No branches or pull requests

3 participants