Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgraded to t-digest 3.3. #3634

Merged
merged 8 commits into from
Jun 24, 2022
Merged

Conversation

dblock
Copy link
Member

@dblock dblock commented Jun 20, 2022

Signed-off-by: dblock [email protected]

Description

The upgrade to t-digest 3.3 fixes a number of bugs in calculating percentiles.

Looking at sample output, the old version 3.2 was interpolating data (see #3634 (comment) for an explanation) and producing different (wrong) results, especially in the small sample size. For example, given input of [1, 51, 101, 151] the 25th, 50th and 75th percentiles changed from [26,76,126] to [51,101,151] with this upgrade.

The tests in this PR have been adjusted to reflect the new expected percentiles, and I added both a 2.x mixed cluster test, and made all the other tests select a 3.x node to preserve a trail of this change. I also corrected the assumption that the number of centroids is <= that the number of data points, not =.

Because results change significantly I think this is a 3.x change and should not be back-ported, but open to other considerations.

There are many changes between t-digest 3.2 and 3.3, see tdunning/t-digest#194.

Issues Resolved

Closes #1756.

Check List

  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@dblock dblock requested review from a team and reta as code owners June 20, 2022 15:30
@opensearch-ci-bot
Copy link
Collaborator

❌   Gradle Check failure 2a5132460f22fcf4bc4a0f6f1e267467ffb5d9c4
Log 6149

Reports 6149

@opensearch-ci-bot
Copy link
Collaborator

❌   Gradle Check failure 89125fc
Log 6150

Reports 6150

@dblock
Copy link
Member Author

dblock commented Jun 20, 2022

Ok, doesn't look so simple...

org.opensearch.search.aggregations.metrics.InternalTDigestPercentilesRanksTests > testEqualsAndHashcode FAILED
    java.lang.AssertionError: expected:<64> but was:<81>
        at __randomizedtesting.SeedInfo.seed([DF508FEF9F0E8839:AE5FF72250E9C116]:0)
        at org.junit.Assert.fail(Assert.java:89)
        at org.junit.Assert.failNotEquals(Assert.java:835)
        at org.junit.Assert.assertEquals(Assert.java:647)
        at org.junit.Assert.assertEquals(Assert.java:633)
        at org.opensearch.search.aggregations.metrics.InternalTDigestPercentilesRanksTests.createTestInstance(InternalTDigestPercentilesRanksTests.java:56)
        at org.opensearch.search.aggregations.metrics.InternalTDigestPercentilesRanksTests.createTestInstance(InternalTDigestPercentilesRanksTests.java:42)

tdunning/t-digest#171 (comment)

@opensearch-ci-bot
Copy link
Collaborator

❌   Gradle Check failure 3c20434
Log 6152

Reports 6152

@dblock
Copy link
Member Author

dblock commented Jun 20, 2022

java.lang.AssertionError: aggregations.percentiles_int.values.25\.0 didn't match expected value:
        aggregations.percentiles_int.values.25\.0: expected Double [26.0] but was Double [51.0]
            at org.opensearch.test.rest.yaml.section.MatchAssertion.doAssert(MatchAssertion.java:115)
            at org.opensearch.test.rest.yaml.section.Assertion.execute(Assertion.java:89)
            at org.opensearch.test.rest.yaml.OpenSearchClientYamlSuiteTestCase.executeSection(OpenSearchClientYamlSuiteTestCase.java:447)

@dblock dblock marked this pull request as draft June 21, 2022 17:50
@dblock
Copy link
Member Author

dblock commented Jun 21, 2022

@tdunning Would you mind assisting here a bit please?

The upgrade from 3.2 to 3.3 produces different percentiles in some scenarios given very simple data. Neither the old data nor the new data is "correct", but that is expected I imagine given that we use t-digest. You can see raw data in https://github.com/opensearch-project/OpenSearch/blob/37651e9b5fe914a99f0abe0a36e10bd46d958691/rest-api-spec/src/main/resources/rest-api-spec/test/search.aggregation/180_percentiles_tdigest_metric.yml and the diff in this PR for how those changed. The data is just 4 values: 1, 51, 101 and 151, Google Sheets results below.

Data Percentile Result
1 1 2.5
51 5 8.5
101 25 38.5
151 50 76
75 113.5
95 143.5
99 149.5
100 151

I didn't expect such a big difference in a dot release. At the very least I'd like to understand whether this is expected, and whether this is going to have to be released as a major breaking change for OpenSearch users.

More different results with smaller tests: bb9e8f2.

@opensearch-ci-bot
Copy link
Collaborator

❌   Gradle Check failure 37651e9
Log 6185

Reports 6185

@opensearch-ci-bot
Copy link
Collaborator

❌   Gradle Check failure 6b03882
Log 6192

Reports 6192

@opensearch-ci-bot
Copy link
Collaborator

❌   Gradle Check failure bb9e8f2
Log 6193

Reports 6193

@opensearch-ci-bot
Copy link
Collaborator

❌   Gradle Check failure 78e4c08
Log 6195

Reports 6195

@tdunning
Copy link

tdunning commented Jun 21, 2022 via email

@tdunning
Copy link

Sorry... I missed your very fine explanation.

I understand now that you were doing a regression test against previous behavior and were surprised at a change in this behavior.

The fact is, however, this old behavior was a bug. That bug was fixed.

If we look at the quantile curve for your data, we see this:

image

The circles indicate an open boundary and the filled dots indicate a closed one. Because we retain all of the data, we can't in good faith interpolate. The only question is whether the quantile at exactly 0.25 should be 1 or 51. In t-digest, I settled on the lower value. The old code was interpolating and was just wrong.

The upgrade from 3.2 to 3.3 produces different percentiles in some scenarios given very simple data. Neither the old data nor the new data is "correct", but that is expected I imagine given that we use t-digest. You can see raw data in https://github.com/opensearch-project/OpenSearch/blob/37651e9b5fe914a99f0abe0a36e10bd46d958691/rest-api-spec/src/main/resources/rest-api-spec/test/search.aggregation/180_percentiles_tdigest_metric.yml and the diff in this PR for how those changed. The data is just 4 values: 1, 51, 101 and 151.

Data Percentile Result
1 1 2.5
51 5 8.5
101 25 38.5
151 50 76
75 113.5
95 143.5
99 149.5
100 151
I didn't expect such a big difference in a dot release. At the very least I'd like to understand whether this is expected, and whether this is going to have to be released as a major breaking change for OpenSearch users.

@tdunning
Copy link

In case you are curious, a similar issue arises with the cdf function. There, the graph for you data looks like this:

image

Here, what I have chosen is to use the mid point when you ask for the CDF at exactly a sample point. This gets a bit fancier when there are multiple samples at just the same point. In general, I take the CDF to be the

image

@dblock
Copy link
Member Author

dblock commented Jun 22, 2022

@tdunning Thank you! This is super clear.

@opensearch-ci-bot
Copy link
Collaborator

❌   Gradle Check failure 633595636b789bec83313b4ee346d3d114e6c3f5
Log 6219

Reports 6219

@opensearch-ci-bot
Copy link
Collaborator

✅   Gradle Check success 83be47b
Log 6227

Reports 6227

@dblock
Copy link
Member Author

dblock commented Jun 28, 2022

@kartg I do care about users more than merge conflicts, but I hear you. Any feelings about user impact?

@kavilla
Copy link
Member

kavilla commented Jun 29, 2022

One of the functional tests from OpenSearch Dashboards displayed the incorrect value [link it issue]. We will update the value on main but keep the 2.x branches untouched.

kavilla added a commit to kavilla/OpenSearch-Dashboards-1 that referenced this pull request Jun 29, 2022
Origin:
opensearch-project/OpenSearch#3634

The previous value was actually incorrect after OpenSearch bumped t-digest
the value is now the correct value.

Issue:
opensearch-project#1821

Signed-off-by: Kawika Avilla <[email protected]>
@sharp-pixel
Copy link
Contributor

I settled on the lower value.

Then shouldn't the result be [1, 51, 101]?

[1, 51, 101] is the result I get from Mathematica as well:

vec = {1, 51, 101, 151};
Quantile[vec, #] & /@ {1/4, 1/2, 3/4}
{1, 51, 101}

@dblock
Copy link
Member Author

dblock commented Jun 29, 2022

@tdunning ^

kavilla added a commit to kavilla/OpenSearch-Dashboards-1 that referenced this pull request Jun 29, 2022
Origin:
opensearch-project/OpenSearch#3634

The previous value was actually incorrect after OpenSearch bumped t-digest
the value is now the correct value.

Issue:
opensearch-project#1821

Signed-off-by: Kawika Avilla <[email protected]>
kavilla added a commit to kavilla/OpenSearch-Dashboards-1 that referenced this pull request Jun 29, 2022
Origin:
opensearch-project/OpenSearch#3634

The previous value was actually incorrect after OpenSearch bumped t-digest
the value is now the correct value.

Issue:
opensearch-project#1821

Signed-off-by: Kawika Avilla <[email protected]>
kavilla added a commit to opensearch-project/OpenSearch-Dashboards that referenced this pull request Jun 30, 2022
* [Tests] update expected value for percentile ranks

Origin:
opensearch-project/OpenSearch#3634

The previous value was actually incorrect after OpenSearch bumped t-digest
the value is now the correct value.

Issue:
#1821

Signed-off-by: Kawika Avilla <[email protected]>

* skip inconsistent values

Signed-off-by: Kawika Avilla <[email protected]>

* use slice

Signed-off-by: Kawika Avilla <[email protected]>
imRishN pushed a commit to imRishN/OpenSearch that referenced this pull request Jul 3, 2022
@kavilla
Copy link
Member

kavilla commented Jul 5, 2022

So when updated the values for this test it seemed to get inconsistent values for the 50th, 75th, and 95th percentile for example: https://github.com/opensearch-project/OpenSearch-Dashboards/runs/7120932868?check_suite_focus=true

and

opensearch-project/OpenSearch-Dashboards#1822 (comment)

@dblock
Copy link
Member Author

dblock commented Jul 5, 2022

@kavilla Are you sure? I felt like I was getting something similar, but turned out the tests were seeded with some random value.

In any case if you are sure, open a new issue?

@tdunning
Copy link

tdunning commented Jul 5, 2022

I am happy to comment on the t-digest side of things if somebody can say what the test is actually doing.

@dblock
Copy link
Member Author

dblock commented Jul 6, 2022

I settled on the lower value.

Then shouldn't the result be [1, 51, 101]?

[1, 51, 101] is the result I get from Mathematica as well:

vec = {1, 51, 101, 151};
Quantile[vec, #] & /@ {1/4, 1/2, 3/4}
{1, 51, 101}

@tdunning Could you check out the above, please?

@dblock
Copy link
Member Author

dblock commented Nov 1, 2022

@kavilla Want to open an issue in t-digest re: ^ ?

@tdunning
Copy link

tdunning commented Nov 1, 2022 via email

@dblock
Copy link
Member Author

dblock commented Nov 2, 2022

@tdunning The images didn't make it to GitHub, if you care to edit, but thanks for your explanation!

@sharp-pixel
Copy link
Contributor

From https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/quantile and https://mathworld.wolfram.com/Quantile.html, we could choose between 9 standardized definitions.

To make sure we are comparing the same things - especially in unit tests - we probably should decide on a default one, and optionally enable the choice of the others types.

Mathematica uses type 1 by default, R uses type 7 by default, but both provide the option to choose.

@tdunning
Copy link

tdunning commented Nov 2, 2022

@tdunning The images didn't make it to GitHub, if you care to edit, but thanks for your explanation!

Sorry about that. The image is very similar to what I posted in an earlier comment. I have edited my reply but in editing the response, I had difficulty getting the image to show up.

image

@tdunning
Copy link

tdunning commented Nov 3, 2022

So I have been experimenting a fair bit with the Julia implementation (easier than playing with the Java version because interactive).

I have changed the problem in question a tiny bit to make it more clear what is happening. I am using points at [1,11,15,30].

My first experiment was to verify that the cdf function performs as expected. It does. In particular, the cdf at exactly the sample points shows the desired interpolation behavior. Here is a picture produced by scanning x in small increments to get the blue line and then plotting points at exactly the sample points:

image

The point of real interest, however, was to determine how the quantile function behaves. The following plot shows that the quantile (wider blue line) and cdf (thin gray line) functions lay right on top of each other. Further, evaluating quantile just before and just after q=[0.25,0.5,0.75] (green and purple dots) we see the desired boundary behavior.

image

I would contend that it is hard to do better than this due to inevitable floating point limits.

@dblock , @kavilla , @sharp-pixel what do you think?

Also, I looked into the R and Julia implementations of the quantile function. In fact, they are trying to estimate the theoretical distribution rather than the empirical inverse cdf. This is a different problem entirely. Adding the Julia quantile to the graph shows what I mean

image

The result is far from the empirical inverse cdf function.

@sharp-pixel
Copy link
Contributor

sharp-pixel commented Nov 3, 2022

Thanks @tdunning.

It seems Julia uses type 7 (from https://docs.julialang.org/en/v1/stdlib/Statistics/#Statistics.quantile!), so we just need to pick one quantile function type as the default, document it, and optionally have the choice to override the type.

@tdunning
Copy link

tdunning commented Nov 3, 2022

I am not so sure of that.

The different types of quantile estimation are all geared toward estimating a population quantile function assuming that the data we have is only a sample of that population.

That's an important problem.

But it isn't what t-digest is intended to do.

Instead, t-digest is intended to estimate the cdf and inverse cdf of the data we are given as it actually is. This refers to the empirical distribution as opposed to the population CDF. This is much simpler in many ways than trying to estimate the population, but it can be confusing because of the collision on the name "quantile".

There is clearly a problem here (user confusion is indisputably a problem), but I really think that the correct action here is to fix documentation on both TDigest.quantile and in the overview.

@dblock
Copy link
Member Author

dblock commented Nov 7, 2022

I opened #5115 to discuss the user-facing aspects of this. Because there's no one correct result at the edges, I think we could support multiple strategies to everybody's satisfaction.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CI] o.o..search.aggregations.metrics.TDigestStateTests.testMoreThan4BValues failure
7 participants