Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: New quantile interpolation method & QUANTILE_DISC function in SQL #19139

Merged
merged 14 commits into from
Oct 16, 2024

Conversation

pomo-mondreganto
Copy link
Contributor

This is a followup of #18047. As suggested in the comment there, the current quantile interpolation methods are lacking, so this PR adds a new interpolation method that works as described here and uses the new method in QUANTILE_DISC SQL function. Conformance tests with DuckDB are included.

@github-actions github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars labels Oct 8, 2024
@pomo-mondreganto
Copy link
Contributor Author

cc @alexander-beedie, let's continue here

Copy link

codecov bot commented Oct 8, 2024

Codecov Report

Attention: Patch coverage is 87.88927% with 35 lines in your changes missing coverage. Please review.

Project coverage is 80.01%. Comparing base (900dc3b) to head (ce001bb).
Report is 14 commits behind head on main.

Files with missing lines Patch % Lines
.../polars-python/src/lazyframe/visitor/expr_nodes.rs 0.00% 7 Missing ⚠️
...ow/src/legacy/kernels/rolling/no_nulls/quantile.rs 79.16% 5 Missing ⚠️
crates/polars-sql/src/functions.rs 80.00% 4 Missing ⚠️
...arrow/src/legacy/kernels/rolling/nulls/quantile.rs 83.33% 3 Missing ⚠️
crates/polars-core/src/frame/group_by/mod.rs 0.00% 3 Missing ⚠️
...polars-core/src/chunked_array/ops/aggregate/mod.rs 97.82% 2 Missing ⚠️
...s-core/src/chunked_array/ops/aggregate/quantile.rs 94.87% 2 Missing ⚠️
crates/polars-core/src/frame/column/mod.rs 0.00% 2 Missing ⚠️
...s/polars-plan/src/dsl/functions/syntactic_sugar.rs 0.00% 2 Missing ⚠️
crates/polars-python/src/conversion/mod.rs 71.42% 2 Missing ⚠️
... and 3 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #19139      +/-   ##
==========================================
+ Coverage   79.99%   80.01%   +0.01%     
==========================================
  Files        1527     1527              
  Lines      209203   209138      -65     
  Branches     2415     2415              
==========================================
- Hits       167352   167334      -18     
+ Misses      41303    41256      -47     
  Partials      548      548              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@orlp
Copy link
Collaborator

orlp commented Oct 11, 2024

Alright, correct me if I'm wrong, but (modulo implementation bugs), I believe our current interpolation methods all share the same properties:

  1. They distribute the elements evenly among the number line [0, 1], with the first element at 0 and the last element at 1. This means that logically speaking, the ith element is found at position i / (n - 1).
  2. If p == i / (n-1) exactly, we return the ith element unchanged. Otherwise we interpolate between the element before it and after it.

In this context our interpolation methods can be visualized as such, as an example for n = 4 elements:

image

This PR proposes to add another interpolation method which breaks the above convention. It does not distribute the elements among the number line [0, 1] starting at 0 and ending at 1, instead it chops up the number line [0, 1] into equal parts, assigning one value to each part in order:

image

Not shown is what to do about the boundaries, my preference would be left-closed if we had to choose.

Is my understanding correct? If so I have two issues with it:

  1. I think the name Equiprobable is better than Bucket.
  2. I'm not sure this should be an interpolation method instead of a different method altogether, as it does break our above properties that all the other interpolation methods share. If not a different method perhaps we should rename the 'interpolation' parameter to 'method'.

@pomo-mondreganto
Copy link
Contributor Author

Yes, you're exactly right. I agree that Equiprobable is a better name, I'll update my PR. Regarding the second point, there're a few pieces of information that are relevant to this PR:

  1. The main motivation for this PR is the previous discussion in feat: Quantile function in SQL #18047 regarding QUANTILE_DISC method. As far as I understand, Polars SQL is trying to be compatible with DuckDB's or PostgreSQL's implementation, and both use the equiprobable discrete quantiles under the hood.
  2. I agree with @ soerenwolfers in the previous PR that the default expectation of discrete quantile for a set is that it returns each element with equal probability, and all current interpolation options are lacking this characteristic. I'd say that the default method should be either Linear or Equiprobable as other methods don't seem useful statistically. For example, I'm planning to use the proposed QUANTILE_DISC for metrics, and as I see it the best option for a "good enough" discrete quantile in my charts is Nearest, but I'd switch to Equilateral if this is merged.

Renaming to "Method" seems good enough (current discrete options like Lower or Higher are not exactly "interpolation" either), but that'd be a backward incompatible change and I don't know what's the Polars policy on that.

@orlp
Copy link
Collaborator

orlp commented Oct 12, 2024

As long as it's just on the Rust side for now we don't have to worry about the renaming of interpolation to method just yet. And we can do such renames nicely with deprecation.

@pomo-mondreganto
Copy link
Contributor Author

@orlp I've renamed the Options to Method, Bucket to Equiprobable and updated almost all variable names. The only case I didn't rename is the public-facing python bindings as someone could pass interpolation= as kwarg, which would break.

@pomo-mondreganto
Copy link
Contributor Author

Is anything else needed here from me?

@orlp
Copy link
Collaborator

orlp commented Oct 16, 2024

@pomo-mondreganto Looks good to me, @alexander-beedie can you review the SQL bits? Then we can merge this.

Copy link
Collaborator

@alexander-beedie alexander-beedie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me too; the SQL integration is sound, and the results tie out with other implementations. Very nice addition👌

@orlp orlp merged commit 2736621 into pola-rs:main Oct 16, 2024
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants