Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Undeprecate quantile() for arrays #1159

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

ron-wolf
Copy link

Fixes #1150 by undeprecating omission of the broadcast operator . when using quantile, quantile!, and _quantile!.

For reason, see this StatsBase comment. For full discussion, see JuliaStats/StatsBase.jl#586 and JuliaStats/Distributions.jl#1150.

Needed by StatsBase.jl for arrays

More info: JuliaStats/StatsBase.jl#586 (comment)
@codecov-commenter
Copy link

codecov-commenter commented Aug 17, 2020

Codecov Report

Merging #1159 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #1159   +/-   ##
=======================================
  Coverage   79.91%   79.91%           
=======================================
  Files         115      115           
  Lines        5905     5905           
=======================================
  Hits         4719     4719           
  Misses       1186     1186           
Impacted Files Coverage Δ
src/deprecates.jl 0.00% <ø> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 253c3a0...8904276. Read the comment docs.

@nalimilan
Copy link
Member

Thanks. Can you add a test for the method too?

@ron-wolf
Copy link
Author

ron-wolf commented Oct 2, 2020

@nalimilan Sorry, I only just saw this. Where would I add a test? Also, are there any basic methods/tools or best practices for tests in Julia that I might not be aware of?

Tests quantile(collection) by way of iqr(collection). Fixes JuliaStats/Distributions#1150

See JuliaStats#1159 (comment)
@nalimilan
Copy link
Member

Sorry, I meant testing quantile on a distribution, not on a collection (collections are tested in StatsBase). I'm not sure where to put it, maybe in test/utils.jl?

@andreasnoack
Copy link
Member

@ron-wolf Bump. Would you be able to add a tests as suggested in #1159 (comment)?

@nalimilan
Copy link
Member

Bump.

@ron-wolf
Copy link
Author

ron-wolf commented Mar 5, 2021

@nalimilan

Sorry, I meant testing quantile on a distribution, not on a collection (collections are tested in StatsBase).

Looking back on this I was testing it on a distribution after all, not a collection. (If you’ll notice, I’m calling iqr(::Any), which dispatches to a non-weighted quantile, wherever that’s defined.) So would this just require a rename of the file and the test case?

@ron-wolf
Copy link
Author

ron-wolf commented Mar 5, 2021

Actually nevermind, I have completely lost track of what has to be changed and why. Why can't the definition for iqr just be changed to use quantile. instead of quantile? Why are we only undeprecating quantile and not the other implicitly broadcasted operators? In any case, what are the concrete steps involved? “I don’t know” is an acceptable answer.

@ron-wolf
Copy link
Author

ron-wolf commented Mar 5, 2021

I think I may have found a better solution to the point brought up in this comment?:

# we specify AbstractVector{<:Real} as the type for a discrete 
# distribution because it is the most generic type that describes 
# a sortable  sample: a discrete sequence of ordinal values that 
# can be sampled at certain positions

quantile(x::AbstractVector{<:Real}, p::Real; sorted=false) = let
    # ensure x is sorted, then sample it at the location/input p
    sorted || x = sort!(collect(x))
    ... # do your sampling magic!
end
quantile(x::Any, p::Real) = ...

# dispatch on discrete distributions, both sorted & unsorted
quantile(x::AbstractVector{<:Real}, p::AbstractWeights; sorted=false) = let
    # ensure x is sorted, then sample it at each of the locations/inputs in p
    sorted || x = sort!(collect(x))
    quantile.(Ref(x), p; sorted=true)
end

# a quick re-dispatch to broadcasting; instead of deprecating 
# this, we just silently correct to the right behavior so 
# end-users don't have to worry about the distinction
@inline quantile(x::Any, p::AbstractWeights) = quantile.(Ref(x), p)

# TODO: is there some way to deprecate 
#   quantile.(::AbstractVector{<:Real}, p::AbstractWeights),
# which is now radically inefficient?

which would fix our efficiency woes, while also allowing this simple definition to remain in place:

iqr(x) = let
    q = quantile(x, [.25, .75])
    q[2] - q[1]
end

Are there any other functions which, like you said, have this optimization of sorting and then sampling at each point? If so, perhaps it would be best to enumerate them all in this issue, and then collaborate to write a define-them-all-in-one-go metaprogramming loop.

@matbesancon
Copy link
Member

what's the status here?

@devmotion
Copy link
Member

It seems in Distributions there is no performance improvement if quantile is not broadcasted but would support arrays (correct me if I am wrong). Thus it seems the PR would fix an inconsistency with the use of quantile in StatsBase by introducing an inconsistency with other functions such as pdf, cdf, etc. in Distributions without any real benefit for Distributions. Wouldn't be the most natural solution just be that StatsBase supports arrays (since it improves performance) but Distributions doesn't?

@nalimilan
Copy link
Member

Thus it seems the PR would fix an inconsistency with the use of quantile in StatsBase by introducing an inconsistency with other functions such as pdf, cdf, etc. in Distributions without any real benefit for Distributions. Wouldn't be the most natural solution just be that StatsBase supports arrays (since it improves performance) but Distributions doesn't?

Actually there would be a benefit for Distribution, as currently a deprecation warning is printed when calling iqr on a distribution object. See JuliaStats/StatsBase.jl#586. The problem is that to be efficient on both arrays and distributions, iqr has to call the vectorized method quantile(x, p) rather than quantile.(x, p). One alternative would be to override the definition of broacast for quantile in StatsBase to call the optimized method under the hood (i.e. to sort data only once).

@devmotion
Copy link
Member

devmotion commented Apr 24, 2021

One alternative would be to override the definition of broacast for quantile in StatsBase to call the optimized method under the hood (i.e. to sort data only once).

Another possible approach would be to add something like

function StatsBase.iqr(d::Distribution)
    q = map(Base.Fix1(quantile, d), (0.25, 0.75))
    return q[2] - q[1]
end

or even just

function StatsBase.iqr(d::Distribution)
    q1 = quantile(d, 0.25)
	q2 = quantile(d, 0.75)
    return q2 - q1
end

to Distributions, it seems?

@nalimilan
Copy link
Member

Yeah that would be simpler fix. Not sure whether other functions would need the same change.

@ron-wolf
Copy link
Author

ron-wolf commented Aug 31, 2021

Wouldn't be the most natural solution just be that StatsBase supports arrays (since it improves performance) but Distributions doesn't? —@devmotion

In the interest of getting this done, I think Step 1 is to figure out what state we want the code to end up in—in a module-agnostic fashion. Step 2, then, would be figuring out what belongs to Distributions vs. StatsBase; this can be considered a secondary concern.

Regarding the actual code to be written, how do we feel about the merits of the two outlines we’ve developed thus far? To reiterate, this code snippet communicates the main thrust of @devmotion’s sketch:

StatsBase.iqr(d::Distribution) = quantile(d, 0.75) - quantile(d, 0.25)

And this was my sketch:

# we specify AbstractVector{<:Real} as the type for a discrete 
# distribution because it is the most generic type that describes 
# a sortable  sample: a discrete sequence of ordinal values that 
# can be sampled at certain positions

quantile(x::AbstractVector{<:Real}, p::Real; sorted=false) = let
    # ensure x is sorted, then sample it at the location/input p
    sorted || x = sort!(collect(x))
    ... # do your sampling magic!
end
quantile(x::Any, p::Real) = ...

# dispatch on discrete distributions, both sorted & unsorted
quantile(x::AbstractVector{<:Real}, p::AbstractWeights; sorted=false) = let
    # ensure x is sorted, then sample it at each of the locations/inputs in p
    sorted || x = sort!(collect(x))
    quantile.(Ref(x), p; sorted=true)
end

# a quick re-dispatch to broadcasting; instead of deprecating 
# this, we just silently correct to the right behavior so 
# end-users don't have to worry about the distinction
@inline quantile(x::Any, p::AbstractWeights) = quantile.(Ref(x), p)

# TODO: is there some way to deprecate 
#   quantile.(::AbstractVector{<:Real}, p::AbstractWeights),
# which is now radically inefficient?

Any thoughts on whether one of these sketches, or a different sketch, would be the best way forward?

@nalimilan
Copy link
Member

Regarding the actual code to be written, how do we feel about the merits of the two outlines we’ve developed thus far? To reiterate, this code snippet communicates the main thrust of @devmotion’s sketch:

StatsBase.iqr(d::Distribution) = quantile(d, 0.75) - quantile(d, 0.25)

AFAICT that's the best solution.

@ron-wolf
Copy link
Author

ron-wolf commented Sep 3, 2021

Okay. Can someone author a PR over here for me to accept, so as to have the full changeset show up in this PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Undeprecate no-dot syntax for quantile() and nquantile()
6 participants