Faster computation of quantiles in `describe` #2909

nalimilan · 2021-10-15T20:14:21Z

Computing all quantiles when we only need the median is signficantly slower (and it's even worse than it should due to https://github.com/JuliaLang/Statistics.jl/issues/84).
Also avoid trying to compute quantiles for string columns since the failure only happens after sorting the vector, which is almost all of the work.

julia> df = DataFrame(x=string.(rand('a':'k', 10_000)));

# current main
julia> @btime describe(df);
  844.866 μs (147 allocations: 88.03 KiB)

# computing only the median
julia> @btime describe(df);
  480.801 μs (146 allocations: 87.92 KiB)

# not computing the median at all (PR)
julia> @btime describe(df);
  198.739 μs (142 allocations: 9.67 KiB)

julia> df = DataFrame(x=rand(1:10, 10_000));

# current main
julia> @btime describe(df);
  181.266 μs (108 allocations: 85.55 KiB)

# PR
julia> @btime describe(df);
  79.170 μs (100 allocations: 85.19 KiB)

Computing all quantiles when we only need the median is signficantly slower. Also avoid trying to compute quantiles for string columns since the failure only happens after sorting the vector, which is almost all of the work.

bkamins

Looks good. We could add a correctness test.
Also note that this assumes that some custom subtype of AbstractString does not define arithmetic operations on it, but I assume it is OK to make this assumption in practice.

nalimilan · 2021-10-15T20:30:07Z

Looks good. We could add a correctness test.

We already cover this AFAICT.

Also note that this assumes that some custom subtype of AbstractString does not define arithmetic operations on it, but I assume it is OK to make this assumption in practice.

Yes, in theory that's possible, though I'm not sure what sense it would make... An alternative would be to call middle on the first element when the eltype is concrete to get the error immediately. That would be more general so maybe a better solution actually?

pdeffebach · 2021-10-15T20:30:19Z

Can we be even smarter about this and use sorted = true based on some flag we make as we try different quantile! stuff? It's not clear what "partially sorted" means in the documentation.

nalimilan · 2021-10-15T20:32:34Z

Can we be even smarter about this and use sorted = true based on some flag we make as we try different quantile! stuff? It's not clear what "partially sorted" means in the documentation.

"Partially sorted" refers to partialsort. Basically, the element(s) which are at the quantile's position have to be sorted, others can be anywhere. But I'm not sure how passing sorted=true would help?

pdeffebach · 2021-10-15T20:34:01Z

Ah I see. Nevermind, this looks good.

nalimilan · 2021-10-15T21:26:12Z

I've pushed a commit implementing the more general approach, let me know what you think.

It's funny to note that on Julia >= 1.7, thanks to https://github.com/JuliaLang/Statistics.jl/pull/72, a test on Dates is now able to compute the median, which fails with current DataFrames main because computing the third quartile fails (but the median works as it's an integer value...).

src/abstractdataframe/abstractdataframe.jl

Faster computation of quantiles in describe

528d8a4

Computing all quantiles when we only need the median is signficantly slower. Also avoid trying to compute quantiles for string columns since the failure only happens after sorting the vector, which is almost all of the work.

bkamins approved these changes Oct 15, 2021

View reviewed changes

bkamins added the performance label Oct 15, 2021

bkamins added this to the patch milestone Oct 15, 2021

More general approach

c9c249b

bkamins reviewed Oct 16, 2021

View reviewed changes

src/abstractdataframe/abstractdataframe.jl Show resolved Hide resolved

bkamins reviewed Oct 16, 2021

View reviewed changes

src/abstractdataframe/abstractdataframe.jl Show resolved Hide resolved

bkamins reviewed Oct 16, 2021

View reviewed changes

src/abstractdataframe/abstractdataframe.jl Show resolved Hide resolved

bkamins approved these changes Oct 16, 2021

View reviewed changes

nalimilan merged commit 85fa306 into main Oct 16, 2021

nalimilan deleted the nl/describe2 branch October 16, 2021 17:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster computation of quantiles in `describe` #2909

Faster computation of quantiles in `describe` #2909

nalimilan commented Oct 15, 2021 •

edited

Loading

bkamins left a comment

nalimilan commented Oct 15, 2021

pdeffebach commented Oct 15, 2021

nalimilan commented Oct 15, 2021 •

edited

Loading

pdeffebach commented Oct 15, 2021

nalimilan commented Oct 15, 2021

Faster computation of quantiles in describe #2909

Faster computation of quantiles in describe #2909

Conversation

nalimilan commented Oct 15, 2021 • edited Loading

bkamins left a comment

Choose a reason for hiding this comment

nalimilan commented Oct 15, 2021

pdeffebach commented Oct 15, 2021

nalimilan commented Oct 15, 2021 • edited Loading

pdeffebach commented Oct 15, 2021

nalimilan commented Oct 15, 2021

Faster computation of quantiles in `describe` #2909

Faster computation of quantiles in `describe` #2909

nalimilan commented Oct 15, 2021 •

edited

Loading

nalimilan commented Oct 15, 2021 •

edited

Loading