another take at cut #314

bkamins · 2020-12-09T13:43:57Z

@nalimilan - is this intended:

julia> x = [fill(1,1000); fill(2, 100); fill(3, 10); 4];

julia> levels(cut(x, 2))
1-element Array{String,1}:
 "Q1: [1.0, 4.0]"

julia> levels(cut(x, 2, allowempty=true))
1-element Array{String,1}:
 "Q1: [1.0, 4.0]"

julia> cut(x, 3)
ERROR: ArgumentError: cannot compute 3 quantiles: `quantile` returned only 0 groups due to duplicated values in `x`.Pass `allowempty=true` to allow empty quantiles or choose a lower value for `ngroups`.

julia> levels(cut(x, 3, allowempty=true))
2-element Array{String,1}:
 "Q1: (1.0, 1.0)"
 "Q2: [1.0, 4.0]"

?

bkamins · 2020-12-11T07:34:36Z

another test:

julia> x = repeat(1:3, 5);

julia> freqtable(cut(x, 2))
2-element Named Array{Int64,1}
Dim1                                             │ 
─────────────────────────────────────────────────┼───
CategoricalValue{String,UInt32} "Q1: [1.0, 2.0)" │  5
CategoricalValue{String,UInt32} "Q2: [2.0, 3.0]" │ 10

julia> freqtable(cut(x, 3))
3-element Named Array{Int64,1}
Dim1                                                     │ 
─────────────────────────────────────────────────────────┼──
CategoricalValue{String,UInt32} "Q1: [1.0, 1.66667)"     │ 5
CategoricalValue{String,UInt32} "Q2: [1.66667, 2.33333)" │ 5
CategoricalValue{String,UInt32} "Q3: [2.33333, 3.0]"     │ 5

julia> freqtable(cut(x, 4))
2-element Named Array{Int64,1}
Dim1                                             │ 
─────────────────────────────────────────────────┼───
CategoricalValue{String,UInt32} "Q1: [1.0, 2.0)" │  5
CategoricalValue{String,UInt32} "Q2: [2.0, 3.0]" │ 10

julia> freqtable(cut(x, 5))
ERROR: ArgumentError: cannot compute 5 quantiles: `quantile` returned only 2 groups due to duplicated values in `x`.Pass `allowempty=true` to allow empty quantiles or choose a lower value for `ngroups`.

julia> freqtable(cut(x, 6))
4-element Named Array{Int64,1}
Dim1                                                 │ 
─────────────────────────────────────────────────────┼──
CategoricalValue{String,UInt32} "Q1: [1.0, 1.66667)" │ 5
CategoricalValue{String,UInt32} "Q2: [1.66667, 2.0)" │ 0
CategoricalValue{String,UInt32} "Q3: [2.0, 2.33333)" │ 5
CategoricalValue{String,UInt32} "Q4: [2.33333, 3.0]" │ 5

julia> freqtable(cut(x, 7))
ERROR: ArgumentError: cannot compute 7 quantiles: `quantile` returned only 2 groups due to duplicated values in `x`.Pass `allowempty=true` to allow empty quantiles or choose a lower value for `ngroups`.

bkamins · 2020-12-11T08:02:07Z

As a reference in these cases this is what dplyr produces:

> x = c(rep(1, 1000), rep(2, 100), rep(3, 10), 4)
> table(cut_number(x, 2))
Error: Insufficient data values to produce 2 bins.
Run `rlang::last_error()` to see where the error occurred.
> table(cut_number(x, 3))
Error: Insufficient data values to produce 3 bins.
Run `rlang::last_error()` to see where the error occurred.
> table(cut_number(x, 4))
Error: Insufficient data values to produce 4 bins.
Run `rlang::last_error()` to see where the error occurred.

and

> x = rep(1:3, 5)
> table(cut_number(x, 2))

[1,2] (2,3] 
   10     5 
> table(cut_number(x, 3))

   [1,1.67] (1.67,2.33]    (2.33,3] 
          5           5           5 
> table(cut_number(x, 4)) # same with higher values
Error: Insufficient data values to produce 4 bins.
Run `rlang::last_error()` to see where the error occurred.

nalimilan · 2020-12-11T10:07:51Z

We should probably check in the cut(x, ngroups) method that the created array has a number of levels equal to the requested number of groups. The question of what to do in tricky cases when calling cut(x, breaks) directly is more open.

bkamins · 2020-12-11T16:20:48Z

Agreed. But as noted on Slack there are two use-cases of cut(x, ngroups):

user wants exactly ngroups - and then we should error
user wants approximately ngroups but be sure that the function will not error on production (this is quite common if you do a pipeline preprocessing your 10,000 columns and you do not want 1 column that is not typical to cause error)

Maybe we can make option 1. the default, and option 2. as opt-in in which case cut never errors but tries to do the best thing it can?

nalimilan · 2020-12-11T16:28:25Z

Yes we could but we would have to check all possible problems. cut(x, ngroups) is simple but cut(x, breaks) might rely on throwing errors to avoid returning invalid results, so if we change it we have to be very careful.

bkamins · 2020-12-11T17:59:23Z

I meant cut(x, ngroups). If someone passes breaks we should be strict I think.

nalimilan · 2024-12-30T21:48:15Z

This is what we do now since #410. I've checked that we throw error where dplyr does in examples above. Discussion continues at #381.

nalimilan closed this as completed Dec 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

another take at cut #314

another take at cut #314

bkamins commented Dec 9, 2020 •

edited by andreasnoack

Loading

bkamins commented Dec 11, 2020

bkamins commented Dec 11, 2020

nalimilan commented Dec 11, 2020

bkamins commented Dec 11, 2020

nalimilan commented Dec 11, 2020

bkamins commented Dec 11, 2020

nalimilan commented Dec 30, 2024

another take at cut #314

another take at cut #314

Comments

bkamins commented Dec 9, 2020 • edited by andreasnoack Loading

bkamins commented Dec 11, 2020

bkamins commented Dec 11, 2020

nalimilan commented Dec 11, 2020

bkamins commented Dec 11, 2020

nalimilan commented Dec 11, 2020

bkamins commented Dec 11, 2020

nalimilan commented Dec 30, 2024

bkamins commented Dec 9, 2020 •

edited by andreasnoack

Loading