Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

another take at cut #314

Closed
bkamins opened this issue Dec 9, 2020 · 7 comments
Closed

another take at cut #314

bkamins opened this issue Dec 9, 2020 · 7 comments

Comments

@bkamins
Copy link
Member

bkamins commented Dec 9, 2020

@nalimilan - is this intended:

julia> x = [fill(1,1000); fill(2, 100); fill(3, 10); 4];

julia> levels(cut(x, 2))
1-element Array{String,1}:
 "Q1: [1.0, 4.0]"

julia> levels(cut(x, 2, allowempty=true))
1-element Array{String,1}:
 "Q1: [1.0, 4.0]"

julia> cut(x, 3)
ERROR: ArgumentError: cannot compute 3 quantiles: `quantile` returned only 0 groups due to duplicated values in `x`.Pass `allowempty=true` to allow empty quantiles or choose a lower value for `ngroups`.

julia> levels(cut(x, 3, allowempty=true))
2-element Array{String,1}:
 "Q1: (1.0, 1.0)"
 "Q2: [1.0, 4.0]"

?

@bkamins
Copy link
Member Author

bkamins commented Dec 11, 2020

another test:

julia> x = repeat(1:3, 5);

julia> freqtable(cut(x, 2))
2-element Named Array{Int64,1}
Dim1                                             │ 
─────────────────────────────────────────────────┼───
CategoricalValue{String,UInt32} "Q1: [1.0, 2.0)" │  5
CategoricalValue{String,UInt32} "Q2: [2.0, 3.0]" │ 10

julia> freqtable(cut(x, 3))
3-element Named Array{Int64,1}
Dim1                                                     │ 
─────────────────────────────────────────────────────────┼──
CategoricalValue{String,UInt32} "Q1: [1.0, 1.66667)"     │ 5
CategoricalValue{String,UInt32} "Q2: [1.66667, 2.33333)" │ 5
CategoricalValue{String,UInt32} "Q3: [2.33333, 3.0]"     │ 5

julia> freqtable(cut(x, 4))
2-element Named Array{Int64,1}
Dim1                                             │ 
─────────────────────────────────────────────────┼───
CategoricalValue{String,UInt32} "Q1: [1.0, 2.0)" │  5
CategoricalValue{String,UInt32} "Q2: [2.0, 3.0]" │ 10

julia> freqtable(cut(x, 5))
ERROR: ArgumentError: cannot compute 5 quantiles: `quantile` returned only 2 groups due to duplicated values in `x`.Pass `allowempty=true` to allow empty quantiles or choose a lower value for `ngroups`.

julia> freqtable(cut(x, 6))
4-element Named Array{Int64,1}
Dim1                                                 │ 
─────────────────────────────────────────────────────┼──
CategoricalValue{String,UInt32} "Q1: [1.0, 1.66667)" │ 5
CategoricalValue{String,UInt32} "Q2: [1.66667, 2.0)" │ 0
CategoricalValue{String,UInt32} "Q3: [2.0, 2.33333)" │ 5
CategoricalValue{String,UInt32} "Q4: [2.33333, 3.0]" │ 5

julia> freqtable(cut(x, 7))
ERROR: ArgumentError: cannot compute 7 quantiles: `quantile` returned only 2 groups due to duplicated values in `x`.Pass `allowempty=true` to allow empty quantiles or choose a lower value for `ngroups`.

@bkamins
Copy link
Member Author

bkamins commented Dec 11, 2020

As a reference in these cases this is what dplyr produces:

> x = c(rep(1, 1000), rep(2, 100), rep(3, 10), 4)
> table(cut_number(x, 2))
Error: Insufficient data values to produce 2 bins.
Run `rlang::last_error()` to see where the error occurred.
> table(cut_number(x, 3))
Error: Insufficient data values to produce 3 bins.
Run `rlang::last_error()` to see where the error occurred.
> table(cut_number(x, 4))
Error: Insufficient data values to produce 4 bins.
Run `rlang::last_error()` to see where the error occurred.

and

> x = rep(1:3, 5)
> table(cut_number(x, 2))

[1,2] (2,3] 
   10     5 
> table(cut_number(x, 3))

   [1,1.67] (1.67,2.33]    (2.33,3] 
          5           5           5 
> table(cut_number(x, 4)) # same with higher values
Error: Insufficient data values to produce 4 bins.
Run `rlang::last_error()` to see where the error occurred.

@nalimilan
Copy link
Member

We should probably check in the cut(x, ngroups) method that the created array has a number of levels equal to the requested number of groups. The question of what to do in tricky cases when calling cut(x, breaks) directly is more open.

@bkamins
Copy link
Member Author

bkamins commented Dec 11, 2020

Agreed. But as noted on Slack there are two use-cases of cut(x, ngroups):

  1. user wants exactly ngroups - and then we should error
  2. user wants approximately ngroups but be sure that the function will not error on production (this is quite common if you do a pipeline preprocessing your 10,000 columns and you do not want 1 column that is not typical to cause error)

Maybe we can make option 1. the default, and option 2. as opt-in in which case cut never errors but tries to do the best thing it can?

@nalimilan
Copy link
Member

Yes we could but we would have to check all possible problems. cut(x, ngroups) is simple but cut(x, breaks) might rely on throwing errors to avoid returning invalid results, so if we change it we have to be very careful.

@bkamins
Copy link
Member Author

bkamins commented Dec 11, 2020

I meant cut(x, ngroups). If someone passes breaks we should be strict I think.

@nalimilan
Copy link
Member

This is what we do now since #410. I've checked that we throw error where dplyr does in examples above. Discussion continues at #381.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants