-
Notifications
You must be signed in to change notification settings - Fork 7
Description
While reviewing a pull request, I tested the below method with some extra things.
I noticed this:
@pytest.mark.parametrize("n_bins, auto_adapt_bins, data, expected",
[
(2,
False,
# Variable contains floats, nans and inf (e.g. "available resources" variable): WORKS
pd.DataFrame({"variable": [7.5, 8.2, 14.9, np.inf]}),
[(7.0, 12.0), (12.0, np.inf)])
],
ids=["unexpected bin limit"])
def test_fit_column(self, n_bins, auto_adapt_bins, data, expected):
discretizer = KBinsDiscretizer(n_bins=n_bins,
auto_adapt_bins=auto_adapt_bins)
actual = discretizer._fit_column(data, column_name="variable")
assert actual == expected
Result:
[(8.0, 12.0), (12.0, inf)] != [(7.0, 12.0), (12.0, inf)]
Expected :[(7.0, 12.0), (12.0, inf)]
Actual :[(8.0, 12.0), (12.0, inf)]
The column minimum is 7.5, which is rounded up (!) in KBinsDiscretizer._compute_bins_from_edges() to format the lower (!) bin limit, although it should have been rounded down: column value 7.5 falls in a bin that starts at 7.0, not 8.0.
I thought of just changing
for a, b in zip(bin_edges, bin_edges[1:]):
fmt_a = round(a, precision)
fmt_b = round(b, precision)
to just
for a, b in zip(bin_edges, bin_edges[1:]):
fmt_a = floor(a, precision) # <====
fmt_b = round(b, precision)
but that will break the precision, which can be even negative apparently:
compute the minimal precision of the bin_edges. This can be a negative number, which then rounds numbers to the nearest 10, 100, ...
So minor bug, but floor() is not a good enough solution - I won't fix this right away with a commit in the pull request (which is actually not enough related too), so will log it here and continue with that pull request.