Skip to content

Unexpected bin limit rounding #58

@sandervh14

Description

@sandervh14

While reviewing a pull request, I tested the below method with some extra things.
I noticed this:

@pytest.mark.parametrize("n_bins, auto_adapt_bins, data, expected",
                                          [
                                              (2, 
                                               False,
                                               # Variable contains floats, nans and inf (e.g. "available resources" variable): WORKS
                                               pd.DataFrame({"variable": [7.5, 8.2, 14.9, np.inf]}),
                                               [(7.0, 12.0), (12.0, np.inf)])
                                           ],
                                           ids=["unexpected bin limit"])
def test_fit_column(self, n_bins, auto_adapt_bins, data, expected):
        discretizer = KBinsDiscretizer(n_bins=n_bins,
                                                        auto_adapt_bins=auto_adapt_bins)

        actual = discretizer._fit_column(data, column_name="variable")

        assert actual == expected

Result:

[(8.0, 12.0), (12.0, inf)] != [(7.0, 12.0), (12.0, inf)]

Expected :[(7.0, 12.0), (12.0, inf)]
Actual :[(8.0, 12.0), (12.0, inf)]

The column minimum is 7.5, which is rounded up (!) in KBinsDiscretizer._compute_bins_from_edges() to format the lower (!) bin limit, although it should have been rounded down: column value 7.5 falls in a bin that starts at 7.0, not 8.0.

I thought of just changing

        for a, b in zip(bin_edges, bin_edges[1:]):
            fmt_a = round(a, precision)
            fmt_b = round(b, precision)

to just

        for a, b in zip(bin_edges, bin_edges[1:]):
            fmt_a = floor(a, precision)  # <====
            fmt_b = round(b, precision)

but that will break the precision, which can be even negative apparently:

compute the minimal precision of the bin_edges. This can be a negative number, which then rounds numbers to the nearest 10, 100, ...

So minor bug, but floor() is not a good enough solution - I won't fix this right away with a commit in the pull request (which is actually not enough related too), so will log it here and continue with that pull request.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions