Adapt to scikit-learn 1.6 estimator tag changes #11021

jameslamb · 2024-11-26T06:34:42Z

In scikit-learn 1.6 (coming very soon), there are some significant changes to estimator tags and estimator checks. This PR makes xgboost compatible with those changes.

In short:

the estimator._more_tags() -> dict interface is deprecated, and estimators are encouraged to instead use estimator.__sklearn_tags__() -> sklearn.utils.Tags
many new checks have been added to parametrize_with_checks(), and others have been made stricter

Notes for Reviewers

How I tested this

See #10896. I tested these changes locally (M2 macOS, Python 3.11) against the following scikit-learn versions:

1.7.0.dev0 (latest nightly)
1.6.0.rc1 (latest release candidate for 1.6)
1.5.2 (latest stable version)

The patterns proposed here closely follow similar changes we made over in lightgbm:

Shouldn't this require changes on the C++ side?

No. The errors I reported in #10896, which looked like they might be coming from assertions raised on the C++ side, were only showing up because xgboost estimators weren't yet compliant with scikit-learn's new methods like is_regressor() and is_classifier(), and therefore weren't being checked against the correct set of expectations.

ref: https://github.com/scikit-learn/scikit-learn/blob/fa5d7275ba4dd2627b6522e1ec4eaf0f3a2e3c05/sklearn/utils/estimator_checks.py#L385-L387

How to review the tag changes

It may be hard to review changes with assignments to these new nested attributes, like

tags.target_tags.single_output = not tags_dict["multioutput_only"]

See the implementation of sklearn.utils.Tags (and the other *Tags classes), along with their docs, at https://github.com/scikit-learn/scikit-learn/blob/1.6.X/sklearn/utils/_tags.py

…scikit-learn-1.6

jameslamb · 2024-11-26T06:39:54Z

python-package/xgboost/sklearn.py

@@ -1481,7 +1537,7 @@ def _cls_predict_proba(n_classes: int, prediction: PredtT, vstack: Callable) ->
        Number of boosting rounds.
 """,
 )
-class XGBClassifier(XGBModel, XGBClassifierBase):
+class XGBClassifier(XGBClassifierBase, XGBModel):


As of scikit-learn/scikit-learn#30234 (which will be in scikit-learn 1.6), the estimator checks raise an error like the following:

XGBRegressor is inheriting from mixins in the wrong order. In general, in mixin inheritance, more specialized mixins must come before more general ones. This means, for instance, BaseEstimator should be on the right side of most other mixins. You need to change the order...

That check is new, but it enforced behavior that's been documented in scikit-learn's estimator development docs for a long time. See the "BaseEstimator and mixins" section in https://scikit-learn.org/stable/developers/develop.html#rolling-your-own-estimator.

It is particularly important to notice that mixins should be “on the left” while the BaseEstimator should be “on the right” in the inheritance list for proper MRO.

That new check error led to these inheritance-order changes, which led to the XGBModel.get_params() changes.

jameslamb · 2024-11-26T06:40:55Z

.gitignore

 demo/**/*.txt
 *.dmatrix
 .hypothesis
 __MACOSX/
 model*.json
+/tests/python/models/models/


Noticed some model files left behind from running all the Python tests locally while developing this. These .gitignore rules prevent checking them into source control.

jameslamb · 2024-11-26T18:21:16Z

python-package/xgboost/compat.py

+    class LabelEncoder:  # type: ignore[no-redef]
+        """Dummy class for sklearn.preprocessing.LabelEncoder."""
+
+        pass


With all of these placeholder classes set to object, using the re-arranged inheritance order (https://github.com/dmlc/xgboost/pull/11021/files#r1857885039) results in errors like this when importing xgboost when scikit-learn is not installed:

TypeError: Cannot create a consistent method resolution order (MRO) for bases object, XGBModel

(build link)

Having each one be a different class resolves that. I forgot until I saw this test failure that we faced a similar thing a few years ago in LightGBM: microsoft/LightGBM#3192

Just copying @StrikerRUS 's solution from there here... it's worked well for lightgbm.

…public-methods

jameslamb · 2024-11-26T20:37:55Z

python-package/pyproject.toml

@@ -63,6 +63,8 @@ disable = [
    "import-error",
    "attribute-defined-outside-init",
    "import-outside-toplevel",
+    "too-few-public-methods",
+    "too-many-ancestors",


Switching the placeholder classes in the scikit-learn-is-not-available branch of compat.py to be individual, differently-named, empty classes (instead of all using object), led to these several of these pylint errors, in sklearn.py and dask.py:

R0901: Too many ancestors (9/7) (too-many-ancestors) R0903: Too few public methods (0/2) (too-few-public-methods)

(build link)

It seems that there are already many other places in this codebase where those warnings are being suppressed with # pylint: disable comments. So instead of adding more such comments (some of which would have to share a line with # type: ignore comments for mypy), I'm proposing:

just globally ignore these pylint warnings for the whole project

remove any existing # pylint: disable comments about them

I don't feel that strongly... if you'd prefer to keep suppressing individual cases of these, please let me know and I'll happily switch back to #pylint: disable comments.

Looks good to me.

@RAMitchell find the pylint checks helpful. I myself prefer mypy checks and think the pylint is not particularly suitable for ML libraries like XGBoost. In general, I don't have a strong opinion about these "structural" or naming warnings and care mostly about warnings like unused imports or use before definition.

jameslamb · 2024-11-26T21:22:45Z

Seeing a lot of CI passing, so I think this is ready for review.

trivialfis

Thank you for the PR! Overall looks good to me. Some tests for the get_params would be appreciated.

trivialfis · 2024-11-27T06:59:11Z

python-package/xgboost/compat.py

@@ -55,20 +56,43 @@ def lazy_isinstance(instance: Any, module: str, name: str) -> bool:
        from sklearn.cross_validation import KFold as XGBKFold
        from sklearn.cross_validation import StratifiedKFold as XGBStratifiedKFold

+    # sklearn.utils Tags types can be imported unconditionally once


We can do that once the next sklearn is published.

I don't think we should.

That'd effectively raise xgboost's requirement all the way to scikit-learn>=1.6.

Because it would result in compat.SKLEARN_INSTALLED being False for scikit-learn < 1.6:

xgboost/python-package/xgboost/compat.py

Lines 60 to 61 in 5826b02

except ImportError:

SKLEARN_INSTALLED = False

Which would make all the estimators unusable on those versions.

xgboost/python-package/xgboost/sklearn.py

Lines 754 to 757 in 5826b02

if not SKLEARN_INSTALLED:

raise ImportError(

"sklearn needs to be installed in order to use this module"

)

trivialfis · 2024-11-27T06:59:52Z

python-package/xgboost/compat.py

+    class XGBRegressorBase:  # type: ignore[no-redef]
+        """Dummy class for sklearn.base.RegressorMixin."""
+
+    class LabelEncoder:  # type: ignore[no-redef]


We can remove the label encoder for now. It's not used.

Oh great! I just removed that in a511848

Noticed that KFold was also unused, so I removed that as well.

trivialfis · 2024-11-27T07:02:36Z

python-package/pyproject.toml

@@ -63,6 +63,8 @@ disable = [
    "import-error",
    "attribute-defined-outside-init",
    "import-outside-toplevel",
+    "too-few-public-methods",
+    "too-many-ancestors",


Looks good to me.

@RAMitchell find the pylint checks helpful. I myself prefer mypy checks and think the pylint is not particularly suitable for ML libraries like XGBoost. In general, I don't have a strong opinion about these "structural" or naming warnings and care mostly about warnings like unused imports or use before definition.

trivialfis · 2024-11-27T07:04:08Z

tests/python/test_with_sklearn.py

@@ -1526,6 +1527,58 @@ def test_tags() -> None:
    assert "multioutput" not in tags


+# the try-excepts in this test should be removed once xgboost's


We can use pytest.mark.skipif to skip tests. Seems simpler.

I was thinking that it's useful to check that the exact, expected AttributeError is raised...there are so many layers of inheritance involved here, it wouldn't be hard to implement this in a way that accidentally raises some totally unrelated error accessing a non-existent property or something.

If you read that and still think skipif() would be preferable, I'll happy change to that and remove the try-except, just wanted to explain my thinking.

trivialfis · 2024-11-27T07:05:21Z

python-package/xgboost/sklearn.py

@@ -1497,6 +1553,15 @@ def _more_tags(self) -> Dict[str, bool]:
        tags["multilabel"] = True
        return tags

+    def __sklearn_tags__(self) -> _sklearn_Tags:
+        tags = XGBModel.__sklearn_tags__(self)
+        tags.estimator_type = "classifier"


Do we need this if we inherit the classifier mixin?

Ah you are totally right, I don't think we do:

https://github.com/scikit-learn/scikit-learn/blob/e6037ba412ed889a888a60bd6c022990f2669507/sklearn/base.py#L536-L544

Removed this, the corresponding LGBRegressor code, and the imports of ClassifierTags / RegressorTags in a511848

trivialfis · 2024-11-27T07:15:19Z

python-package/xgboost/sklearn.py

        # 2. Return whatever in `**kwargs`.
        # 3. Merge them.
+        #
+        # This needs to accommodate being called recursively in the following


Could you please help add a test for this? The hierarchy and the Python introspection are getting a bit confusing now. ;-(

Sure! I just added one in a511848, let me know if there are other conditions you'd like to see tested.

Between that and the existing test:

xgboost/tests/python/test_with_sklearn.py

Line 758 in 5826b02

def test_parameters_access():

I think this behavior should be well-covered.

trivialfis · 2024-12-02T09:45:20Z

python-package/xgboost/sklearn.py

        params = super().get_params(deep)
        cp = copy.copy(self)
-        cp.__class__ = cp.__class__.__bases__[0]
+        # if the immediate parent is a mixin, skip it (mixins don't define get_params())


Do you think it's more general to check for the get_params attribute instead of checking hardcoded mixins? The current check seems to defeat the purpose of having a polymorphic structure (inheritance).

jameslamb added 7 commits November 23, 2024 20:31

add methods in XGBModel

79ed32c

implement __sklearn_tags__()

3106cf1

regressor tests passing

3af44be

Merge branch 'master' of github.com:dmlc/xgboost into python/support-…

a9e30b4

…scikit-learn-1.6

centrallize more in _more_tags(), ignore testing outputs in .gitignore

6a12576

make get_params() check less fragile

52e6d83

formatting

816667a

jameslamb commented Nov 26, 2024

View reviewed changes

ignore pylint warning

abfc6a6

jameslamb mentioned this pull request Nov 26, 2024

[python] tests fail on scikit-learn 1.6 nightlies #10896

Open

fix imports in environment without scikit-learn, fix tests

ef725c1

jameslamb commented Nov 26, 2024

View reviewed changes

jameslamb added 2 commits November 26, 2024 14:19

address pylint warnings

d845922

remove 'pylint: disable' comments for too-many-ancestors and too-few-…

b7564a1

…public-methods

jameslamb commented Nov 26, 2024

View reviewed changes

jameslamb changed the title ~~WIP: Adapt to scikit-learn 1.6 estimator tag changes~~ Adapt to scikit-learn 1.6 estimator tag changes Nov 26, 2024

jameslamb marked this pull request as ready for review November 26, 2024 21:22

trivialfis reviewed Nov 27, 2024

View reviewed changes

jameslamb added 2 commits November 30, 2024 23:05

Merge branch 'master' into python/support-scikit-learn-1.6

8364e92

rely on mixins to set estimator_type, removed unused imports

a511848

trivialfis reviewed Dec 2, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adapt to scikit-learn 1.6 estimator tag changes #11021

Adapt to scikit-learn 1.6 estimator tag changes #11021

jameslamb commented Nov 26, 2024 •

edited

Loading

jameslamb Nov 26, 2024

jameslamb Nov 26, 2024

jameslamb Nov 26, 2024 •

edited

Loading

jameslamb Nov 26, 2024

trivialfis Nov 27, 2024

jameslamb commented Nov 26, 2024

trivialfis left a comment

trivialfis Nov 27, 2024

jameslamb Dec 1, 2024

trivialfis Nov 27, 2024

jameslamb Dec 1, 2024

trivialfis Nov 27, 2024

trivialfis Nov 27, 2024

jameslamb Dec 1, 2024

trivialfis Nov 27, 2024

jameslamb Dec 1, 2024

trivialfis Nov 27, 2024

jameslamb Dec 1, 2024

trivialfis Dec 2, 2024 •

edited

Loading

	if not SKLEARN_INSTALLED:
	raise ImportError(
	"sklearn needs to be installed in order to use this module"
	)

		@@ -1526,6 +1527,58 @@ def test_tags() -> None:
		assert "multioutput" not in tags


		# the try-excepts in this test should be removed once xgboost's

Adapt to scikit-learn 1.6 estimator tag changes #11021

Are you sure you want to change the base?

Adapt to scikit-learn 1.6 estimator tag changes #11021

Conversation

jameslamb commented Nov 26, 2024 • edited Loading

Notes for Reviewers

How I tested this

Shouldn't this require changes on the C++ side?

How to review the tag changes

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jameslamb Nov 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jameslamb commented Nov 26, 2024

trivialfis left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

trivialfis Dec 2, 2024 • edited Loading

Choose a reason for hiding this comment

jameslamb commented Nov 26, 2024 •

edited

Loading

jameslamb Nov 26, 2024 •

edited

Loading

trivialfis Dec 2, 2024 •

edited

Loading