Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adapt to scikit-learn 1.6 estimator tag changes #11021

Open
wants to merge 13 commits into
base: master
Choose a base branch
from

Conversation

jameslamb
Copy link
Contributor

@jameslamb jameslamb commented Nov 26, 2024

fixes #10896

In scikit-learn 1.6 (coming very soon), there are some significant changes to estimator tags and estimator checks. This PR makes xgboost compatible with those changes.

In short:

  • the estimator._more_tags() -> dict interface is deprecated, and estimators are encouraged to instead use estimator.__sklearn_tags__() -> sklearn.utils.Tags
  • many new checks have been added to parametrize_with_checks(), and others have been made stricter

Notes for Reviewers

How I tested this

See #10896. I tested these changes locally (M2 macOS, Python 3.11) against the following scikit-learn versions:

  • 1.7.0.dev0 (latest nightly)
  • 1.6.0.rc1 (latest release candidate for 1.6)
  • 1.5.2 (latest stable version)

The patterns proposed here closely follow similar changes we made over in lightgbm:

Shouldn't this require changes on the C++ side?

No. The errors I reported in #10896, which looked like they might be coming from assertions raised on the C++ side, were only showing up because xgboost estimators weren't yet compliant with scikit-learn's new methods like is_regressor() and is_classifier(), and therefore weren't being checked against the correct set of expectations.

ref: https://github.com/scikit-learn/scikit-learn/blob/fa5d7275ba4dd2627b6522e1ec4eaf0f3a2e3c05/sklearn/utils/estimator_checks.py#L385-L387

How to review the tag changes

It may be hard to review changes with assignments to these new nested attributes, like

tags.target_tags.single_output = not tags_dict["multioutput_only"]

See the implementation of sklearn.utils.Tags (and the other *Tags classes), along with their docs, at https://github.com/scikit-learn/scikit-learn/blob/1.6.X/sklearn/utils/_tags.py

@@ -1481,7 +1537,7 @@ def _cls_predict_proba(n_classes: int, prediction: PredtT, vstack: Callable) ->
Number of boosting rounds.
""",
)
class XGBClassifier(XGBModel, XGBClassifierBase):
class XGBClassifier(XGBClassifierBase, XGBModel):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As of scikit-learn/scikit-learn#30234 (which will be in scikit-learn 1.6), the estimator checks raise an error like the following:

XGBRegressor is inheriting from mixins in the wrong order. In general, in mixin inheritance, more specialized mixins must come before more general ones. This means, for instance, BaseEstimator should be on the right side of most other mixins. You need to change the order...

That check is new, but it enforced behavior that's been documented in scikit-learn's estimator development docs for a long time. See the "BaseEstimator and mixins" section in https://scikit-learn.org/stable/developers/develop.html#rolling-your-own-estimator.

It is particularly important to notice that mixins should be “on the left” while the BaseEstimator should be “on the right” in the inheritance list for proper MRO.

That new check error led to these inheritance-order changes, which led to the XGBModel.get_params() changes.

demo/**/*.txt
*.dmatrix
.hypothesis
__MACOSX/
model*.json
/tests/python/models/models/
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noticed some model files left behind from running all the Python tests locally while developing this. These .gitignore rules prevent checking them into source control.

class LabelEncoder: # type: ignore[no-redef]
"""Dummy class for sklearn.preprocessing.LabelEncoder."""

pass
Copy link
Contributor Author

@jameslamb jameslamb Nov 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With all of these placeholder classes set to object, using the re-arranged inheritance order (https://github.com/dmlc/xgboost/pull/11021/files#r1857885039) results in errors like this when importing xgboost when scikit-learn is not installed:

TypeError: Cannot create a consistent method resolution order (MRO) for bases object, XGBModel

(build link)

Having each one be a different class resolves that. I forgot until I saw this test failure that we faced a similar thing a few years ago in LightGBM: microsoft/LightGBM#3192

Just copying @StrikerRUS 's solution from there here... it's worked well for lightgbm.

@@ -63,6 +63,8 @@ disable = [
"import-error",
"attribute-defined-outside-init",
"import-outside-toplevel",
"too-few-public-methods",
"too-many-ancestors",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switching the placeholder classes in the scikit-learn-is-not-available branch of compat.py to be individual, differently-named, empty classes (instead of all using object), led to these several of these pylint errors, in sklearn.py and dask.py:

R0901: Too many ancestors (9/7) (too-many-ancestors)
R0903: Too few public methods (0/2) (too-few-public-methods)

(build link)

It seems that there are already many other places in this codebase where those warnings are being suppressed with # pylint: disable comments. So instead of adding more such comments (some of which would have to share a line with # type: ignore comments for mypy), I'm proposing:

  • just globally ignore these pylint warnings for the whole project
  • remove any existing # pylint: disable comments about them

I don't feel that strongly... if you'd prefer to keep suppressing individual cases of these, please let me know and I'll happily switch back to #pylint: disable comments.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

@RAMitchell find the pylint checks helpful. I myself prefer mypy checks and think the pylint is not particularly suitable for ML libraries like XGBoost. In general, I don't have a strong opinion about these "structural" or naming warnings and care mostly about warnings like unused imports or use before definition.

@jameslamb jameslamb changed the title WIP: Adapt to scikit-learn 1.6 estimator tag changes Adapt to scikit-learn 1.6 estimator tag changes Nov 26, 2024
@jameslamb
Copy link
Contributor Author

Seeing a lot of CI passing, so I think this is ready for review.

@jameslamb jameslamb marked this pull request as ready for review November 26, 2024 21:22
Copy link
Member

@trivialfis trivialfis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR! Overall looks good to me. Some tests for the get_params would be appreciated.

@@ -55,20 +56,43 @@ def lazy_isinstance(instance: Any, module: str, name: str) -> bool:
from sklearn.cross_validation import KFold as XGBKFold
from sklearn.cross_validation import StratifiedKFold as XGBStratifiedKFold

# sklearn.utils Tags types can be imported unconditionally once
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can do that once the next sklearn is published.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should.

That'd effectively raise xgboost's requirement all the way to scikit-learn>=1.6.

Because it would result in compat.SKLEARN_INSTALLED being False for scikit-learn < 1.6:

except ImportError:
SKLEARN_INSTALLED = False

Which would make all the estimators unusable on those versions.

if not SKLEARN_INSTALLED:
raise ImportError(
"sklearn needs to be installed in order to use this module"
)

class XGBRegressorBase: # type: ignore[no-redef]
"""Dummy class for sklearn.base.RegressorMixin."""

class LabelEncoder: # type: ignore[no-redef]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove the label encoder for now. It's not used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh great! I just removed that in a511848

Noticed that KFold was also unused, so I removed that as well.

@@ -63,6 +63,8 @@ disable = [
"import-error",
"attribute-defined-outside-init",
"import-outside-toplevel",
"too-few-public-methods",
"too-many-ancestors",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

@RAMitchell find the pylint checks helpful. I myself prefer mypy checks and think the pylint is not particularly suitable for ML libraries like XGBoost. In general, I don't have a strong opinion about these "structural" or naming warnings and care mostly about warnings like unused imports or use before definition.

@@ -1526,6 +1527,58 @@ def test_tags() -> None:
assert "multioutput" not in tags


# the try-excepts in this test should be removed once xgboost's
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use pytest.mark.skipif to skip tests. Seems simpler.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking that it's useful to check that the exact, expected AttributeError is raised...there are so many layers of inheritance involved here, it wouldn't be hard to implement this in a way that accidentally raises some totally unrelated error accessing a non-existent property or something.

If you read that and still think skipif() would be preferable, I'll happy change to that and remove the try-except, just wanted to explain my thinking.

@@ -1497,6 +1553,15 @@ def _more_tags(self) -> Dict[str, bool]:
tags["multilabel"] = True
return tags

def __sklearn_tags__(self) -> _sklearn_Tags:
tags = XGBModel.__sklearn_tags__(self)
tags.estimator_type = "classifier"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this if we inherit the classifier mixin?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah you are totally right, I don't think we do:

https://github.com/scikit-learn/scikit-learn/blob/e6037ba412ed889a888a60bd6c022990f2669507/sklearn/base.py#L536-L544

Removed this, the corresponding LGBRegressor code, and the imports of ClassifierTags / RegressorTags in a511848

# 2. Return whatever in `**kwargs`.
# 3. Merge them.
#
# This needs to accommodate being called recursively in the following
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please help add a test for this? The hierarchy and the Python introspection are getting a bit confusing now. ;-(

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! I just added one in a511848, let me know if there are other conditions you'd like to see tested.

Between that and the existing test:

def test_parameters_access():

I think this behavior should be well-covered.

params = super().get_params(deep)
cp = copy.copy(self)
cp.__class__ = cp.__class__.__bases__[0]
# if the immediate parent is a mixin, skip it (mixins don't define get_params())
Copy link
Member

@trivialfis trivialfis Dec 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it's more general to check for the get_params attribute instead of checking hardcoded mixins? The current check seems to defeat the purpose of having a polymorphic structure (inheritance).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[python] tests fail on scikit-learn 1.6 nightlies
2 participants