Skip to content

Commit

Permalink
DOC Replace boston dataset in ensemble.rst (scikit-learn#16876)
Browse files Browse the repository at this point in the history
  • Loading branch information
lucyleeow authored May 19, 2020
1 parent b632762 commit 3ff3981
Showing 1 changed file with 34 additions and 27 deletions.
61 changes: 34 additions & 27 deletions doc/modules/ensemble.rst
Original file line number Diff line number Diff line change
Expand Up @@ -555,17 +555,15 @@ for regression which can be specified via the argument
5.00...

The figure below shows the results of applying :class:`GradientBoostingRegressor`
with least squares loss and 500 base learners to the Boston house price dataset
(:func:`sklearn.datasets.load_boston`).
with least squares loss and 500 base learners to the diabetes dataset
(:func:`sklearn.datasets.load_diabetes`).
The plot on the left shows the train and test error at each iteration.
The train error at each iteration is stored in the
:attr:`~GradientBoostingRegressor.train_score_` attribute
of the gradient boosting model. The test error at each iterations can be obtained
via the :meth:`~GradientBoostingRegressor.staged_predict` method which returns a
generator that yields the predictions at each stage. Plots like these can be used
to determine the optimal number of trees (i.e. ``n_estimators``) by early stopping.
The plot on the right shows the impurity-based feature importances which can be
obtained via the ``feature_importances_`` property.

.. figure:: ../auto_examples/ensemble/images/sphx_glr_plot_gradient_boosting_regression_001.png
:target: ../auto_examples/ensemble/plot_gradient_boosting_regression.html
Expand Down Expand Up @@ -1348,18 +1346,18 @@ Usage

The following example shows how to fit the VotingRegressor::

>>> from sklearn.datasets import load_boston
>>> from sklearn.datasets import load_diabetes
>>> from sklearn.ensemble import GradientBoostingRegressor
>>> from sklearn.ensemble import RandomForestRegressor
>>> from sklearn.linear_model import LinearRegression
>>> from sklearn.ensemble import VotingRegressor

>>> # Loading some example data
>>> X, y = load_boston(return_X_y=True)
>>> X, y = load_diabetes(return_X_y=True)

>>> # Training classifiers
>>> reg1 = GradientBoostingRegressor(random_state=1, n_estimators=10)
>>> reg2 = RandomForestRegressor(random_state=1, n_estimators=10)
>>> reg1 = GradientBoostingRegressor(random_state=1)
>>> reg2 = RandomForestRegressor(random_state=1)
>>> reg3 = LinearRegression()
>>> ereg = VotingRegressor(estimators=[('gb', reg1), ('rf', reg2), ('lr', reg3)])
>>> ereg = ereg.fit(X, y)
Expand Down Expand Up @@ -1392,26 +1390,30 @@ are stacked together in parallel on the input data. It should be given as a
list of names and estimators::

>>> from sklearn.linear_model import RidgeCV, LassoCV
>>> from sklearn.svm import SVR
>>> from sklearn.neighbors import KNeighborsRegressor
>>> estimators = [('ridge', RidgeCV()),
... ('lasso', LassoCV(random_state=42)),
... ('svr', SVR(C=1, gamma=1e-6))]
... ('knr', KNeighborsRegressor(n_neighbors=20,
... metric='euclidean'))]

The `final_estimator` will use the predictions of the `estimators` as input. It
needs to be a classifier or a regressor when using :class:`StackingClassifier`
or :class:`StackingRegressor`, respectively::

>>> from sklearn.ensemble import GradientBoostingRegressor
>>> from sklearn.ensemble import StackingRegressor
>>> final_estimator = GradientBoostingRegressor(
... n_estimators=25, subsample=0.5, min_samples_leaf=25, max_features=1,
... random_state=42)
>>> reg = StackingRegressor(
... estimators=estimators,
... final_estimator=GradientBoostingRegressor(random_state=42))
... final_estimator=final_estimator)

To train the `estimators` and `final_estimator`, the `fit` method needs
to be called on the training data::

>>> from sklearn.datasets import load_boston
>>> X, y = load_boston(return_X_y=True)
>>> from sklearn.datasets import load_diabetes
>>> X, y = load_diabetes(return_X_y=True)
>>> from sklearn.model_selection import train_test_split
>>> X_train, X_test, y_train, y_test = train_test_split(X, y,
... random_state=42)
Expand All @@ -1437,21 +1439,21 @@ any other regressor or classifier, exposing a `predict`, `predict_proba`, and
>>> y_pred = reg.predict(X_test)
>>> from sklearn.metrics import r2_score
>>> print('R2 score: {:.2f}'.format(r2_score(y_test, y_pred)))
R2 score: 0.81
R2 score: 0.53

Note that it is also possible to get the output of the stacked
`estimators` using the `transform` method::

>>> reg.transform(X_test[:5])
array([[28.78..., 28.43... , 22.62...],
[35.96..., 32.58..., 23.68...],
[14.97..., 14.05..., 16.45...],
[25.19..., 25.54..., 22.92...],
[18.93..., 19.26..., 17.03... ]])

In practise, a stacking predictor predict as good as the best predictor of the
base layer and even sometimes outputperform it by combining the different
strength of the these predictors. However, training a stacking predictor is
array([[142..., 138..., 146...],
[179..., 182..., 151...],
[139..., 132..., 158...],
[286..., 292..., 225...],
[126..., 124..., 164...]])

In practice, a stacking predictor predicts as good as the best predictor of the
base layer and even sometimes outperforms it by combining the different
strengths of the these predictors. However, training a stacking predictor is
computationally expensive.

.. note::
Expand All @@ -1464,22 +1466,27 @@ computationally expensive.
Multiple stacking layers can be achieved by assigning `final_estimator` to
a :class:`StackingClassifier` or :class:`StackingRegressor`::

>>> final_layer_rfr = RandomForestRegressor(
... n_estimators=10, max_features=1, max_leaf_nodes=5,random_state=42)
>>> final_layer_gbr = GradientBoostingRegressor(
... n_estimators=10, max_features=1, max_leaf_nodes=5,random_state=42)
>>> final_layer = StackingRegressor(
... estimators=[('rf', RandomForestRegressor(random_state=42)),
... ('gbrt', GradientBoostingRegressor(random_state=42))],
... estimators=[('rf', final_layer_rfr),
... ('gbrt', final_layer_gbr)],
... final_estimator=RidgeCV()
... )
>>> multi_layer_regressor = StackingRegressor(
... estimators=[('ridge', RidgeCV()),
... ('lasso', LassoCV(random_state=42)),
... ('svr', SVR(C=1, gamma=1e-6, kernel='rbf'))],
... ('knr', KNeighborsRegressor(n_neighbors=20,
... metric='euclidean'))],
... final_estimator=final_layer
... )
>>> multi_layer_regressor.fit(X_train, y_train)
StackingRegressor(...)
>>> print('R2 score: {:.2f}'
... .format(multi_layer_regressor.score(X_test, y_test)))
R2 score: 0.83
R2 score: 0.53

.. topic:: References

Expand Down

0 comments on commit 3ff3981

Please sign in to comment.