Skip to content

Commit

Permalink
DOC Fixed documents that refer to Bunch object scikit-learn#16438 (sc…
Browse files Browse the repository at this point in the history
…ikit-learn#16447)

* Added links to utils.Bunch and fixed format of the docstring in datasets

* Added links to utils.Bunch in sklearn.compose

* Added links to utils.Bunch in sklearn.tree

* Added links to utils.Bunch in sklearn.ensemble

* Added links to utils.Bunch in sklearn.inspection

* Added links to utils.Bunch in sklearn.pipeline

* modified docstring of Bunch

* Added links to utils.Bunch to index.rst of sklearn.datasets

* Fixed some docstrings because the lines are too long

* Fixed some points as reviewed.

* Add links and delete 'for more information...'

* Fixed indent

* Fixed forgotten points.

* Fixed some points as reviewed.
  • Loading branch information
CastaChick authored Feb 27, 2020
1 parent 54cbf42 commit ca78d75
Show file tree
Hide file tree
Showing 18 changed files with 249 additions and 205 deletions.
60 changes: 32 additions & 28 deletions doc/datasets/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -21,46 +21,50 @@ also possible to generate synthetic data.
General dataset API
===================

There are three main kinds of dataset interfaces that can be used to get
There are three main kinds of dataset interfaces that can be used to get
datasets depending on the desired type of dataset.
**The dataset loaders.** They can be used to load small standard datasets,
described in the :ref:`toy_datasets` section.

**The dataset loaders.** They can be used to load small standard datasets,
described in the :ref:`toy_datasets` section.

**The dataset fetchers.** They can be used to download and load larger datasets,
described in the :ref:`real_world_datasets` section.

Both loaders and fetchers functions return a dictionary-like object holding
at least two items: an array of shape ``n_samples`` * ``n_features`` with
key ``data`` (except for 20newsgroups) and a numpy array of
Both loaders and fetchers functions return a :class:`sklearn.utils.Bunch`
object holding at least two items:
an array of shape ``n_samples`` * ``n_features`` with
key ``data`` (except for 20newsgroups) and a numpy array of
length ``n_samples``, containing the target values, with key ``target``.

The Bunch object is a dictionary that exposes its keys are attributes.
For more information about Bunch object, see :class:`sklearn.utils.Bunch`:

It's also possible for almost all of these function to constrain the output
to be a tuple containing only the data and the target, by setting the
to be a tuple containing only the data and the target, by setting the
``return_X_y`` parameter to ``True``.

The datasets also contain a full description in their ``DESCR`` attribute and
some contain ``feature_names`` and ``target_names``. See the dataset
descriptions below for details.
The datasets also contain a full description in their ``DESCR`` attribute and
some contain ``feature_names`` and ``target_names``. See the dataset
descriptions below for details.

**The dataset generation functions.** They can be used to generate controlled
**The dataset generation functions.** They can be used to generate controlled
synthetic datasets, described in the :ref:`sample_generators` section.

These functions return a tuple ``(X, y)`` consisting of a ``n_samples`` *
``n_features`` numpy array ``X`` and an array of length ``n_samples``
containing the targets ``y``.

In addition, there are also miscellaneous tools to load datasets of other
In addition, there are also miscellaneous tools to load datasets of other
formats or from other locations, described in the :ref:`loading_other_datasets`
section.
section.

.. _toy_datasets:

Toy datasets
============

scikit-learn comes with a few small standard datasets that do not require to
download any file from some external website.
scikit-learn comes with a few small standard datasets that do not require to
download any file from some external website.

They can be loaded using the following functions:

Expand Down Expand Up @@ -484,17 +488,17 @@ Loading from external datasets
scikit-learn works on any numeric data stored as numpy arrays or scipy sparse
matrices. Other types that are convertible to numeric arrays such as pandas
DataFrame are also acceptable.

Here are some recommended ways to load standard columnar data into a
format usable by scikit-learn:

* `pandas.io <https://pandas.pydata.org/pandas-docs/stable/io.html>`_
Here are some recommended ways to load standard columnar data into a
format usable by scikit-learn:

* `pandas.io <https://pandas.pydata.org/pandas-docs/stable/io.html>`_
provides tools to read data from common formats including CSV, Excel, JSON
and SQL. DataFrames may also be constructed from lists of tuples or dicts.
Pandas handles heterogeneous data smoothly and provides tools for
manipulation and conversion into a numeric array suitable for scikit-learn.
* `scipy.io <https://docs.scipy.org/doc/scipy/reference/io.html>`_
specializes in binary formats often used in scientific computing
* `scipy.io <https://docs.scipy.org/doc/scipy/reference/io.html>`_
specializes in binary formats often used in scientific computing
context such as .mat and .arff
* `numpy/routines.io <https://docs.scipy.org/doc/numpy/reference/routines.io.html>`_
for standard loading of columnar data into numpy arrays
Expand All @@ -508,18 +512,18 @@ For some miscellaneous data such as images, videos, and audio, you may wish to
refer to:

* `skimage.io <https://scikit-image.org/docs/dev/api/skimage.io.html>`_ or
`Imageio <https://imageio.readthedocs.io/en/latest/userapi.html>`_
`Imageio <https://imageio.readthedocs.io/en/latest/userapi.html>`_
for loading images and videos into numpy arrays
* `scipy.io.wavfile.read
<https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.io.wavfile.read.html>`_
* `scipy.io.wavfile.read
<https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.io.wavfile.read.html>`_
for reading WAV files into a numpy array

Categorical (or nominal) features stored as strings (common in pandas DataFrames)
Categorical (or nominal) features stored as strings (common in pandas DataFrames)
will need converting to numerical features using :class:`sklearn.preprocessing.OneHotEncoder`
or :class:`sklearn.preprocessing.OrdinalEncoder` or similar.
See :ref:`preprocessing`.

Note: if you manage your own numerical data it is recommended to use an
Note: if you manage your own numerical data it is recommended to use an
optimized file format such as HDF5 to reduce data load times. Various libraries
such as H5Py, PyTables and pandas provides a Python interface for reading and
such as H5Py, PyTables and pandas provides a Python interface for reading and
writing data in that format.
2 changes: 1 addition & 1 deletion sklearn/compose/_column_transformer.py
Original file line number Diff line number Diff line change
Expand Up @@ -124,7 +124,7 @@ class ColumnTransformer(TransformerMixin, _BaseComposition):
``len(transformers_)==len(transformers)+1``, otherwise
``len(transformers_)==len(transformers)``.
named_transformers_ : Bunch
named_transformers_ : :class:`~sklearn.utils.Bunch`
Read-only attribute to access any transformer by given name.
Keys are transformer names and values are the fitted transformer
objects.
Expand Down
78 changes: 50 additions & 28 deletions sklearn/datasets/_base.py
Original file line number Diff line number Diff line change
Expand Up @@ -163,12 +163,20 @@ def load_files(container_path, description=None, categories=None,
Returns
-------
data : Bunch
Dictionary-like object, the interesting attributes are: either
data, the raw text data to learn, or 'filenames', the files
holding it, 'target', the classification labels (integer index),
'target_names', the meaning of the labels, and 'DESCR', the full
description of the dataset.
data : :class:`~sklearn.utils.Bunch`
Dictionary-like object, with the following attributes.
data : list of str
Only present when `load_content=True`.
The raw text data to learn.
target : ndarray
The target labels (integer index).
target_names : list
The names of target classes.
DESCR : str
The full description of the dataset.
filenames: ndarray
The filenames holding the dataset.
"""
target = []
target_names = []
Expand Down Expand Up @@ -295,8 +303,8 @@ def load_wine(return_X_y=False, as_frame=False):
Returns
-------
data : Bunch
Dictionary-like object, with attributes:
data : :class:`~sklearn.utils.Bunch`
Dictionary-like object, with the following attributes.
data : {ndarray, dataframe} of shape (178, 13)
The data matrix. If `as_frame=True`, `data` will be a pandas
Expand Down Expand Up @@ -409,8 +417,8 @@ def load_iris(return_X_y=False, as_frame=False):
Returns
-------
data : Bunch
Dictionary-like object, with attributes:
data : :class:`~sklearn.utils.Bunch`
Dictionary-like object, with the following attributes.
data : {ndarray, dataframe} of shape (150, 4)
The data matrix. If `as_frame=True`, `data` will be a pandas
Expand Down Expand Up @@ -521,8 +529,8 @@ def load_breast_cancer(return_X_y=False, as_frame=False):
Returns
-------
data : Bunch
Dictionary-like object, with attributes:
data : :class:`~sklearn.utils.Bunch`
Dictionary-like object, with the following attributes.
data : {ndarray, dataframe} of shape (569, 30)
The data matrix. If `as_frame=True`, `data` will be a pandas
Expand Down Expand Up @@ -645,8 +653,8 @@ def load_digits(n_class=10, return_X_y=False, as_frame=False):
Returns
-------
data : Bunch
Dictionary-like object, with attributes:
data : :class:`~sklearn.utils.Bunch`
Dictionary-like object, with the following attributes.
data : {ndarray, dataframe} of shape (1797, 64)
The flattened data matrix. If `as_frame=True`, `data` will be
Expand Down Expand Up @@ -759,8 +767,8 @@ def load_diabetes(return_X_y=False, as_frame=False):
Returns
-------
data : Bunch
Dictionary-like object, with attributes:
data : :class:`~sklearn.utils.Bunch`
Dictionary-like object, with the following attributes.
data : {ndarray, dataframe} of shape (442, 10)
The data matrix. If `as_frame=True`, `data` will be a pandas
Expand Down Expand Up @@ -853,8 +861,8 @@ def load_linnerud(return_X_y=False, as_frame=False):
Returns
-------
data : Bunch
Dictionary-like object, with attributes:
data : :class:`~sklearn.utils.Bunch`
Dictionary-like object, with the following attributes.
data : {ndarray, dataframe} of shape (20, 3)
The data matrix. If `as_frame=True`, `data` will be a pandas
Expand Down Expand Up @@ -943,12 +951,21 @@ def load_boston(return_X_y=False):
Returns
-------
data : Bunch
Dictionary-like object, the interesting attributes are:
'data', the data to learn, 'target', the regression targets,
'DESCR', the full description of the dataset,
and 'filename', the physical location of boston
csv dataset (added in version `0.20`).
data : :class:`~sklearn.utils.Bunch`
Dictionary-like object, with the following attributes.
data : ndarray of shape (506, 13)
The data matrix.
target : ndarray of shape (506, )
The regression target.
filename : str
The physical location of boston csv dataset.
.. versionadded:: 0.20
DESCR : str
The full description of the dataset.
feature_names : ndarray
The names of features
(data, target) : tuple if ``return_X_y`` is True
Expand Down Expand Up @@ -1007,10 +1024,15 @@ def load_sample_images():
Returns
-------
data : Bunch
Dictionary-like object with the following attributes : 'images', the
two sample images, 'filenames', the file names for the images, and
'DESCR' the full description of the dataset.
data : :class:`~sklearn.utils.Bunch`
Dictionary-like object, with the following attributes.
images : list of ndarray of shape (427, 640, 3)
The two sample image.
filenames : list
The filenames for the images.
DESCR : str
The full description of the dataset.
Examples
--------
Expand Down
29 changes: 14 additions & 15 deletions sklearn/datasets/_california_housing.py
Original file line number Diff line number Diff line change
Expand Up @@ -87,21 +87,20 @@ def fetch_california_housing(data_home=None, download_if_missing=True,
Returns
-------
dataset : dict-like object with the following attributes:
dataset.data : ndarray, shape [20640, 8]
Each row corresponding to the 8 feature values in order.
If ``as_frame`` is True, ``data`` is a pandas object.
dataset.target : numpy array of shape (20640,)
Each value corresponds to the average house value in units of 100,000.
If ``as_frame`` is True, ``target`` is a pandas object.
dataset.feature_names : array of length 8
Array of ordered feature names used in the dataset.
dataset.DESCR : string
Description of the California housing dataset.
dataset : :class:`~sklearn.utils.Bunch`
Dictionary-like object, with the following attributes.
data : ndarray, shape (20640, 8)
Each row corresponding to the 8 feature values in order.
If ``as_frame`` is True, ``data`` is a pandas object.
target : numpy array of shape (20640,)
Each value corresponds to the average
house value in units of 100,000.
If ``as_frame`` is True, ``target`` is a pandas object.
feature_names : list of length 8
Array of ordered feature names used in the dataset.
DESCR : string
Description of the California housing dataset.
(data, target) : tuple if ``return_X_y`` is True
Expand Down
22 changes: 11 additions & 11 deletions sklearn/datasets/_covtype.py
Original file line number Diff line number Diff line change
Expand Up @@ -81,17 +81,17 @@ def fetch_covtype(data_home=None, download_if_missing=True,
Returns
-------
dataset : dict-like object with the following attributes:
dataset.data : numpy array of shape (581012, 54)
Each row corresponds to the 54 features in the dataset.
dataset.target : numpy array of shape (581012,)
Each value corresponds to one of the 7 forest covertypes with values
ranging between 1 to 7.
dataset.DESCR : string
Description of the forest covertype dataset.
dataset : :class:`~sklearn.utils.Bunch`
Dictionary-like object, with the following attributes.
data : numpy array of shape (581012, 54)
Each row corresponds to the 54 features in the dataset.
target : numpy array of shape (581012,)
Each value corresponds to one of
the 7 forest covertypes with values
ranging between 1 to 7.
DESCR : str
Description of the forest covertype dataset.
(data, target) : tuple if ``return_X_y`` is True
Expand Down
24 changes: 15 additions & 9 deletions sklearn/datasets/_kddcup99.py
Original file line number Diff line number Diff line change
Expand Up @@ -96,11 +96,15 @@ def fetch_kddcup99(subset=None, data_home=None, shuffle=False,
Returns
-------
data : Bunch
Dictionary-like object, the interesting attributes are:
- 'data', the data to learn.
- 'target', the regression target for each sample.
- 'DESCR', a description of the dataset.
data : :class:`~sklearn.utils.Bunch`
Dictionary-like object, with the following attributes.
data : ndarray of shape (494021, 41)
The data matrix to learn.
target : ndarray of shape (494021,)
The regression target for each sample.
DESCR : str
The full description of the dataset.
(data, target) : tuple if ``return_X_y`` is True
Expand Down Expand Up @@ -190,13 +194,15 @@ def _fetch_brute_kddcup99(data_home=None,
Returns
-------
dataset : dict-like object with the following attributes:
dataset.data : numpy array of shape (494021, 41)
dataset : :class:`~sklearn.utils.Bunch`
Dictionary-like object, with the following attributes.
data : numpy array of shape (494021, 41)
Each row corresponds to the 41 features in the dataset.
dataset.target : numpy array of shape (494021,)
target : numpy array of shape (494021,)
Each value corresponds to one of the 21 attack types or to the
label 'normal.'.
dataset.DESCR : string
DESCR : string
Description of the kddcup99 dataset.
"""
Expand Down
Loading

0 comments on commit ca78d75

Please sign in to comment.