From 5a53892cb8846d7b915d4bcf83f53f35b4a78d35 Mon Sep 17 00:00:00 2001 From: Nicolas Hug <contact@nicolas-hug.com> Date: Tue, 26 Nov 2019 18:14:32 -0500 Subject: [PATCH 01/23] WIP --- slep011/proposal.rst | 379 +++++++++++++++++++++++++++++++++++++++++++ under_review.rst | 2 +- 2 files changed, 380 insertions(+), 1 deletion(-) create mode 100644 slep011/proposal.rst diff --git a/slep011/proposal.rst b/slep011/proposal.rst new file mode 100644 index 0000000..16dc10e --- /dev/null +++ b/slep011/proposal.rst @@ -0,0 +1,379 @@ +.. _slep_010: + +===================== +SLEP011: random state +===================== + +:Author: Nicolas Hug +:Status: Under review +:Type: Standards Track +:Created: 2019-11-23 + +Current status +============== + +`random_state` parameters are used commonly in estimators, and in CV +splitters. They can be either int, RandomState instances, or None. The +parameter is stored as an attribute in init and never changes, as per our +convention. + +`fit` and `split` usually look like this:: + + def fit(self, X, y): # or split(self, X, y) + rng = check_random_state(self.random_state) + ... + rng.some_method() # directly use instance, e.g. for sampling + some_function_or_class(random_state=rng) # pass it down to other tools + # e.g. if self is a meta-estimator + +`check_random_state` behaves as follows:: + + def check_random_state(x): + if x is an int, return np.RandomState(seed=x) + if x is a RandomState instance, return x + if x is None, return numpy's RandomState singleton instance + +Accepting instances or None comes with different complications, listed below. +`None` is the current default for all estimators / splitters. + + +Problems with passing RandomState instances or None +=================================================== + +The main issue with RandomState instances and None is that it makes the +estimator stateful accross calls to `fit`. Almost all the other issues are +consequences of this fact. + +Statefulness and violation of fit idempotence +--------------------------------------------- + +Estimators should be stateless. That is, any call to `fit` should forget +whatever happened to previous calls to `fit`, provided that no warm-start is +hapenning. Whether CV splitters should be stateful or not is debatable, and +that point is discussed below. + +Another related convention is that the `fit` method should be idempotent: +calling `est.fit(X, y)` and then `est.fit(X, y)` again should yield the same +model both times. + +These properties are key for enforcing reproducibility. We have a common +checks for them. + +If a `RandomState` instance or None are used, the idemptency property may be +violated:: + + rng = RandomState(0) + est = Est(..., random_state=rng) + est.fit(...) # consume rng + est.fit(...) + +The model inferred the second time isn't the same as the previous one since +the rng has been consumed during the first called. The statefulness property +isn't respected either since the first call to `fit` has an impact on the +second. + +The same goes with passing `None`. + +A related issue is that the `rng` may be consumed outside of the estimator. +The estimator isn't "self contained" anymore and its behaviour is now +dependent on some stuff that happen outside. + +Countless bugs +-------------- + +Quoting @amueller from `14042 +<https://github.com/scikit-learn/scikit-learn/issues/14042>`_, many bugs +have happened over the years because of RandomState instances and None. + +These bugs are often hard to find, and some of them are actual data leaks, +see e.g. `14034 +<https://github.com/scikit-learn/scikit-learn/issues/14034>`_. Some of these +bugs have been around forever and we just haven't discovered them yet. + +The bug in `14034 +<https://github.com/scikit-learn/scikit-learn/issues/14034>`_ is that the +validation subset for early-stopping was computed using `self.random_state`: +in a warm-starting context, that subset may be different accross multiple +calls to `fit`, since the RNG is getting consumed. We instead want the +subset to be the same for all calls to `fit`. The fix was to store a private +seed that was generated the first time `fit` is called. + +This is a typical example of many other similar bugs that we need to +monkey-patch with potentially overly complex logic. + +Cloning may return a different estimator +---------------------------------------- + +`my_clone = clone(est)` returns an *unfitted* estimator whose parameters are +(supposed to be) the same as those that were passed when `est` was +instanciated. Whether +*clone* is a good name for describing this process is another issue, but the +semantic of cloning is scikit-learn is as described above. We can think of +*cloning* as *reset with initial parameters*. + +That semantic is not respected if `est` was instanciated with an instance or +with None:: + + rng = RandomState(0) + est = Est(..., random_state=rng) + est.fit(X, y) # consume RNG here + my_clone = clone(est) + my_clone.fit(X, y) # not the same model! + +`my_clone` isn't the same as what `est` was, since the RNG has been consumed +in-between. Fitting `my_clone` on the same data will not give the same model +as `est`. While this is not desirable when an instance is passed, one might +argue that this is the desired behaviour when None is passed. + +In addition, `est` and `my_clone` are now interdependent because they share the +same `rng` instance. + +Incompatibility with third-party estimators +-------------------------------------------- + +From the snippet above in the introduction, most meta-estimators will +directly pass down `self.random_state` to their sub-estimators. Some +third-party libraries only support integers as `random_state`, not instances +or None. See e.g. `15611 +<https://github.com/scikit-learn/scikit-learn/issues/15611>`_ + +CV-Splitters statefulness +------------------------- + +CV-splitters are stateful:: + + rng = np.random.RandomState(0) + cv = KFolds(shuffle=True, random_state=rng) + a = cv.split(X, y) + b = cv.split(X, y) # different from a + +`a` and `b` are different splits, because of how `split` is implemented (see +introduction above). + +This behaviour is inconsistent for two reasons. + +The first one is that if `rng` were an int, then `a` and `b` would have been +equal. Indeed, the behaviour of the CV splitter depends on the +**type** of the `random_state` parameter:: + +- int -> stateless, get the same splits each time you call split() +- None or instance -> stateful, get different splits each time you call split() + +Concretely, we have a method (`split`) whose behaviour depends on the *type* +of a parameter that was passed to `init`. We can argue that this is a common +pattern in object-oriented design, but in the case of the `random_state` +parameter, this is potentially confusing. + +The second inconsistency is that splitters are stateful by design, while we +want our estimators to be stateless. Granted, splitters aren't estimators. +But, quoting `@GaelVaroquaux, +<https://github.com/scikit-learn/scikit-learn/pull/15177#issuecomment-548021786>`_, +consistency is one thing that we are really good at. + +So it is important to have the splitters consistent with the estimators, +w.r.t. the statelessness property. The current behaviour is not necessarily +clear for users. Fixing how random_state is handled in the splitters is one +of the entries in the `Roadmap +<https://scikit-learn.org/dev/roadmap.html>`_. + +Some users may rely on the current behaviour, `to implement e.g. +bootstrapping, +<https://github.com/scikit-learn/scikit-learn/pull/15177#issuecomment-548021786>`_. +If we make the splitters stateless, the "old" behaviour can be easily +reproduced by simply creating new CV instances, instead of calling `split` +on the same instance. Instances are dead cheap to create. + +Potential bugs in custom parameter searches +------------------------------------------- + +(This issue is a direct consequence of the splitters being stateful.) + +We have a private API for subclassing BaseSearchCV and implementing custom +parameter search strategies. The contract is that the custom class should +override the `_run_search(evaluate_candidate, ...)` method which itself must +call the `evaluate_candidates()` closure, were `cv.split()` will be called. + +Third-party developers may only *call* `evaluate_candidates()`, not change +its content. Now, since `cv.split()` is called in `evaluate_candadates()`, +that means that `evalute_candidates()` will potentially evaluate its +candidates parameters **on different splits** each time it is called. + +This is a quite subtle issue that third-party developers might easily +overlook. + +Depending on the intended behaviour of the parameter search, this may or may +not be a good thing. This is typically a bug if we implement successive +halving + warm start (details ommitted here, you may refer to `this issue +<https://github.com/scikit-learn/scikit-learn/issues/15125>`_ for some more +details). + +Proposed Solution +================= + +We need a solution that fixes the statefulness of the estimators and the +splitters. + +The proposed solution is to store the *state* of a RandomState instance in +`__init__`:: + + def __init__(self, ..., random_state=None): + self.random_state = check_random_state(random_state).get_state() + +That `random_state` attribute is a tuple with about 620 integers, as returned +by `get_state() +<https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.random.get_state.html#numpy.random.get_state>`_. + +That state is then used in `fit` or in `split` as follows:: + + def fit(self, X, y): # or split() + rng = np.random.RandomState() + rng.set_state(self.random_state) + # ... use rng as before + +Since `self.random_state` is a tuple that never changes, calling `fit` or +`split` on the same instance always gives the same results. + +We want `__init__` and `set_params/get_params` to be consistent. To that end, +we will need to special-case these methods:: + + def get_params(self): + + random_state = np.random.RandomState() + random_state.set_state(self.random_state) + return {'random_state': random_sate, ...} + + def set_params(self, ...): + + self.random_state = check_random_state(random_state).get_state() # same as in init + +`clone` does not need to be special-cased, because `get_params` does all the +work. Note that the following:: + + est.set_params(random_state=est.get_params()['random_state']) + +behaves as expected and does not change the `random_state` attribute of the +estimator. However, one should not use:: + + est.set_params(random_state=est.random_state) + +since `est.random_state` is neither an int, None or an instance: it is a tuple. +We can error with a decent message in that case. + +Advantages: + +- It fixes the statefullness issue. `fit` is now idempotent. Calling `split` on + the same instance gives the same splits. In other words, it does what we + want. + +- It is relatively simple to implement, and not too intrusive. + +- Backward compatibility is preserved between scikit-learn versions. Let A + be a version with the current behaviour (say 0.22) and let B be a version + where the new behaviour is implemented. The models and the splits obtained + will be the same in A and in B. That property may not be respected with + other solutions, see below. + +- Both RandomState instances and None are still supported. We don't need to + deprecate the use of any of them. + +- As a bonus, the `self.random_state` attribute is an *actual* random state: + it is the state of some RNG. What we currently call `random_state` is not + a state but a RNG (though this is numpy's fault.) + +Drawbacks: + +- we break our convention that `__init__` should only ever store attributes, as + they are passed in. Note however that this convention is enforced so that + the semantic of `__init__` and `set_params` are the same, and to enable + people to change public attributes without having surprising behaviour. + **This is still respected here.** So this isn't really an issue. + +- there is a subtelty that occurs when passing `None`. `check_random_state` + will return the singleton `np.random.mtrand._rand`, and we will call + `get_state()` on the singleton. The thing is, its state won't change + accross multiple instances unless the singleton is consumed. So if we do + `a = Est(random_state=None); b = Est(random_state=None)`, a and b actually + have exactly the same `random_state` attribute, since the state of the + singleton wasn't changed. One solution would be to consume the singleton's + RNG in `__init__`, with a private helper. + +- The `__repr__()` will need to special-case the `random_state` attribute to + avoid printing a long tuple. + +- We need to store about 620 integers. This is however negligible w.r.t. e.g. + the size of a typical dataset + +- It does not fix the issue about third-party packages only accepting integers. + This can however be fixed in meta-estimator, independently. + +Alternative solutions +===================== + +Store a seed instead of a state +-------------------------------- + +Instead of storing a state from `get_state()`, we could store a randomly +generated seed:: + + def __init__(self, ..., random_state=None): + self.random_state = check_random_state(random_state).randint(0. BIG_INT) + +Then instead of using `set_state` we could just use +`rng = RandomState(seed=self.random_state)` in `fit` or `split`. + +Advantages: + +- It also fixes the third-party estimators issue, since we would be passing + self.random_state which is an int +- It's cheaper than storing 620 ints +- we only need to subcase `set_params()`. `get_params()` can be left as-is, + same for `clone`. + + +Drawbacks: + +- We want that + `est.set_params(random_state=est.get_params()['random_state'])` does not + change `est.random_state`. With this logic, this is not possible to enforce. + Also, clone will not work was expected. + +- It is not backward compatible between versions. For example if you passed + an int in version A (say 0.22), then in version B (with the new + behaviour), your estimator will not start with the same RNG when `fit` is + called the first time. Same for splitters. For this reason, storing a + state may be preferable. + +Store the state in fit instead of in init +----------------------------------------- + +Instead of storing the output of `get_state()` in `__init__`, we could store it +the first time `fit()` is called. For example:: + + def fit(self): + self._random_state = getattr(self, '_random_state', check_random_state(self.random_state).get_state()) + rng = np.random.RandomState() + rng.set_state(self._random_state) + # ... + +The advantage is that we respect our convention with `__init__`. + +However, `fit` idempotency isn't respected anymore: the first call to `fit` +clearly influences all the other ones. This also introduces a private +attribute, so we would need more intrusive changes to `set_params`, +`get_params`, and `clone`. + + +References and Footnotes +------------------------ + +.. [1] Each SLEP must either be explicitly labeled as placed in the public + domain (see this SLEP as an example) or licensed under the `Open + Publication License`_. + +.. _Open Publication License: https://www.opencontent.org/openpub/ + + +Copyright +--------- + +This document has been placed in the public domain. [1]_ diff --git a/under_review.rst b/under_review.rst index 47ee5f8..6c83081 100644 --- a/under_review.rst +++ b/under_review.rst @@ -4,4 +4,4 @@ SLEPs under review .. toctree:: :maxdepth: 1 - slep010/proposal + slep011/proposal From 926ad397e08e4eaeee94f263b21af68bb0635c69 Mon Sep 17 00:00:00 2001 From: Nicolas Hug <contact@nicolas-hug.com> Date: Tue, 26 Nov 2019 18:54:19 -0500 Subject: [PATCH 02/23] still WIP --- slep011/proposal.rst | 72 ++++++++++++++++++++++++-------------------- 1 file changed, 39 insertions(+), 33 deletions(-) diff --git a/slep011/proposal.rst b/slep011/proposal.rst index 16dc10e..d91e09e 100644 --- a/slep011/proposal.rst +++ b/slep011/proposal.rst @@ -1,4 +1,4 @@ -.. _slep_010: +.. _slep_011: ===================== SLEP011: random state @@ -45,7 +45,7 @@ estimator stateful accross calls to `fit`. Almost all the other issues are consequences of this fact. Statefulness and violation of fit idempotence ---------------------------------------------- +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Estimators should be stateless. That is, any call to `fit` should forget whatever happened to previous calls to `fit`, provided that no warm-start is @@ -79,7 +79,7 @@ The estimator isn't "self contained" anymore and its behaviour is now dependent on some stuff that happen outside. Countless bugs --------------- +~~~~~~~~~~~~~~ Quoting @amueller from `14042 <https://github.com/scikit-learn/scikit-learn/issues/14042>`_, many bugs @@ -102,7 +102,7 @@ This is a typical example of many other similar bugs that we need to monkey-patch with potentially overly complex logic. Cloning may return a different estimator ----------------------------------------- +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ `my_clone = clone(est)` returns an *unfitted* estimator whose parameters are (supposed to be) the same as those that were passed when `est` was @@ -129,7 +129,7 @@ In addition, `est` and `my_clone` are now interdependent because they share the same `rng` instance. Incompatibility with third-party estimators --------------------------------------------- +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ From the snippet above in the introduction, most meta-estimators will directly pass down `self.random_state` to their sub-estimators. Some @@ -138,7 +138,7 @@ or None. See e.g. `15611 <https://github.com/scikit-learn/scikit-learn/issues/15611>`_ CV-Splitters statefulness -------------------------- +~~~~~~~~~~~~~~~~~~~~~~~~~ CV-splitters are stateful:: @@ -166,25 +166,23 @@ parameter, this is potentially confusing. The second inconsistency is that splitters are stateful by design, while we want our estimators to be stateless. Granted, splitters aren't estimators. -But, quoting `@GaelVaroquaux, +But, quoting `@GaelVaroquaux <https://github.com/scikit-learn/scikit-learn/pull/15177#issuecomment-548021786>`_, consistency is one thing that we are really good at. - So it is important to have the splitters consistent with the estimators, w.r.t. the statelessness property. The current behaviour is not necessarily -clear for users. Fixing how random_state is handled in the splitters is one -of the entries in the `Roadmap -<https://scikit-learn.org/dev/roadmap.html>`_. +clear for users. -Some users may rely on the current behaviour, `to implement e.g. -bootstrapping, +Note Fixing how random_state is handled in the splitters is one of the +entries in the `Roadmap <https://scikit-learn.org/dev/roadmap.html>`_. Some +users may rely on the current behaviour, `to implement e.g. bootstrapping, <https://github.com/scikit-learn/scikit-learn/pull/15177#issuecomment-548021786>`_. If we make the splitters stateless, the "old" behaviour can be easily reproduced by simply creating new CV instances, instead of calling `split` -on the same instance. Instances are dead cheap to create. +on the same instance. Instances are cheap to create. Potential bugs in custom parameter searches -------------------------------------------- +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ (This issue is a direct consequence of the splitters being stateful.) @@ -213,6 +211,9 @@ Proposed Solution We need a solution that fixes the statefulness of the estimators and the splitters. +A toy example of the proposed solution is implemented in this `notebook +<https://nbviewer.jupyter.org/gist/NicolasHug/1169ee253a4669ff993c947507ae2cb5>`_. + The proposed solution is to store the *state* of a RandomState instance in `__init__`:: @@ -282,20 +283,22 @@ Advantages: Drawbacks: -- we break our convention that `__init__` should only ever store attributes, as - they are passed in. Note however that this convention is enforced so that - the semantic of `__init__` and `set_params` are the same, and to enable - people to change public attributes without having surprising behaviour. - **This is still respected here.** So this isn't really an issue. +- We break our convention that `__init__` should only ever store attributes, as + they are passed in. Note however that the reason we have this convention + is that we want the semantic of `__init__` and `set_params` are the same, + and we want to enable people to change public attributes without having + surprising behaviour. **This is still respected here.** So this isn't + really an issue. -- there is a subtelty that occurs when passing `None`. `check_random_state` +- There is a subtelty that occurs when passing `None`. `check_random_state` will return the singleton `np.random.mtrand._rand`, and we will call `get_state()` on the singleton. The thing is, its state won't change accross multiple instances unless the singleton is consumed. So if we do `a = Est(random_state=None); b = Est(random_state=None)`, a and b actually have exactly the same `random_state` attribute, since the state of the - singleton wasn't changed. One solution would be to consume the singleton's - RNG in `__init__`, with a private helper. + singleton wasn't changed. To circumvent this, the logic in `__init__` and + `set_params` involves a private helper that makes sure the singleton's RNG is + consumed. Please refer to the notebook. - The `__repr__()` will need to special-case the `random_state` attribute to avoid printing a long tuple. @@ -310,13 +313,13 @@ Alternative solutions ===================== Store a seed instead of a state --------------------------------- +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Instead of storing a state from `get_state()`, we could store a randomly generated seed:: def __init__(self, ..., random_state=None): - self.random_state = check_random_state(random_state).randint(0. BIG_INT) + self.random_state = check_random_state(random_state).randint(0, BIG_INT) Then instead of using `set_state` we could just use `rng = RandomState(seed=self.random_state)` in `fit` or `split`. @@ -326,9 +329,8 @@ Advantages: - It also fixes the third-party estimators issue, since we would be passing self.random_state which is an int - It's cheaper than storing 620 ints -- we only need to subcase `set_params()`. `get_params()` can be left as-is, - same for `clone`. - +- We don't need to artificially consume the singleton's RNG since it is + de-facto consumed anyway. Drawbacks: @@ -340,16 +342,15 @@ Drawbacks: - It is not backward compatible between versions. For example if you passed an int in version A (say 0.22), then in version B (with the new behaviour), your estimator will not start with the same RNG when `fit` is - called the first time. Same for splitters. For this reason, storing a - state may be preferable. + called the first time. Same for splitters. -Store the state in fit instead of in init ------------------------------------------ +Store the state in fit/split instead of in init +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Instead of storing the output of `get_state()` in `__init__`, we could store it the first time `fit()` is called. For example:: - def fit(self): + def fit(self): # or split() self._random_state = getattr(self, '_random_state', check_random_state(self.random_state).get_state()) rng = np.random.RandomState() rng.set_state(self._random_state) @@ -363,6 +364,11 @@ attribute, so we would need more intrusive changes to `set_params`, `get_params`, and `clone`. +Execution +========= + +That change might be a 1.0 thing. + References and Footnotes ------------------------ From e5ba859a22e496ef6829554512f770828631727d Mon Sep 17 00:00:00 2001 From: Nicolas Hug <contact@nicolas-hug.com> Date: Wed, 27 Nov 2019 08:28:28 -0500 Subject: [PATCH 03/23] more --- slep011/proposal.rst | 129 +++++++++++++++++++++++-------------------- 1 file changed, 70 insertions(+), 59 deletions(-) diff --git a/slep011/proposal.rst b/slep011/proposal.rst index d91e09e..6a311ec 100644 --- a/slep011/proposal.rst +++ b/slep011/proposal.rst @@ -1,23 +1,23 @@ .. _slep_011: -===================== -SLEP011: random state -===================== +==================================== +SLEP011: Fixing randomness behaviour +==================================== :Author: Nicolas Hug :Status: Under review :Type: Standards Track -:Created: 2019-11-23 +:Created: 2019-11-27 -Current status -============== +How we currently handle randomness +================================== `random_state` parameters are used commonly in estimators, and in CV splitters. They can be either int, RandomState instances, or None. The parameter is stored as an attribute in init and never changes, as per our convention. -`fit` and `split` usually look like this:: +`fit` and `split` methods typically look like this:: def fit(self, X, y): # or split(self, X, y) rng = check_random_state(self.random_state) @@ -33,16 +33,17 @@ convention. if x is a RandomState instance, return x if x is None, return numpy's RandomState singleton instance -Accepting instances or None comes with different complications, listed below. -`None` is the current default for all estimators / splitters. - +Accepting RandomState instances or None comes with different complications, +listed below. `None` is the current default for all estimators / splitters. Problems with passing RandomState instances or None =================================================== The main issue with RandomState instances and None is that it makes the -estimator stateful accross calls to `fit`. Almost all the other issues are -consequences of this fact. +estimators and the splitters stateful accross calls to `fit` or `split`. As +a result, different calls to `fit` or `split` on the same instance yield +different results. Almost all the other issues are consequences of this +fact. Statefulness and violation of fit idempotence ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -72,7 +73,8 @@ the rng has been consumed during the first called. The statefulness property isn't respected either since the first call to `fit` has an impact on the second. -The same goes with passing `None`. +The same goes with passing `None`: you get different models every time you call +`fit`. A related issue is that the `rng` may be consumed outside of the estimator. The estimator isn't "self contained" anymore and its behaviour is now @@ -93,10 +95,10 @@ bugs have been around forever and we just haven't discovered them yet. The bug in `14034 <https://github.com/scikit-learn/scikit-learn/issues/14034>`_ is that the validation subset for early-stopping was computed using `self.random_state`: -in a warm-starting context, that subset may be different accross multiple -calls to `fit`, since the RNG is getting consumed. We instead want the -subset to be the same for all calls to `fit`. The fix was to store a private -seed that was generated the first time `fit` is called. +that subset is different accross multiple calls to `fit`, since the RNG is +getting consumed. This is a problem when doing warm-start, because we want +the subset to be the same for all calls to `fit` in this case. The fix was +to store a private seed that was generated the first time `fit` is called. This is a typical example of many other similar bugs that we need to monkey-patch with potentially overly complex logic. @@ -153,7 +155,7 @@ introduction above). This behaviour is inconsistent for two reasons. The first one is that if `rng` were an int, then `a` and `b` would have been -equal. Indeed, the behaviour of the CV splitter depends on the +equal. As a result, the behaviour of the CV splitter depends on the **type** of the `random_state` parameter:: - int -> stateless, get the same splits each time you call split() @@ -173,18 +175,14 @@ So it is important to have the splitters consistent with the estimators, w.r.t. the statelessness property. The current behaviour is not necessarily clear for users. -Note Fixing how random_state is handled in the splitters is one of the -entries in the `Roadmap <https://scikit-learn.org/dev/roadmap.html>`_. Some -users may rely on the current behaviour, `to implement e.g. bootstrapping, -<https://github.com/scikit-learn/scikit-learn/pull/15177#issuecomment-548021786>`_. -If we make the splitters stateless, the "old" behaviour can be easily -reproduced by simply creating new CV instances, instead of calling `split` -on the same instance. Instances are cheap to create. +Note that fixing how random_state is handled in the splitters is one of the +entries in the `Roadmap <https://scikit-learn.org/dev/roadmap.html>`_. Potential bugs in custom parameter searches ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -(This issue is a direct consequence of the splitters being stateful.) +This issue is a direct consequence of the splitters being stateful. It's also +more advanced than the rest, you may want to skip it. We have a private API for subclassing BaseSearchCV and implementing custom parameter search strategies. The contract is that the custom class should @@ -193,8 +191,8 @@ call the `evaluate_candidates()` closure, were `cv.split()` will be called. Third-party developers may only *call* `evaluate_candidates()`, not change its content. Now, since `cv.split()` is called in `evaluate_candadates()`, -that means that `evalute_candidates()` will potentially evaluate its -candidates parameters **on different splits** each time it is called. +that means that `evalute_candidates()` will evaluate the candidate +parameters **on different splits** each time it is called. This is a quite subtle issue that third-party developers might easily overlook. @@ -209,21 +207,21 @@ Proposed Solution ================= We need a solution that fixes the statefulness of the estimators and the -splitters. +splitters. Most of the remaining issues would be fixed as a consequence. A toy example of the proposed solution is implemented in this `notebook -<https://nbviewer.jupyter.org/gist/NicolasHug/1169ee253a4669ff993c947507ae2cb5>`_. +<https://gist.github.com/NicolasHug/1169ee253a4669ff993c947507ae2cb5>`_. +The bulk of the solution is to manipulate actual random *states*, as +returned by `get_state() +<https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.random.get_state.html#numpy.random.get_state>`_. -The proposed solution is to store the *state* of a RandomState instance in +Specifically, we would store the *state* of a RandomState instance in `__init__`:: def __init__(self, ..., random_state=None): self.random_state = check_random_state(random_state).get_state() -That `random_state` attribute is a tuple with about 620 integers, as returned -by `get_state() -<https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.random.get_state.html#numpy.random.get_state>`_. - +That `random_state` attribute is a tuple with about 620 integers. That state is then used in `fit` or in `split` as follows:: def fit(self, X, y): # or split() @@ -231,8 +229,8 @@ That state is then used in `fit` or in `split` as follows:: rng.set_state(self.random_state) # ... use rng as before -Since `self.random_state` is a tuple that never changes, calling `fit` or -`split` on the same instance always gives the same results. +Since `self.random_state` is an immutable tuple that never changes, calling +`fit` or `split` on the same instance always gives the same results. We want `__init__` and `set_params/get_params` to be consistent. To that end, we will need to special-case these methods:: @@ -293,7 +291,7 @@ Drawbacks: - There is a subtelty that occurs when passing `None`. `check_random_state` will return the singleton `np.random.mtrand._rand`, and we will call `get_state()` on the singleton. The thing is, its state won't change - accross multiple instances unless the singleton is consumed. So if we do + unless the singleton is consumed. So if we do `a = Est(random_state=None); b = Est(random_state=None)`, a and b actually have exactly the same `random_state` attribute, since the state of the singleton wasn't changed. To circumvent this, the logic in `__init__` and @@ -306,8 +304,8 @@ Drawbacks: - We need to store about 620 integers. This is however negligible w.r.t. e.g. the size of a typical dataset -- It does not fix the issue about third-party packages only accepting integers. - This can however be fixed in meta-estimator, independently. +- It does not fix the issue about third-party estimators only accepting + integers. This can however be fixed in each meta-estimator, independently. Alternative solutions ===================== @@ -334,10 +332,12 @@ Advantages: Drawbacks: -- We want that - `est.set_params(random_state=est.get_params()['random_state'])` does not - change `est.random_state`. With this logic, this is not possible to enforce. - Also, clone will not work was expected. +- Since we draw a seed in init (and in `set_params()`), `clone` will not + work as expected. In particular with `my_clone = clone(est)`, my_clone and + est cannot have the same `random_state` attribute. This is the same for + `my_clone.set_params(random_state=est.get_params()['random_state'])`. The + seed will have to be drawn in `set_params`, thus leading to a different + `random_state` attribute. - It is not backward compatible between versions. For example if you passed an int in version A (say 0.22), then in version B (with the new @@ -359,27 +359,38 @@ the first time `fit()` is called. For example:: The advantage is that we respect our convention with `__init__`. However, `fit` idempotency isn't respected anymore: the first call to `fit` -clearly influences all the other ones. This also introduces a private -attribute, so we would need more intrusive changes to `set_params`, -`get_params`, and `clone`. +clearly influences all the other ones. + +This also introduces a private attribute, so we would need more intrusive +changes to `set_params`, `get_params`, and `clone`. + +Execution and considerations +============================ +Making the estimator stateless can be considered a bug fix. However, we are +clearly changing the behaviour of the splitters, and some users may rely on +the current behaviour, `to implement e.g. bootstrapping +<https://github.com/scikit-learn/scikit-learn/pull/15177#issuecomment-548021786>`_. +If we make the splitters stateless, the "old" behaviour can be easily +reproduced by simply creating new CV instances, instead of calling `split` +on the same instance. Instances are cheap to create. -Execution -========= +We would need a lot of outreach before introducing that change to let users +know about it. And depending on how comfortable we are with it, this might +be a 1.0 thing. -That change might be a 1.0 thing. -References and Footnotes ------------------------- +.. References and Footnotes +.. ------------------------ -.. [1] Each SLEP must either be explicitly labeled as placed in the public - domain (see this SLEP as an example) or licensed under the `Open - Publication License`_. +.. .. [1] Each SLEP must either be explicitly labeled as placed in the public +.. domain (see this SLEP as an example) or licensed under the `Open +.. Publication License`_. -.. _Open Publication License: https://www.opencontent.org/openpub/ +.. .. _Open Publication License: https://www.opencontent.org/openpub/ -Copyright ---------- +.. Copyright +.. --------- -This document has been placed in the public domain. [1]_ +.. This document has been placed in the public domain. [1]_ From 47011fa1795a164ba25c844b6637358258c72394 Mon Sep 17 00:00:00 2001 From: Nicolas Hug <contact@nicolas-hug.com> Date: Wed, 27 Nov 2019 08:29:49 -0500 Subject: [PATCH 04/23] title --- slep011/proposal.rst | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/slep011/proposal.rst b/slep011/proposal.rst index 6a311ec..af81a2c 100644 --- a/slep011/proposal.rst +++ b/slep011/proposal.rst @@ -1,8 +1,8 @@ .. _slep_011: -==================================== -SLEP011: Fixing randomness behaviour -==================================== +=================================== +SLEP011: Fixing randomness handling +=================================== :Author: Nicolas Hug :Status: Under review From e3f02ac408d7f577e7d8a828e35bf686cfbc9c90 Mon Sep 17 00:00:00 2001 From: Nicolas Hug <contact@nicolas-hug.com> Date: Wed, 27 Nov 2019 08:36:06 -0500 Subject: [PATCH 05/23] some more --- slep011/proposal.rst | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/slep011/proposal.rst b/slep011/proposal.rst index af81a2c..93c8547 100644 --- a/slep011/proposal.rst +++ b/slep011/proposal.rst @@ -264,6 +264,10 @@ Advantages: the same instance gives the same splits. In other words, it does what we want. +- The behaviour is clear and intuitive: the object is fully defined at init, + and only at init. Things that happen between init or fit *do not* influence + the state of the object. + - It is relatively simple to implement, and not too intrusive. - Backward compatibility is preserved between scikit-learn versions. Let A From 18a0f251bce62eb3d5dffddc3d810a67ee2e14d8 Mon Sep 17 00:00:00 2001 From: Nicolas Hug <contact@nicolas-hug.com> Date: Wed, 11 Dec 2019 08:33:31 -0500 Subject: [PATCH 06/23] Added abstract --- slep011/proposal.rst | 19 +++++++++++++++++-- 1 file changed, 17 insertions(+), 2 deletions(-) diff --git a/slep011/proposal.rst b/slep011/proposal.rst index 93c8547..9baa7cd 100644 --- a/slep011/proposal.rst +++ b/slep011/proposal.rst @@ -9,8 +9,23 @@ SLEP011: Fixing randomness handling :Type: Standards Track :Created: 2019-11-27 -How we currently handle randomness -================================== +Abstract +======== + +This SLEP aims at fixing the issues related to how scikit-learn handles +randomness of estimators and CV splitters. + +The proposed solution is to make estimators and splitter stateless, by +storing the state of the `random_state` parameter that is passed in +`__init__`. + +More than anything, this SLEP's goal is to *inform* the discussions related +to randomness handling: if we end up going with the status quo (i.e. keep +estimators and splitters stateful), then at least we are all aware of the +price we're paying. + +Background: How we currently handle randomness +============================================== `random_state` parameters are used commonly in estimators, and in CV splitters. They can be either int, RandomState instances, or None. The From 18ed43c438294eea4989bc9178d4d08f09798f44 Mon Sep 17 00:00:00 2001 From: Nicolas Hug <contact@nicolas-hug.com> Date: Fri, 4 Sep 2020 17:49:23 -0400 Subject: [PATCH 07/23] Added note about SH forbidding stateful splitters --- slep011/proposal.rst | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/slep011/proposal.rst b/slep011/proposal.rst index 9baa7cd..73210e3 100644 --- a/slep011/proposal.rst +++ b/slep011/proposal.rst @@ -216,7 +216,10 @@ Depending on the intended behaviour of the parameter search, this may or may not be a good thing. This is typically a bug if we implement successive halving + warm start (details ommitted here, you may refer to `this issue <https://github.com/scikit-learn/scikit-learn/issues/15125>`_ for some more -details). +details). Currently, the `Successive Halving implementation +<https://github.com/scikit-learn/scikit-learn/pull/13900>`_ **forbids users +from using stateful splitters**, e.g. `KFolds(5, shuffle=True, +random_state=None)` is forbidden. Proposed Solution ================= From c3f54959013ff083ccc233f6d05b5d38016c7b4c Mon Sep 17 00:00:00 2001 From: Nicolas Hug <contact@nicolas-hug.com> Date: Fri, 4 Sep 2020 18:15:53 -0400 Subject: [PATCH 08/23] Update slep011/proposal.rst Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com> --- slep011/proposal.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/slep011/proposal.rst b/slep011/proposal.rst index 73210e3..bacd7bf 100644 --- a/slep011/proposal.rst +++ b/slep011/proposal.rst @@ -116,7 +116,7 @@ the subset to be the same for all calls to `fit` in this case. The fix was to store a private seed that was generated the first time `fit` is called. This is a typical example of many other similar bugs that we need to -monkey-patch with potentially overly complex logic. +patch with potentially overly complex logic. Cloning may return a different estimator ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ From 907672c65df2e1db27fa72a62a6e3bab148c5ed1 Mon Sep 17 00:00:00 2001 From: Nicolas Hug <contact@nicolas-hug.com> Date: Mon, 7 Sep 2020 10:01:49 -0400 Subject: [PATCH 09/23] some important details after chat with Alex --- slep011/proposal.rst | 28 ++++++++++++++++++++++++++-- 1 file changed, 26 insertions(+), 2 deletions(-) diff --git a/slep011/proposal.rst b/slep011/proposal.rst index 73210e3..5bee03a 100644 --- a/slep011/proposal.rst +++ b/slep011/proposal.rst @@ -167,7 +167,19 @@ CV-splitters are stateful:: `a` and `b` are different splits, because of how `split` is implemented (see introduction above). -This behaviour is inconsistent for two reasons. +This means that the following code is very likely incorrect:: + + rng = np.random.RandomState(0) + cv = KFold(shuffle=True, random_state=rng) + estimators = [...] # the estimators you want to compare + for est in estimators: + cross_val_score(est, X, y, cv=cv) + +Users might not realize it, but **the estimators will be evaluated on +different splits**, even though they think they've set the random state by +passing a carefuly crafted instance. They should have passed an int. + +The behaviour of the splitters is inconsistent for two reasons. The first one is that if `rng` were an int, then `a` and `b` would have been equal. As a result, the behaviour of the CV splitter depends on the @@ -216,7 +228,8 @@ Depending on the intended behaviour of the parameter search, this may or may not be a good thing. This is typically a bug if we implement successive halving + warm start (details ommitted here, you may refer to `this issue <https://github.com/scikit-learn/scikit-learn/issues/15125>`_ for some more -details). Currently, the `Successive Halving implementation +details). Currently, in order to prevent any potential bug and misconception +from users, the `Successive Halving implementation <https://github.com/scikit-learn/scikit-learn/pull/13900>`_ **forbids users from using stateful splitters**, e.g. `KFolds(5, shuffle=True, random_state=None)` is forbidden. @@ -224,6 +237,10 @@ random_state=None)` is forbidden. Proposed Solution ================= +.. note:: + This proposed solution is a work in progress and there is room for + improvement. + We need a solution that fixes the statefulness of the estimators and the splitters. Most of the remaining issues would be fixed as a consequence. @@ -332,6 +349,13 @@ Drawbacks: Alternative solutions ===================== +Change the default of all random_state from `None` to a hardcoded int +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +This doesn't solve much: it might limit pitfalls in users code, but does not +address the core issue, which is the statefulness of splitters and +estimators. + Store a seed instead of a state ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ From 95b91235330cd7d6de66b3c8caba2c8239731017 Mon Sep 17 00:00:00 2001 From: Nicolas Hug <contact@nicolas-hug.com> Date: Thu, 10 Sep 2020 16:42:38 -0400 Subject: [PATCH 10/23] Wrote new requirements section --- slep011/proposal.rst | 112 +++++++++++++++++++++++++++++++++++-------- 1 file changed, 91 insertions(+), 21 deletions(-) diff --git a/slep011/proposal.rst b/slep011/proposal.rst index e803273..885fcd3 100644 --- a/slep011/proposal.rst +++ b/slep011/proposal.rst @@ -12,25 +12,34 @@ SLEP011: Fixing randomness handling Abstract ======== -This SLEP aims at fixing the issues related to how scikit-learn handles -randomness of estimators and CV splitters. +This SLEP aims at fixing various issues related to how scikit-learn handles +randomness in estimators and Cross-Validation (CV) splitters. These issues +include hard-to-find bugs, surprising results, and potentially misuses and +misunderstanding from users. They are described in more details in a +dedicated section below. +This SLEP is *not* about: + +- what the default value of `random_state` should be +- whether we should allow RandomState instances and None to be passed. + +TODO: update this The proposed solution is to make estimators and splitter stateless, by storing the state of the `random_state` parameter that is passed in `__init__`. -More than anything, this SLEP's goal is to *inform* the discussions related -to randomness handling: if we end up going with the status quo (i.e. keep -estimators and splitters stateful), then at least we are all aware of the -price we're paying. +.. More than anything, this SLEP's goal is to *inform* the discussions related +.. to randomness handling: if we end up going with the status quo (i.e. keep +.. estimators and splitters stateful), then at least we are all aware of the +.. price we're paying. Background: How we currently handle randomness ============================================== `random_state` parameters are used commonly in estimators, and in CV splitters. They can be either int, RandomState instances, or None. The -parameter is stored as an attribute in init and never changes, as per our -convention. +parameter is stored as an attribute in `__init__` and never changes, as per +our convention. `fit` and `split` methods typically look like this:: @@ -39,7 +48,6 @@ convention. ... rng.some_method() # directly use instance, e.g. for sampling some_function_or_class(random_state=rng) # pass it down to other tools - # e.g. if self is a meta-estimator `check_random_state` behaves as follows:: @@ -48,34 +56,96 @@ convention. if x is a RandomState instance, return x if x is None, return numpy's RandomState singleton instance -Accepting RandomState instances or None comes with different complications, -listed below. `None` is the current default for all estimators / splitters. +Accepting RandomState instances or None comes with different complications +and potentially surpring behavior that are listed below. `None` is the +current default for all estimators / splitters. + +.. note:: + Since passing `random_state=None` is equivalent to passing the global + `RandomState` instance from `numpy` + (`random_state=np.random.mtrand._rand`), we will not explicitly mention + None here, but everything that applies to instances also applies to + using None. + +Key use-cases and requirements +============================== + +.. note:: + In what follows, *legacy* design means what is supported in the curent + scikit-learn master branch, i.e. `0.24.dev`. + +There are a few use-cases that are made possible by the legacy design and +that we will want to keep supporting in the new design: + +- A. Allow for consistent results across executions. This is a natural + requirement for any implementation: we want users to be able to get + consistently reproducible results across multiple program executions. This + is currently supported by passing down integers or RandomState instances to + every `random_state` parameter. +- B. Allow for maximum variability and different results across executions. + This is currently supported by passing `random_state=None`: multiple + program executions yield different results each time. +- C. CV procedures where the estimator's RNG is exactly the same on each + fold. This is currently supported by passing an int to the `random_state` + parameter of the estimator. +- D. CV procedures where the estimator's RNG is different for each fold. This + is useful to, e.g., increase the statistical significance of CV results. + This is currently supported by passing a RandomState instance or None to the + `random_state` parameter of the estimator. +- E. Obtain *strict* clones, i.e. clones that will yield exactly the same + results when called on the same data. This is currently supported by + calling `clone` if the estimator's `random_state` is an int. +- F. Obtain *statistical* clones, i.e. estimators + that are identical but that will yield different results, even when called + on the same data. This is currently supported by calling `clone` if + the estimator's `random_state` is an instance or None. +- G. Easily obtain different splits with the same characteristics from a + splitter. By "same characteristics", we mean same number of splits, same + choice of stratification, etc. This is useful to implement e.g. + boostrapping. This is currently supported by calling `split` multiple times + on the same `cv` instance, if `cv.random_state` is an instance, or None. + +.. note:: + C and D are currently mutually exclusive: a given estimator can only do C + and not D, or only D and not C. + Also, C and E are very related: C is supported via creating strict clones + (E). Similarly, D is supported by creating statistical clones (F). + + In the proposed design, all estimators will support C and D. Also, strict + and statistical clones can be obtained from any estimator. Problems with passing RandomState instances or None =================================================== +.. note:: + Some of the issues described here will also be described in the docs + if/when `#18363 + <https://github.com/scikit-learn/scikit-learn/pull/18363>`_ gets merged. + Note however that this is a recent PR. + In this PR, we recommend users to pass integers everywhere to avoid + potential mistakes. + The main issue with RandomState instances and None is that it makes the -estimators and the splitters stateful accross calls to `fit` or `split`. As -a result, different calls to `fit` or `split` on the same instance yield -different results. Almost all the other issues are consequences of this -fact. +estimators and the splitters stateful accross calls to `fit` and `split`. As +a result, different calls to `fit` and `split` on the same instance yield +different results. Almost all the other issues are consequences of this fact. Statefulness and violation of fit idempotence ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Estimators should be stateless. That is, any call to `fit` should forget -whatever happened to previous calls to `fit`, provided that no warm-start is -hapenning. Whether CV splitters should be stateful or not is debatable, and -that point is discussed below. +whatever happened to previous calls to `fit`, provided that warm-start is +off. Whether CV splitters should be stateful or not is debatable, and that +point is discussed below. Another related convention is that the `fit` method should be idempotent: calling `est.fit(X, y)` and then `est.fit(X, y)` again should yield the same model both times. -These properties are key for enforcing reproducibility. We have a common -checks for them. +These properties are key for enforcing reproducibility. There is a dedicated +check in the `check_estimator` suite. -If a `RandomState` instance or None are used, the idemptency property may be +If a `RandomState` instance is used, the idemptency property may be violated:: rng = RandomState(0) From 7a4a2fe57e462f1f1078b764a27e765e47eb5546 Mon Sep 17 00:00:00 2001 From: Nicolas Hug <contact@nicolas-hug.com> Date: Fri, 18 Sep 2020 16:25:55 -0400 Subject: [PATCH 11/23] Basically re-wrote the entire thing --- slep011/proposal.rst | 798 ++++++++++++++++++++++--------------------- 1 file changed, 415 insertions(+), 383 deletions(-) diff --git a/slep011/proposal.rst b/slep011/proposal.rst index 885fcd3..f97b741 100644 --- a/slep011/proposal.rst +++ b/slep011/proposal.rst @@ -1,8 +1,8 @@ .. _slep_011: -=================================== -SLEP011: Fixing randomness handling -=================================== +====================================== +SLEP011: Fixing randomness ambiguities +====================================== :Author: Nicolas Hug :Status: Under review @@ -13,52 +13,48 @@ Abstract ======== This SLEP aims at fixing various issues related to how scikit-learn handles -randomness in estimators and Cross-Validation (CV) splitters. These issues -include hard-to-find bugs, surprising results, and potentially misuses and -misunderstanding from users. They are described in more details in a -dedicated section below. - -This SLEP is *not* about: +randomness in estimators and Cross-Validation (CV) splitters. The +`random_state` design makes estimators and splitters stateful across calls to +`fit` and `split`, which induces a complex mental model for developers and +users. Various issues are: + + - hard-to-find bugs + - Implicit behavior of CV procedures and meta-estimators: the estimator's + `random_state` parameter implicitly (but significantly) influences the + behaviour of seemingly unrelated procedures like `cross_val_score`, or + meta-estimators like `RFE` or `SequentialFeatureSelection`. To understand + these subtleties, users need to know inner details of these CV procedures + and meta-estimators: namely, they need to know *when* `fit` is called. + - Similarly, users are required to know how many times `split` is called in + a CV procedure (e.g `GridSearchCV.fit`), or there is a risk that they may + make unsound performance comparison. + +These issues are described in more details in a dedicated section below. This +SLEP is *not* about: - what the default value of `random_state` should be - whether we should allow RandomState instances and None to be passed. -TODO: update this -The proposed solution is to make estimators and splitter stateless, by -storing the state of the `random_state` parameter that is passed in -`__init__`. - -.. More than anything, this SLEP's goal is to *inform* the discussions related -.. to randomness handling: if we end up going with the status quo (i.e. keep -.. estimators and splitters stateful), then at least we are all aware of the -.. price we're paying. - -Background: How we currently handle randomness -============================================== - -`random_state` parameters are used commonly in estimators, and in CV -splitters. They can be either int, RandomState instances, or None. The -parameter is stored as an attribute in `__init__` and never changes, as per -our convention. - -`fit` and `split` methods typically look like this:: - - def fit(self, X, y): # or split(self, X, y) - rng = check_random_state(self.random_state) - ... - rng.some_method() # directly use instance, e.g. for sampling - some_function_or_class(random_state=rng) # pass it down to other tools - -`check_random_state` behaves as follows:: - - def check_random_state(x): - if x is an int, return np.RandomState(seed=x) - if x is a RandomState instance, return x - if x is None, return numpy's RandomState singleton instance +**The proposed solution is to**: + +- Make estimators and splitter stateless, by drawing a random seed in + `__init__`. This seed is used as the sole source of RNG in `fit` and in + `split`. Neither `fit` nor `split` need to be changed. This will + allow users and developers to reason about `random_state` with a simpler + mental model. +- Let users explicitly request *strict* or *statistical* clones, with the + introduction of a new `use_exact_clones` parameter to each CV procedure and + meta-estimator where relevant (name open for discussion). + + This will explicitly expose the choice of the CV and meta-estimator + strategy to users, instead of implicitly relying on the type of the + estimator's `random_state`. Moreover, unlike in the legacy design, both + strategies (using strict or statistical clones) will be supported by any + estimator, no matter the type of the `random_state` attribute. -Accepting RandomState instances or None comes with different complications -and potentially surpring behavior that are listed below. `None` is the -current default for all estimators / splitters. +.. note:: + The *legacy* design refers to the design as used in the current + scikit-learn master branch, i.e. `0.24.dev`. .. note:: Since passing `random_state=None` is equivalent to passing the global @@ -67,435 +63,471 @@ current default for all estimators / splitters. None here, but everything that applies to instances also applies to using None. -Key use-cases and requirements -============================== - -.. note:: - In what follows, *legacy* design means what is supported in the curent - scikit-learn master branch, i.e. `0.24.dev`. - -There are a few use-cases that are made possible by the legacy design and -that we will want to keep supporting in the new design: - -- A. Allow for consistent results across executions. This is a natural - requirement for any implementation: we want users to be able to get - consistently reproducible results across multiple program executions. This - is currently supported by passing down integers or RandomState instances to - every `random_state` parameter. -- B. Allow for maximum variability and different results across executions. - This is currently supported by passing `random_state=None`: multiple - program executions yield different results each time. -- C. CV procedures where the estimator's RNG is exactly the same on each - fold. This is currently supported by passing an int to the `random_state` - parameter of the estimator. -- D. CV procedures where the estimator's RNG is different for each fold. This - is useful to, e.g., increase the statistical significance of CV results. - This is currently supported by passing a RandomState instance or None to the - `random_state` parameter of the estimator. -- E. Obtain *strict* clones, i.e. clones that will yield exactly the same - results when called on the same data. This is currently supported by - calling `clone` if the estimator's `random_state` is an int. -- F. Obtain *statistical* clones, i.e. estimators - that are identical but that will yield different results, even when called - on the same data. This is currently supported by calling `clone` if - the estimator's `random_state` is an instance or None. -- G. Easily obtain different splits with the same characteristics from a - splitter. By "same characteristics", we mean same number of splits, same - choice of stratification, etc. This is useful to implement e.g. - boostrapping. This is currently supported by calling `split` multiple times - on the same `cv` instance, if `cv.random_state` is an instance, or None. - -.. note:: - C and D are currently mutually exclusive: a given estimator can only do C - and not D, or only D and not C. - Also, C and E are very related: C is supported via creating strict clones - (E). Similarly, D is supported by creating statistical clones (F). - - In the proposed design, all estimators will support C and D. Also, strict - and statistical clones can be obtained from any estimator. +Issues with the legacy design +============================= -Problems with passing RandomState instances or None -=================================================== +We here describe various issues caused by the legacy design of +`random_state`. Before reading this section, please make sure to read `#18363 +<https://github.com/scikit-learn/scikit-learn/pull/18363>`_ which provides +more context, and illustrates some use-cases that will be used here. -.. note:: - Some of the issues described here will also be described in the docs - if/when `#18363 - <https://github.com/scikit-learn/scikit-learn/pull/18363>`_ gets merged. - Note however that this is a recent PR. - In this PR, we recommend users to pass integers everywhere to avoid - potential mistakes. +A complex mental model +---------------------- -The main issue with RandomState instances and None is that it makes the -estimators and the splitters stateful accross calls to `fit` and `split`. As -a result, different calls to `fit` and `split` on the same instance yield -different results. Almost all the other issues are consequences of this fact. +The rules that dictate how `random_state` impacts the behaviour of estimators +and CV splitters are seemingly simple: use an int and get the same results +across calls, use an instance and get different results. Things get however +much more complicated when these estimators and splitters are used within +other procedures like CV procedures or meta-estimators. -Statefulness and violation of fit idempotence -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Users and core devs alike still seem to have a hard time figuring it out. +Quoting Andreas Müller from `#14042 +<https://github.com/scikit-learn/scikit-learn/issues/14042>`_: -Estimators should be stateless. That is, any call to `fit` should forget -whatever happened to previous calls to `fit`, provided that warm-start is -off. Whether CV splitters should be stateful or not is debatable, and that -point is discussed below. + The current design of random_state is often hard to understand and + confusing for users. I think we should rethink how random_state works. -Another related convention is that the `fit` method should be idempotent: -calling `est.fit(X, y)` and then `est.fit(X, y)` again should yield the same -model both times. +As another empirical evidence, the fact the CV splitters may yield different +splits across calls recently came as a surprise to a long-time core dev. -These properties are key for enforcing reproducibility. There is a dedicated -check in the `check_estimator` suite. +At the very least, it should be noted that prior to `#9517 +<https://github.com/scikit-learn/scikit-learn/pull/9517/>`_ (merged in 2018) +and `#18363 <https://github.com/scikit-learn/scikit-learn/pull/18363>`_ (not +yet merged as of September 2020), most subtleties related to `random_state` +were largely under-documented, and it is likely that a lot of users are +actually oblivious to most pitfalls. -If a `RandomState` instance is used, the idemptency property may be -violated:: +The numerous bugs that `random_state` has caused (:ref:`next section <bugs>`) +is another sign that the legacy design might be too complex, even for core +devs. The :ref:`other following sections <estimator_issues>` describe other +issues that illustrate how `random_state` might be confusing to users as +well. - rng = RandomState(0) - est = Est(..., random_state=rng) - est.fit(...) # consume rng - est.fit(...) +.. _bugs: -The model inferred the second time isn't the same as the previous one since -the rng has been consumed during the first called. The statefulness property -isn't respected either since the first call to `fit` has an impact on the -second. +Estimators: bugs caused by statefulness +--------------------------------------- -The same goes with passing `None`: you get different models every time you call -`fit`. +Many bugs have happened over the years because of RandomState instances and +None. Quoting again Andreas Müller from `#14042 +<https://github.com/scikit-learn/scikit-learn/issues/14042>`_: -A related issue is that the `rng` may be consumed outside of the estimator. -The estimator isn't "self contained" anymore and its behaviour is now -dependent on some stuff that happen outside. + There have been countless bugs because of this -Countless bugs -~~~~~~~~~~~~~~ - -Quoting @amueller from `14042 -<https://github.com/scikit-learn/scikit-learn/issues/14042>`_, many bugs -have happened over the years because of RandomState instances and None. +("*This*" = RandomState instances and the implied statefulness of the +estimator). These bugs are often hard to find, and some of them are actual data leaks, -see e.g. `14034 -<https://github.com/scikit-learn/scikit-learn/issues/14034>`_. Some of these -bugs have been around forever and we just haven't discovered them yet. - -The bug in `14034 -<https://github.com/scikit-learn/scikit-learn/issues/14034>`_ is that the -validation subset for early-stopping was computed using `self.random_state`: -that subset is different accross multiple calls to `fit`, since the RNG is -getting consumed. This is a problem when doing warm-start, because we want -the subset to be the same for all calls to `fit` in this case. The fix was -to store a private seed that was generated the first time `fit` is called. - -This is a typical example of many other similar bugs that we need to -patch with potentially overly complex logic. - -Cloning may return a different estimator -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -`my_clone = clone(est)` returns an *unfitted* estimator whose parameters are -(supposed to be) the same as those that were passed when `est` was -instanciated. Whether -*clone* is a good name for describing this process is another issue, but the -semantic of cloning is scikit-learn is as described above. We can think of -*cloning* as *reset with initial parameters*. - -That semantic is not respected if `est` was instanciated with an instance or -with None:: - - rng = RandomState(0) - est = Est(..., random_state=rng) - est.fit(X, y) # consume RNG here - my_clone = clone(est) - my_clone.fit(X, y) # not the same model! - -`my_clone` isn't the same as what `est` was, since the RNG has been consumed -in-between. Fitting `my_clone` on the same data will not give the same model -as `est`. While this is not desirable when an instance is passed, one might -argue that this is the desired behaviour when None is passed. - -In addition, `est` and `my_clone` are now interdependent because they share the -same `rng` instance. - -Incompatibility with third-party estimators -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -From the snippet above in the introduction, most meta-estimators will -directly pass down `self.random_state` to their sub-estimators. Some -third-party libraries only support integers as `random_state`, not instances -or None. See e.g. `15611 -<https://github.com/scikit-learn/scikit-learn/issues/15611>`_ - -CV-Splitters statefulness -~~~~~~~~~~~~~~~~~~~~~~~~~ - -CV-splitters are stateful:: +e.g. `#14034 <https://github.com/scikit-learn/scikit-learn/issues/14034>`_. - rng = np.random.RandomState(0) - cv = KFolds(shuffle=True, random_state=rng) - a = cv.split(X, y) - b = cv.split(X, y) # different from a +These bugs arise because the estimators are stateful across calls to `fit`. +Fixing them usually involves forcing the estimator to be (at least partially) +stateless. For example, `this fix +<https://github.com/scikit-learn/scikit-learn/pull/14999>`_ is to draw a +random seed once and to re-use that seed for data splitting when +early-stopping + warm start is used. It is *not* an obvious bug, nor an +obvious fix. -`a` and `b` are different splits, because of how `split` is implemented (see -introduction above). +Making estimators stateless across calls to `fit` would prevent such bugs to +happen, and would keep the code-base cleaner. -This means that the following code is very likely incorrect:: +.. _estimator_issues: - rng = np.random.RandomState(0) - cv = KFold(shuffle=True, random_state=rng) - estimators = [...] # the estimators you want to compare - for est in estimators: - cross_val_score(est, X, y, cv=cv) - -Users might not realize it, but **the estimators will be evaluated on -different splits**, even though they think they've set the random state by -passing a carefuly crafted instance. They should have passed an int. +Estimators: CV procedures and meta-estimators behaviour is implicitly dictated by the base estimator's `random_state` +--------------------------------------------------------------------------------------------------------------------- -The behaviour of the splitters is inconsistent for two reasons. +As described in more details in `#18363 +<https://github.com/scikit-learn/scikit-learn/pull/18363>`_, the following +snippets run two very different cross-validation procedures:: -The first one is that if `rng` were an int, then `a` and `b` would have been -equal. As a result, the behaviour of the CV splitter depends on the -**type** of the `random_state` parameter:: + cross_val_score(RandomForestClassifier(random_state=0), X, y) + cross_val_score(RandomForestClassifier(random_state=np.RandomState(0)), X, y) -- int -> stateless, get the same splits each time you call split() -- None or instance -> stateful, get different splits each time you call split() +In the first one (use-case C, see :ref:`key_use_cases`), the estimator RNG is +the same on each fold and the RF selects the same subset of features across +folds. In the second one (use-case D), the estimator RNG varies across folds +and the random subset of selected features will vary. -Concretely, we have a method (`split`) whose behaviour depends on the *type* -of a parameter that was passed to `init`. We can argue that this is a common -pattern in object-oriented design, but in the case of the `random_state` -parameter, this is potentially confusing. +The same is true for any procedure that performs cross-validation (manual CV, +`GridSearchCV`, etc.): **the behaviour of the CV procedure is implicitly +dictated by the estimator's** `random_state`. -The second inconsistency is that splitters are stateful by design, while we -want our estimators to be stateless. Granted, splitters aren't estimators. -But, quoting `@GaelVaroquaux -<https://github.com/scikit-learn/scikit-learn/pull/15177#issuecomment-548021786>`_, -consistency is one thing that we are really good at. -So it is important to have the splitters consistent with the estimators, -w.r.t. the statelessness property. The current behaviour is not necessarily -clear for users. +There are similar issues in meta-estimators, like `RFE` or +`SequentialFeatureSelection`: these are iterative feature selection +algorithms that will use either *exact* or *statistical* clones depending on +the estimator's `random_state`, which leads to significantly different +strategies. Here again, **how they behave is only implicitly dictated by +the estimator's** `random_state`. -Note that fixing how random_state is handled in the splitters is one of the -entries in the `Roadmap <https://scikit-learn.org/dev/roadmap.html>`_. +In addition, since the estimator's `random_state` type dictates the strategy, +users are bound to one single strategy once the estimator has been created: +it is for example impossible for an estimator to use a different RNG across +folds if that estimator was initialized with an integer. -Potential bugs in custom parameter searches -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +It is unlikely that users have a perfect understanding of these subtleties. +For users to actually understand how `random_state` impacts the CV procedures +and meta-estimators, they actually need to know inner details of these: they +need to know where and when `fit` is called, and also when `clone` is called. -This issue is a direct consequence of the splitters being stateful. It's also -more advanced than the rest, you may want to skip it. +There is a very similar problem with CV splitters as described in the next +section. -We have a private API for subclassing BaseSearchCV and implementing custom -parameter search strategies. The contract is that the custom class should -override the `_run_search(evaluate_candidate, ...)` method which itself must -call the `evaluate_candidates()` closure, were `cv.split()` will be called. +CV Splitters: users need to know inner details of CV procedures to avoid erroneous performance comparison +--------------------------------------------------------------------------------------------------------- -Third-party developers may only *call* `evaluate_candidates()`, not change -its content. Now, since `cv.split()` is called in `evaluate_candadates()`, -that means that `evalute_candidates()` will evaluate the candidate -parameters **on different splits** each time it is called. +CV splitters yield different splits every time `split` is called if their +`random_state` is a `RandomState` instance. This means that the following +code doesn't allow fold-to-fold comparison of scores:: -This is a quite subtle issue that third-party developers might easily -overlook. + rng = np.random.RandomState(0) + cv = KFold(shuffle=True, random_state=rng) + estimators = [...] # the estimators you want to compare + scores = { + est: cross_val_score(est, X, y, cv=cv) + for est in estimators + } -Depending on the intended behaviour of the parameter search, this may or may -not be a good thing. This is typically a bug if we implement successive -halving + warm start (details ommitted here, you may refer to `this issue -<https://github.com/scikit-learn/scikit-learn/issues/15125>`_ for some more -details). Currently, in order to prevent any potential bug and misconception -from users, the `Successive Halving implementation -<https://github.com/scikit-learn/scikit-learn/pull/13900>`_ **forbids users -from using stateful splitters**, e.g. `KFolds(5, shuffle=True, -random_state=None)` is forbidden. +Users might not realize it, but **the estimators will be evaluated on +different splits**, even though they think they've set the random state by +passing a carefuly crafted instance. This is because `cv.split` was called +multiple times, yet these calls were hidden inside of `cross_val_score`. On +top of impossible fold-to-fold comparison, comparing the average scores is +also not ideal if the number of folds or samples is small. -Proposed Solution -================= +As a consequence, before users can safely report score comparisons, **they +need to know how many times** `split` **is called**, which should just be an +implementation detail. While the above example is already error-prone, things +get harder in more complex tools like `GridSearchCV`: how are users supposed +to know that `split` is called only once in `GridSearchCV.fit`? .. note:: - This proposed solution is a work in progress and there is room for - improvement. + This implementation detail about `GridSearchCV.fit` is in fact + documented, at the very end of the `User Guide + <https://scikit-learn.org/stable/modules/cross_validation.html#a-note-on-shuffling>`_. -We need a solution that fixes the statefulness of the estimators and the -splitters. Most of the remaining issues would be fixed as a consequence. +.. note:: + In `#18363 <https://github.com/scikit-learn/scikit-learn/pull/18363>`_, + we recommend users to use integers for CV splitters' `random_state`, + effectively making them stateless. -A toy example of the proposed solution is implemented in this `notebook -<https://gist.github.com/NicolasHug/1169ee253a4669ff993c947507ae2cb5>`_. -The bulk of the solution is to manipulate actual random *states*, as -returned by `get_state() -<https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.random.get_state.html#numpy.random.get_state>`_. +.. note:: + Fixing how `random_state` is handled in the splitters is one of the + entries in the `Roadmap <https://scikit-learn.org/dev/roadmap.html>`_. -Specifically, we would store the *state* of a RandomState instance in -`__init__`:: +CV splitters: potential bugs in custom parameter searches +--------------------------------------------------------- - def __init__(self, ..., random_state=None): - self.random_state = check_random_state(random_state).get_state() +This issue is a direct consequence of the splitters being stateful, and it is +related to the above issue. -That `random_state` attribute is a tuple with about 620 integers. -That state is then used in `fit` or in `split` as follows:: +When a third-party library wants to implement its own parameter +search strategy, it needs to subclass `BaseSearchCV` and call a built-in +function `evaluate_candidates(candidates)` once, or multiple times. +`evaluate_candidates()` internally calls `split` once. If +`evalute_candidates()` is called more than once, this means that **the +candidate parameters are evaluated on different splits each time**. - def fit(self, X, y): # or split() - rng = np.random.RandomState() - rng.set_state(self.random_state) - # ... use rng as before +This is a quite subtle issue that third-party developers might easily +overlook. Some core devs (Joel Nothman and Nicolas Hug) kept forgetting and +re-discovering this issue over and over in the `Successive Halving PR +<https://github.com/scikit-learn/scikit-learn/pull/13900>`_. -Since `self.random_state` is an immutable tuple that never changes, calling -`fit` or `split` on the same instance always gives the same results. +At the very least, this makes fold-to-fold comparison impossible whenever the +search strategy calls `evaluate_candidates()` more than once. This can +however cause bigger bugs in other scenarios, e.g. if we implement successive +halving + warm start (details ommitted here, you may refer to `this issue +<https://github.com/scikit-learn/scikit-learn/issues/15125>`_). -We want `__init__` and `set_params/get_params` to be consistent. To that end, -we will need to special-case these methods:: +Currently, in order to prevent any potential future bug and to prevent users +from making erroneous comparisons, the `Successive Halving implementation +<https://scikit-learn.org/dev/modules/generated/sklearn.model_selection.HalvingGridSearchCV.html#sklearn.model_selection.HalvingGridSearchCV>`_ +**forbids users from using stateful splitters**: e.g. `KFolds(5, shuffle=True, +random_state=None)` is forbidden. - def get_params(self): +Other issues +------------ - random_state = np.random.RandomState() - random_state.set_state(self.random_state) - return {'random_state': random_sate, ...} +Fit idempotency isn't respected +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - def set_params(self, ...): +Quoting our `Developer Guidelines +<https://scikit-learn.org/stable/developers/develop.html#fitting>`_: - self.random_state = check_random_state(random_state).get_state() # same as in init + When fit is called, any previous call to fit should be ignored. -`clone` does not need to be special-cased, because `get_params` does all the -work. Note that the following:: +This means that ideally, calling `est.fit(X, y)` should yield the same model +twice. We have a check for that in the `check_estimator()` suite: +`check_fit_idempotent()`. Clearly, this fit-idempotency property is violated +when RandomState instances are used. - est.set_params(random_state=est.get_params()['random_state']) +`clone` 's behaviour is implicit +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -behaves as expected and does not change the `random_state` attribute of the -estimator. However, one should not use:: +Depending on the estimator's `random_state` parameter, `clone` will return +an exact clone or a statistical clone. Statistical clones share a common +RandomState instance and thus are inter-dependent, as detailed in `#18363 +<https://github.com/scikit-learn/scikit-learn/pull/18363>`_. This makes +debugging harder, among other things. Until `#18363 +<https://github.com/scikit-learn/scikit-learn/pull/18363>`_, the distinction +between exact and statistical clones had never been documented and was up to +users to figure out. - est.set_params(random_state=est.random_state) +.. note:: + While less user-facing, this issue is actually part of the root cause for + the aforementioned issues related to estimators in CV procedures and in + meta-estimators. -since `est.random_state` is neither an int, None or an instance: it is a tuple. -We can error with a decent message in that case. +.. _key_use_cases: -Advantages: +Key use-cases and requirements +============================== -- It fixes the statefullness issue. `fit` is now idempotent. Calling `split` on - the same instance gives the same splits. In other words, it does what we - want. +There are a few use-cases that are made possible by the legacy design and +that we will want to keep supporting in the new design: -- The behaviour is clear and intuitive: the object is fully defined at init, - and only at init. Things that happen between init or fit *do not* influence - the state of the object. +- A. Allow for consistent results across executions. This is a natural + requirement for any implementation: we want users to be able to get + consistently reproducible results across multiple program executions. This + is currently supported by removing any use of `random_state=None`. +- B. Allow for maximum variability and different results across executions. + This is currently supported by passing `random_state=None`: multiple + program executions yield different results each time. +- C. CV procedures where the estimator's RNG is exactly the same on each + fold. This is useful when one wants to make sure that a given seed will + work well on new data (whether users should actually do this is + debatable...). This is currently supported by passing an int to the + `random_state` parameter of the estimator. +- D. CV procedures where the estimator's RNG is different for each fold. This + is useful to increase the statistical significance of CV results. This is + currently supported by passing a RandomState instance or None to the + `random_state` parameter of the estimator. +- E. Obtain *strict* clones, i.e. clones that will yield exactly the same + results when called on the same data. This is currently supported by + calling `clone` if the estimator's `random_state` is an int. +- F. Obtain *statistical* clones, i.e. estimators + that are identical but that will yield different results, even when called + on the same data. This is currently supported by calling `clone` if + the estimator's `random_state` is an instance or None. +- G. Easily obtain different splits with the same characteristics from a + splitter. By "same characteristics", we mean same number of splits, same + choice of stratification, etc. This is useful to implement e.g. + boostrapping. This is currently supported by calling `split` multiple times + on the same `cv` instance, if `cv.random_state` is an instance, or None. -- It is relatively simple to implement, and not too intrusive. +.. note:: + C and E are very related: C is supported via creating strict clones (E). + Similarly, D is supported by creating statistical clones (F). -- Backward compatibility is preserved between scikit-learn versions. Let A - be a version with the current behaviour (say 0.22) and let B be a version - where the new behaviour is implemented. The models and the splits obtained - will be the same in A and in B. That property may not be respected with - other solutions, see below. + In legacy, C and D are mutually exclusive: a given estimator can only do + C and not D, or only D and not C. Also, a given estimator can only + produce strict clones or only statistical clones, but not both. In the + proposed design, all estimators will support both C and D. Similarly, + strict and statistical clones can be obtained from any estimator. -- Both RandomState instances and None are still supported. We don't need to - deprecate the use of any of them. +Proposed Solution +================= -- As a bonus, the `self.random_state` attribute is an *actual* random state: - it is the state of some RNG. What we currently call `random_state` is not - a state but a RNG (though this is numpy's fault.) +.. note:: + This proposed solution is a work in progress and there is room for + improvement. Feel free to suggest any. -Drawbacks: +We want to make estimators and splitter stateless, while also avoiding the +ambiguity of CV procedures and meta-estimators. We also want to keep +supporting all the aforementioned use-cases in some way. -- We break our convention that `__init__` should only ever store attributes, as - they are passed in. Note however that the reason we have this convention +.. note:: + A toy implementation of the proposed solution with illustration snippets + is available in `this notebook + <https://nbviewer.jupyter.org/gist/NicolasHug/2db607b01482988fa549eb2c8770f79f>`_. + +The proposed solution is to sample a seed from `random_state` at `__init__` +in estimators and splitters:: + + def _sample_seed(random_state): + # sample a random seed to be stored as the random_state attribute + # ints are passe-through + if isinstance(random_state, int): + return random_state + else: + return check_random_state(random_state).randint(0, 2**32) + + class MyEstimator(BaseEstimator): + def __init__(self, random_state=None): + self.random_state = _sample_seed(random_state) + + def set_params(self, random_state=None): + self.random_state = _sample_seed(random_state) + + class MySplitter(BaseSplitter): + def __init__(self, random_state=None): + self.random_state = _sample_seed(random_state) + +`fit`, `split`, and `get_params` are unchanged. + +In order to **explicitly** support use-cases C and D, CV procedures like +`cross_val_score` should be updated with a new `use_exact_clones` parameter +(name up for discussion):: + + def _check_statistical_clone_possible(est): + if 'random_state' not in est.get_params(): + raise ValueError("This estimator isn't random and can only have exact clones") + + def cross_val_score(est, X, y, cv, use_exact_clones=True): + # use_exact_clones: + # - if True, the same estimator RNG is used on each fold (use-case C) + # - if False, the different estimator RNG are used on each fold (use-case D) + + if use_exact_clones: + statistical = False + else: + # need a local RNG so that clones have different random_state attributes + _check_statistical_clone_possible(est) + statistical = np.random.RandomState(est.random_state) + + return [ + clone(est, statistical=statistical) + .fit(X[train], y[train]) + .score(X[test], y[test]) + for train, test in cv.split(X, y) + ] + +Meta-estimators should be updated in a similar fashion to make their +different behavior explicit. The `clone` function needs to be updated so that +one can explicitly request exact or statistical clones:: + + def clone(est, statistical=False): + # Return a strict clone or a statistical clone. + + # statistical parameter can be: + # - False: a strict clone is returned + # - True: a statistical clone is returned. Its RNG is seeded from `est` + # - None, int, or RandomState instance: a statistical clone is returned. + # Its RNG is seeded from `statistical`. This is useful to + # create multiple statistical clones that don't have the same RNG + + params = est.get_params() + + if statistical is not False: + # A statistical clone is a clone with a different random_state attribute + _check_statistical_clone_possible(est) + rng = params['random_state'] if statistical is True else statistical + params['random_state'] = _sample_seed(check_random_state(rng)) + + return est.__class__(**params) + +Use-cases support +----------------- + +Use-cases A and B are supported just like before. + +Use-cases C, D, E, F are explicitly supported *and* can be achieved by any +estimator, no matter its `random_state`. The legacy design can only +(implicitly) support either C and E or D and F. + +Use-case G can be explicitly supported by creating multiple CV instances, +each with a different `random_state`:: + + rng = np.RandomState(0) + cvs = [KFold(random_state=rng) for _ in range(n_bootstrap_iterations)] + +Alternatively, we can introduce the notion of statistical clone for splitters, +and let `clone` support splitters as well. This is however more involved. + +Advantages +---------- + +- Users do not need to know the internals of CV procedures or meta estimators + anymore. Any potential ambiguity is now made explicit and exposed to them + in the documentation. *This is the main point of this SLEP*. + +- The mental model is simpler: no matter what is passed as `random_state` + (int, RandomState instances, or None), results are constant across calls to + `fit` and `split`. The RandomState instance is mutated **once** (and only + once) in `__init__`. Bugs will be less likely to sneak in. + +- Estimators and splitters are stateless, and `fit` is now properly + idempotent. + +- Users now have explicit control on the CV procedures and meta-estimators, + instead of implicitly relying on the estimator's `random_state`. + +- Since CV splitters always yield the same splits, the chances of performing + erroneous score comparison is limited. + +- It is possible to preserve full backward-compatibility of behaviors in + cases where an int was passed. + +- `fit`, `split`, and `get_params` are unchanged. + +Drawbacks +--------- + +- We break our convention that `__init__` should only ever store attributes, + as they are passed in. Note however that the reason we have this convention is that we want the semantic of `__init__` and `set_params` are the same, and we want to enable people to change public attributes without having surprising behaviour. **This is still respected here.** So this isn't really an issue. -- There is a subtelty that occurs when passing `None`. `check_random_state` - will return the singleton `np.random.mtrand._rand`, and we will call - `get_state()` on the singleton. The thing is, its state won't change - unless the singleton is consumed. So if we do - `a = Est(random_state=None); b = Est(random_state=None)`, a and b actually - have exactly the same `random_state` attribute, since the state of the - singleton wasn't changed. To circumvent this, the logic in `__init__` and - `set_params` involves a private helper that makes sure the singleton's RNG is - consumed. Please refer to the notebook. +- CV procedures and meta-estimators must be updated. -- The `__repr__()` will need to special-case the `random_state` attribute to - avoid printing a long tuple. +Backward compatibility +---------------------- -- We need to store about 620 integers. This is however negligible w.r.t. e.g. - the size of a typical dataset +When an integer was used as `random_state`, all backward compatibility in +terms of results can be preserved (this will depend on the default value of +the `use_exact_clones` parameter) -- It does not fix the issue about third-party estimators only accepting - integers. This can however be fixed in each meta-estimator, independently. +Similarly, when an instance was used, results may be backward compatible for +CV procedures and meta-estimators depending on the default. However, +successive calls to `fit` will not yield backward-compatible results. +Similarly for splitters, backward compatibility cannot be preserved. + +A deprecation path could be to start introducing the `use_exact_clones` +parameters without introducing the seed drawing in `__init__` yet. This would +however require a temporary deprecation of RandomState instances as well. + +The easiest way might be allow for a breaking change. Alternative solutions ===================== Change the default of all random_state from `None` to a hardcoded int -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +--------------------------------------------------------------------- This doesn't solve much: it might limit pitfalls in users code, but does not -address the core issue, which is the statefulness of splitters and -estimators. - -Store a seed instead of a state -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -Instead of storing a state from `get_state()`, we could store a randomly -generated seed:: - - def __init__(self, ..., random_state=None): - self.random_state = check_random_state(random_state).randint(0, BIG_INT) - -Then instead of using `set_state` we could just use -`rng = RandomState(seed=self.random_state)` in `fit` or `split`. +address the statefulness issues, nor the ambiguity of CV procedures and +meta-estimators. -Advantages: +Stop supporting RandomState instances +------------------------------------- -- It also fixes the third-party estimators issue, since we would be passing - self.random_state which is an int -- It's cheaper than storing 620 ints -- We don't need to artificially consume the singleton's RNG since it is - de-facto consumed anyway. +We would not be able to support use-cases D or G except by passing `None`, +but then it would be impossible to get reproducible results across executions +(use-case A) -Drawbacks: +Store a seed in fit/split instead of in init +--------------------------------------------- -- Since we draw a seed in init (and in `set_params()`), `clone` will not - work as expected. In particular with `my_clone = clone(est)`, my_clone and - est cannot have the same `random_state` attribute. This is the same for - `my_clone.set_params(random_state=est.get_params()['random_state'])`. The - seed will have to be drawn in `set_params`, thus leading to a different - `random_state` attribute. - -- It is not backward compatible between versions. For example if you passed - an int in version A (say 0.22), then in version B (with the new - behaviour), your estimator will not start with the same RNG when `fit` is - called the first time. Same for splitters. - -Store the state in fit/split instead of in init -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -Instead of storing the output of `get_state()` in `__init__`, we could store it -the first time `fit()` is called. For example:: +Instead of storing a seed in `__init__`, we could store it the first time +`fit()` is called. For example:: def fit(self): # or split() - self._random_state = getattr(self, '_random_state', check_random_state(self.random_state).get_state()) - rng = np.random.RandomState() - rng.set_state(self._random_state) + self._random_state = check_random_state(self.random_state).randint(2*32) + rng = self._random_state # ... The advantage is that we respect our convention with `__init__`. However, `fit` idempotency isn't respected anymore: the first call to `fit` -clearly influences all the other ones. +clearly influences all the other ones. The mental model is also not as clear +as the one of the proposed solution: users don't really know when that seed +is going to be drawn unless they know the internals of the procedures they +are using. This also introduces a private attribute, so we would need more intrusive changes to `set_params`, `get_params`, and `clone`. -Execution and considerations -============================ - -Making the estimator stateless can be considered a bug fix. However, we are -clearly changing the behaviour of the splitters, and some users may rely on -the current behaviour, `to implement e.g. bootstrapping -<https://github.com/scikit-learn/scikit-learn/pull/15177#issuecomment-548021786>`_. -If we make the splitters stateless, the "old" behaviour can be easily -reproduced by simply creating new CV instances, instead of calling `split` -on the same instance. Instances are cheap to create. - -We would need a lot of outreach before introducing that change to let users -know about it. And depending on how comfortable we are with it, this might -be a 1.0 thing. - - .. References and Footnotes .. ------------------------ From 0f5ba58c1152ad34bbf920252f504bdbfb2dfdce Mon Sep 17 00:00:00 2001 From: Nicolas Hug <contact@nicolas-hug.com> Date: Sat, 19 Sep 2020 11:18:07 -0400 Subject: [PATCH 12/23] moved some sections --- slep011/proposal.rst | 169 ++++++++++++++++++++++++------------------- 1 file changed, 96 insertions(+), 73 deletions(-) diff --git a/slep011/proposal.rst b/slep011/proposal.rst index f97b741..34640a5 100644 --- a/slep011/proposal.rst +++ b/slep011/proposal.rst @@ -18,16 +18,17 @@ randomness in estimators and Cross-Validation (CV) splitters. The `fit` and `split`, which induces a complex mental model for developers and users. Various issues are: - - hard-to-find bugs - Implicit behavior of CV procedures and meta-estimators: the estimator's `random_state` parameter implicitly (but significantly) influences the behaviour of seemingly unrelated procedures like `cross_val_score`, or - meta-estimators like `RFE` or `SequentialFeatureSelection`. To understand + meta-estimators like `RFE` and `SequentialFeatureSelection`. To understand these subtleties, users need to know inner details of these CV procedures and meta-estimators: namely, they need to know *when* `fit` is called. - Similarly, users are required to know how many times `split` is called in a CV procedure (e.g `GridSearchCV.fit`), or there is a risk that they may make unsound performance comparison. + - hard-to-find bugs, due to the insiduous ways in which `random_state` + implicity affects seemingly unrelated procedures. These issues are described in more details in a dedicated section below. This SLEP is *not* about: @@ -46,8 +47,8 @@ SLEP is *not* about: introduction of a new `use_exact_clones` parameter to each CV procedure and meta-estimator where relevant (name open for discussion). - This will explicitly expose the choice of the CV and meta-estimator - strategy to users, instead of implicitly relying on the type of the + This will **explicitly** expose the choice of the CV and meta-estimator + strategy to users, instead of **implicitly** relying on the type of the estimator's `random_state`. Moreover, unlike in the legacy design, both strategies (using strict or statistical clones) will be supported by any estimator, no matter the type of the `random_state` attribute. @@ -66,13 +67,8 @@ SLEP is *not* about: Issues with the legacy design ============================= -We here describe various issues caused by the legacy design of -`random_state`. Before reading this section, please make sure to read `#18363 -<https://github.com/scikit-learn/scikit-learn/pull/18363>`_ which provides -more context, and illustrates some use-cases that will be used here. - -A complex mental model ----------------------- +Introduction: a complex mental model +------------------------------------ The rules that dictate how `random_state` impacts the behaviour of estimators and CV splitters are seemingly simple: use an int and get the same results @@ -97,39 +93,17 @@ yet merged as of September 2020), most subtleties related to `random_state` were largely under-documented, and it is likely that a lot of users are actually oblivious to most pitfalls. -The numerous bugs that `random_state` has caused (:ref:`next section <bugs>`) -is another sign that the legacy design might be too complex, even for core -devs. The :ref:`other following sections <estimator_issues>` describe other -issues that illustrate how `random_state` might be confusing to users as -well. - -.. _bugs: - -Estimators: bugs caused by statefulness ---------------------------------------- - -Many bugs have happened over the years because of RandomState instances and -None. Quoting again Andreas Müller from `#14042 -<https://github.com/scikit-learn/scikit-learn/issues/14042>`_: - - There have been countless bugs because of this - -("*This*" = RandomState instances and the implied statefulness of the -estimator). +The next sections describe: -These bugs are often hard to find, and some of them are actual data leaks, -e.g. `#14034 <https://github.com/scikit-learn/scikit-learn/issues/14034>`_. +- Issues that illustrate how `random_state` implictly affects CV procedures + and meta-estimators in insiduous ways. +- The numerous bugs that `random_state` has caused (:ref:`next section + <bugs>`), which are another sign that the legacy design might be too + complex, even for core devs. -These bugs arise because the estimators are stateful across calls to `fit`. -Fixing them usually involves forcing the estimator to be (at least partially) -stateless. For example, `this fix -<https://github.com/scikit-learn/scikit-learn/pull/14999>`_ is to draw a -random seed once and to re-use that seed for data splitting when -early-stopping + warm start is used. It is *not* an obvious bug, nor an -obvious fix. - -Making estimators stateless across calls to `fit` would prevent such bugs to -happen, and would keep the code-base cleaner. +Before reading these sections, please make sure to read `#18363 +<https://github.com/scikit-learn/scikit-learn/pull/18363>`_ which provides +more context, and illustrates some use-cases that will be used here. .. _estimator_issues: @@ -138,15 +112,15 @@ Estimators: CV procedures and meta-estimators behaviour is implicitly dictated b As described in more details in `#18363 <https://github.com/scikit-learn/scikit-learn/pull/18363>`_, the following -snippets run two very different cross-validation procedures:: +lines run two very different cross-validation procedures:: cross_val_score(RandomForestClassifier(random_state=0), X, y) cross_val_score(RandomForestClassifier(random_state=np.RandomState(0)), X, y) -In the first one (use-case C, see :ref:`key_use_cases`), the estimator RNG is -the same on each fold and the RF selects the same subset of features across -folds. In the second one (use-case D), the estimator RNG varies across folds -and the random subset of selected features will vary. +In the first one, the estimator RNG is the same on each fold and the RF +selects the same subset of features across folds. In the second one, the +estimator RNG varies across folds and the random subset of selected features +will vary. The same is true for any procedure that performs cross-validation (manual CV, `GridSearchCV`, etc.): **the behaviour of the CV procedure is implicitly @@ -155,9 +129,9 @@ dictated by the estimator's** `random_state`. There are similar issues in meta-estimators, like `RFE` or `SequentialFeatureSelection`: these are iterative feature selection algorithms that will use either *exact* or *statistical* clones depending on -the estimator's `random_state`, which leads to significantly different -strategies. Here again, **how they behave is only implicitly dictated by -the estimator's** `random_state`. +the estimator's `random_state`, which leads to two significantly different +strategies. Here again, **how they behave is only implicitly dictated by the +estimator's** `random_state`. In addition, since the estimator's `random_state` type dictates the strategy, users are bound to one single strategy once the estimator has been created: @@ -172,6 +146,8 @@ need to know where and when `fit` is called, and also when `clone` is called. There is a very similar problem with CV splitters as described in the next section. +.. _cv_splitters_issues: + CV Splitters: users need to know inner details of CV procedures to avoid erroneous performance comparison --------------------------------------------------------------------------------------------------------- @@ -214,15 +190,59 @@ to know that `split` is called only once in `GridSearchCV.fit`? Fixing how `random_state` is handled in the splitters is one of the entries in the `Roadmap <https://scikit-learn.org/dev/roadmap.html>`_. -CV splitters: potential bugs in custom parameter searches ---------------------------------------------------------- +.. _bugs: + +Bugs +---- -This issue is a direct consequence of the splitters being stateful, and it is -related to the above issue. +Many bugs have happened over the years because of RandomState instances and +None. Quoting Andreas Müller from `#14042 +<https://github.com/scikit-learn/scikit-learn/issues/14042>`_: + + There have been countless bugs because of this -When a third-party library wants to implement its own parameter -search strategy, it needs to subclass `BaseSearchCV` and call a built-in -function `evaluate_candidates(candidates)` once, or multiple times. +("*This*" = RandomState instances and the implied statefulness of the +estimator). + +Bugs caused by estimators statefulness +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +These bugs are often hard to find, and some of them are actual data leaks, +e.g. `#14034 <https://github.com/scikit-learn/scikit-learn/issues/14034>`_. + +They arise because the estimators are stateful across calls to `fit`. Fixing +them usually involves forcing the estimator to be (at least partially) +stateless. A classical bug that happened multiple times is that the +validation set may differ across calls to `fit` in a warm-start + early +stopping context. For example, `this fix +<https://github.com/scikit-learn/scikit-learn/pull/14999>`_ is to draw a +random seed once and to re-use that seed for data splitting when +early-stopping + warm start is used. It is *not* an obvious bug, nor an +obvious fix. + +Making estimators stateless across calls to `fit` would prevent such bugs to +happen, and would keep the code-base cleaner. + +Bugs caused by splitters statefulness +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +`#18431 <https://github.com/scikit-learn/scikit-learn/pull/18431>`_ is a bug +introduced in `SequentialFeatureSelection` that perfectly illustrates the +previous section :ref:`cv_splitters_issues`. The bug was that splitter +statefulness would lead to comparing average scores of candidates that have +been evaluated on different splits. Here again, the fix is to enforce +statelessness of the splitter, e.g. +`KFolds(5, shuffle=True, random_state=None)` is forbidden. + +.. note:: + This bug was introduced by Nicolas Hug, who is this SLEP's author: it's + very easy to let these bugs sneak in, even when you're trying hard not + to. + +Other potential bugs can happen in the parameter search estimators. When a +third-party library wants to implement its own parameter search strategy, it +needs to subclass `BaseSearchCV` and call a built-in function +`evaluate_candidates(candidates)` once, or multiple times. `evaluate_candidates()` internally calls `split` once. If `evalute_candidates()` is called more than once, this means that **the candidate parameters are evaluated on different splits each time**. @@ -232,17 +252,18 @@ overlook. Some core devs (Joel Nothman and Nicolas Hug) kept forgetting and re-discovering this issue over and over in the `Successive Halving PR <https://github.com/scikit-learn/scikit-learn/pull/13900>`_. -At the very least, this makes fold-to-fold comparison impossible whenever the -search strategy calls `evaluate_candidates()` more than once. This can -however cause bigger bugs in other scenarios, e.g. if we implement successive -halving + warm start (details ommitted here, you may refer to `this issue +At the very least, this makes fold-to-fold comparison between candidates +impossible whenever the search strategy calls `evaluate_candidates()` more +than once. This can however cause bigger bugs in other scenarios, e.g. if we +implement successive halving + warm start (details ommitted here, you may +refer to `this issue <https://github.com/scikit-learn/scikit-learn/issues/15125>`_). -Currently, in order to prevent any potential future bug and to prevent users +In order to prevent any potential future bug and to prevent users from making erroneous comparisons, the `Successive Halving implementation <https://scikit-learn.org/dev/modules/generated/sklearn.model_selection.HalvingGridSearchCV.html#sklearn.model_selection.HalvingGridSearchCV>`_ -**forbids users from using stateful splitters**: e.g. `KFolds(5, shuffle=True, -random_state=None)` is forbidden. +forbids users from using stateful splitters, just like +`SequentialFeatureSelection`. Other issues ------------ @@ -283,7 +304,8 @@ Key use-cases and requirements ============================== There are a few use-cases that are made possible by the legacy design and -that we will want to keep supporting in the new design: +that we will want to keep supporting in the new design. We will refer to +these use-cases in the rest of the document: - A. Allow for consistent results across executions. This is a natural requirement for any implementation: we want users to be able to get @@ -391,9 +413,9 @@ In order to **explicitly** support use-cases C and D, CV procedures like for train, test in cv.split(X, y) ] -Meta-estimators should be updated in a similar fashion to make their -different behavior explicit. The `clone` function needs to be updated so that -one can explicitly request exact or statistical clones:: +Meta-estimators should be updated in a similar fashion to make their two +alternative behaviors explicit. The `clone` function needs to be updated so +that one can explicitly request exact or statistical clones:: def clone(est, statistical=False): # Return a strict clone or a statistical clone. @@ -430,8 +452,9 @@ each with a different `random_state`:: rng = np.RandomState(0) cvs = [KFold(random_state=rng) for _ in range(n_bootstrap_iterations)] -Alternatively, we can introduce the notion of statistical clone for splitters, -and let `clone` support splitters as well. This is however more involved. +CV instances are extremely cheap to create and to store. Alternatively, we +can introduce the notion of statistical clone for splitters, and let `clone` +support splitters as well. This is however more involved. Advantages ---------- @@ -502,9 +525,9 @@ meta-estimators. Stop supporting RandomState instances ------------------------------------- -We would not be able to support use-cases D or G except by passing `None`, -but then it would be impossible to get reproducible results across executions -(use-case A) +We would not be able to support use-cases D, F, and G, except by passing +`None`, but then it would be impossible to get reproducible results across +executions (use-case A) Store a seed in fit/split instead of in init --------------------------------------------- From dd646b24921945d4582142c1e23687bd9b65d029 Mon Sep 17 00:00:00 2001 From: Nicolas Hug <contact@nicolas-hug.com> Date: Sat, 19 Sep 2020 11:32:10 -0400 Subject: [PATCH 13/23] more about grid search --- slep011/proposal.rst | 20 ++++++++++++++++++-- 1 file changed, 18 insertions(+), 2 deletions(-) diff --git a/slep011/proposal.rst b/slep011/proposal.rst index 34640a5..f5d2037 100644 --- a/slep011/proposal.rst +++ b/slep011/proposal.rst @@ -123,8 +123,24 @@ estimator RNG varies across folds and the random subset of selected features will vary. The same is true for any procedure that performs cross-validation (manual CV, -`GridSearchCV`, etc.): **the behaviour of the CV procedure is implicitly -dictated by the estimator's** `random_state`. +`GridSearchCV`, etc.). Things are particularly ambiguous in `GridSearchCV`: +when a RandomState instance is used, a new RNG is used on each fold, **but +also for each candidate**:: + + fold 0: use different RNG across candidates + fold 1: use different RNG across candidates (different RNGs from fold 0) + ... +Users might actually expect that the RNG would be different on each fold, but +still constant across candidates, i.e. something like:: + + fold 0: use same RNG for all candidates + fold 1: use same RNG for all candidates (different RNG from fold 0) + fold 2: use same RNG for all candidates (different RNG from folds 0 and 0) + etc... + +How are users supposed to understand or even acknowledge those differences? +The core problem here is that **the behaviour of the CV procedure is +implicitly dictated by the estimator's** `random_state`. There are similar issues in meta-estimators, like `RFE` or `SequentialFeatureSelection`: these are iterative feature selection From 7aa7c2017386be495172f898784c87a540587312 Mon Sep 17 00:00:00 2001 From: Nicolas Hug <contact@nicolas-hug.com> Date: Sat, 19 Sep 2020 11:34:30 -0400 Subject: [PATCH 14/23] more on grid search --- slep011/proposal.rst | 21 ++++++++++++++------- 1 file changed, 14 insertions(+), 7 deletions(-) diff --git a/slep011/proposal.rst b/slep011/proposal.rst index f5d2037..c7f5545 100644 --- a/slep011/proposal.rst +++ b/slep011/proposal.rst @@ -127,20 +127,27 @@ The same is true for any procedure that performs cross-validation (manual CV, when a RandomState instance is used, a new RNG is used on each fold, **but also for each candidate**:: - fold 0: use different RNG across candidates - fold 1: use different RNG across candidates (different RNGs from fold 0) - ... + fold 0: use different RNGs across candidates + fold 1: use different RNGs across candidates (different RNGs from fold 0) + fold 2: use different RNGs across candidates (different RNGs from folds 0 and 1) + etc... + Users might actually expect that the RNG would be different on each fold, but still constant across candidates, i.e. something like:: fold 0: use same RNG for all candidates fold 1: use same RNG for all candidates (different RNG from fold 0) - fold 2: use same RNG for all candidates (different RNG from folds 0 and 0) + fold 2: use same RNG for all candidates (different RNG from folds 0 and 1) etc... -How are users supposed to understand or even acknowledge those differences? -The core problem here is that **the behaviour of the CV procedure is -implicitly dictated by the estimator's** `random_state`. +.. note:: + This strategy is in fact not even supported right now: neither integers, + RandomState instances or None can achieve this. + +Unfortunately, there is now way for users to figure out what strategy is used +until they look at the code, and it is not just a documentation problem. The +core problem here is that **the behaviour of the CV procedure is implicitly +dictated by the estimator's** `random_state`. There are similar issues in meta-estimators, like `RFE` or `SequentialFeatureSelection`: these are iterative feature selection From 832f023237c4ee7e95d2890c07a2cf3e244d321e Mon Sep 17 00:00:00 2001 From: Nicolas Hug <contact@nicolas-hug.com> Date: Sat, 19 Sep 2020 14:25:54 -0400 Subject: [PATCH 15/23] minor comments --- slep011/proposal.rst | 25 +++++++++++++------------ 1 file changed, 13 insertions(+), 12 deletions(-) diff --git a/slep011/proposal.rst b/slep011/proposal.rst index c7f5545..04bf79e 100644 --- a/slep011/proposal.rst +++ b/slep011/proposal.rst @@ -23,7 +23,8 @@ users. Various issues are: behaviour of seemingly unrelated procedures like `cross_val_score`, or meta-estimators like `RFE` and `SequentialFeatureSelection`. To understand these subtleties, users need to know inner details of these CV procedures - and meta-estimators: namely, they need to know *when* `fit` is called. + and meta-estimators: namely, they need to know *when* `fit` and `clone` + are called. - Similarly, users are required to know how many times `split` is called in a CV procedure (e.g `GridSearchCV.fit`), or there is a risk that they may make unsound performance comparison. @@ -43,15 +44,15 @@ SLEP is *not* about: `split`. Neither `fit` nor `split` need to be changed. This will allow users and developers to reason about `random_state` with a simpler mental model. -- Let users explicitly request *strict* or *statistical* clones, with the - introduction of a new `use_exact_clones` parameter to each CV procedure and - meta-estimator where relevant (name open for discussion). +- Let users explicitly choose the strategy executed by CV procedures and + meta-estimators, by introducing new parameters, where relevant. A candidate + parameter name is `use_exact_clones` (name open for discussion). This will **explicitly** expose the choice of the CV and meta-estimator strategy to users, instead of **implicitly** relying on the type of the estimator's `random_state`. Moreover, unlike in the legacy design, both - strategies (using strict or statistical clones) will be supported by any - estimator, no matter the type of the `random_state` attribute. + all strategies will be supported by any estimator, no matter the type of + the `random_state` attribute. .. note:: The *legacy* design refers to the design as used in the current @@ -97,9 +98,8 @@ The next sections describe: - Issues that illustrate how `random_state` implictly affects CV procedures and meta-estimators in insiduous ways. -- The numerous bugs that `random_state` has caused (:ref:`next section - <bugs>`), which are another sign that the legacy design might be too - complex, even for core devs. +- Examples of bugs that `random_state` has caused, which are another sign + that the legacy design might be too complex, even for core devs. Before reading these sections, please make sure to read `#18363 <https://github.com/scikit-learn/scikit-learn/pull/18363>`_ which provides @@ -145,8 +145,8 @@ still constant across candidates, i.e. something like:: RandomState instances or None can achieve this. Unfortunately, there is now way for users to figure out what strategy is used -until they look at the code, and it is not just a documentation problem. The -core problem here is that **the behaviour of the CV procedure is implicitly +until they look at the code. It is not just a documentation problem. The core +problem here is that **the behaviour of the CV procedure is implicitly dictated by the estimator's** `random_state`. There are similar issues in meta-estimators, like `RFE` or @@ -533,7 +533,8 @@ A deprecation path could be to start introducing the `use_exact_clones` parameters without introducing the seed drawing in `__init__` yet. This would however require a temporary deprecation of RandomState instances as well. -The easiest way might be allow for a breaking change. +The easiest way might be allow for a breaking change, which means that the +SLEP would be implemented for a new major version of scikit-learn. Alternative solutions ===================== From bcb3dd19f6248ded50131f766c1b7692ce4f3b6f Mon Sep 17 00:00:00 2001 From: Nicolas Hug <contact@nicolas-hug.com> Date: Mon, 21 Sep 2020 09:26:47 -0400 Subject: [PATCH 16/23] Added new alternative solution --- slep011/proposal.rst | 34 ++++++++++++++++++++++++++++++---- 1 file changed, 30 insertions(+), 4 deletions(-) diff --git a/slep011/proposal.rst b/slep011/proposal.rst index 04bf79e..2ca64c6 100644 --- a/slep011/proposal.rst +++ b/slep011/proposal.rst @@ -410,8 +410,7 @@ in estimators and splitters:: `fit`, `split`, and `get_params` are unchanged. In order to **explicitly** support use-cases C and D, CV procedures like -`cross_val_score` should be updated with a new `use_exact_clones` parameter -(name up for discussion):: +`cross_val_score` should be updated with a new `use_exact_clones` parameter:: def _check_statistical_clone_possible(est): if 'random_state' not in est.get_params(): @@ -436,9 +435,17 @@ In order to **explicitly** support use-cases C and D, CV procedures like for train, test in cv.split(X, y) ] +.. note:: + The name `use_exact_clones` is just a placeholder for now, and the final + name is up for discussion. A more descriptive name for `cross_val_score` + could be e.g. `estimator_randomness={'constant', 'splitwise'}`. The name + doesn't have to be the same throughout all CV procedures. + Meta-estimators should be updated in a similar fashion to make their two -alternative behaviors explicit. The `clone` function needs to be updated so -that one can explicitly request exact or statistical clones:: +alternative behaviors explicit. + +As can be seen from the above snippet, the `clone` function needs to be +updated so that one can explicitly request exact or statistical clones:: def clone(est, statistical=False): # Return a strict clone or a statistical clone. @@ -460,6 +467,10 @@ that one can explicitly request exact or statistical clones:: return est.__class__(**params) +Note how one can pass a RandomState instance to `statistical` in order to +obtain a sequence of estimators that have different RNGs. This is used in +particular in `cross_val_score`. + Use-cases support ----------------- @@ -575,6 +586,21 @@ are using. This also introduces a private attribute, so we would need more intrusive changes to `set_params`, `get_params`, and `clone`. +Add a new parameter to estimators instead of adding a parameter to CV procedures and meta-estimators +---------------------------------------------------------------------------------------------------- + +Instead of updating each CV procedure and meta-estimator with the +`use_exact_clones` parameter, we could instead add this parameter to the +estimators that have a `random_state` attribute, and let `clone` detect what +kind of clone it needs to output depending on the estimators' corresponding +attribute. + +However, the strategy used by CV procedures and estimators would still be +somewhat implicit, and making these strategies explicit is one of the main +goal of this SLEP. It is also easier to document and to expose the different +possible strategies of a CV procedure when there is a dedicated parameter in +that procedure. + .. References and Footnotes .. ------------------------ From c806293ca6593e292cf314fb1fa282c72987d0d0 Mon Sep 17 00:00:00 2001 From: Nicolas Hug <contact@nicolas-hug.com> Date: Mon, 21 Sep 2020 09:52:14 -0400 Subject: [PATCH 17/23] minor additions --- slep011/proposal.rst | 71 +++++++++++++++++++++++++------------------- 1 file changed, 41 insertions(+), 30 deletions(-) diff --git a/slep011/proposal.rst b/slep011/proposal.rst index 2ca64c6..de9834f 100644 --- a/slep011/proposal.rst +++ b/slep011/proposal.rst @@ -29,7 +29,7 @@ users. Various issues are: a CV procedure (e.g `GridSearchCV.fit`), or there is a risk that they may make unsound performance comparison. - hard-to-find bugs, due to the insiduous ways in which `random_state` - implicity affects seemingly unrelated procedures. + affects estimators and splitters. These issues are described in more details in a dedicated section below. This SLEP is *not* about: @@ -50,9 +50,9 @@ SLEP is *not* about: This will **explicitly** expose the choice of the CV and meta-estimator strategy to users, instead of **implicitly** relying on the type of the - estimator's `random_state`. Moreover, unlike in the legacy design, both - all strategies will be supported by any estimator, no matter the type of - the `random_state` attribute. + estimator's `random_state`. Moreover, unlike in the legacy design, all + strategies will be supported by any estimator, no matter the type of the + `random_state` attribute. .. note:: The *legacy* design refers to the design as used in the current @@ -62,8 +62,8 @@ SLEP is *not* about: Since passing `random_state=None` is equivalent to passing the global `RandomState` instance from `numpy` (`random_state=np.random.mtrand._rand`), we will not explicitly mention - None here, but everything that applies to instances also applies to - using None. + None in most cases, but everything that applies to instances also applies + to using None. Issues with the legacy design ============================= @@ -74,8 +74,8 @@ Introduction: a complex mental model The rules that dictate how `random_state` impacts the behaviour of estimators and CV splitters are seemingly simple: use an int and get the same results across calls, use an instance and get different results. Things get however -much more complicated when these estimators and splitters are used within -other procedures like CV procedures or meta-estimators. +much more complicated when these calls to `fit` and `split` are embedded +within other procedures like CV procedures or meta-estimators. Users and core devs alike still seem to have a hard time figuring it out. Quoting Andreas Müller from `#14042 @@ -120,7 +120,8 @@ lines run two very different cross-validation procedures:: In the first one, the estimator RNG is the same on each fold and the RF selects the same subset of features across folds. In the second one, the estimator RNG varies across folds and the random subset of selected features -will vary. +will vary. In other words, the CV strategy that `cross_val_score` uses +implicitly depends on the type of the estimators' `random_state` parameter. The same is true for any procedure that performs cross-validation (manual CV, `GridSearchCV`, etc.). Things are particularly ambiguous in `GridSearchCV`: @@ -175,8 +176,8 @@ CV Splitters: users need to know inner details of CV procedures to avoid erroneo --------------------------------------------------------------------------------------------------------- CV splitters yield different splits every time `split` is called if their -`random_state` is a `RandomState` instance. This means that the following -code doesn't allow fold-to-fold comparison of scores:: +`random_state` is a RandomState instance. This means that the following code +doesn't allow fold-to-fold comparison of scores:: rng = np.random.RandomState(0) cv = KFold(shuffle=True, random_state=rng) @@ -201,8 +202,10 @@ to know that `split` is called only once in `GridSearchCV.fit`? .. note:: This implementation detail about `GridSearchCV.fit` is in fact - documented, at the very end of the `User Guide + documented, but only at the very end of the cross-validation `User Guide <https://scikit-learn.org/stable/modules/cross_validation.html#a-note-on-shuffling>`_. + It is not documented where it shoud be, that is, in the hyper-parameter + tuning User Guide or in the docstrings. .. note:: In `#18363 <https://github.com/scikit-learn/scikit-learn/pull/18363>`_, @@ -225,7 +228,7 @@ None. Quoting Andreas Müller from `#14042 There have been countless bugs because of this ("*This*" = RandomState instances and the implied statefulness of the -estimator). +estimators). Bugs caused by estimators statefulness ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -266,8 +269,8 @@ Other potential bugs can happen in the parameter search estimators. When a third-party library wants to implement its own parameter search strategy, it needs to subclass `BaseSearchCV` and call a built-in function `evaluate_candidates(candidates)` once, or multiple times. -`evaluate_candidates()` internally calls `split` once. If -`evalute_candidates()` is called more than once, this means that **the +`evaluate_candidates` internally calls `split` once. If +`evalute_candidates` is called more than once, this means that **the candidate parameters are evaluated on different splits each time**. This is a quite subtle issue that third-party developers might easily @@ -276,17 +279,19 @@ re-discovering this issue over and over in the `Successive Halving PR <https://github.com/scikit-learn/scikit-learn/pull/13900>`_. At the very least, this makes fold-to-fold comparison between candidates -impossible whenever the search strategy calls `evaluate_candidates()` more +impossible whenever the search strategy calls `evaluate_candidates` more than once. This can however cause bigger bugs in other scenarios, e.g. if we implement successive halving + warm start (details ommitted here, you may refer to `this issue <https://github.com/scikit-learn/scikit-learn/issues/15125>`_). In order to prevent any potential future bug and to prevent users -from making erroneous comparisons, the `Successive Halving implementation +from making erroneous comparisons between the candidates scores, the +`Successive Halving implementation <https://scikit-learn.org/dev/modules/generated/sklearn.model_selection.HalvingGridSearchCV.html#sklearn.model_selection.HalvingGridSearchCV>`_ forbids users from using stateful splitters, just like -`SequentialFeatureSelection`. +`SequentialFeatureSelection` (see the note in the docstring for the `cv` +parameter). Other issues ------------ @@ -307,9 +312,10 @@ when RandomState instances are used. `clone` 's behaviour is implicit ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -Depending on the estimator's `random_state` parameter, `clone` will return -an exact clone or a statistical clone. Statistical clones share a common -RandomState instance and thus are inter-dependent, as detailed in `#18363 +Much like CV procedures and meta-estimatorws, what clone returns implicitly +depends on the estimators' `random_state`: it may return an exact clone or a +statistical clone. Statistical clones share a common RandomState instance and +thus are inter-dependent, as detailed in `#18363 <https://github.com/scikit-learn/scikit-learn/pull/18363>`_. This makes debugging harder, among other things. Until `#18363 <https://github.com/scikit-learn/scikit-learn/pull/18363>`_, the distinction @@ -390,7 +396,7 @@ in estimators and splitters:: def _sample_seed(random_state): # sample a random seed to be stored as the random_state attribute - # ints are passe-through + # ints are passed-through if isinstance(random_state, int): return random_state else: @@ -419,7 +425,7 @@ In order to **explicitly** support use-cases C and D, CV procedures like def cross_val_score(est, X, y, cv, use_exact_clones=True): # use_exact_clones: # - if True, the same estimator RNG is used on each fold (use-case C) - # - if False, the different estimator RNG are used on each fold (use-case D) + # - if False, the different estimator RNGs are used on each fold (use-case D) if use_exact_clones: statistical = False @@ -439,7 +445,8 @@ In order to **explicitly** support use-cases C and D, CV procedures like The name `use_exact_clones` is just a placeholder for now, and the final name is up for discussion. A more descriptive name for `cross_val_score` could be e.g. `estimator_randomness={'constant', 'splitwise'}`. The name - doesn't have to be the same throughout all CV procedures. + doesn't have to be the same throughout all CV procedures, and itshould + accurately describe the alternative strategies that are possible. Meta-estimators should be updated in a similar fashion to make their two alternative behaviors explicit. @@ -467,9 +474,9 @@ updated so that one can explicitly request exact or statistical clones:: return est.__class__(**params) -Note how one can pass a RandomState instance to `statistical` in order to -obtain a sequence of estimators that have different RNGs. This is used in -particular in `cross_val_score`. +Note how one can pass a RandomState instance as the `statistical` parameter, +in order to obtain a sequence of estimators that have different RNGs. This is +used in particular in the above `cross_val_score`. Use-cases support ----------------- @@ -495,7 +502,8 @@ Advantages - Users do not need to know the internals of CV procedures or meta estimators anymore. Any potential ambiguity is now made explicit and exposed to them - in the documentation. *This is the main point of this SLEP*. + by a new parameter, which will also make documentation easier. + *This is the main point of this SLEP*. - The mental model is simpler: no matter what is passed as `random_state` (int, RandomState instances, or None), results are constant across calls to @@ -526,7 +534,9 @@ Drawbacks surprising behaviour. **This is still respected here.** So this isn't really an issue. -- CV procedures and meta-estimators must be updated. +- CV procedures and meta-estimators must be updated. There is however no way + around this, if we want to explicitly expose the different possible + strategies. Backward compatibility ---------------------- @@ -562,7 +572,8 @@ Stop supporting RandomState instances We would not be able to support use-cases D, F, and G, except by passing `None`, but then it would be impossible to get reproducible results across -executions (use-case A) +executions (use-case A). This doesn't address either the statefulness and +ambiguity issues previousl mentioned. Store a seed in fit/split instead of in init --------------------------------------------- From d017d749295c3d22fbb791a208e653d0caed243f Mon Sep 17 00:00:00 2001 From: Nicolas Hug <contact@nicolas-hug.com> Date: Tue, 22 Sep 2020 17:11:08 -0400 Subject: [PATCH 18/23] updated deprecation path section --- slep011/proposal.rst | 155 ++++++++++++++++++++++++++++--------------- 1 file changed, 100 insertions(+), 55 deletions(-) diff --git a/slep011/proposal.rst b/slep011/proposal.rst index de9834f..f67ac9a 100644 --- a/slep011/proposal.rst +++ b/slep011/proposal.rst @@ -45,8 +45,9 @@ SLEP is *not* about: allow users and developers to reason about `random_state` with a simpler mental model. - Let users explicitly choose the strategy executed by CV procedures and - meta-estimators, by introducing new parameters, where relevant. A candidate - parameter name is `use_exact_clones` (name open for discussion). + meta-estimators, by introducing new parameters in those procedures, where + relevant. A candidate parameter name is `use_exact_clones` (name open for + discussion). This will **explicitly** expose the choice of the CV and meta-estimator strategy to users, instead of **implicitly** relying on the type of the @@ -122,6 +123,8 @@ selects the same subset of features across folds. In the second one, the estimator RNG varies across folds and the random subset of selected features will vary. In other words, the CV strategy that `cross_val_score` uses implicitly depends on the type of the estimators' `random_state` parameter. +This is unexpected, since the behaviour of `cross_val_predict` should +preferably be determined by parameters passed to `cross_val_predict`. The same is true for any procedure that performs cross-validation (manual CV, `GridSearchCV`, etc.). Things are particularly ambiguous in `GridSearchCV`: @@ -145,22 +148,23 @@ still constant across candidates, i.e. something like:: This strategy is in fact not even supported right now: neither integers, RandomState instances or None can achieve this. -Unfortunately, there is now way for users to figure out what strategy is used +Unfortunately, there is no way for users to figure out what strategy is used until they look at the code. It is not just a documentation problem. The core problem here is that **the behaviour of the CV procedure is implicitly dictated by the estimator's** `random_state`. There are similar issues in meta-estimators, like `RFE` or `SequentialFeatureSelection`: these are iterative feature selection -algorithms that will use either *exact* or *statistical* clones depending on -the estimator's `random_state`, which leads to two significantly different -strategies. Here again, **how they behave is only implicitly dictated by the +algorithms that will use either *exact* or *statistical* clones at each +iteration, depending on the estimator's `random_state`. Exact and statistical +clones lead to two significantly different strategies. Here again, the +behavior of these meta-estimators **is only implicitly dictated by the estimator's** `random_state`. -In addition, since the estimator's `random_state` type dictates the strategy, -users are bound to one single strategy once the estimator has been created: -it is for example impossible for an estimator to use a different RNG across -folds if that estimator was initialized with an integer. +In addition, since the sub-estimator's `random_state` type dictates the +strategy, users are bound to one single strategy once the estimator has been +created: it is for example impossible for an estimator to use a different RNG +across folds if that estimator was initialized with an integer. It is unlikely that users have a perfect understanding of these subtleties. For users to actually understand how `random_state` impacts the CV procedures @@ -270,7 +274,7 @@ third-party library wants to implement its own parameter search strategy, it needs to subclass `BaseSearchCV` and call a built-in function `evaluate_candidates(candidates)` once, or multiple times. `evaluate_candidates` internally calls `split` once. If -`evalute_candidates` is called more than once, this means that **the +`evaluate_candidates` is called more than once, this means that **the candidate parameters are evaluated on different splits each time**. This is a quite subtle issue that third-party developers might easily @@ -285,13 +289,14 @@ implement successive halving + warm start (details ommitted here, you may refer to `this issue <https://github.com/scikit-learn/scikit-learn/issues/15125>`_). -In order to prevent any potential future bug and to prevent users -from making erroneous comparisons between the candidates scores, the -`Successive Halving implementation -<https://scikit-learn.org/dev/modules/generated/sklearn.model_selection.HalvingGridSearchCV.html#sklearn.model_selection.HalvingGridSearchCV>`_ -forbids users from using stateful splitters, just like -`SequentialFeatureSelection` (see the note in the docstring for the `cv` -parameter). +.. note:: + In order to prevent any potential future bug and to prevent users from + making erroneous comparisons between the candidates scores, the + `Successive Halving implementation + <https://scikit-learn.org/dev/modules/generated/sklearn.model_selection.HalvingGridSearchCV.html#sklearn.model_selection.HalvingGridSearchCV>`_ + forbids users from using stateful splitters, just like + `SequentialFeatureSelection` (see the note in the docstring for the `cv` + parameter). Other issues ------------ @@ -312,7 +317,7 @@ when RandomState instances are used. `clone` 's behaviour is implicit ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -Much like CV procedures and meta-estimatorws, what clone returns implicitly +Much like CV procedures and meta-estimators, what `clone` returns implicitly depends on the estimators' `random_state`: it may return an exact clone or a statistical clone. Statistical clones share a common RandomState instance and thus are inter-dependent, as detailed in `#18363 @@ -364,15 +369,18 @@ these use-cases in the rest of the document: choice of stratification, etc. This is useful to implement e.g. boostrapping. This is currently supported by calling `split` multiple times on the same `cv` instance, if `cv.random_state` is an instance, or None. + This can also be achieved by just creating multiple splitter instances with + different `random_state` values (which could be integers). .. note:: - C and E are very related: C is supported via creating strict clones (E). - Similarly, D is supported by creating statistical clones (F). - - In legacy, C and D are mutually exclusive: a given estimator can only do - C and not D, or only D and not C. Also, a given estimator can only - produce strict clones or only statistical clones, but not both. In the - proposed design, all estimators will support both C and D. Similarly, + C and E are very related: in `cross_val_score`, C is supported by + creating strict clones (E). Similarly, D is supported by creating + statistical clones (F). + + In the legacy design, C and D are mutually exclusive: a given estimator + can only do C and not D, or only D and not C. Also, a given estimator can + only produce strict clones or only statistical clones, but not both. In + the proposed design, all estimators will support both C and D. Similarly, strict and statistical clones can be obtained from any estimator. Proposed Solution @@ -452,7 +460,8 @@ Meta-estimators should be updated in a similar fashion to make their two alternative behaviors explicit. As can be seen from the above snippet, the `clone` function needs to be -updated so that one can explicitly request exact or statistical clones:: +updated so that one can explicitly request exact or statistical clones +(use-cases E and F):: def clone(est, statistical=False): # Return a strict clone or a statistical clone. @@ -503,7 +512,7 @@ Advantages - Users do not need to know the internals of CV procedures or meta estimators anymore. Any potential ambiguity is now made explicit and exposed to them by a new parameter, which will also make documentation easier. - *This is the main point of this SLEP*. + *Removing ambiguity is the main point of this SLEP*. - The mental model is simpler: no matter what is passed as `random_state` (int, RandomState instances, or None), results are constant across calls to @@ -514,48 +523,84 @@ Advantages idempotent. - Users now have explicit control on the CV procedures and meta-estimators, - instead of implicitly relying on the estimator's `random_state`. + instead of implicitly relying on the estimator's `random_state`. These + procedures are not ambiguous anymore. - Since CV splitters always yield the same splits, the chances of performing erroneous score comparison is limited. -- It is possible to preserve full backward-compatibility of behaviors in - cases where an int was passed. - - `fit`, `split`, and `get_params` are unchanged. Drawbacks --------- - We break our convention that `__init__` should only ever store attributes, - as they are passed in. Note however that the reason we have this convention - is that we want the semantic of `__init__` and `set_params` are the same, - and we want to enable people to change public attributes without having - surprising behaviour. **This is still respected here.** So this isn't - really an issue. + as they are passed in (for integers, this convention is still respected). + Note however that the reason we have this convention is that we want the + semantic of `__init__` and `set_params` to be the same. **This is still + respected here.** So this isn't really an issue. - CV procedures and meta-estimators must be updated. There is however no way around this, if we want to explicitly expose the different possible strategies. -Backward compatibility ----------------------- - -When an integer was used as `random_state`, all backward compatibility in -terms of results can be preserved (this will depend on the default value of -the `use_exact_clones` parameter) - -Similarly, when an instance was used, results may be backward compatible for -CV procedures and meta-estimators depending on the default. However, -successive calls to `fit` will not yield backward-compatible results. -Similarly for splitters, backward compatibility cannot be preserved. - -A deprecation path could be to start introducing the `use_exact_clones` -parameters without introducing the seed drawing in `__init__` yet. This would -however require a temporary deprecation of RandomState instances as well. - -The easiest way might be allow for a breaking change, which means that the -SLEP would be implemented for a new major version of scikit-learn. +Backward compatibility and possible roadmap for adoption +-------------------------------------------------------- + +There are two main changes proposed here: + +1. Making estimators and splitters stateless +2. Introducing the `use_exact_clones` (or alternative names) parameter to CV + procedures and meta-estimators. + +A possible deprecation path for 1) would be to warn the user the second time +`fit` (or `split`) is called, when `random_state` is an instance. There is no +need for a warning if `random_state` is an integer since estimators and +splitters are already stateless in this case. The warning would suggest users +to explicitly create a statistical clone (using `clone`) instead of calling +`fit` twice on the same instance. For splitters, we would just suggest the +creation of a new instance. + +However, making estimators stateless without supporting item 2) would mean +dropping support for use-case D. So item 2) should be implemented before item +1). + +Item 2) can be implemented in a backward-compatible fashion by introducing +the `use_exact_clones` parameter where relevant with a default 'warn', which +would tell users that they should be setting this parameter explictly from +now on:: + + def cross_val_score(..., use_exact_clones='warn'): + if use_exact_clones == 'warn': + warn("The use_exact_clones parameter was introduced in ..." + "you should explicitly set it to True or False...") + # proceed as before in a backward compatible way + elif use_exact_clones is True: + # build exact clones (probably by sampling a seed from the estimator's random_state) + else: + # build statistical clones (probably by creating a local RandomState instance) + +However if 1) isn't implemented yet, this is not fully satisfactory: if +estimators are still stateful, this means for example that the seed drawn in +the `elif` block would be different across calls to `cross_val_score`, since +the estimator's `random_state` would have been mutated. This is not +desirable: once 1) is implemented and the estimators become stateless, +multiple calls to `cross_val_score` would result in consistent results (as +expected). + +As a result, 1) must be implemented before 2). But since 2) must also be +implemented before 1)... we have a chicken and egg problem. + +The path with the least amount of friction would likely be to implement both +1) and 2) at the same time, and accept the fact that `cross_val_score` and +similar procedures might temporarily give different results across calls, in +cases where the estimator is still stateful (i.e. if its `random_state` is an +instance). The number of user codes where results may change once the +estimators become stateless is likely to be quite small. This may be +acceptable with a reasonable amount of outreach. + +An easier but more brutal adoption path is to just break backward +compatibility and release this in a new major version, e.g. 1.0 or 2.0. Alternative solutions ===================== From 33fdd12436d024102f4975e201c41ebc6ade7646 Mon Sep 17 00:00:00 2001 From: Nicolas Hug <contact@nicolas-hug.com> Date: Fri, 12 May 2023 11:20:35 +0100 Subject: [PATCH 19/23] ajbdfajdfbjladgljadg --- {slep011 => slep021}/proposal.rst | 26 ++++++++++++++++++++++---- 1 file changed, 22 insertions(+), 4 deletions(-) rename {slep011 => slep021}/proposal.rst (97%) diff --git a/slep011/proposal.rst b/slep021/proposal.rst similarity index 97% rename from slep011/proposal.rst rename to slep021/proposal.rst index f67ac9a..7ffa08e 100644 --- a/slep011/proposal.rst +++ b/slep021/proposal.rst @@ -1,13 +1,31 @@ -.. _slep_011: +.. _slep_021: ====================================== -SLEP011: Fixing randomness ambiguities +SLEP021: Fixing randomness ambiguities ====================================== -:Author: Nicolas Hug +:Author: Needs a champion :Status: Under review :Type: Standards Track -:Created: 2019-11-27 +:Created: 2020-09-01 + +.. warning:: + + Update from 2023: This SLEP was written a long time ago, when the + understanding was that `RandomState` instances are **shared** across + clones. `It turns out it's not the case + <https://github.com/scikit-learn/scikit-learn/issues/26148>`_ and the + instances are instead copied. So a few of the claims made in this SLEP may + be slightly confusing. What this SLEP claims is happening for `RandomState` + instances is in fact happening only when passing `None`, and `RandomState` + instances behave mostly like `int` in CV procedures. Nonetheless, a lot of + its content is still relevant. + + Note: this SLEP is a complete re-write of the abandonned `SLEP 11 + <https://github.com/scikit-learn/enhancement_proposals/pull/24>`_ and + proposes a different solution. + + Abstract ======== From 9917a9e5c5c6f58bc912766293c613aac33dfd68 Mon Sep 17 00:00:00 2001 From: Nicolas Hug <contact@nicolas-hug.com> Date: Fri, 12 May 2023 11:24:29 +0100 Subject: [PATCH 20/23] AJBFAJDBF --- slep021/proposal.rst | 12 +++++++----- 1 file changed, 7 insertions(+), 5 deletions(-) diff --git a/slep021/proposal.rst b/slep021/proposal.rst index 7ffa08e..655a743 100644 --- a/slep021/proposal.rst +++ b/slep021/proposal.rst @@ -15,11 +15,13 @@ SLEP021: Fixing randomness ambiguities understanding was that `RandomState` instances are **shared** across clones. `It turns out it's not the case <https://github.com/scikit-learn/scikit-learn/issues/26148>`_ and the - instances are instead copied. So a few of the claims made in this SLEP may - be slightly confusing. What this SLEP claims is happening for `RandomState` - instances is in fact happening only when passing `None`, and `RandomState` - instances behave mostly like `int` in CV procedures. Nonetheless, a lot of - its content is still relevant. + instances are instead copied. So a few of the claims made in this SLEP are + wrong, and may be slightly confusing. What this SLEP claims is happening for + `RandomState` instances is in fact happening only when passing `None`, and + `RandomState` instances behave mostly like `int` in CV procedures. + + Nonetheless, a lot of its content is still relevant. At the very least, the + :ref:`key_use_cases` section may help framing discussions. Note: this SLEP is a complete re-write of the abandonned `SLEP 11 <https://github.com/scikit-learn/enhancement_proposals/pull/24>`_ and From 66216e62e3723af2890437ad4da3f2af9b818e32 Mon Sep 17 00:00:00 2001 From: Nicolas Hug <contact@nicolas-hug.com> Date: Fri, 12 May 2023 11:27:03 +0100 Subject: [PATCH 21/23] 21 -> 22 --- index.rst | 1 + {slep021 => slep022}/proposal.rst | 4 ++-- 2 files changed, 3 insertions(+), 2 deletions(-) rename {slep021 => slep022}/proposal.rst (99%) diff --git a/index.rst b/index.rst index ff7d43c..77ba421 100644 --- a/index.rst +++ b/index.rst @@ -25,6 +25,7 @@ slep012/proposal slep017/proposal slep019/proposal + slep022/proposal .. toctree:: :maxdepth: 1 diff --git a/slep021/proposal.rst b/slep022/proposal.rst similarity index 99% rename from slep021/proposal.rst rename to slep022/proposal.rst index 655a743..5bce266 100644 --- a/slep021/proposal.rst +++ b/slep022/proposal.rst @@ -1,7 +1,7 @@ -.. _slep_021: +.. _slep_022: ====================================== -SLEP021: Fixing randomness ambiguities +SLEP022: Fixing randomness ambiguities ====================================== :Author: Needs a champion From c7e3105896a462c5d5364263a7e7a0f2292727f3 Mon Sep 17 00:00:00 2001 From: Nicolas Hug <contact@nicolas-hug.com> Date: Wed, 14 Jun 2023 21:36:52 +0100 Subject: [PATCH 22/23] Some cleanups --- slep022/proposal.rst | 580 ++++++++++++++++++++++--------------------- 1 file changed, 295 insertions(+), 285 deletions(-) diff --git a/slep022/proposal.rst b/slep022/proposal.rst index 5bce266..888501c 100644 --- a/slep022/proposal.rst +++ b/slep022/proposal.rst @@ -11,22 +11,16 @@ SLEP022: Fixing randomness ambiguities .. warning:: - Update from 2023: This SLEP was written a long time ago, when the - understanding was that `RandomState` instances are **shared** across - clones. `It turns out it's not the case - <https://github.com/scikit-learn/scikit-learn/issues/26148>`_ and the - instances are instead copied. So a few of the claims made in this SLEP are - wrong, and may be slightly confusing. What this SLEP claims is happening for - `RandomState` instances is in fact happening only when passing `None`, and - `RandomState` instances behave mostly like `int` in CV procedures. - - Nonetheless, a lot of its content is still relevant. At the very least, the - :ref:`key_use_cases` section may help framing discussions. - Note: this SLEP is a complete re-write of the abandonned `SLEP 11 <https://github.com/scikit-learn/enhancement_proposals/pull/24>`_ and - proposes a different solution. + proposes a **different** solution. + + Play around with `this notebook + <https://nbviewer.jupyter.org/gist/NicolasHug/2db607b01482988fa549eb2c8770f79f>`_. + which illustrates the proposed solution. + Please, **please** take some time to read and understand this SLEP + before taking part in discussions. Abstract @@ -79,278 +73,20 @@ SLEP is *not* about: The *legacy* design refers to the design as used in the current scikit-learn master branch, i.e. `0.24.dev`. -.. note:: - Since passing `random_state=None` is equivalent to passing the global - `RandomState` instance from `numpy` - (`random_state=np.random.mtrand._rand`), we will not explicitly mention - None in most cases, but everything that applies to instances also applies - to using None. - Issues with the legacy design ============================= -Introduction: a complex mental model ------------------------------------- - -The rules that dictate how `random_state` impacts the behaviour of estimators -and CV splitters are seemingly simple: use an int and get the same results -across calls, use an instance and get different results. Things get however -much more complicated when these calls to `fit` and `split` are embedded -within other procedures like CV procedures or meta-estimators. - -Users and core devs alike still seem to have a hard time figuring it out. -Quoting Andreas Müller from `#14042 -<https://github.com/scikit-learn/scikit-learn/issues/14042>`_: - - The current design of random_state is often hard to understand and - confusing for users. I think we should rethink how random_state works. - -As another empirical evidence, the fact the CV splitters may yield different -splits across calls recently came as a surprise to a long-time core dev. - -At the very least, it should be noted that prior to `#9517 -<https://github.com/scikit-learn/scikit-learn/pull/9517/>`_ (merged in 2018) -and `#18363 <https://github.com/scikit-learn/scikit-learn/pull/18363>`_ (not -yet merged as of September 2020), most subtleties related to `random_state` -were largely under-documented, and it is likely that a lot of users are -actually oblivious to most pitfalls. - -The next sections describe: - -- Issues that illustrate how `random_state` implictly affects CV procedures - and meta-estimators in insiduous ways. -- Examples of bugs that `random_state` has caused, which are another sign - that the legacy design might be too complex, even for core devs. - -Before reading these sections, please make sure to read `#18363 -<https://github.com/scikit-learn/scikit-learn/pull/18363>`_ which provides -more context, and illustrates some use-cases that will be used here. - -.. _estimator_issues: - -Estimators: CV procedures and meta-estimators behaviour is implicitly dictated by the base estimator's `random_state` ---------------------------------------------------------------------------------------------------------------------- - -As described in more details in `#18363 -<https://github.com/scikit-learn/scikit-learn/pull/18363>`_, the following -lines run two very different cross-validation procedures:: - - cross_val_score(RandomForestClassifier(random_state=0), X, y) - cross_val_score(RandomForestClassifier(random_state=np.RandomState(0)), X, y) - -In the first one, the estimator RNG is the same on each fold and the RF -selects the same subset of features across folds. In the second one, the -estimator RNG varies across folds and the random subset of selected features -will vary. In other words, the CV strategy that `cross_val_score` uses -implicitly depends on the type of the estimators' `random_state` parameter. -This is unexpected, since the behaviour of `cross_val_predict` should -preferably be determined by parameters passed to `cross_val_predict`. - -The same is true for any procedure that performs cross-validation (manual CV, -`GridSearchCV`, etc.). Things are particularly ambiguous in `GridSearchCV`: -when a RandomState instance is used, a new RNG is used on each fold, **but -also for each candidate**:: - - fold 0: use different RNGs across candidates - fold 1: use different RNGs across candidates (different RNGs from fold 0) - fold 2: use different RNGs across candidates (different RNGs from folds 0 and 1) - etc... - -Users might actually expect that the RNG would be different on each fold, but -still constant across candidates, i.e. something like:: - - fold 0: use same RNG for all candidates - fold 1: use same RNG for all candidates (different RNG from fold 0) - fold 2: use same RNG for all candidates (different RNG from folds 0 and 1) - etc... - -.. note:: - This strategy is in fact not even supported right now: neither integers, - RandomState instances or None can achieve this. - -Unfortunately, there is no way for users to figure out what strategy is used -until they look at the code. It is not just a documentation problem. The core -problem here is that **the behaviour of the CV procedure is implicitly -dictated by the estimator's** `random_state`. - -There are similar issues in meta-estimators, like `RFE` or -`SequentialFeatureSelection`: these are iterative feature selection -algorithms that will use either *exact* or *statistical* clones at each -iteration, depending on the estimator's `random_state`. Exact and statistical -clones lead to two significantly different strategies. Here again, the -behavior of these meta-estimators **is only implicitly dictated by the -estimator's** `random_state`. - -In addition, since the sub-estimator's `random_state` type dictates the -strategy, users are bound to one single strategy once the estimator has been -created: it is for example impossible for an estimator to use a different RNG -across folds if that estimator was initialized with an integer. - -It is unlikely that users have a perfect understanding of these subtleties. -For users to actually understand how `random_state` impacts the CV procedures -and meta-estimators, they actually need to know inner details of these: they -need to know where and when `fit` is called, and also when `clone` is called. - -There is a very similar problem with CV splitters as described in the next -section. - -.. _cv_splitters_issues: - -CV Splitters: users need to know inner details of CV procedures to avoid erroneous performance comparison ---------------------------------------------------------------------------------------------------------- - -CV splitters yield different splits every time `split` is called if their -`random_state` is a RandomState instance. This means that the following code -doesn't allow fold-to-fold comparison of scores:: - - rng = np.random.RandomState(0) - cv = KFold(shuffle=True, random_state=rng) - estimators = [...] # the estimators you want to compare - scores = { - est: cross_val_score(est, X, y, cv=cv) - for est in estimators - } - -Users might not realize it, but **the estimators will be evaluated on -different splits**, even though they think they've set the random state by -passing a carefuly crafted instance. This is because `cv.split` was called -multiple times, yet these calls were hidden inside of `cross_val_score`. On -top of impossible fold-to-fold comparison, comparing the average scores is -also not ideal if the number of folds or samples is small. - -As a consequence, before users can safely report score comparisons, **they -need to know how many times** `split` **is called**, which should just be an -implementation detail. While the above example is already error-prone, things -get harder in more complex tools like `GridSearchCV`: how are users supposed -to know that `split` is called only once in `GridSearchCV.fit`? - -.. note:: - This implementation detail about `GridSearchCV.fit` is in fact - documented, but only at the very end of the cross-validation `User Guide - <https://scikit-learn.org/stable/modules/cross_validation.html#a-note-on-shuffling>`_. - It is not documented where it shoud be, that is, in the hyper-parameter - tuning User Guide or in the docstrings. - -.. note:: - In `#18363 <https://github.com/scikit-learn/scikit-learn/pull/18363>`_, - we recommend users to use integers for CV splitters' `random_state`, - effectively making them stateless. - -.. note:: - Fixing how `random_state` is handled in the splitters is one of the - entries in the `Roadmap <https://scikit-learn.org/dev/roadmap.html>`_. - -.. _bugs: - -Bugs ----- - -Many bugs have happened over the years because of RandomState instances and -None. Quoting Andreas Müller from `#14042 -<https://github.com/scikit-learn/scikit-learn/issues/14042>`_: - - There have been countless bugs because of this - -("*This*" = RandomState instances and the implied statefulness of the -estimators). - -Bugs caused by estimators statefulness -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -These bugs are often hard to find, and some of them are actual data leaks, -e.g. `#14034 <https://github.com/scikit-learn/scikit-learn/issues/14034>`_. +TL;DR: nobody, `not even core devs +<https://github.com/scikit-learn/scikit-learn/issues/26148>`_, understands how +randomness works in scikit-learn. We think we do, but we don't. -They arise because the estimators are stateful across calls to `fit`. Fixing -them usually involves forcing the estimator to be (at least partially) -stateless. A classical bug that happened multiple times is that the -validation set may differ across calls to `fit` in a warm-start + early -stopping context. For example, `this fix -<https://github.com/scikit-learn/scikit-learn/pull/14999>`_ is to draw a -random seed once and to re-use that seed for data splitting when -early-stopping + warm start is used. It is *not* an obvious bug, nor an -obvious fix. +There are two key problems, that this SLEP solves: -Making estimators stateless across calls to `fit` would prevent such bugs to -happen, and would keep the code-base cleaner. +- Estimators are stateful across calls to `fit()`. +- Users can't properly control what kind of CV procedure their code is running. -Bugs caused by splitters statefulness -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +More details below in :ref:`issues`. -`#18431 <https://github.com/scikit-learn/scikit-learn/pull/18431>`_ is a bug -introduced in `SequentialFeatureSelection` that perfectly illustrates the -previous section :ref:`cv_splitters_issues`. The bug was that splitter -statefulness would lead to comparing average scores of candidates that have -been evaluated on different splits. Here again, the fix is to enforce -statelessness of the splitter, e.g. -`KFolds(5, shuffle=True, random_state=None)` is forbidden. - -.. note:: - This bug was introduced by Nicolas Hug, who is this SLEP's author: it's - very easy to let these bugs sneak in, even when you're trying hard not - to. - -Other potential bugs can happen in the parameter search estimators. When a -third-party library wants to implement its own parameter search strategy, it -needs to subclass `BaseSearchCV` and call a built-in function -`evaluate_candidates(candidates)` once, or multiple times. -`evaluate_candidates` internally calls `split` once. If -`evaluate_candidates` is called more than once, this means that **the -candidate parameters are evaluated on different splits each time**. - -This is a quite subtle issue that third-party developers might easily -overlook. Some core devs (Joel Nothman and Nicolas Hug) kept forgetting and -re-discovering this issue over and over in the `Successive Halving PR -<https://github.com/scikit-learn/scikit-learn/pull/13900>`_. - -At the very least, this makes fold-to-fold comparison between candidates -impossible whenever the search strategy calls `evaluate_candidates` more -than once. This can however cause bigger bugs in other scenarios, e.g. if we -implement successive halving + warm start (details ommitted here, you may -refer to `this issue -<https://github.com/scikit-learn/scikit-learn/issues/15125>`_). - -.. note:: - In order to prevent any potential future bug and to prevent users from - making erroneous comparisons between the candidates scores, the - `Successive Halving implementation - <https://scikit-learn.org/dev/modules/generated/sklearn.model_selection.HalvingGridSearchCV.html#sklearn.model_selection.HalvingGridSearchCV>`_ - forbids users from using stateful splitters, just like - `SequentialFeatureSelection` (see the note in the docstring for the `cv` - parameter). - -Other issues ------------- - -Fit idempotency isn't respected -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -Quoting our `Developer Guidelines -<https://scikit-learn.org/stable/developers/develop.html#fitting>`_: - - When fit is called, any previous call to fit should be ignored. - -This means that ideally, calling `est.fit(X, y)` should yield the same model -twice. We have a check for that in the `check_estimator()` suite: -`check_fit_idempotent()`. Clearly, this fit-idempotency property is violated -when RandomState instances are used. - -`clone` 's behaviour is implicit -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -Much like CV procedures and meta-estimators, what `clone` returns implicitly -depends on the estimators' `random_state`: it may return an exact clone or a -statistical clone. Statistical clones share a common RandomState instance and -thus are inter-dependent, as detailed in `#18363 -<https://github.com/scikit-learn/scikit-learn/pull/18363>`_. This makes -debugging harder, among other things. Until `#18363 -<https://github.com/scikit-learn/scikit-learn/pull/18363>`_, the distinction -between exact and statistical clones had never been documented and was up to -users to figure out. - -.. note:: - While less user-facing, this issue is actually part of the root cause for - the aforementioned issues related to estimators in CV procedures and in - meta-estimators. .. _key_use_cases: @@ -370,20 +106,24 @@ these use-cases in the rest of the document: program executions yield different results each time. - C. CV procedures where the estimator's RNG is exactly the same on each fold. This is useful when one wants to make sure that a given seed will - work well on new data (whether users should actually do this is - debatable...). This is currently supported by passing an int to the - `random_state` parameter of the estimator. + work well on their data. This is a legit requirement as long as the user is OK + for the estimator to perform well just by chance. This is currently supported + by passing an int + **or a `RandomState` instance** to the `random_state` parameter of the + estimator. - D. CV procedures where the estimator's RNG is different for each fold. This is useful to increase the statistical significance of CV results. This is - currently supported by passing a RandomState instance or None to the - `random_state` parameter of the estimator. + currently **only** supported by passing None as the `random_state` parameter + of the estimator - and that throws reproducibilty out the window. **In the + legacy design, A and D are incompatible.** - E. Obtain *strict* clones, i.e. clones that will yield exactly the same results when called on the same data. This is currently supported by - calling `clone` if the estimator's `random_state` is an int. + calling `clone` if the estimator's `random_state` is an int or a `RandomState` + instance. - F. Obtain *statistical* clones, i.e. estimators that are identical but that will yield different results, even when called on the same data. This is currently supported by calling `clone` if - the estimator's `random_state` is an instance or None. + the estimator's `random_state` is None. **F and A are incompatible.** - G. Easily obtain different splits with the same characteristics from a splitter. By "same characteristics", we mean same number of splits, same choice of stratification, etc. This is useful to implement e.g. @@ -567,6 +307,12 @@ Drawbacks Backward compatibility and possible roadmap for adoption -------------------------------------------------------- +.. warning:: + + I wrote this a long time ago. BC considerations might be easier / looser if + we choose to only support this proposal on the new numpy RNG objects. I + haven't thought about it in depth. + There are two main changes proposed here: 1. Making estimators and splitters stateless @@ -677,6 +423,270 @@ goal of this SLEP. It is also easier to document and to expose the different possible strategies of a CV procedure when there is a dedicated parameter in that procedure. +.. _issues: + +Appendix: Issues with the legacy design +======================================= + + +Introduction: a complex mental model +------------------------------------ + +The rules that dictate how `random_state` impacts the behaviour of estimators +and CV splitters are seemingly simple: use an int or a `RandomState` instance +and get the same results across calls, use None and get different +results. Things get however much more complicated when these calls to `fit` and +`split` are embedded within other procedures like CV procedures or +meta-estimators. + +Users and core devs alike still seem to have a hard time figuring it out. +Quoting Andreas Müller from `#14042 +<https://github.com/scikit-learn/scikit-learn/issues/14042>`_: + + The current design of random_state is often hard to understand and + confusing for users. I think we should rethink how random_state works. + +As another empirical evidence, the fact the CV splitters may yield different +splits across calls recently came as a surprise to a long-time core dev. + +At the very least, it should be noted that prior to `#9517 +<https://github.com/scikit-learn/scikit-learn/pull/9517/>`_ (merged in 2018) +and `#18363 <https://github.com/scikit-learn/scikit-learn/pull/18363>`_ (not +yet merged as of September 2020), most subtleties related to `random_state` +were largely under-documented, and it is likely that a lot of users are +actually oblivious to most pitfalls. + +The next sections describe: + +- Issues that illustrate how `random_state` implictly affects CV procedures + and meta-estimators in insiduous ways. +- Examples of bugs that `random_state` has caused, which are another sign + that the legacy design might be too complex, even for core devs. + +Before reading these sections, please make sure to read `#18363 +<https://github.com/scikit-learn/scikit-learn/pull/18363>`_ which provides +more context, and illustrates some use-cases that will be used here. + +.. _estimator_issues: + +Estimators: CV procedures and meta-estimators behaviour is implicitly dictated by the base estimator's `random_state` +--------------------------------------------------------------------------------------------------------------------- + +As described in more details in `#18363 +<https://github.com/scikit-learn/scikit-learn/pull/18363>`_, the following +lines run two very different cross-validation procedures:: + + cross_val_score(RandomForestClassifier(random_state=0), X, y) + cross_val_score(RandomForestClassifier(random_state=np.RandomState(0)), X, y) + +In the first one, the estimator RNG is the same on each fold and the RF +selects the same subset of features across folds. In the second one, the +estimator RNG varies across folds and the random subset of selected features +will vary. In other words, the CV strategy that `cross_val_score` uses +implicitly depends on the type of the estimators' `random_state` parameter. +This is unexpected, since the behaviour of `cross_val_predict` should +preferably be determined by parameters passed to `cross_val_predict`. + +The same is true for any procedure that performs cross-validation (manual CV, +`GridSearchCV`, etc.). Things are particularly ambiguous in `GridSearchCV`: +when None is used, a new RNG is used on each fold, **but +also for each candidate**:: + + fold 0: use different RNGs across candidates + fold 1: use different RNGs across candidates (different RNGs from fold 0) + fold 2: use different RNGs across candidates (different RNGs from folds 0 and 1) + etc... + +Users might actually expect that the RNG would be different on each fold, but +still constant across candidates, i.e. something like:: + + fold 0: use same RNG for all candidates + fold 1: use same RNG for all candidates (different RNG from fold 0) + fold 2: use same RNG for all candidates (different RNG from folds 0 and 1) + etc... + +.. note:: + This strategy is in fact not even supported right now: neither integers, + RandomState instances or None can achieve this. + +Unfortunately, there is no way for users to figure out what strategy is used +until they look at the code. It is not just a documentation problem. The core +problem here is that **the behaviour of the CV procedure is implicitly +dictated by the estimator's** `random_state`. + +There are similar issues in meta-estimators, like `RFE` or +`SequentialFeatureSelection`: these are iterative feature selection +algorithms that will use either *exact* or *statistical* clones at each +iteration, depending on the estimator's `random_state`. Exact and statistical +clones lead to two significantly different strategies. Here again, the +behavior of these meta-estimators **is only implicitly dictated by the +estimator's** `random_state`. + +In addition, since the sub-estimator's `random_state` type dictates the +strategy, users are bound to one single strategy once the estimator has been +created: it is for example impossible for an estimator to use a different RNG +across folds if that estimator was initialized with an integer. + +It is unlikely that users have a perfect understanding of these subtleties. +For users to actually understand how `random_state` impacts the CV procedures +and meta-estimators, they actually need to know inner details of these: they +need to know where and when `fit` is called, and also when `clone` is called. + +There is a very similar problem with CV splitters as described in the next +section. + +.. _cv_splitters_issues: + +CV Splitters: users need to know inner details of CV procedures to avoid erroneous performance comparison +--------------------------------------------------------------------------------------------------------- + +CV splitters yield different splits every time `split` is called if their +`random_state` is a RandomState instance. This means that the following code +doesn't allow fold-to-fold comparison of scores:: + + rng = np.random.RandomState(0) + cv = KFold(shuffle=True, random_state=rng) + estimators = [...] # the estimators you want to compare + scores = { + est: cross_val_score(est, X, y, cv=cv) + for est in estimators + } + +Users might not realize it, but **the estimators will be evaluated on +different splits**, even though they think they've set the random state by +passing a carefuly crafted instance. This is because `cv.split` was called +multiple times, yet these calls were hidden inside of `cross_val_score`. On +top of impossible fold-to-fold comparison, comparing the average scores is +also not ideal if the number of folds or samples is small. + +As a consequence, before users can safely report score comparisons, **they +need to know how many times** `split` **is called**, which should just be an +implementation detail. While the above example is already error-prone, things +get harder in more complex tools like `GridSearchCV`: how are users supposed +to know that `split` is called only once in `GridSearchCV.fit`? + +.. note:: + This implementation detail about `GridSearchCV.fit` is in fact + documented, but only at the very end of the cross-validation `User Guide + <https://scikit-learn.org/stable/modules/cross_validation.html#a-note-on-shuffling>`_. + It is not documented where it shoud be, that is, in the hyper-parameter + tuning User Guide or in the docstrings. + +.. note:: + In `#18363 <https://github.com/scikit-learn/scikit-learn/pull/18363>`_, + we recommend users to use integers for CV splitters' `random_state`, + effectively making them stateless. + +.. note:: + Fixing how `random_state` is handled in the splitters is one of the + entries in the `Roadmap <https://scikit-learn.org/dev/roadmap.html>`_. + +.. _bugs: + +Bugs +---- + +Many bugs have happened over the years because of RandomState instances and +None. Quoting Andreas Müller from `#14042 +<https://github.com/scikit-learn/scikit-learn/issues/14042>`_: + + There have been countless bugs because of this + +("*This*" = RandomState instances and the implied statefulness of the +estimators). + +Bugs caused by estimators statefulness +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +These bugs are often hard to find, and some of them are actual data leaks, +e.g. `#14034 <https://github.com/scikit-learn/scikit-learn/issues/14034>`_. + +They arise because the estimators are stateful across calls to `fit`. Fixing +them usually involves forcing the estimator to be (at least partially) +stateless. A classical bug that happened multiple times is that the +validation set may differ across calls to `fit` in a warm-start + early +stopping context. For example, `this fix +<https://github.com/scikit-learn/scikit-learn/pull/14999>`_ is to draw a +random seed once and to re-use that seed for data splitting when +early-stopping + warm start is used. It is *not* an obvious bug, nor an +obvious fix. + +Making estimators stateless across calls to `fit` would prevent such bugs to +happen, and would keep the code-base cleaner. + +Bugs caused by splitters statefulness +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +`#18431 <https://github.com/scikit-learn/scikit-learn/pull/18431>`_ is a bug +introduced in `SequentialFeatureSelection` that perfectly illustrates the +previous section :ref:`cv_splitters_issues`. The bug was that splitter +statefulness would lead to comparing average scores of candidates that have +been evaluated on different splits. Here again, the fix is to enforce +statelessness of the splitter, e.g. +`KFolds(5, shuffle=True, random_state=None)` is forbidden. + +.. note:: + This bug was introduced by Nicolas Hug, who is this SLEP's author: it's + very easy to let these bugs sneak in, even when you're trying hard not + to. + +Other potential bugs can happen in the parameter search estimators. When a +third-party library wants to implement its own parameter search strategy, it +needs to subclass `BaseSearchCV` and call a built-in function +`evaluate_candidates(candidates)` once, or multiple times. +`evaluate_candidates` internally calls `split` once. If +`evaluate_candidates` is called more than once, this means that **the +candidate parameters are evaluated on different splits each time**. + +This is a quite subtle issue that third-party developers might easily +overlook. Some core devs (Joel Nothman and Nicolas Hug) kept forgetting and +re-discovering this issue over and over in the `Successive Halving PR +<https://github.com/scikit-learn/scikit-learn/pull/13900>`_. + +At the very least, this makes fold-to-fold comparison between candidates +impossible whenever the search strategy calls `evaluate_candidates` more +than once. This can however cause bigger bugs in other scenarios, e.g. if we +implement successive halving + warm start (details ommitted here, you may +refer to `this issue +<https://github.com/scikit-learn/scikit-learn/issues/15125>`_). + +.. note:: + In order to prevent any potential future bug and to prevent users from + making erroneous comparisons between the candidates scores, the + `Successive Halving implementation + <https://scikit-learn.org/dev/modules/generated/sklearn.model_selection.HalvingGridSearchCV.html#sklearn.model_selection.HalvingGridSearchCV>`_ + forbids users from using stateful splitters, just like + `SequentialFeatureSelection` (see the note in the docstring for the `cv` + parameter). + +Other issues +------------ + +Fit idempotency isn't respected +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Quoting our `Developer Guidelines +<https://scikit-learn.org/stable/developers/develop.html#fitting>`_: + + When fit is called, any previous call to fit should be ignored. + +This means that ideally, calling `est.fit(X, y)` should yield the same model +twice. We have a check for that in the `check_estimator()` suite: +`check_fit_idempotent()`. Clearly, this fit-idempotency property is violated +when None is used. + +`clone` 's behaviour is implicit +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Much like CV procedures and meta-estimators, what `clone` returns implicitly +depends on the estimators' `random_state`: it may return an exact clone or a +statistical clone. + +.. note:: + While less user-facing, this issue is actually part of the root cause for + the aforementioned issues related to estimators in CV procedures and in + meta-estimators. + .. References and Footnotes .. ------------------------ From fa814436a4693ef759e3a034a22c88bacd1e4d7c Mon Sep 17 00:00:00 2001 From: Nicolas Hug <contact@nicolas-hug.com> Date: Thu, 15 Jun 2023 10:39:34 +0100 Subject: [PATCH 23/23] typo --- slep022/proposal.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/slep022/proposal.rst b/slep022/proposal.rst index 888501c..ef0ce28 100644 --- a/slep022/proposal.rst +++ b/slep022/proposal.rst @@ -16,7 +16,7 @@ SLEP022: Fixing randomness ambiguities proposes a **different** solution. Play around with `this notebook - <https://nbviewer.jupyter.org/gist/NicolasHug/2db607b01482988fa549eb2c8770f79f>`_. + <https://nbviewer.jupyter.org/gist/NicolasHug/2db607b01482988fa549eb2c8770f79f>`_ which illustrates the proposed solution. Please, **please** take some time to read and understand this SLEP