Merge branch 'master' into prioritized-experience-replay

DLR-RM · Jan 30, 2024 · f6accf9 · f6accf9
2 parents a043cfd + 620e58e
commit f6accf9
Show file tree

Hide file tree

Showing 11 changed files with 228 additions and 55 deletions.
diff --git a/docs/guide/callbacks.rst b/docs/guide/callbacks.rst
@@ -29,24 +29,25 @@ You can find two examples of custom callbacks in the documentation: one for savi
 
         :param verbose: Verbosity level: 0 for no output, 1 for info messages, 2 for debug messages
         """
-        def __init__(self, verbose=0):
+        def __init__(self, verbose: int = 0):
             super().__init__(verbose)
             # Those variables will be accessible in the callback
             # (they are defined in the base class)
             # The RL model
             # self.model = None  # type: BaseAlgorithm
             # An alias for self.model.get_env(), the environment used for training
-            # self.training_env = None  # type: Union[gym.Env, VecEnv, None]
+            # self.training_env # type: VecEnv
             # Number of time the callback was called
             # self.n_calls = 0  # type: int
+            # num_timesteps = n_envs * n times env.step() was called
             # self.num_timesteps = 0  # type: int
             # local and global variables
-            # self.locals = None  # type: Dict[str, Any]
-            # self.globals = None  # type: Dict[str, Any]
+            # self.locals = {}  # type: Dict[str, Any]
+            # self.globals = {}  # type: Dict[str, Any]
             # The logger object, used to report things in the terminal
-            # self.logger = None  # stable_baselines3.common.logger
-            # # Sometimes, for event callback, it is useful
-            # # to have access to the parent object
+            # self.logger # type: stable_baselines3.common.logger.Logger
+            # Sometimes, for event callback, it is useful
+            # to have access to the parent object
             # self.parent = None  # type: Optional[BaseCallback]
 
         def _on_training_start(self) -> None:

diff --git a/docs/guide/examples.rst b/docs/guide/examples.rst
@@ -364,7 +364,7 @@ Atari Games
 
 Training a RL agent on Atari games is straightforward thanks to ``make_atari_env`` helper function.
 It will do `all the preprocessing <https://danieltakeshi.github.io/2016/11/25/frame-skipping-and-preprocessing-for-deep-q-networks-on-atari-2600-games/>`_
-and multiprocessing for you. To install the Atari environments, run the command ``pip install gym[atari, accept-rom-license]`` to install the Atari environments and ROMs, or install Stable Baselines3 with ``pip install stable-baselines3[extra]`` to install this and other optional dependencies.
+and multiprocessing for you. To install the Atari environments, run the command ``pip install gymnasium[atari,accept-rom-license]`` to install the Atari environments and ROMs, or install Stable Baselines3 with ``pip install stable-baselines3[extra]`` to install this and other optional dependencies.
 
 .. image:: ../_static/img/colab-badge.svg
    :target: https://colab.research.google.com/github/Stable-Baselines-Team/rl-colab-notebooks/blob/sb3/atari_games.ipynb

diff --git a/docs/guide/export.rst b/docs/guide/export.rst
@@ -31,53 +31,52 @@ to do inference in another framework.
 Export to ONNX
 -----------------
 
-As of June 2021, ONNX format  `doesn't support <https://github.com/onnx/onnx/issues/3033>`_ exporting models that use the ``broadcast_tensors`` functionality of pytorch. So in order to export the trained stable-baseline3 models in the ONNX format, we need to first remove the layers that use broadcasting. This can be done by creating a class that removes the unsupported layers.
 
-The following examples are for ``MlpPolicy`` only, and are general examples. Note that you have to preprocess the observation the same way stable-baselines3 agent does (see ``common.preprocessing.preprocess_obs``).
+If you are using PyTorch 2.0+ and ONNX Opset 14+, you can easily export SB3 policies using the following code:
 
-For PPO, assuming a shared feature extractor.
 
 .. warning::
 
-  The following example is for continuous actions only.
-  When using discrete or binary actions, you must do some `post-processing <https://github.com/DLR-RM/stable-baselines3/blob/f3a35aa786ee41ffff599b99fa1607c067e89074/stable_baselines3/common/policies.py#L621-L637>`_
-  to obtain the action (e.g., convert action logits to action).
+  The following returns normalized actions and doesn't include the `post-processing <https://github.com/DLR-RM/stable-baselines3/blob/a9273f968eaf8c6e04302a07d803eebfca6e7e86/stable_baselines3/common/policies.py#L370-L377>`_ step that is done with continuous actions
+  (clip or unscale the action to the correct space).
 
 
 .. code-block:: python
 
   import torch as th
+  from typing import Tuple
 
   from stable_baselines3 import PPO
+  from stable_baselines3.common.policies import BasePolicy
 
 
-  class OnnxablePolicy(th.nn.Module):
-      def __init__(self, extractor, action_net, value_net):
+  class OnnxableSB3Policy(th.nn.Module):
+      def __init__(self, policy: BasePolicy):
           super().__init__()
-          self.extractor = extractor
-          self.action_net = action_net
-          self.value_net = value_net
+          self.policy = policy
 
-      def forward(self, observation):
-          # NOTE: You may have to process (normalize) observation in the correct
-          #       way before using this. See `common.preprocessing.preprocess_obs`
-          action_hidden, value_hidden = self.extractor(observation)
-          return self.action_net(action_hidden), self.value_net(value_hidden)
+      def forward(self, observation: th.Tensor) -> Tuple[th.Tensor, th.Tensor, th.Tensor]:
+          # NOTE: Preprocessing is included, but postprocessing
+          # (clipping/inscaling actions) is not,
+          # If needed, you also need to transpose the images so that they are channel first
+          # use deterministic=False if you want to export the stochastic policy
+          # policy() returns `actions, values, log_prob` for PPO
+          return self.policy(observation, deterministic=True)
 
 
   # Example: model = PPO("MlpPolicy", "Pendulum-v1")
+  PPO("MlpPolicy", "Pendulum-v1").save("PathToTrainedModel")
   model = PPO.load("PathToTrainedModel.zip", device="cpu")
-  onnxable_model = OnnxablePolicy(
-      model.policy.mlp_extractor, model.policy.action_net, model.policy.value_net
-  )
+
+  onnx_policy = OnnxableSB3Policy(model.policy)
 
   observation_size = model.observation_space.shape
   dummy_input = th.randn(1, *observation_size)
   th.onnx.export(
-      onnxable_model,
+      onnx_policy,
       dummy_input,
       "my_ppo_model.onnx",
-      opset_version=9,
+      opset_version=17,
       input_names=["input"],
   )
 
@@ -93,7 +92,13 @@ For PPO, assuming a shared feature extractor.
 
   observation = np.zeros((1, *observation_size)).astype(np.float32)
   ort_sess = ort.InferenceSession(onnx_path)
-  action, value = ort_sess.run(None, {"input": observation})
+  actions, values, log_prob = ort_sess.run(None, {"input": observation})
+
+  print(actions, values, log_prob)
+
+  # Check that the predictions are the same
+  with th.no_grad():
+      print(model.policy(th.as_tensor(observation), deterministic=True))
 
 
 For SAC the procedure is similar. The example shown only exports the actor network as the actor is sufficient to roll out the trained policies.
@@ -108,23 +113,16 @@ For SAC the procedure is similar. The example shown only exports the actor netwo
   class OnnxablePolicy(th.nn.Module):
       def __init__(self, actor: th.nn.Module):
           super().__init__()
-          # Removing the flatten layer because it can't be onnxed
-          self.actor = th.nn.Sequential(
-              actor.latent_pi,
-              actor.mu,
-              # For gSDE
-              # th.nn.Hardtanh(min_val=-actor.clip_mean, max_val=actor.clip_mean),
-              # Squash the output
-              th.nn.Tanh(),
-          )
+          self.actor = actor
 
       def forward(self, observation: th.Tensor) -> th.Tensor:
-          # NOTE: You may have to process (normalize) observation in the correct
-          #       way before using this. See `common.preprocessing.preprocess_obs`
-          return self.actor(observation)
+          # NOTE: You may have to postprocess (unnormalize) actions
+          # to the correct bounds (see commented code below)
+          return self.actor(observation, deterministic=True)
 
 
   # Example: model = SAC("MlpPolicy", "Pendulum-v1")
+  SAC("MlpPolicy", "Pendulum-v1").save("PathToTrainedModel.zip")
   model = SAC.load("PathToTrainedModel.zip", device="cpu")
   onnxable_model = OnnxablePolicy(model.policy.actor)
 
@@ -134,7 +132,7 @@ For SAC the procedure is similar. The example shown only exports the actor netwo
       onnxable_model,
       dummy_input,
       "my_sac_actor.onnx",
-      opset_version=9,
+      opset_version=17,
       input_names=["input"],
   )
 
@@ -147,10 +145,23 @@ For SAC the procedure is similar. The example shown only exports the actor netwo
 
   observation = np.zeros((1, *observation_size)).astype(np.float32)
   ort_sess = ort.InferenceSession(onnx_path)
-  action = ort_sess.run(None, {"input": observation})
+  scaled_action = ort_sess.run(None, {"input": observation})[0]
+
+  print(scaled_action)
+
+  # Post-process: rescale to correct space
+  # Rescale the action from [-1, 1] to [low, high]
+  # low, high = model.action_space.low, model.action_space.high
+  # post_processed_action = low + (0.5 * (scaled_action + 1.0) * (high - low))
+
+  # Check that the predictions are the same
+  with th.no_grad():
+      print(model.actor(th.as_tensor(observation), deterministic=True))
+
+
+For more discussion around the topic, please refer to `GH#383 <https://github.com/DLR-RM/stable-baselines3/issues/383>`_ and `GH#1349 <https://github.com/DLR-RM/stable-baselines3/issues/1349>`_.
 
 
-For more discussion around the topic refer to this `issue. <https://github.com/DLR-RM/stable-baselines3/issues/383>`_
 
 Trace/Export to C++
 -------------------

diff --git a/docs/guide/rl_tips.rst b/docs/guide/rl_tips.rst
@@ -252,6 +252,12 @@ A better solution would be to use a squashing function (cf ``SAC``) or a Beta di
 Tips and Tricks when implementing an RL algorithm
 =================================================
 
+.. note::
+
+  We have a `video on YouTube about reliable RL <https://www.youtube.com/watch?v=7-PUg9EAa3Y>`_ that covers
+  this section in more details. You can also find the `slides online <https://araffin.github.io/slides/tips-reliable-rl/>`_.
+
+
 When you try to reproduce a RL paper by implementing the algorithm, the `nuts and bolts of RL research <http://joschu.net/docs/nuts-and-bolts.pdf>`_
 by John Schulman are quite useful (`video <https://www.youtube.com/watch?v=8EcdaCk9KaQ>`_).
 
@@ -282,4 +288,4 @@ in RL with discrete actions:
 3. Pong (one of the easiest Atari game)
 4. other Atari games (e.g. Breakout)
 
-.. _SBX: https://github.com/araffin/sbx
+.. _SBX: https://github.com/araffin/sbx
diff --git a/docs/guide/vec_envs.rst b/docs/guide/vec_envs.rst
@@ -96,6 +96,90 @@ SB3 VecEnv API is actually close to Gym 0.21 API but differs to Gym 0.26+ API:
   ``vec_env.env_method("method_name", args1, args2, kwargs1=kwargs1)`` and ``vec_env.set_attr("attribute_name", new_value)``.
 
 
+Modifying Vectorized Environments Attributes
+--------------------------------------------
+
+If you plan to `modify the attributes of an environment <https://github.com/DLR-RM/stable-baselines3/issues/1573>`_ while it is used (e.g., modifying an attribute specifying the task carried out for a portion of training when doing multi-task learning, or
+a parameter of the environment dynamics), you must expose a setter method.
+In fact, directly accessing the environment attribute in the callback can lead to unexpected behavior because environments can be wrapped (using gym or VecEnv wrappers, the ``Monitor`` wrapper being one example).
+
+Consider the following example for a custom env:
+
+.. code-block:: python
+
+	import gymnasium as gym
+	from gymnasium import spaces
+
+	from stable_baselines3.common.env_util import make_vec_env
+
+
+	class MyMultiTaskEnv(gym.Env):
+
+	  def __init__(self):
+	      super().__init__()
+	      """
+	      A state and action space for robotic locomotion.
+	      The multi-task twist is that the policy would need to adapt to different terrains, each with its own
+	      friction coefficient, mu.
+	      The friction coefficient is the only parameter that changes between tasks.
+	      mu is a scalar between 0 and 1, and during training a callback is used to update mu.
+	      """
+	      ...
+
+	  def step(self, action):
+	    # Do something, depending on the action and current value of mu the next state is computed
+	    return self._get_obs(), reward, done, truncated, info
+
+	  def set_mu(self, new_mu: float) -> None:
+	      # Note: this value should be used only at the next reset
+	      self.mu = new_mu
+
+	# Example of wrapped env
+	# env is of type <TimeLimit<OrderEnforcing<PassiveEnvChecker<CartPoleEnv<CartPole-v1>>>>>
+	env = gym.make("CartPole-v1")
+	# To access the base env, without wrapper, you should use `.unwrapped`
+	# or env.get_wrapper_attr("gravity") to include wrappers
+	env.unwrapped.gravity
+	# SB3 uses VecEnv for training, where `env.unwrapped.x = new_value` cannot be used to set an attribute
+	# therefore, you should expose a setter like `set_mu` to properly set an attribute
+	vec_env = make_vec_env(MyMultiTaskEnv)
+	# Print current mu value
+	# Note: you should use vec_env.env_method("get_wrapper_attr", "mu") in Gymnasium v1.0
+	print(vec_env.env_method("get_wrapper_attr", "mu"))
+	# Change `mu` attribute via the setter
+	vec_env.env_method("set_mu", "mu", 0.1)
+
+
+In this example ``env.mu`` cannot be accessed/changed directly because it is wrapped in a ``VecEnv`` and because it could be wrapped with other wrappers (see `GH#1573 <https://github.com/DLR-RM/stable-baselines3/issues/1573>`_ for a longer explanation).
+Instead, the callback should use the ``set_mu`` method via the ``env_method`` method for Vectorized Environments.
+
+.. code-block:: python
+
+	from itertools import cycle
+
+	class ChangeMuCallback(BaseCallback):
+	  """
+	  This callback changes the value of mu during training looping
+	  through a list of values until training is aborted.
+	  The environment is implemented so that the impact of changing
+	  the value of mu mid-episode is visible only after the episode is over
+	  and the reset method has been called.
+	  """"
+	  def __init__(self):
+	    super().__init__()
+	    # An iterator that contains the different of the friction coefficient
+	    self.mus = cycle([0.1, 0.2, 0.5, 0.13, 0.9])
+
+	  def _on_step(self):
+	    # Note: in practice, you should not change this value at every step
+	    # but rather depending on some events/metrics like agent performance/episode termination
+	    # both accessible via the `self.logger` or `self.locals` variables
+	    self.training_env.env_method("set_mu", next(self.mus))
+
+This callback can then be used to safely modify environment attributes during training since
+it calls the environment setter method.
+
+
 Vectorized Environments Wrappers
 --------------------------------
 

diff --git a/docs/misc/changelog.rst b/docs/misc/changelog.rst
@@ -3,6 +3,65 @@
 Changelog
 ==========
 
+Release 2.3.0a1 (WIP)
+--------------------------
+
+Breaking Changes:
+^^^^^^^^^^^^^^^^^
+- The defaults hyperparameters of ``TD3`` and ``DDPG`` have been changed to be more consistent with ``SAC``
+
+.. code-block:: python
+
+  # SB3 < 2.3.0 default hyperparameters
+  # model = TD3("MlpPolicy", env, train_freq=(1, "episode"), gradient_steps=-1, batch_size=100)
+  # SB3 >= 2.3.0:
+  model = TD3("MlpPolicy", env, train_freq=1, gradient_steps=1, batch_size=256)
+
+.. note::
+
+	Two inconsistencies remains: the default network architecture for ``TD3/DDPG`` is ``[400, 300]`` instead of ``[256, 256]`` for SAC (for backward compatibility reasons, see `report on the influence of the network size <https://wandb.ai/openrlbenchmark/sbx/reports/SBX-TD3-Influence-of-policy-net--Vmlldzo2NDg1Mzk3>`_) and the default learning rate is 1e-3 instead of 3e-4 for SAC (for performance reasons, see `W&B report on the influence of the lr <https://wandb.ai/openrlbenchmark/sbx/reports/SBX-TD3-RL-Zoo-v2-3-0a0-vs-SB3-TD3-RL-Zoo-2-2-1---Vmlldzo2MjUyNTQx>`_)
+
+
+
+- The default ``leanrning_starts`` parameter of ``DQN`` have been changed to be consistent with the other offpolicy algorithms
+
+
+.. code-block:: python
+
+  # SB3 < 2.3.0 default hyperparameters, 50_000 corresponded to Atari defaults hyperparameters
+  # model = DQN("MlpPolicy", env, learning_start=50_000)
+  # SB3 >= 2.3.0:
+  model = DQN("MlpPolicy", env, learning_start=100)
+
+
+New Features:
+^^^^^^^^^^^^^
+
+Bug Fixes:
+^^^^^^^^^^
+
+`SB3-Contrib`_
+^^^^^^^^^^^^^^
+
+`RL Zoo`_
+^^^^^^^^^
+
+`SBX`_ (SB3 + Jax)
+^^^^^^^^^^^^^^^^^^
+
+Deprecations:
+^^^^^^^^^^^^^
+
+Others:
+^^^^^^^
+
+Documentation:
+^^^^^^^^^^^^^^
+- Added a paragraph on modifying vectorized environment parameters via setters (@fracapuano)
+- Updated callback code example
+- Updated export to ONNX documentation, it is now much simpler to export SB3 models with newer ONNX Opset!
+- Added video link to "Practical Tips for Reliable Reinforcement Learning" video
+
 Release 2.2.1 (2023-11-17)
 --------------------------
 **Support for options at reset, bug fixes and better error messages**
@@ -91,6 +150,8 @@ Documentation:
 ^^^^^^^^^^^^^^
 - Updated RL Tips and Tricks (include recommendation for evaluation, added links to DroQ, ARS and SBX).
 - Fixed various typos and grammar mistakes
+- Added PokemonRedExperiments to the project page
+- Fixed an out-of-date command for installing Atari in examples
 
 Release 2.1.0 (2023-08-17)
 --------------------------
@@ -1489,7 +1550,7 @@ And all the contributors:
 @flodorner @KuKuXia @NeoExtended @PartiallyTyped @mmcenta @richardwu @kinalmehta @rolandgvc @tkelestemur @mloo3
 @tirafesi @blurLake @koulakis @joeljosephjin @shwang @rk37 @andyshih12 @RaphaelWag @xicocaio
 @diditforlulz273 @liorcohen5 @ManifoldFR @mloo3 @SwamyDev @wmmc88 @megan-klaiber @thisray
-@tfederico @hn2 @LucasAlegre @AptX395 @zampanteymedio @JadenTravnik @decodyng @ardabbour @lorenz-h @mschweizer @lorepieri8 @vwxyzjn
+@tfederico @hn2 @LucasAlegre @AptX395 @zampanteymedio @fracapuano @JadenTravnik @decodyng @ardabbour @lorenz-h @mschweizer @lorepieri8 @vwxyzjn
 @ShangqunYu @PierreExeter @JacopoPan @ltbd78 @tom-doerr @Atlis @liusida @09tangriro @amy12xx @juancroldan
 @benblack769 @bstee615 @c-rizz @skandermoalla @MihaiAnca13 @davidblom603 @ayeright @cyprienc
 @wkirgsn @AechPro @CUN-bjy @batu @IljaAvadiev @timokau @kachayev @cleversonahum