Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Rendering with EvalCallback does not render the initial or final state #1692

Open
5 tasks done
kylesayrs opened this issue Sep 23, 2023 · 13 comments
Open
5 tasks done
Labels
bug Something isn't working

Comments

@kylesayrs
Copy link
Contributor

kylesayrs commented Sep 23, 2023

🐛 Bug

  1. EvalCallback does not render initial state of the episode. This is an issue for environments which have dynamic/random initial states, as it makes it unclear to the user what the state looked like before the agent performed its first step.

  2. EvalCallback does not render the last step of the episode, instead rendering the reset state. This robs the user of seeing the last (often crucial) step and is confusing for new users.

This behavior is because the environment reset is performed within the DummyVecEnv.step function. I have a local implementation that fixes this issue and clearly renders both initial and final states, but I'd first like to confirm that this behavior should indeed be treated as a bug worthy of a feature/fix.

To Reproduce

from typing import Optional

import numpy as np

from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3 import DQN
from stable_baselines3.common.monitor import Monitor
from stable_baselines3.common.callbacks import EvalCallback
from stable_baselines3.common.envs import BitFlippingEnv


# A user should not be required to subclass their environment to debug the EvalCallback
class ClearerRenderEnv(BitFlippingEnv):
    def step(self, *args, **kwargs):
        pre_state = self.state.copy()
        rets = super().step(*args, **kwargs)
        print(f"step    : {pre_state} -> {self.state} | {self.desired_goal}")

        return rets
    
    def reset(self, *args, **kwargs):
        if not hasattr(self, "state"):
            return super().reset(*args, **kwargs)
        
        pre_state = self.state.copy()
        rets = super().reset(*args, **kwargs)
        print(f"reset   : {pre_state} -> {self.state} | {self.desired_goal} ")

        return rets

    def render(self) -> Optional[np.ndarray]:
        if self.render_mode == "rgb_array":
            return self.state.copy()
        print(f"rendered:          {self.state} | {self.desired_goal}")


if __name__ == "__main__":
    eval_callback = EvalCallback(
        Monitor(ClearerRenderEnv(n_bits=2)),
        n_eval_episodes=1,
        eval_freq=10_000,
        render=True,
    )

    environment = make_vec_env(
        BitFlippingEnv,
        env_kwargs={"n_bits": 2},
        n_envs=1
    )

    model = DQN(
        "MultiInputPolicy",
        environment,
        learning_starts=0,
        learning_rate=0.1,
        verbose=2
    )
    
    model.learn(
        total_timesteps=10_000,
        log_interval=None,
        callback=eval_callback,
        progress_bar=True,
    )

Relevant log output / Error message

Notice the first state [0, 0] is never rendered and neither is the last step, only the reset frame (in this case, also [0, 0])

step    : [0 0] -> [0 1] | [1 1]
rendered:          [0 1] | [1 1]
step    : [0 1] -> [1 1] | [1 1]
reset   : [1 1] -> [0 0] | [1 1] 
rendered:          [0 0] | [1 1]
Eval num_timesteps=10000, episode_reward=-1.00 +/- 0.00
Episode length: 2.00 +/- 0.00
Success rate: 100.00%

For a new user only looking at the rendered messages, this would be very confusing. Here's another example where it seems like the agent changed two values at once.

step    : [1 1] -> [1 0] | [1 1]
rendered:          [1 0] | [1 1]
step    : [1 0] -> [1 1] | [1 1]
reset   : [1 1] -> [0 1] | [1 1] 
rendered:          [0 1] | [1 1]
Eval num_timesteps=10, episode_reward=-1.00 +/- 0.00
Episode length: 2.00 +/- 0.00
Success rate: 100.00%

System Info

No response

Checklist

  • My issue does not relate to a custom gym environment. (Use the custom gym env template instead)
  • I have checked that there is no similar issue in the repo
  • I have read the documentation
  • I have provided a minimal and working example to reproduce the bug
  • I've used the markdown code blocks for both code and stack traces.
@kylesayrs kylesayrs added the bug Something isn't working label Sep 23, 2023
@araffin
Copy link
Member

araffin commented Sep 25, 2023

Hello,
I guess the issue comes from evaluate_policy() where should call render() before the call to predict().
For the final, I'm not sure as with VecEnv, they are reset automatically.

@kylesayrs
Copy link
Contributor Author

The evalcallback already specially constructs an environment for evaluation, and this environment should act slightly differently than typical VecEnvs (for example, stop stepping after reaching a terminal state and do not reset) . Therefore my current plan to fix is to subclass VecEnv and overload the step_wait function to step to not reset automatically.

I'll have to create a check in the step function to make sure that step is never called with a done environment

@kylesayrs
Copy link
Contributor Author

Side note, there's another discussion to be had about parallelizing the eval environment, aka passing a n_envs argument to evaluate_policy

@araffin
Copy link
Member

araffin commented Sep 26, 2023

I'll have to create a check in the step function to make sure that step is never called with a done environment
Side note, there's another discussion to be had about parallelizing the eval environment, aka passing a n_envs argument to evaluate_policy

The two are related, evaluate_policy() already supports multiple envs, and that's because of VecEnv precisely: #402

and this environment should act slightly differently than typical VecEnvs

if you want that behavior, you should use a gym.Env instead.
As I don't see rendering the last step as critical (rendering is usually for debug only), I would call render() before predict but add a note in the doc about the fact that the last timestep is not.

@kylesayrs
Copy link
Contributor Author

The two are related, evaluate_policy() already supports multiple envs, and that's because of VecEnv precisely.

evaluate_policy has the capacity to support multiple envs, but currently hard codes just one.

The two methods that should be overridden by the EvalVecEnv subclass are VecEnv.reset and VecEnv.step.
If we pass both of these methods the render argument, reset will ensure that the first frame is rendered and step will ensure that render() is called after the predicted step is taken in all environments but before any of the environments reset.

Pseudocode

def reset(render=false):
    rets = super().reset()
    if render:
        self.render()

    return rets

def step(render=false):
    self.step_async()
    for env_idx in range(self.num_envs):
        ...
        self.envs[env_idx].step(...)
        ...
    
    if render:
        self.render()
    
    for env_idx in range(self.num_envs):
        ...
        self.envs[env_idx].reset()
        ...

@araffin
Copy link
Member

araffin commented Sep 27, 2023

evaluate_policy has the capacity to support multiple envs, but currently hard codes just one.

What makes you think that?

Pseudocode

so if you want to do something like that, gym wrapper will help you achieve what you want, no need for a new VecEnv.

@kylesayrs
Copy link
Contributor Author

What makes you think that?

common/evaluation.py:62 env = DummyVecEnv([lambda: env])
common/envs/dummy_vec_env.py:42 super().__init__(len(env_fns), env.observation_space, env.action_space)
common/envs/base_vec_env.py:63 self.num_envs = num_envs

Therefore env.num_envs is always 1. Is it possible I'm misreading something?

@kylesayrs
Copy link
Contributor Author

Ah, I guess a user could pass a VecEnv that is already instantiated with multiple environments. I missed this in the documentation.

If a vector env is passed in, this divides the episodes to evaluate onto the different elements of the vector env

so if you want to do something like that, gym wrapper will help you achieve what you want, no need for a new VecEnv.

Are you proposing a EvalGymWrapper after the environment has been vectorized?

if not isinstance(env, VecEnv):
        env = DummyVecEnv([lambda: env])  # type: ignore[list-item, return-value]
env = EvalWrapper(env)

@araffin
Copy link
Member

araffin commented Sep 27, 2023

Ah, I guess a user could pass a VecEnv that is already instantiated with multiple environments. I missed this in the documentation.

yes

Are you proposing a EvalGymWrapper after the environment has been vectorized?

No, I'm just saying that if a user wants to render the very last step, he/she can wrap the Gym env before creating a VecEnv with a Gym wrapper that calls render before and after the step, in order to render the final step.

Rendering the first step will be solved by moving the call to render() in the evaluation function.

@kylesayrs
Copy link
Contributor Author

kylesayrs commented Sep 28, 2023

Shouldn't rendering all the steps and states be the default behavior of the EvalCallback? I personally find a lot of value in rendering all the frames both debugging the env and the agent's learning.

@kylesayrs
Copy link
Contributor Author

kylesayrs commented Sep 28, 2023

I see how my previous implementation wouldn't work if the function is passed a VecEnv, rather than the base gym environment.

One change could to be to enforce that the caller pass a base Gym environment and a n_envs, then the function constructs a EvalVecEnv which inherits from VecEnv and implements correct rendering with the specified number of environments.

We can continue to support VecEnvs, but display a warning that rendering won't behave properly.

I believe this implementation would allow first and last frames to be rendered, preserve multi-environment evaluation, and reduce complexity for the user since they not longer have to construct a vec_env themselves. What do you think?

@kylesayrs
Copy link
Contributor Author

This might be better done with a wrapper. Still working on a PR...

@jonomon
Copy link

jonomon commented Oct 7, 2023

I was missing the last render.
As a workaround, I added a "terminal_render" value in self.buf_infos of the DummyVecEnv

# https://github.com/DLR-RM/stable-baselines3/blob/master/stable_baselines3/common/vec_env/dummy_vec_env.py#L69
def step_wait(self) -> VecEnvStepReturn:
  ...
  if self.buf_dones[env_idx]:
      # save final observation where user can get it, then reset
      self.buf_infos[env_idx]["terminal_observation"] = obs
      self.buf_infos[env_idx]["terminal_render"] = self.envs[env_idx].render() #<---- THIS LINE
      obs, self.reset_infos[env_idx] = self.envs[env_idx].reset()

In the eval callback, I included a conditional render:

if done:
    screen = self._eval_env.buf_infos[0]["terminal_render"]
else:
    screen = self._eval_env.render()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants