Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Higher memory usage on sequential training runs #1966

Open
5 tasks done
NickLucche opened this issue Jul 9, 2024 · 5 comments
Open
5 tasks done

[Bug]: Higher memory usage on sequential training runs #1966

NickLucche opened this issue Jul 9, 2024 · 5 comments
Labels
bug Something isn't working

Comments

@NickLucche
Copy link
Contributor

NickLucche commented Jul 9, 2024

🐛 Bug

Hey,
thanks a lot for your work!
I am trying to debug an apparent memory leak/higher memory usage when running the training code multiple times, but I can't pinpoint its cause.
I've boiled down my problem to the snippet below. Basically when starting sequential training runs I get a higher memory consumption than a single one, when I would expect all resources to be released after PPO object is collected.
I believe the only real difference in this example is the obs and action space, which mimics my use case.

Single run memory usage model.learn(total_timesteps=500_000)
image

Multi run memory usage model.learn(total_timesteps=25_000) N times. Crashes early due to OOM.
image

To Reproduce

import gymnasium as gym
from gymnasium.wrappers.time_limit import TimeLimit
from stable_baselines3 import PPO
from stable_baselines3.common.monitor import Monitor
from stable_baselines3.common.vec_env import SubprocVecEnv
from gymnasium.spaces import MultiDiscrete, Box
import numpy as np

OBS_SHAPE = (320, 320, 2)

class DummyEnv(gym.Env):
    def __init__(self, *args, **kwargs):
        super(DummyEnv, self).__init__()
        self.action_space = MultiDiscrete([6 * 8 + 1, 2, 2], dtype=np.uint8)
        self.observation_space = Box(low=0, high=255, shape=OBS_SHAPE, dtype=np.uint8)

    def reset(self, *args, **kwargs):
        return self.observation_space.sample(), {}

    def step(self, action, *args, **kwargs):
        assert self.action_space.contains(action), f"{action} ({type(action)}) invalid"
        state = self.observation_space.sample()
        reward = 0.0
        done = False
        return state, reward, done, False, {}
        

def make_env():
    env = DummyEnv()
    env = TimeLimit(env, 100)
    return Monitor(env)

def train(ts):
    vec_env = SubprocVecEnv([make_env for _ in range(12)])
    model = PPO("CnnPolicy", vec_env, verbose=1)
    model.learn(total_timesteps=ts)
    model.get_env().close()

if __name__ == "__main__":    
    for i in range(20):
        print("Starting", i)
        train(25_000)
        print(i, "finished")
    # train(500_000)

Relevant log output / Error message

No response

System Info

- OS: Linux-6.8.7-arch1-1-x86_64-with-glibc2.39 # 1 SMP PREEMPT_DYNAMIC Wed, 17 Apr 2024 15:20:28 +0000
- Python: 3.11.8
- Stable-Baselines3: 2.3.0
- PyTorch: 2.3.0+cu121
- GPU Enabled: True
- Numpy: 1.26.4
- Cloudpickle: 3.0.0
- Gymnasium: 0.29.1

Checklist

  • My issue does not relate to a custom gym environment. (Use the custom gym env template instead)
  • I have checked that there is no similar issue in the repo
  • I have read the documentation
  • I have provided a minimal and working example to reproduce the bug
  • I've used the markdown code blocks for both code and stack traces.
@NickLucche NickLucche added the bug Something isn't working label Jul 9, 2024
@araffin
Copy link
Member

araffin commented Jul 10, 2024

Hello,
why would you do instead of using a callback for instance?
I'm also wondering why you would recreate the environment every time instead of just calling learn(..., reset_num_timesteps=False) (see our doc)?

Does the higher memory usage happened also when using a DummyVecEnv ?

@NickLucche
Copy link
Contributor Author

Code is structured that way because my (actual) environment depends on some initial seed/state, which I can use to simulate ~"unseen" data and test generalization. It's then very straightforward to re-use regular training script to train on different ""splits""/conditions i.e just calling train in a loop like that.

But I believe use-case is of secondary importance here if there's some actual unreleased resource we could address (assuming there's no blunt mistake on my side).

Does the higher memory usage happened also when using a DummyVecEnv ?

Yep, still happens even when

vec_env = make_vec_env(make_env, n_envs=12)

Decreased obs space to 256x256x2 to better highlight ramp-up before my system's OOM:
image

@araffin
Copy link
Member

araffin commented Jul 10, 2024

depends on some initial seed/state, which I can use to simulate ~"unseen" data and test generalization

.reset(seed=...) is made for that normally (.seed() for VecEnv and then do a reset)

@NIvo172
Copy link

NIvo172 commented Sep 19, 2024

I have the same issue.

  • OS: Ubuntu 22.04.5 LTS
  • Python: 3.10.12
  • Stable-Baselines3: 2.3.0
  • PyTorch: 2.4.1+cu121
  • GPU Enabled: True
  • Numpy: 2.1.1
  • Cloudpickle: 3.0.0
  • Gymnasium: 0.29.1
Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
   139    683.0 MiB    683.0 MiB           1       @profile
   140                                             def collect_rollouts(
   141                                                 self,
   142                                                 env: VecEnv,
   143                                                 callback: BaseCallback,
   144                                                 rollout_buffer: RolloutBuffer,
   145                                                 n_rollout_steps: int,
   146                                             ) -> bool:
   147                                                 """
   148                                                 Collect experiences using the current policy and fill a ``RolloutBuffer``.
   149                                                 The term rollout here refers to the model-free notion and should not
   150                                                 be used with the concept of rollout used in model-based RL or planning.
   151                                         
   152                                                 :param env: The training environment
   153                                                 :param callback: Callback that will be called at each step
   154                                                     (and at the beginning and end of the rollout)
   155                                                 :param rollout_buffer: Buffer to fill with rollouts
   156                                                 :param n_rollout_steps: Number of experiences to collect per environment
   157                                                 :return: True if function returned with at least `n_rollout_steps`
   158                                                     collected, False if callback terminated rollout prematurely.
   159                                                 """
   160    683.0 MiB      0.0 MiB           1           assert self._last_obs is not None, "No previous observation was provided"
   161                                                 # Switch to eval mode (this affects batch norm / dropout)
   162    683.0 MiB      0.0 MiB           1           self.policy.set_training_mode(False)
   163                                         
   164    683.0 MiB      0.0 MiB           1           n_steps = 0
   165    683.2 MiB      0.2 MiB           1           rollout_buffer.reset()
   166                                                 # Sample new weights for the state dependent exploration
   167    683.2 MiB      0.0 MiB           1           if self.use_sde:
   168                                                     self.policy.reset_noise(env.num_envs)
   169                                         
   170    683.2 MiB      0.0 MiB           1           callback.on_rollout_start()
   171                                         
   172  13234.2 MiB      0.0 MiB        1025           while n_steps < n_rollout_steps:
   173  13222.5 MiB      0.0 MiB        1024               if self.use_sde and self.sde_sample_freq > 0 and n_steps % self.sde_sample_freq == 0:
   174                                                         # Sample a new noise matrix
   175                                                         self.policy.reset_noise(env.num_envs)
   176                                         
   177  13222.5 MiB      0.0 MiB        2048               with th.no_grad():
   178                                                         # Convert to pytorch tensor or to TensorDict
   179  13222.5 MiB      0.0 MiB        1024                   obs_tensor = obs_as_tensor(self._last_obs, self.device)
   180  13222.5 MiB    432.8 MiB        1024                   actions, values, log_probs = self.policy(obs_tensor)
   181  13222.5 MiB      0.0 MiB        1024               actions = actions.cpu().numpy()
   182                                         
   183                                                     # Rescale and perform action
   184  13222.5 MiB      0.0 MiB        1024               clipped_actions = actions
   185                                         
   186  13222.5 MiB      0.0 MiB        1024               if isinstance(self.action_space, spaces.Box):
   187                                                         if self.policy.squash_output:
   188                                                             # Unscale the actions to match env bounds
   189                                                             # if they were previously squashed (scaled in [-1, 1])
   190                                                             clipped_actions = self.policy.unscale_action(clipped_actions)
   191                                                         else:
   192                                                             # Otherwise, clip the actions to avoid out of bound error
   193                                                             # as we are sampling from an unbounded Gaussian distribution
   194                                                             clipped_actions = np.clip(actions, self.action_space.low, self.action_space.high)
   195                                         
   196  13222.5 MiB     25.3 MiB        1024               new_obs, rewards, dones, infos = env.step(clipped_actions)
   197                                         
   198  13222.5 MiB      0.0 MiB        1024               self.num_timesteps += env.num_envs
   199                                         
   200                                                     # Give access to local variables
   201  13222.5 MiB      0.0 MiB        1024               callback.update_locals(locals())
   202  13222.5 MiB      0.0 MiB        1024               if not callback.on_step():
   203                                                         return False
   204                                         
   205  13222.5 MiB      0.0 MiB        1024               self._update_info_buffer(infos, dones)
   206  13222.5 MiB      0.0 MiB        1024               n_steps += 1
   207                                         
   208  13222.5 MiB      0.0 MiB        1024               if isinstance(self.action_space, spaces.Discrete):
   209                                                         # Reshape in case of discrete action
   210  13222.5 MiB      0.0 MiB        1024                   actions = actions.reshape(-1, 1)
   211                                         
   212                                                     # Handle timeout by bootstraping with value function
   213                                                     # see GitHub issue #633
   214  13222.5 MiB      0.0 MiB       50176               for idx, done in enumerate(dones):
   215  13222.5 MiB      0.0 MiB       49152                   if (
   216  13222.5 MiB      0.0 MiB       49152                       done
   217  13222.5 MiB      0.0 MiB        4560                       and infos[idx].get("terminal_observation") is not None
   218  13222.5 MiB      0.0 MiB        2280                       and infos[idx].get("TimeLimit.truncated", False)
   219                                                         ):
   220                                                             terminal_obs = self.policy.obs_to_tensor(infos[idx]["terminal_observation"])[0]
   221                                                             with th.no_grad():
   222                                                                 terminal_value = self.policy.predict_values(terminal_obs)[0]  # type: ignore[arg-type]
   223                                                             rewards[idx] += self.gamma * terminal_value
   224                                         
   225  13234.2 MiB  12095.8 MiB        2048               rollout_buffer.add(
   226  13222.5 MiB      0.0 MiB        1024                   self._last_obs,  # type: ignore[arg-type]
   227  13222.5 MiB      0.0 MiB        1024                   actions,
   228  13222.5 MiB      0.0 MiB        1024                   rewards,
   229  13222.5 MiB      0.0 MiB        1024                   self._last_episode_starts,  # type: ignore[arg-type]
   230  13222.5 MiB      0.0 MiB        1024                   values,
   231  13222.5 MiB      0.0 MiB        1024                   log_probs,
   232                                                     )
   233  13234.2 MiB     -2.8 MiB        1024               self._last_obs = new_obs  # type: ignore[assignment]
   234  13234.2 MiB      0.0 MiB        1024               self._last_episode_starts = dones
   235                                         
   236  13234.2 MiB      0.0 MiB           2           with th.no_grad():
   237                                                     # Compute value for the last timestep
   238  13234.2 MiB      0.0 MiB           1               values = self.policy.predict_values(obs_as_tensor(new_obs, self.device))  # type: ignore[arg-type]
   239                                         
   240  13234.2 MiB      0.0 MiB           1           rollout_buffer.compute_returns_and_advantage(last_values=values, dones=dones)
   241                                         
   242  13234.2 MiB      0.0 MiB           1           callback.update_locals(locals())
   243                                         
   244  13234.2 MiB      0.0 MiB           1           callback.on_rollout_end()
   245                                         
   246  13234.2 MiB      0.0 MiB           1           return True

I profiled the memory usage of the code, I guess the buffer needs to be reset somewhere? But its not done?

@NIvo172
Copy link

NIvo172 commented Sep 20, 2024

@NickLucche I think the problem is not SB3, I think its that Python does not free the memory in the loop.

In your loop, you train 20 different models? In mine I run 20 loops and load the last save.

for train_rounds in range(20):
    if exists(model_savestate):
        model = PPO.load(model_savestate)
    else:
       model = PPO()

   model.learn()
   model.save()

This went OOM. I did the following:

import gc

for train_rounds in range(20):
    if exists(model_savestate):
        model = PPO.load(model_savestate)
    else:
       model = PPO()

   model.learn()
   model.save()
   del model # Dereference old model
   gc.collect() # Force free memory

Without the gc.collect() call the memory is still not freed. Maybe this helps you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants