-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Higher memory usage on sequential training runs #1966
Comments
Hello, Does the higher memory usage happened also when using a |
|
I have the same issue.
Line # Mem usage Increment Occurrences Line Contents
=============================================================
139 683.0 MiB 683.0 MiB 1 @profile
140 def collect_rollouts(
141 self,
142 env: VecEnv,
143 callback: BaseCallback,
144 rollout_buffer: RolloutBuffer,
145 n_rollout_steps: int,
146 ) -> bool:
147 """
148 Collect experiences using the current policy and fill a ``RolloutBuffer``.
149 The term rollout here refers to the model-free notion and should not
150 be used with the concept of rollout used in model-based RL or planning.
151
152 :param env: The training environment
153 :param callback: Callback that will be called at each step
154 (and at the beginning and end of the rollout)
155 :param rollout_buffer: Buffer to fill with rollouts
156 :param n_rollout_steps: Number of experiences to collect per environment
157 :return: True if function returned with at least `n_rollout_steps`
158 collected, False if callback terminated rollout prematurely.
159 """
160 683.0 MiB 0.0 MiB 1 assert self._last_obs is not None, "No previous observation was provided"
161 # Switch to eval mode (this affects batch norm / dropout)
162 683.0 MiB 0.0 MiB 1 self.policy.set_training_mode(False)
163
164 683.0 MiB 0.0 MiB 1 n_steps = 0
165 683.2 MiB 0.2 MiB 1 rollout_buffer.reset()
166 # Sample new weights for the state dependent exploration
167 683.2 MiB 0.0 MiB 1 if self.use_sde:
168 self.policy.reset_noise(env.num_envs)
169
170 683.2 MiB 0.0 MiB 1 callback.on_rollout_start()
171
172 13234.2 MiB 0.0 MiB 1025 while n_steps < n_rollout_steps:
173 13222.5 MiB 0.0 MiB 1024 if self.use_sde and self.sde_sample_freq > 0 and n_steps % self.sde_sample_freq == 0:
174 # Sample a new noise matrix
175 self.policy.reset_noise(env.num_envs)
176
177 13222.5 MiB 0.0 MiB 2048 with th.no_grad():
178 # Convert to pytorch tensor or to TensorDict
179 13222.5 MiB 0.0 MiB 1024 obs_tensor = obs_as_tensor(self._last_obs, self.device)
180 13222.5 MiB 432.8 MiB 1024 actions, values, log_probs = self.policy(obs_tensor)
181 13222.5 MiB 0.0 MiB 1024 actions = actions.cpu().numpy()
182
183 # Rescale and perform action
184 13222.5 MiB 0.0 MiB 1024 clipped_actions = actions
185
186 13222.5 MiB 0.0 MiB 1024 if isinstance(self.action_space, spaces.Box):
187 if self.policy.squash_output:
188 # Unscale the actions to match env bounds
189 # if they were previously squashed (scaled in [-1, 1])
190 clipped_actions = self.policy.unscale_action(clipped_actions)
191 else:
192 # Otherwise, clip the actions to avoid out of bound error
193 # as we are sampling from an unbounded Gaussian distribution
194 clipped_actions = np.clip(actions, self.action_space.low, self.action_space.high)
195
196 13222.5 MiB 25.3 MiB 1024 new_obs, rewards, dones, infos = env.step(clipped_actions)
197
198 13222.5 MiB 0.0 MiB 1024 self.num_timesteps += env.num_envs
199
200 # Give access to local variables
201 13222.5 MiB 0.0 MiB 1024 callback.update_locals(locals())
202 13222.5 MiB 0.0 MiB 1024 if not callback.on_step():
203 return False
204
205 13222.5 MiB 0.0 MiB 1024 self._update_info_buffer(infos, dones)
206 13222.5 MiB 0.0 MiB 1024 n_steps += 1
207
208 13222.5 MiB 0.0 MiB 1024 if isinstance(self.action_space, spaces.Discrete):
209 # Reshape in case of discrete action
210 13222.5 MiB 0.0 MiB 1024 actions = actions.reshape(-1, 1)
211
212 # Handle timeout by bootstraping with value function
213 # see GitHub issue #633
214 13222.5 MiB 0.0 MiB 50176 for idx, done in enumerate(dones):
215 13222.5 MiB 0.0 MiB 49152 if (
216 13222.5 MiB 0.0 MiB 49152 done
217 13222.5 MiB 0.0 MiB 4560 and infos[idx].get("terminal_observation") is not None
218 13222.5 MiB 0.0 MiB 2280 and infos[idx].get("TimeLimit.truncated", False)
219 ):
220 terminal_obs = self.policy.obs_to_tensor(infos[idx]["terminal_observation"])[0]
221 with th.no_grad():
222 terminal_value = self.policy.predict_values(terminal_obs)[0] # type: ignore[arg-type]
223 rewards[idx] += self.gamma * terminal_value
224
225 13234.2 MiB 12095.8 MiB 2048 rollout_buffer.add(
226 13222.5 MiB 0.0 MiB 1024 self._last_obs, # type: ignore[arg-type]
227 13222.5 MiB 0.0 MiB 1024 actions,
228 13222.5 MiB 0.0 MiB 1024 rewards,
229 13222.5 MiB 0.0 MiB 1024 self._last_episode_starts, # type: ignore[arg-type]
230 13222.5 MiB 0.0 MiB 1024 values,
231 13222.5 MiB 0.0 MiB 1024 log_probs,
232 )
233 13234.2 MiB -2.8 MiB 1024 self._last_obs = new_obs # type: ignore[assignment]
234 13234.2 MiB 0.0 MiB 1024 self._last_episode_starts = dones
235
236 13234.2 MiB 0.0 MiB 2 with th.no_grad():
237 # Compute value for the last timestep
238 13234.2 MiB 0.0 MiB 1 values = self.policy.predict_values(obs_as_tensor(new_obs, self.device)) # type: ignore[arg-type]
239
240 13234.2 MiB 0.0 MiB 1 rollout_buffer.compute_returns_and_advantage(last_values=values, dones=dones)
241
242 13234.2 MiB 0.0 MiB 1 callback.update_locals(locals())
243
244 13234.2 MiB 0.0 MiB 1 callback.on_rollout_end()
245
246 13234.2 MiB 0.0 MiB 1 return True I profiled the memory usage of the code, I guess the buffer needs to be reset somewhere? But its not done? |
@NickLucche I think the problem is not SB3, I think its that Python does not free the memory in the loop. In your loop, you train 20 different models? In mine I run 20 loops and load the last save. for train_rounds in range(20):
if exists(model_savestate):
model = PPO.load(model_savestate)
else:
model = PPO()
model.learn()
model.save() This went OOM. I did the following: import gc
for train_rounds in range(20):
if exists(model_savestate):
model = PPO.load(model_savestate)
else:
model = PPO()
model.learn()
model.save()
del model # Dereference old model
gc.collect() # Force free memory Without the |
🐛 Bug
Hey,
thanks a lot for your work!
I am trying to debug an apparent memory leak/higher memory usage when running the training code multiple times, but I can't pinpoint its cause.
I've boiled down my problem to the snippet below. Basically when starting sequential training runs I get a higher memory consumption than a single one, when I would expect all resources to be released after
PPO
object is collected.I believe the only real difference in this example is the obs and action space, which mimics my use case.
Single run memory usage
model.learn(total_timesteps=500_000)
Multi run memory usage
model.learn(total_timesteps=25_000)
N times. Crashes early due to OOM.To Reproduce
Relevant log output / Error message
No response
System Info
Checklist
The text was updated successfully, but these errors were encountered: