[Question] fps drops significantly over time #54

oxkitsune · 2024-09-23T13:47:00Z

Question

I'm training a simple PPO agent in a mujoco gym env, however I've noticed that over the course of training the FPS goes down significantly. The fps keeps going down, even after the updates have started.

I'm wondering what could be the cause of this?

Additional context

I'm training on gpu, although running on cpu yields similar results
I'm training with the following parameters:
- n_envs: 1
- n_epochs: 10
- n_steps: 1024
I've read [Question] Why does FPS tend to decrease during training? DLR-RM/stable-baselines3#1597, but I don't feel like any of the comments there apply to my case.
Dependency versions:
- Python 3.11
- gymnasium 1.0 (master branch)
- sbx 0.15
- sb3 2.3.2

The following charts show cpu usage and fps over time, where this particular run took 17h on a machine with:

cpu: 13th Gen Intel i9-13900KF (32) @ 6.000GHz
gpu: NVIDIA GeForce RTX 4090

Checklist

I have read the documentation (required)
I have checked that there is no similar issue in the repo (required)

The text was updated successfully, but these errors were encountered:

araffin · 2024-10-07T14:04:14Z

what mujoco version do you use?
SB3 is only fully compatible with 0.29.1 for now.

Do you have the same issue with built-env mujoco env? (for instance HalfCheetah-v4)

I haven't experienced that so far, might do some runs that I will log in open rl benchmark later.

oxkitsune · 2024-10-26T13:53:09Z

Yes, it happens for the built in mujoco envs as well. I think I have narrowed it down to my use of SubprocVecEnv vs. DummyVecEnv. Where the SubprocVecEnv seems to leak memory or something which causes the memory on my machine to gradually get filled up.

araffin · 2024-10-30T08:02:39Z

I'm doing many runs on MuJoCo envs and I cannot see the effect you describe so far: https://wandb.ai/openrlbenchmark/sbx/runs/99wrpkc7?nw=nwuseraraffin (other runs are available in https://wandb.ai/openrlbenchmark/sbx/).

To reproduce, use the train script from the readme (using RL Zoo) and:

python train.py --algo ppo --env HalfCheetah-v4 --seed 3831217417 --eval-freq 25000 --verbose 0 --n-eval-envs 5 --eval-episodes 20 --log-interval 100 -c hyperparams/ppo.py -P --track

Note: I'm using JAX_PLATFORMS=cpu CUDA_VISIBLE_DEVICES=

the hyperparameter file is simply:

# Default hyperparameters for SB3 are tuned for MuJoCo
default_hyperparams = dict(
    n_envs=1,
    n_timesteps=int(1e6),
    policy="MlpPolicy",
    policy_kwargs={},
    normalize=True,
)

hyperparams = {}

for env_id in [
    "HalfCheetah-v4",
    "Humanoid-v4",
    "HalfCheetahBulletEnv-v0",
    "Ant-v4",
    "Hopper-v4",
    "Walker2d-v4",
    "Swimmer-v4",
]:
    hyperparams[env_id] = default_hyperparams

EDIT: I'll try with multiple envs later

araffin · 2024-10-30T08:07:36Z

Note: when using multiple envs, you should probably adjust the n_steps to have a constant batch size

For instance:

JAX_PLATFORMS=cpu CUDA_VISIBLE_DEVICES= python train.py --algo ppo \
--env HalfCheetah-v4 -P -param n_envs:4 n_steps:512 --vec-env subproc \
--verbose 0 --eval-freq -1 -c hyperparams/ppo.py

When using 4 envs instead of 1 (the default n_steps is 2048, 2048/4 = 512)

oxkitsune · 2024-11-21T01:16:39Z

It seems there is some sort of (memory?) leak when using SubProcEnv on Linux, switching to DummyVecEnv fixed the issue.

oxkitsune added the question Further information is requested label Sep 23, 2024

araffin mentioned this issue Oct 30, 2024

[Question] Try to use sbx on meltingpot env #58

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] fps drops significantly over time #54

[Question] fps drops significantly over time #54

oxkitsune commented Sep 23, 2024

araffin commented Oct 7, 2024 •

edited

Loading

oxkitsune commented Oct 26, 2024

araffin commented Oct 30, 2024 •

edited

Loading

araffin commented Oct 30, 2024 •

edited

Loading

oxkitsune commented Nov 21, 2024

[Question] fps drops significantly over time #54

[Question] fps drops significantly over time #54

Comments

oxkitsune commented Sep 23, 2024

Question

Additional context

Checklist

araffin commented Oct 7, 2024 • edited Loading

oxkitsune commented Oct 26, 2024

araffin commented Oct 30, 2024 • edited Loading

araffin commented Oct 30, 2024 • edited Loading

oxkitsune commented Nov 21, 2024

araffin commented Oct 7, 2024 •

edited

Loading

araffin commented Oct 30, 2024 •

edited

Loading

araffin commented Oct 30, 2024 •

edited

Loading