Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel collection and evaluation #143

Open
gliese876b opened this issue Nov 18, 2024 · 9 comments
Open

Parallel collection and evaluation #143

gliese876b opened this issue Nov 18, 2024 · 9 comments

Comments

@gliese876b
Copy link

Can _evaluation_loop use SyncDataCollector for non vectorized envs so that the evaluation is also parallel?

While running on Melting Pot envs, increasing n_envs_per_worker definitely improves execution time but the evaluation steps take almost 3 times longer (I have evaluation_episodes: 10) than a regular iteration since the evaluation is sequential.

Making test_env SerialEnv could solve the issue.

@matteobettini
Copy link
Collaborator

matteobettini commented Nov 24, 2024

Hello!

Thanks for opening this and sorry for the delay in answering.
It gives me the chance to speak about this which I wanted to do.

In vectorized environments, both collection and evaluation are done using a batch of vectorized environments.

In other environments, right now, both collection and evaluation are sequentially in the number of environments.

Collection:

SerialEnv(self.config.n_envs_per_worker(self.on_policy), env_func),

Evaluation:
for eval_episode in range(self.config.evaluation_episodes):

Allowing to change both of these to Parallel has long been on the TODO list: #94

This could be as simple as changing SerialEnv to ParallelEnv but also have certain implications which have to be checked.

This is top of the todo list so I think I will get to it when i have time.

RE your specific case, in meltingpot changing the n_envs_per_worker should not change much as it will collect sequentially anyway. Maybe the reason evaluation is so much longer could be rendering? try to test it with rendering disabled (it is in the experiment config)

@matteobettini matteobettini pinned this issue Nov 24, 2024
@matteobettini matteobettini changed the title Parallel Evaluation Parallel collection and evaluation Nov 24, 2024
@gliese876b
Copy link
Author

Hello!

Thanks for the response. It is good to know that the issue is at the top of the todo list.

You are right that the collection and evaluation is done sequentially.

Just to follow your suggestion, I changed SerialEnv to ParallelEnv yet it led to many errors so I stopped.

Also, I definitely see execution time improvements when I set n_envs_per_worker from 2 to 20. But I guess it has something to do with the reset method of meltingpot envs.

Here is an example run of IQN on Harvest env with 10 agents with off_policy_n_envs_per_worker: 20, evaluation_interval: 50_000, evaluation_episodes: 10, off_policy_collected_frames_per_batch: 2000.

0%|          | 0/2500 [00:00<?, ?it/s].../logger.py:100: UserWarning: No episode terminated this iteration and thus the episode rewards will be NaN, this is normal if your horizon is longer then one iteration. Learning is proceeding fine.The episodes will probably terminate in a future iteration.
  warnings.warn(

mean return = nan:   0%|          | 1/2500 [05:48<241:38:32, 348.10s/it].../logger.py:100: UserWarning: No episode terminated this iteration and thus the episode rewards will be NaN, this is normal if your horizon is longer then one iteration. Learning is proceeding fine.The episodes will probably terminate in a future iteration.
  warnings.warn(

mean return = nan:   0%|          | 2/2500 [06:38<120:12:07, 173.23s/it]
mean return = nan:   0%|          | 3/2500 [07:29<81:18:35, 117.23s/it] 
mean return = nan:   0%|          | 4/2500 [08:20<63:04:10, 90.97s/it] 
mean return = nan:   0%|          | 5/2500 [09:11<53:00:30, 76.49s/it]
mean return = nan:   0%|          | 6/2500 [10:02<46:59:21, 67.83s/it]
mean return = nan:   0%|          | 7/2500 [10:52<43:04:41, 62.21s/it]
mean return = nan:   0%|          | 8/2500 [11:43<40:39:06, 58.73s/it]
mean return = nan:   0%|          | 9/2500 [12:36<39:21:00, 56.87s/it]
mean return = -88.84220123291016:   0%|          | 10/2500 [13:58<44:37:06, 64.51s/it]
mean return = nan:   0%|          | 11/2500 [14:49<41:44:54, 60.38s/it]               
mean return = nan:   0%|          | 12/2500 [15:41<39:56:15, 57.79s/it]
mean return = nan:   1%|          | 13/2500 [16:32<38:38:37, 55.94s/it]
mean return = nan:   1%|          | 14/2500 [17:24<37:44:07, 54.65s/it]
mean return = nan:   1%|          | 15/2500 [18:16<37:08:39, 53.81s/it]
mean return = nan:   1%|          | 16/2500 [19:08<36:42:25, 53.20s/it]
mean return = nan:   1%|          | 17/2500 [20:00<36:24:12, 52.78s/it]
mean return = nan:   1%|          | 18/2500 [20:52<36:17:42, 52.64s/it]
mean return = nan:   1%|          | 19/2500 [21:45<36:22:00, 52.77s/it]
mean return = -108.3295669555664:   1%|          | 20/2500 [23:08<42:33:57, 61.79s/it]
mean return = nan:   1%|          | 21/2500 [23:59<40:23:30, 58.66s/it]               
mean return = nan:   1%|          | 22/2500 [24:51<39:01:50, 56.70s/it]
mean return = nan:   1%|          | 23/2500 [25:43<38:02:08, 55.28s/it]
mean return = nan:   1%|          | 24/2500 [26:35<37:23:44, 54.37s/it]
mean return = nan:   1%|          | 25/2500 [32:32<99:44:40, 145.08s/it]  ------------------> evaluation
mean return = nan:   1%|          | 26/2500 [33:23<80:13:32, 116.74s/it]
mean return = nan:   1%|          | 27/2500 [34:13<66:28:06, 96.76s/it] 
mean return = nan:   1%|          | 28/2500 [35:04<57:02:45, 83.08s/it]
mean return = nan:   1%|          | 29/2500 [35:55<50:19:59, 73.33s/it]
mean return = -130.4561309814453:   1%|          | 30/2500 [37:15<51:42:54, 75.37s/it]

There is also an increase in time execution when episodes end. I guess, at the end, it cancels out the improvement on regular iterations.

@matteobettini
Copy link
Collaborator

Just to follow your suggestion, I changed SerialEnv to ParallelEnv yet it led to many errors so I stopped.

Ok that is what I was afraid of. In theory they should be interchangable but in practice they are my first cause of migranes (hence why we only have serial for now). but when I gather some courage I'll look into it.

Rgarding the other part of the message: anything out of what you expected/something I can help with?

@gliese876b
Copy link
Author

Nope. Thanks for the quick responses.
Good luck!

@matteobettini
Copy link
Collaborator

I ll just keep this open until the feature lands

@matteobettini matteobettini reopened this Nov 25, 2024
@gliese876b
Copy link
Author

I revisited the issue and you were right that switching from SerialEnv to ParallelEnv works!

Apparently, the problem was about how I pass some env config params to env creator function. I guess ParallelEnv does not copy task config as SerialEnv do. I changed the way I pass the args and removed hydra option and now it works.

@matteobettini
Copy link
Collaborator

Nice! Would you be able to share your solution in a PR? Also maybe if you can open an issue in torchrl outlining where the serial and parallel differ that you did not expect

@gliese876b
Copy link
Author

gliese876b commented Dec 1, 2024

Well, in terms of collection time, ParallelEnv improves a lot.

However, after checking the results, I can see that there is a big change in terms of learning performance. I ran some more tests with the config below (on IQL), only changing SerialEnv - ParallelEnv, and somehow the learning is very poor when I use ParallelEnv.

off_policy_collected_frames_per_batch: 2000
off_policy_n_envs_per_worker: 20
off_policy_n_optimizer_steps: 20
off_policy_train_batch_size: 128
off_policy_memory_size: 20000
off_policy_init_random_frames: 0

I thought the only difference is that SerialEnv is just stepping 20 envs in sequence whereas ParallelEnv steps them in seperate processes. Note that an episode ends only if 1000 steps are taken.

I am not sure if this originates from MeltingPot and it is due to async collection from envs.

@matteobettini
Copy link
Collaborator

matteobettini commented Dec 1, 2024

Oh no, that does not sound good. I feared something like this. I'll need to take a look

We need to identify where this deviation first occurs.

Maybe the first apporoach would be to test with a non-learned deterministic policy and see if the result is different betwen the 2 envs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants