Fix issues with distributed training #80

eihli · 2024-02-13T02:03:27Z

Addresses Issue #33

Log to wandb through the accelerator so that you don't get multiple logs sent, one from each process. I know we had logging tucked inside a check for accelerator.is_main_process but that wasn't working for some reason I can't explain at the time I'm writing this message.

Also, fixes an issue that when you wrap a customized model in DistributedDataParallel you don't have access to the custom attributes/methods. You can get access to them by unwrapping to the original module with model.module.custom_attribute. I haven't deeply investigated the consequences of this. What does the DDP wrapper do? What do you skip by reaching through? If you're just accessing a scalar argument value, like context_length, then I imagine it's safe. But what if you're accessing some custom data loading functionality?

You can see a wandb training run here

Log to wandb through the accelerator so that you don't get multiple logs sent, one from each process. I know we had logging tucked inside a check for accelerator.is_main_process but that wasn't working for some reason I can't explain at the time I'm writing this message. Also, fixes an issue that when you wrap a customized model in DistributedDataParallel you don't have access to the custom attributes/methods. You can get access to them by unwrapping to the original module with model.module.custom_attribute. I haven't deeply investigated the consequences of this. What does the DDP wrapper do? What do you skip by reaching through? If you're just accessing a scalar argument value, like context_length, then I imagine it's safe. But what if you're accessing some custom data loading functionality?

eihli · 2024-02-28T18:08:41Z

gato/tasks/control_task.py

@@ -145,7 +153,7 @@ def evaluate(self, model: GatoPolicy, n_iterations, deterministic=True, promptle
                # trim to context length
                input_dict[self.obs_str] = input_dict[self.obs_str][-context_timesteps:,]
                input_dict[self.action_str] = input_dict[self.action_str][-context_timesteps:,]
-                action = model.predict_control(input_dict, task=self, deterministic=deterministic)
+                action = model.module.predict_control(input_dict, task=self, deterministic=deterministic)


Having a hard time thinking of a good way to handle this. When the training is launched with accelerate launch train.py, then you need to access model.module.predict... But when it's launched with python train.py, then you need just model.predict.... I'd hate to have if conditionals all over the place.

Oh! Clearly! https://pytorch.org/tutorials/beginner/former_torchies/parallelism_tutorial.html

Attributes of the wrapped module

After wrapping a Module with DataParallel, the attributes of the module (e.g. custom methods) became inaccessible. This is because DataParallel defines a few new members, and allowing other attributes might lead to clashes in their names. For those who still want to access the attributes, a workaround is to use a subclass of DataParallel as below.

class MyDataParallel([nn.DataParallel](https://pytorch.org/docs/stable/generated/torch.nn.DataParallel.html#torch.nn.DataParallel)): def __getattr__(self, name): try: return super().__getattr__(name) except AttributeError: return getattr(self.module, name)

Scratch that. Now I remember. We're getting DataParallel by way of Huggingface's Accelerate library. We'd need to make the change there. Not quite so clear.

Oh. Here we go:

class GatoPolicy(nn.Module): # ... @property def module(self): return self

Add that and just use model.module everywhere. That ought to work for both Accelerated runs and non-distributed runs.

bhavul · 2024-04-20T18:22:58Z

Good job, thanks @eihli for fixing this.

eihli added 3 commits November 15, 2023 21:45

Add wandb output directory to .gitignore

6393c51

Merge branch 'master' into distributed-wandb

4437a44

eihli commented Feb 28, 2024

View reviewed changes

eihli added 5 commits April 18, 2024 18:54

Fix to work with new versions of deps

6d8c2d2

Merge branch 'master' into fix-distrib-wandb

271f621

Merge branch 'fix-distrib-wandb' into distributed-wandb

55dcdc2

Fix issues with distributed training

705b167

Remove debug prints

ae5bb4a

bhavul merged commit 71d1a9d into ManifoldRG:master Apr 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix issues with distributed training #80

Fix issues with distributed training #80

eihli commented Feb 13, 2024 •

edited

Loading

eihli Feb 28, 2024

eihli Apr 11, 2024

eihli Apr 13, 2024

eihli Apr 19, 2024

bhavul commented Apr 20, 2024

Fix issues with distributed training #80

Fix issues with distributed training #80

Conversation

eihli commented Feb 13, 2024 • edited Loading

eihli Feb 28, 2024

Choose a reason for hiding this comment

eihli Apr 11, 2024

Choose a reason for hiding this comment

eihli Apr 13, 2024

Choose a reason for hiding this comment

eihli Apr 19, 2024

Choose a reason for hiding this comment

bhavul commented Apr 20, 2024

eihli commented Feb 13, 2024 •

edited

Loading