Deepspeed integration #4693

jacobdanovitch · 2020-10-02T14:49:56Z

Draft for #4634 . Small change to allennlp.training.metrics.categorical_accuracy addresses my comment in #4623 . Still very rough, but functional.

Example config: gist

…peed

dirkgr · 2020-10-03T01:15:59Z

I did not carefully examine the difference between the existing trainer and the deep speed one, but it looks like they are almost the same?

jacobdanovitch · 2020-10-03T15:35:37Z

I did not carefully examine the difference between the existing trainer and the deep speed one, but it looks like they are almost the same?

Yes, they are very similar. The main differences are that deepspeed's model engine handles things like gradient accumulation, norm, clipping, schedulers, so I removed a lot of that functionality, and modified the backprop step. I also may need to adjust the checkpointing.

This is an MWE, but like I was talking about in the issue thread, I think it could be further optimized by avoiding the direct use of their model engine, which I'll take a look at next.

dirkgr

You said it's slow because of a bunch of logging and metrics stuff. Do you think the issue is tensor board and the AllenNLP metrics code? Or are there other things that I didn't see?

I put in some comments about stuff that might have to change. They are mostly there to confirm or deny the half-truths I know about DeepSpeed.

allennlp/training/deepspeed_trainer.py

jacobdanovitch · 2020-10-07T16:37:31Z

See my comments in the issue thread for more detail. The slowdown seems to be related to gradient accumulation. The next steps are (1) seeing if the slowdown is reproducible on other machines and (2) confirming the right place in the library for this to go. Once those are both resolved I'm going to simplify the model engine using Registrables and then this should be ~reviewable.

allennlp/training/deepspeed/__init__.py

jacobdanovitch · 2020-11-05T18:36:44Z

@dirkgr I think this is ready to take a look at. Some notes thus far:

Deepspeed is heavily config-based and it's hard to avoid, so rather than fighting it, I just tried to make the config itself registrable. So everything can be handled from within one config file.
That said, some things (like optimizers) can be instantiated directly, so I've made it possible to either pass an optimizer to the trainer as normal, or pass an optimizer config to deepspeed. If it's a bit confusing to have both ways, it'll be easy to pick one and stick with it.
There's a lot of code overlap with the GradientDescent trainer, still. I don't know if this is avoidable without breaking the trainer into various hooks like Lightning does.
I've only worked on the optimization/scaling-related features, I haven't looked at Sparse Attention / the transformer kernel yet (in allennlp or in general). I can try to add these things or just wait until the existing stuff is done with.

dirkgr

This is a great start! Some things are missing before we can consider merging it.

passing tests
tests for the new functionality (maybe the same tests we have, just running with a different trainer?)
code formatting and all that stuff. Are you familiar with black and flake8?
left-over debug code
Maybe we can do something about the massive code duplication? It doesn't have to be a primary goal, but there might be low-hanging fruit. In particular, could you highlight for me where the differences are between GradentDescentTrainer and DeepspeedTrainer? If they are all in the configuration and initialization, maybe we can come up with a much lighter-weight approach to getting this done.

dirkgr · 2020-11-05T22:06:27Z

allennlp/training/__init__.py

+# try:
+#     from allennlp.training.deepspeed import DeepspeedTrainer
+# except ImportError:
+#     warnings.warn('Deepspeed plugin not installed. Ignoring.')


Is this leftover debug code?

Depends on how we want to include this. Based on my experience, I wouldn't recommend making deepspeed a required dependency. If we're doing the pip install allennlp[deepspeed] thing, this could be replaced/updated (not sure offhand how that gets handled but I can look for some examples).

If you don't mind doing the work making it optional, then let's make it optional.

allennlp/training/deepspeed/checkpointer.py

dirkgr · 2020-11-05T22:08:17Z

allennlp/training/deepspeed/checkpointer.py

+            # Model will need a weight file to load; 
+            # not sure if ZeRO stage 2 will mess this up
+            if not os.path.isfile(model_path):
+                    torch.save(model_state, model_path)


This would be good to know. Have you tried the checkpointing logic? Does it work?

The checkpointing works for saving; it's able to go through the training process E2E, doing the checkpointing and so on. I'm just not sure how model-parallel affects this part, if it's saving the entire model state or just the state local to that device. I imagine that this could be validated in a test case.

That seems important. I don't want to make a release claiming that this works and then it doesn't in a fairly common use case.

allennlp/training/deepspeed/checkpointer.py

dirkgr · 2020-11-05T22:10:07Z

allennlp/training/deepspeed/checkpointer.py

+        engine_path = checkpoints[-1]
+        return engine_path, model_path, training_state_path
+
+    def restore_checkpoint(self) -> Tuple[Dict[str, Any], Dict[str, Any]]:


I didn't make sure, but a lot of these functions look identical to the ones from the regular checkpointer. Can you derive from that one and just override the methods that have differences?

It is derived from the regular checkpointer. I might be able to clean more of this up depending on the above points; if I didn't have to re-load the torch weights and could delegate almost entirely to deepspeed, it would simplify things quite a lot.

How do you find out whether you can do this?

I suspect that deep speed will work best the more stuff we delegate to it.

allennlp/training/deepspeed/trainer.py

allennlp/training/deepspeed/utils.py

allennlp/training/deepspeed/trainer.py

dirkgr · 2020-11-05T23:01:53Z

I haven't looked at Sparse Attention / the transformer kernel yet (in allennlp or in general).

These are special nn.Modules that work particularly well with DeepSpeed?

jacobdanovitch · 2020-11-09T19:32:43Z

Thanks for looking it over! I'll start linting everything and getting the tests up and running (we can probably re-use the existing Trainer tests, yeah).

As for the code duplication, the biggest overlaps are:

Checkpointing (I don't think this will be too duplicated when all is said and done)
Certain parts of _train_epoch
Almost all of _validation_loss

With Deepspeed, their training looks like:

for batch in data_loader:
    loss = model_engine(**batch) # <- same as GDTrainer, but different attribute
    model_engine.backward(loss) # <- different
    model_engine.step() # <- different

The model_engine handles gradient accumulation, learning rate scheduling, optimization, etc, so these specific parts end up different but then the rest is all the same. That includes stuff in _train_epoch like:

CPU/GPU memory usage
Regularization penalty
Tensorboard logging
Metrics / progress bar stuff

And then almost all of _validation_loss is duplicated, I'm pretty sure the only change I made was model_engine.backwards(loss).

I think I'll be able to inherit train directly as well if I inherit from the GD trainer, but I wouldn't be able to use its constructor, so I'd have to do something like:

@Trainer.register("deepspeed", constructor="from_partial_objects")
class DeepspeedTrainer(GradientDescentTrainer):
    Trainer.__init__(self, serialization_dir, cuda_device, distributed, local_rank, world_size)

I don't immediately see any problems with that, so if that sounds good that should help reduce duplication too. Overall, the problematic overlap is where I just have to call one thing differently, like model_engine.backwards(loss). Lightning's hooks pattern helps reduce this a bit, but that would be a bit more of a refactor than just changing a few lines.

jacobdanovitch · 2020-11-09T19:36:02Z

These are special nn.Modules that work particularly well with DeepSpeed?

More or less, as far as I understand they're heavily optimized CUDA kernels that help for things like long sequences / are more efficient in general.

dirkgr · 2020-11-11T00:49:15Z

class DeepspeedTrainer(GradientDescentTrainer):
    Trainer.__init__(self, serialization_dir, cuda_device, distributed, local_rank, world_size)

I don't understand this. Do you mean you would not be able to do this normal thing?

class DeepspeedTrainer(GradientDescentTrainer):
    def __init__(...):
       super().__init__(...)

dirkgr · 2020-11-11T01:08:29Z

If it's too difficult to not duplicate code, let's not do it. I looked at the code for _validation_loss, and while it would be a shame to have all those lines twice, it's pretty tightly integrated there, and I wouldn't want to compromise the regular trainer for this too much.

…obdanovitch/deepspeed

jacobdanovitch · 2020-11-23T19:20:10Z

Still working on deduplicating code (and linting). I was able to get a lot reduced almost the entire constructor) by lying to the super().__init__() and passing distributed=False so that it wouldn't try to setup DDP. Also gave up on the sparse attention embedder, too much stuff needed to be installed with root and I couldn't get around it.

Just so I know explicitly how far I should go, should I just not touch anything at all inside of the GradientDescentTrainer? The stuff inside of for batch_group in batch_group_generator_tqdm: in _try_train could potentially be trainer-dependent while everything outside of and around it should be common to all trainers, so that could be a useful thing to put in a hook-like method to prevent a bunch of repetition. But if you'd rather I just not touch anything at all I'll leave it.

Some nice news is that deepspeed is now pip installable, so it's a lot easier to get everything configured.

dirkgr · 2020-11-25T19:42:12Z

I'm fine with small modifications to the regular trainer, but what you're proposing sounds like a bigger deal, so let's hold off on that. We may want to re-visit when and if deep speed proves permanently useful.

jacobdanovitch · 2020-11-30T16:07:59Z

Got all the typechecks out the way, phew. I've also managed to cut out a lot of duplicated code, I think! The remainder is almost entirely checkpointing related. For loading/saving, there's a bit of duplication here and there but nothing overwhelming, and the rest is delegated to deepspeed. The last thing that would be really, really nice to get around would be the dist.barrier() calls in GradientDescentTrainer:

Lines 1006-1008, 1091-1093

# Wait for the master to finish saving the model checkpoint
            if self._distributed:
                dist.barrier()

I could be wrong with my limited understanding of DDP, but as far as I can tell, this causes a fatal hang for deepspeed, which also calls dist.barrier() within checkpointing. This seems like it could be a fairly common situation for anyone trying to override the trainer for any distributed-related stuff (like Fairscale or maybe the upcoming huggingface model parallelism stuff).

Do you think there's a clean solution to this? That's about ~150 LOC duplicated for the removal of 4 lines, which isn't great. Is there a way that this could be delegated to the checkpointer itself, perhaps? Once that's settled, it should just be tests / dependencies left to do.

dirkgr · 2020-12-05T01:58:38Z

Is there a way to detect whether we are in a deepspeed context? If so, I'd be OK with some sort of if not in_deepspeed:. Otherwise, let's just duplicate it.

jacobdanovitch · 2020-12-05T02:15:28Z

Is there a way to detect whether we are in a deepspeed context? If so, I'd be OK with some sort of if not in_deepspeed:. Otherwise, let's just duplicate it.

I mean, one easy way would just be to set an environment variable DEEPSPEED=1 😛 Would that work, or would you rather more a more general solution?

dirkgr · 2020-12-05T02:18:07Z

If deepspeed doesn't have some sort of global context (like dist does), then let's duplicate the code. I'm not that comfortable with inventing our own global flags, but if they are already there, defined and documented by major libraries, I'm OK using them.

jacobdanovitch · 2020-12-05T02:19:32Z

Sounds good. I think Deepspeed might set some environment variables itself, similarly to torch, so I'll poke around to see if we can use one of those. If not we can just duplicate for now and I'll proceed onto testing.

…speed

jacobdanovitch · 2021-01-03T19:42:05Z

Took a holiday break from this while our cluster was down for maintenance for a bit. Turns out that checkpointing/barrier issue might be more complicated than I thought, but not sure if it's something to do with our cluster (the networking seems buggy, all_reduce and such often hang outside of allennlp for me). It's freezing while collecting memory usage now, which is odd, so still trying to figure that out.

Outside of that, I have a basic test working (well, when the above works), but it's more complicated than the existing trainer tests because all it's really doing is testing distributed training, which requires dist.init_process_group and the whole works. I don't see that being done for the gradient descent trainer, so not sure if I should be doing that.

dirkgr · 2021-01-07T01:03:11Z

Collecting memory usage often defers to shelling out to nvidia-smi, which holds a lock. If you have a lot of processes calling nvidia-smi, you can see deadlocks. We’ve had this happen on our clusters as well.

Distributed training has tests. It’s tested from the test for the training command, not on the trainer directly: https://github.com/allenai/allennlp/blob/main/tests/commands/train_test.py#L191

You could do the same thing for the deep speed trainer. Just test it end-to-end, not individually.

jacobdanovitch · 2021-01-08T20:34:31Z

Ah I think I see the real issue here. It's not the logging itself hanging.

(All ranks) My trainer tells my checkpointer to save if it's the master process
(Rank 0) My checkpointer delegates to DeepspeedEngine.save_checkpoint.
(Rank 0) DeepspeedEngine.save_checkpoint calls dist.barrier()
(Ranks 1-n) The other workers never get a chance to make the above barrier call, which leads to a lock as reported in pytorch here.

Any subsequent distributed operation, including the all_reduce in memory logging, then hangs (not sure how they make it through the barriers honestly). So, in fact, the real culprits were not the dist.barrier() calls I mentioned above, but rather the if self._master and self._checkpointer is not None: checks. Everything works perfectly when removing the first half.

So circling back to the code duplication issue, it's not so much being in a deepspeed context that I need to check for, it's something like:

# trainer.py => _try_train

if self._checkpointer is not None and self._checkpointer.call_on_rank(self._rank):

# checkpointer

class Checkpointer(Registrable):
    def call_on_rank(self, rank: int) -> bool:
        return rank == 0

# deepspeed checkpointer

class DeepspeedCheckpointer(Checkpointer):
    @overrides
    def call_on_rank(self, rank: int) -> bool:
        return True

So if you're open to adding some sort of flag like that to the base checkpointer, that would solve the issue. Or, we could check if it's a deepspeed checkpointer:

if self._checkpointer is not None and (self._master or isinstance(self._checkpointer, DeepspeedCheckpointer)):

But that might cause some circular import issues/issues for those who want to install without deepspeed. Either one would let me completely eliminate my override of _try_train; as long as I have a way to force self._checkpointer.save_checkpoint(epoch, self) to get called in every process.

You could do the same thing for the deep speed trainer. Just test it end-to-end, not individually.

Yep this worked perfectly, thanks. Exact same as test_train_model_distributed but different config. Should I decorate it with @requires_multi_gpu?

dirkgr · 2021-01-08T23:32:33Z

Why isn't the checkpointing thing a problem outside of AllenNLP? This should be an issue with DeepSpeed all the time, right?

Should I decorate it with @requires_multi_gpu?

It makes sense to me even abstractly that something like Deepspeed can only be accurately tested on a multi-GPU box.

jacobdanovitch · 2021-01-10T15:27:09Z

Why isn't the checkpointing thing a problem outside of AllenNLP? This should be an issue with DeepSpeed all the time, right?

Their typical training loop is something like (source):

# load checkpoint
for step, batch in enumerate(data_loader):
    loss = model_engine(batch)
    model_engine.backward(loss)
    model_engine.step()

    if step % args.save_interval:
        ckpt_id = loss.item()
        model_engine.save_checkpoint(args.save_dir, ckpt_id)

Note that all of these calls are made on every worker, so every worker enters the model_engine.save_checkpoint function and hit the dist.barrier call(s) contained within. I believe that if you did if step % args.save_interval and dist.get_rank() == 0 then you'd have the same problem. It's sort of a case of Deepspeed and AllenNLP both trying to help the user and running into each other as a result.

dirkgr · 2021-01-14T22:09:00Z

I see. We can always determine our rank, right? So we could just move the rank check into the checkpointer. The regular checkpointer will say if rank() != 0: return, and the DeepSpeed one will proceed with all the ranks. Does that work?

jacobdanovitch · 2021-01-14T22:54:29Z

Yeah that should work perfectly, I'll give it a try.

…danovitch/deepspeed

yanaiela · 2021-04-11T20:49:42Z

Hey, are there any news regarding this PR?
I'd be interested in using Deepspeed with Allennlp.

dirkgr · 2021-04-13T21:46:01Z

@epwalsh is testing it as we speak!

jeffra · 2021-07-30T17:22:14Z

@dirkgr @yanaiela @jacobdanovitch any blockers on this PR? Happy to help answer any deepspeed related questions that might be causing issues here.

dirkgr · 2021-08-02T20:43:28Z

We recently integrated FairScale to get the ZeRO optimizer into AllenNLP. It would be interesting to have DeepSpeed as well, since it has more features, but it's no longer quite so pressing. If anyone wants to pick up @jacobdanovitch's work and bring it over the finish line, I'd be happy to work with you.

jacobdanovitch added 3 commits October 2, 2020 10:44

first draft of deepspeed trainer

e2ac4b5

delegating grad_clipping, grad_norm, grad_acculumation, etc. to deeps…

619657e

…peed

cleaning up deepspeed config interface

a329fd2

dirkgr reviewed Oct 6, 2020

View reviewed changes

allennlp/training/deepspeed_trainer.py Outdated Show resolved Hide resolved

allennlp/training/deepspeed_trainer.py Outdated Show resolved Hide resolved

allennlp/training/deepspeed_trainer.py Outdated Show resolved Hide resolved

idenifying bottleneck / start simplifying model engine

00666c2

jacobdanovitch added 3 commits October 10, 2020 18:24

1416 LOC -> 562

f0da3bf

debugging memory leak

d0e8a68

functioning / cleaner prototype

0a74573

jacobdanovitch commented Oct 31, 2020

View reviewed changes

allennlp/training/deepspeed/__init__.py Show resolved Hide resolved

jacobdanovitch added 5 commits November 2, 2020 10:28

Merge branch 'master' into jacobdanovitch/deepspeed

eaf8aa5

checkpointing works e2e

498d3a2

ready for review

a211b5e

Merge branch 'master' into jacobdanovitch/deepspeed

3b30e21

add new trainer/lazy changes

fdd888b

jacobdanovitch marked this pull request as ready for review November 5, 2020 18:27

dirkgr suggested changes Nov 5, 2020

View reviewed changes

Merge branch 'master' into jacobdanovitch/deepspeed

ef544c9

jacobdanovitch added 2 commits November 23, 2020 09:46

dangling changes

083a6d0

Merge branch 'master' of https://github.com/allenai/allennlp into jac…

0f8d5b7

…obdanovitch/deepspeed

typechecks passing!

f48ea19

dirkgr self-assigned this Dec 12, 2020

schmmd changed the base branch from master to main December 23, 2020 18:47

jacobdanovitch added 2 commits January 3, 2021 11:25

init file

b3328fc

Merge remote-tracking branch 'upstream/main' into jacobdanovitch/deep…

966e296

…speed

jacobdanovitch added 2 commits January 8, 2021 11:16

save old tests in case

2fdb7c0

tracking down dist barrier bug(s)

95a9e5f

jacobdanovitch added 7 commits January 19, 2021 11:31

catch up

b152fe1

Merge branch 'main' of https://github.com/allenai/allennlp into jacob…

5b82534

…danovitch/deepspeed

moved master checks to checkpointer to accomodate deepspeed

4fb6604

Merge branch 'main' of https://github.com/allenai/allennlp into jacob…

e21fb1f

…danovitch/deepspeed

updating to 2.0

703843c

checking in sparse attention

e7b8825

merge resolution

3fc1835

Deepspeed integration #4693

Are you sure you want to change the base?

Deepspeed integration #4693

Conversation

jacobdanovitch commented Oct 2, 2020

dirkgr commented Oct 3, 2020

jacobdanovitch commented Oct 3, 2020

dirkgr left a comment

Choose a reason for hiding this comment

jacobdanovitch commented Oct 7, 2020

jacobdanovitch commented Nov 5, 2020

dirkgr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dirkgr commented Nov 5, 2020

jacobdanovitch commented Nov 9, 2020

jacobdanovitch commented Nov 9, 2020

dirkgr commented Nov 11, 2020

dirkgr commented Nov 11, 2020

jacobdanovitch commented Nov 23, 2020

dirkgr commented Nov 25, 2020

jacobdanovitch commented Nov 30, 2020

dirkgr commented Dec 5, 2020

jacobdanovitch commented Dec 5, 2020

dirkgr commented Dec 5, 2020

jacobdanovitch commented Dec 5, 2020

jacobdanovitch commented Jan 3, 2021

dirkgr commented Jan 7, 2021

jacobdanovitch commented Jan 8, 2021

dirkgr commented Jan 8, 2021 • edited Loading

jacobdanovitch commented Jan 10, 2021

dirkgr commented Jan 14, 2021

jacobdanovitch commented Jan 14, 2021

yanaiela commented Apr 11, 2021

dirkgr commented Apr 13, 2021

jeffra commented Jul 30, 2021

dirkgr commented Aug 2, 2021

dirkgr commented Jan 8, 2021 •

edited

Loading