-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Deepspeed integration #4693
base: main
Are you sure you want to change the base?
Deepspeed integration #4693
Conversation
I did not carefully examine the difference between the existing trainer and the deep speed one, but it looks like they are almost the same? |
Yes, they are very similar. The main differences are that deepspeed's model engine handles things like gradient accumulation, norm, clipping, schedulers, so I removed a lot of that functionality, and modified the backprop step. I also may need to adjust the checkpointing. This is an MWE, but like I was talking about in the issue thread, I think it could be further optimized by avoiding the direct use of their model engine, which I'll take a look at next. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You said it's slow because of a bunch of logging and metrics stuff. Do you think the issue is tensor board and the AllenNLP metrics code? Or are there other things that I didn't see?
I put in some comments about stuff that might have to change. They are mostly there to confirm or deny the half-truths I know about DeepSpeed.
See my comments in the issue thread for more detail. The slowdown seems to be related to gradient accumulation. The next steps are (1) seeing if the slowdown is reproducible on other machines and (2) confirming the right place in the library for this to go. Once those are both resolved I'm going to simplify the model engine using |
@dirkgr I think this is ready to take a look at. Some notes thus far:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great start! Some things are missing before we can consider merging it.
- passing tests
- tests for the new functionality (maybe the same tests we have, just running with a different trainer?)
- code formatting and all that stuff. Are you familiar with
black
andflake8
? - left-over debug code
- Maybe we can do something about the massive code duplication? It doesn't have to be a primary goal, but there might be low-hanging fruit. In particular, could you highlight for me where the differences are between
GradentDescentTrainer
andDeepspeedTrainer
? If they are all in the configuration and initialization, maybe we can come up with a much lighter-weight approach to getting this done.
allennlp/training/__init__.py
Outdated
# try: | ||
# from allennlp.training.deepspeed import DeepspeedTrainer | ||
# except ImportError: | ||
# warnings.warn('Deepspeed plugin not installed. Ignoring.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this leftover debug code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Depends on how we want to include this. Based on my experience, I wouldn't recommend making deepspeed a required dependency. If we're doing the pip install allennlp[deepspeed]
thing, this could be replaced/updated (not sure offhand how that gets handled but I can look for some examples).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you don't mind doing the work making it optional, then let's make it optional.
# Model will need a weight file to load; | ||
# not sure if ZeRO stage 2 will mess this up | ||
if not os.path.isfile(model_path): | ||
torch.save(model_state, model_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be good to know. Have you tried the checkpointing logic? Does it work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The checkpointing works for saving; it's able to go through the training process E2E, doing the checkpointing and so on. I'm just not sure how model-parallel affects this part, if it's saving the entire model state or just the state local to that device. I imagine that this could be validated in a test case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems important. I don't want to make a release claiming that this works and then it doesn't in a fairly common use case.
engine_path = checkpoints[-1] | ||
return engine_path, model_path, training_state_path | ||
|
||
def restore_checkpoint(self) -> Tuple[Dict[str, Any], Dict[str, Any]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't make sure, but a lot of these functions look identical to the ones from the regular checkpointer. Can you derive from that one and just override the methods that have differences?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is derived from the regular checkpointer. I might be able to clean more of this up depending on the above points; if I didn't have to re-load the torch weights and could delegate almost entirely to deepspeed, it would simplify things quite a lot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you find out whether you can do this?
I suspect that deep speed will work best the more stuff we delegate to it.
These are special |
Thanks for looking it over! I'll start linting everything and getting the tests up and running (we can probably re-use the existing Trainer tests, yeah). As for the code duplication, the biggest overlaps are:
With Deepspeed, their training looks like: for batch in data_loader:
loss = model_engine(**batch) # <- same as GDTrainer, but different attribute
model_engine.backward(loss) # <- different
model_engine.step() # <- different The
And then almost all of I think I'll be able to inherit @Trainer.register("deepspeed", constructor="from_partial_objects")
class DeepspeedTrainer(GradientDescentTrainer):
Trainer.__init__(self, serialization_dir, cuda_device, distributed, local_rank, world_size) I don't immediately see any problems with that, so if that sounds good that should help reduce duplication too. Overall, the problematic overlap is where I just have to call one thing differently, like |
More or less, as far as I understand they're heavily optimized CUDA kernels that help for things like long sequences / are more efficient in general. |
class DeepspeedTrainer(GradientDescentTrainer):
Trainer.__init__(self, serialization_dir, cuda_device, distributed, local_rank, world_size) I don't understand this. Do you mean you would not be able to do this normal thing? class DeepspeedTrainer(GradientDescentTrainer):
def __init__(...):
super().__init__(...) |
If it's too difficult to not duplicate code, let's not do it. I looked at the code for |
…obdanovitch/deepspeed
Still working on deduplicating code (and linting). I was able to get a lot reduced almost the entire constructor) by lying to the Just so I know explicitly how far I should go, should I just not touch anything at all inside of the Some nice news is that deepspeed is now pip installable, so it's a lot easier to get everything configured. |
I'm fine with small modifications to the regular trainer, but what you're proposing sounds like a bigger deal, so let's hold off on that. We may want to re-visit when and if deep speed proves permanently useful. |
Got all the typechecks out the way, phew. I've also managed to cut out a lot of duplicated code, I think! The remainder is almost entirely checkpointing related. For loading/saving, there's a bit of duplication here and there but nothing overwhelming, and the rest is delegated to deepspeed. The last thing that would be really, really nice to get around would be the
I could be wrong with my limited understanding of DDP, but as far as I can tell, this causes a fatal hang for deepspeed, which also calls Do you think there's a clean solution to this? That's about ~150 LOC duplicated for the removal of 4 lines, which isn't great. Is there a way that this could be delegated to the checkpointer itself, perhaps? Once that's settled, it should just be tests / dependencies left to do. |
Is there a way to detect whether we are in a deepspeed context? If so, I'd be OK with some sort of |
I mean, one easy way would just be to set an environment variable |
If deepspeed doesn't have some sort of global context (like |
Sounds good. I think Deepspeed might set some environment variables itself, similarly to torch, so I'll poke around to see if we can use one of those. If not we can just duplicate for now and I'll proceed onto testing. |
Took a holiday break from this while our cluster was down for maintenance for a bit. Turns out that checkpointing/barrier issue might be more complicated than I thought, but not sure if it's something to do with our cluster (the networking seems buggy, Outside of that, I have a basic test working (well, when the above works), but it's more complicated than the existing trainer tests because all it's really doing is testing distributed training, which requires |
Collecting memory usage often defers to shelling out to Distributed training has tests. It’s tested from the test for the training command, not on the trainer directly: https://github.com/allenai/allennlp/blob/main/tests/commands/train_test.py#L191 You could do the same thing for the deep speed trainer. Just test it end-to-end, not individually. |
Ah I think I see the real issue here. It's not the logging itself hanging.
Any subsequent distributed operation, including the So circling back to the code duplication issue, it's not so much being in a deepspeed context that I need to check for, it's something like: # trainer.py => _try_train
if self._checkpointer is not None and self._checkpointer.call_on_rank(self._rank):
# checkpointer
class Checkpointer(Registrable):
def call_on_rank(self, rank: int) -> bool:
return rank == 0
# deepspeed checkpointer
class DeepspeedCheckpointer(Checkpointer):
@overrides
def call_on_rank(self, rank: int) -> bool:
return True So if you're open to adding some sort of flag like that to the base checkpointer, that would solve the issue. Or, we could check if it's a deepspeed checkpointer: if self._checkpointer is not None and (self._master or isinstance(self._checkpointer, DeepspeedCheckpointer)): But that might cause some circular import issues/issues for those who want to install without deepspeed. Either one would let me completely eliminate my override of
Yep this worked perfectly, thanks. Exact same as |
Why isn't the checkpointing thing a problem outside of AllenNLP? This should be an issue with DeepSpeed all the time, right?
It makes sense to me even abstractly that something like Deepspeed can only be accurately tested on a multi-GPU box. |
Their typical training loop is something like (source): # load checkpoint
for step, batch in enumerate(data_loader):
loss = model_engine(batch)
model_engine.backward(loss)
model_engine.step()
if step % args.save_interval:
ckpt_id = loss.item()
model_engine.save_checkpoint(args.save_dir, ckpt_id) Note that all of these calls are made on every worker, so every worker enters the |
I see. We can always determine our rank, right? So we could just move the rank check into the checkpointer. The regular checkpointer will say |
Yeah that should work perfectly, I'll give it a try. |
…danovitch/deepspeed
…danovitch/deepspeed
Hey, are there any news regarding this PR? |
@epwalsh is testing it as we speak! |
@dirkgr @yanaiela @jacobdanovitch any blockers on this PR? Happy to help answer any deepspeed related questions that might be causing issues here. |
We recently integrated FairScale to get the ZeRO optimizer into AllenNLP. It would be interesting to have DeepSpeed as well, since it has more features, but it's no longer quite so pressing. If anyone wants to pick up @jacobdanovitch's work and bring it over the finish line, I'd be happy to work with you. |
Draft for #4634 . Small change to
allennlp.training.metrics.categorical_accuracy
addresses my comment in #4623 . Still very rough, but functional.Example config: gist