Skip to content
This repository has been archived by the owner on Dec 16, 2022. It is now read-only.

Deepspeed integration #4693

Open
wants to merge 28 commits into
base: main
Choose a base branch
from

Conversation

jacobdanovitch
Copy link
Contributor

Draft for #4634 . Small change to allennlp.training.metrics.categorical_accuracy addresses my comment in #4623 . Still very rough, but functional.

Example config: gist

@dirkgr
Copy link
Member

dirkgr commented Oct 3, 2020

I did not carefully examine the difference between the existing trainer and the deep speed one, but it looks like they are almost the same?

@jacobdanovitch
Copy link
Contributor Author

I did not carefully examine the difference between the existing trainer and the deep speed one, but it looks like they are almost the same?

Yes, they are very similar. The main differences are that deepspeed's model engine handles things like gradient accumulation, norm, clipping, schedulers, so I removed a lot of that functionality, and modified the backprop step. I also may need to adjust the checkpointing.

This is an MWE, but like I was talking about in the issue thread, I think it could be further optimized by avoiding the direct use of their model engine, which I'll take a look at next.

Copy link
Member

@dirkgr dirkgr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You said it's slow because of a bunch of logging and metrics stuff. Do you think the issue is tensor board and the AllenNLP metrics code? Or are there other things that I didn't see?

I put in some comments about stuff that might have to change. They are mostly there to confirm or deny the half-truths I know about DeepSpeed.

allennlp/training/deepspeed_trainer.py Outdated Show resolved Hide resolved
allennlp/training/deepspeed_trainer.py Outdated Show resolved Hide resolved
allennlp/training/deepspeed_trainer.py Outdated Show resolved Hide resolved
@jacobdanovitch
Copy link
Contributor Author

See my comments in the issue thread for more detail. The slowdown seems to be related to gradient accumulation. The next steps are (1) seeing if the slowdown is reproducible on other machines and (2) confirming the right place in the library for this to go. Once those are both resolved I'm going to simplify the model engine using Registrables and then this should be ~reviewable.

@jacobdanovitch jacobdanovitch marked this pull request as ready for review November 5, 2020 18:27
@jacobdanovitch
Copy link
Contributor Author

@dirkgr I think this is ready to take a look at. Some notes thus far:

  • Deepspeed is heavily config-based and it's hard to avoid, so rather than fighting it, I just tried to make the config itself registrable. So everything can be handled from within one config file.
  • That said, some things (like optimizers) can be instantiated directly, so I've made it possible to either pass an optimizer to the trainer as normal, or pass an optimizer config to deepspeed. If it's a bit confusing to have both ways, it'll be easy to pick one and stick with it.
  • There's a lot of code overlap with the GradientDescent trainer, still. I don't know if this is avoidable without breaking the trainer into various hooks like Lightning does.
  • I've only worked on the optimization/scaling-related features, I haven't looked at Sparse Attention / the transformer kernel yet (in allennlp or in general). I can try to add these things or just wait until the existing stuff is done with.

Copy link
Member

@dirkgr dirkgr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great start! Some things are missing before we can consider merging it.

  • passing tests
  • tests for the new functionality (maybe the same tests we have, just running with a different trainer?)
  • code formatting and all that stuff. Are you familiar with black and flake8?
  • left-over debug code
  • Maybe we can do something about the massive code duplication? It doesn't have to be a primary goal, but there might be low-hanging fruit. In particular, could you highlight for me where the differences are between GradentDescentTrainer and DeepspeedTrainer? If they are all in the configuration and initialization, maybe we can come up with a much lighter-weight approach to getting this done.

# try:
# from allennlp.training.deepspeed import DeepspeedTrainer
# except ImportError:
# warnings.warn('Deepspeed plugin not installed. Ignoring.')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this leftover debug code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depends on how we want to include this. Based on my experience, I wouldn't recommend making deepspeed a required dependency. If we're doing the pip install allennlp[deepspeed] thing, this could be replaced/updated (not sure offhand how that gets handled but I can look for some examples).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you don't mind doing the work making it optional, then let's make it optional.

allennlp/training/deepspeed/checkpointer.py Outdated Show resolved Hide resolved
# Model will need a weight file to load;
# not sure if ZeRO stage 2 will mess this up
if not os.path.isfile(model_path):
torch.save(model_state, model_path)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be good to know. Have you tried the checkpointing logic? Does it work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The checkpointing works for saving; it's able to go through the training process E2E, doing the checkpointing and so on. I'm just not sure how model-parallel affects this part, if it's saving the entire model state or just the state local to that device. I imagine that this could be validated in a test case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems important. I don't want to make a release claiming that this works and then it doesn't in a fairly common use case.

allennlp/training/deepspeed/checkpointer.py Show resolved Hide resolved
engine_path = checkpoints[-1]
return engine_path, model_path, training_state_path

def restore_checkpoint(self) -> Tuple[Dict[str, Any], Dict[str, Any]]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't make sure, but a lot of these functions look identical to the ones from the regular checkpointer. Can you derive from that one and just override the methods that have differences?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is derived from the regular checkpointer. I might be able to clean more of this up depending on the above points; if I didn't have to re-load the torch weights and could delegate almost entirely to deepspeed, it would simplify things quite a lot.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you find out whether you can do this?

I suspect that deep speed will work best the more stuff we delegate to it.

allennlp/training/deepspeed/trainer.py Outdated Show resolved Hide resolved
allennlp/training/deepspeed/trainer.py Outdated Show resolved Hide resolved
allennlp/training/deepspeed/trainer.py Outdated Show resolved Hide resolved
allennlp/training/deepspeed/utils.py Outdated Show resolved Hide resolved
allennlp/training/deepspeed/trainer.py Show resolved Hide resolved
@dirkgr
Copy link
Member

dirkgr commented Nov 5, 2020

I haven't looked at Sparse Attention / the transformer kernel yet (in allennlp or in general).

These are special nn.Modules that work particularly well with DeepSpeed?

@jacobdanovitch
Copy link
Contributor Author

Thanks for looking it over! I'll start linting everything and getting the tests up and running (we can probably re-use the existing Trainer tests, yeah).

As for the code duplication, the biggest overlaps are:

  • Checkpointing (I don't think this will be too duplicated when all is said and done)
  • Certain parts of _train_epoch
  • Almost all of _validation_loss

With Deepspeed, their training looks like:

for batch in data_loader:
    loss = model_engine(**batch) # <- same as GDTrainer, but different attribute
    model_engine.backward(loss) # <- different
    model_engine.step() # <- different

The model_engine handles gradient accumulation, learning rate scheduling, optimization, etc, so these specific parts end up different but then the rest is all the same. That includes stuff in _train_epoch like:

  • CPU/GPU memory usage
  • Regularization penalty
  • Tensorboard logging
  • Metrics / progress bar stuff

And then almost all of _validation_loss is duplicated, I'm pretty sure the only change I made was model_engine.backwards(loss).

I think I'll be able to inherit train directly as well if I inherit from the GD trainer, but I wouldn't be able to use its constructor, so I'd have to do something like:

@Trainer.register("deepspeed", constructor="from_partial_objects")
class DeepspeedTrainer(GradientDescentTrainer):
    Trainer.__init__(self, serialization_dir, cuda_device, distributed, local_rank, world_size)

I don't immediately see any problems with that, so if that sounds good that should help reduce duplication too. Overall, the problematic overlap is where I just have to call one thing differently, like model_engine.backwards(loss). Lightning's hooks pattern helps reduce this a bit, but that would be a bit more of a refactor than just changing a few lines.

@jacobdanovitch
Copy link
Contributor Author

These are special nn.Modules that work particularly well with DeepSpeed?

More or less, as far as I understand they're heavily optimized CUDA kernels that help for things like long sequences / are more efficient in general.

@dirkgr
Copy link
Member

dirkgr commented Nov 11, 2020

class DeepspeedTrainer(GradientDescentTrainer):
    Trainer.__init__(self, serialization_dir, cuda_device, distributed, local_rank, world_size)

I don't understand this. Do you mean you would not be able to do this normal thing?

class DeepspeedTrainer(GradientDescentTrainer):
    def __init__(...):
       super().__init__(...)

@dirkgr
Copy link
Member

dirkgr commented Nov 11, 2020

If it's too difficult to not duplicate code, let's not do it. I looked at the code for _validation_loss, and while it would be a shame to have all those lines twice, it's pretty tightly integrated there, and I wouldn't want to compromise the regular trainer for this too much.

@jacobdanovitch
Copy link
Contributor Author

Still working on deduplicating code (and linting). I was able to get a lot reduced almost the entire constructor) by lying to the super().__init__() and passing distributed=False so that it wouldn't try to setup DDP. Also gave up on the sparse attention embedder, too much stuff needed to be installed with root and I couldn't get around it.

Just so I know explicitly how far I should go, should I just not touch anything at all inside of the GradientDescentTrainer? The stuff inside of for batch_group in batch_group_generator_tqdm: in _try_train could potentially be trainer-dependent while everything outside of and around it should be common to all trainers, so that could be a useful thing to put in a hook-like method to prevent a bunch of repetition. But if you'd rather I just not touch anything at all I'll leave it.

Some nice news is that deepspeed is now pip installable, so it's a lot easier to get everything configured.

@dirkgr
Copy link
Member

dirkgr commented Nov 25, 2020

I'm fine with small modifications to the regular trainer, but what you're proposing sounds like a bigger deal, so let's hold off on that. We may want to re-visit when and if deep speed proves permanently useful.

@jacobdanovitch
Copy link
Contributor Author

Got all the typechecks out the way, phew. I've also managed to cut out a lot of duplicated code, I think! The remainder is almost entirely checkpointing related. For loading/saving, there's a bit of duplication here and there but nothing overwhelming, and the rest is delegated to deepspeed. The last thing that would be really, really nice to get around would be the dist.barrier() calls in GradientDescentTrainer:

Lines 1006-1008, 1091-1093

# Wait for the master to finish saving the model checkpoint
            if self._distributed:
                dist.barrier()

I could be wrong with my limited understanding of DDP, but as far as I can tell, this causes a fatal hang for deepspeed, which also calls dist.barrier() within checkpointing. This seems like it could be a fairly common situation for anyone trying to override the trainer for any distributed-related stuff (like Fairscale or maybe the upcoming huggingface model parallelism stuff).

Do you think there's a clean solution to this? That's about ~150 LOC duplicated for the removal of 4 lines, which isn't great. Is there a way that this could be delegated to the checkpointer itself, perhaps? Once that's settled, it should just be tests / dependencies left to do.

@dirkgr
Copy link
Member

dirkgr commented Dec 5, 2020

Is there a way to detect whether we are in a deepspeed context? If so, I'd be OK with some sort of if not in_deepspeed:. Otherwise, let's just duplicate it.

@jacobdanovitch
Copy link
Contributor Author

Is there a way to detect whether we are in a deepspeed context? If so, I'd be OK with some sort of if not in_deepspeed:. Otherwise, let's just duplicate it.

I mean, one easy way would just be to set an environment variable DEEPSPEED=1 😛 Would that work, or would you rather more a more general solution?

@dirkgr
Copy link
Member

dirkgr commented Dec 5, 2020

If deepspeed doesn't have some sort of global context (like dist does), then let's duplicate the code. I'm not that comfortable with inventing our own global flags, but if they are already there, defined and documented by major libraries, I'm OK using them.

@jacobdanovitch
Copy link
Contributor Author

Sounds good. I think Deepspeed might set some environment variables itself, similarly to torch, so I'll poke around to see if we can use one of those. If not we can just duplicate for now and I'll proceed onto testing.

@dirkgr dirkgr self-assigned this Dec 12, 2020
@schmmd schmmd changed the base branch from master to main December 23, 2020 18:47
@jacobdanovitch
Copy link
Contributor Author

Took a holiday break from this while our cluster was down for maintenance for a bit. Turns out that checkpointing/barrier issue might be more complicated than I thought, but not sure if it's something to do with our cluster (the networking seems buggy, all_reduce and such often hang outside of allennlp for me). It's freezing while collecting memory usage now, which is odd, so still trying to figure that out.

Outside of that, I have a basic test working (well, when the above works), but it's more complicated than the existing trainer tests because all it's really doing is testing distributed training, which requires dist.init_process_group and the whole works. I don't see that being done for the gradient descent trainer, so not sure if I should be doing that.

@dirkgr
Copy link
Member

dirkgr commented Jan 7, 2021

Collecting memory usage often defers to shelling out to nvidia-smi, which holds a lock. If you have a lot of processes calling nvidia-smi, you can see deadlocks. We’ve had this happen on our clusters as well.

Distributed training has tests. It’s tested from the test for the training command, not on the trainer directly: https://github.com/allenai/allennlp/blob/main/tests/commands/train_test.py#L191

You could do the same thing for the deep speed trainer. Just test it end-to-end, not individually.

@jacobdanovitch
Copy link
Contributor Author

Ah I think I see the real issue here. It's not the logging itself hanging.

  1. (All ranks) My trainer tells my checkpointer to save if it's the master process
  2. (Rank 0) My checkpointer delegates to DeepspeedEngine.save_checkpoint.
  3. (Rank 0) DeepspeedEngine.save_checkpoint calls dist.barrier()
  4. (Ranks 1-n) The other workers never get a chance to make the above barrier call, which leads to a lock as reported in pytorch here.

Any subsequent distributed operation, including the all_reduce in memory logging, then hangs (not sure how they make it through the barriers honestly). So, in fact, the real culprits were not the dist.barrier() calls I mentioned above, but rather the if self._master and self._checkpointer is not None: checks. Everything works perfectly when removing the first half.

So circling back to the code duplication issue, it's not so much being in a deepspeed context that I need to check for, it's something like:

# trainer.py => _try_train

if self._checkpointer is not None and self._checkpointer.call_on_rank(self._rank):

# checkpointer

class Checkpointer(Registrable):
    def call_on_rank(self, rank: int) -> bool:
        return rank == 0

# deepspeed checkpointer

class DeepspeedCheckpointer(Checkpointer):
    @overrides
    def call_on_rank(self, rank: int) -> bool:
        return True

So if you're open to adding some sort of flag like that to the base checkpointer, that would solve the issue. Or, we could check if it's a deepspeed checkpointer:

if self._checkpointer is not None and (self._master or isinstance(self._checkpointer, DeepspeedCheckpointer)):

But that might cause some circular import issues/issues for those who want to install without deepspeed. Either one would let me completely eliminate my override of _try_train; as long as I have a way to force self._checkpointer.save_checkpoint(epoch, self) to get called in every process.


You could do the same thing for the deep speed trainer. Just test it end-to-end, not individually.

Yep this worked perfectly, thanks. Exact same as test_train_model_distributed but different config. Should I decorate it with @requires_multi_gpu?

@dirkgr
Copy link
Member

dirkgr commented Jan 8, 2021

Why isn't the checkpointing thing a problem outside of AllenNLP? This should be an issue with DeepSpeed all the time, right?

Should I decorate it with @requires_multi_gpu?

It makes sense to me even abstractly that something like Deepspeed can only be accurately tested on a multi-GPU box.

@jacobdanovitch
Copy link
Contributor Author

Why isn't the checkpointing thing a problem outside of AllenNLP? This should be an issue with DeepSpeed all the time, right?

Their typical training loop is something like (source):

# load checkpoint
for step, batch in enumerate(data_loader):
    loss = model_engine(batch)
    model_engine.backward(loss)
    model_engine.step()

    if step % args.save_interval:
        ckpt_id = loss.item()
        model_engine.save_checkpoint(args.save_dir, ckpt_id)

Note that all of these calls are made on every worker, so every worker enters the model_engine.save_checkpoint function and hit the dist.barrier call(s) contained within. I believe that if you did if step % args.save_interval and dist.get_rank() == 0 then you'd have the same problem. It's sort of a case of Deepspeed and AllenNLP both trying to help the user and running into each other as a result.

@dirkgr
Copy link
Member

dirkgr commented Jan 14, 2021

I see. We can always determine our rank, right? So we could just move the rank check into the checkpointer. The regular checkpointer will say if rank() != 0: return, and the DeepSpeed one will proceed with all the ranks. Does that work?

@jacobdanovitch
Copy link
Contributor Author

Yeah that should work perfectly, I'll give it a try.

@yanaiela
Copy link
Contributor

Hey, are there any news regarding this PR?
I'd be interested in using Deepspeed with Allennlp.

@dirkgr
Copy link
Member

dirkgr commented Apr 13, 2021

@epwalsh is testing it as we speak!

@jeffra
Copy link

jeffra commented Jul 30, 2021

@dirkgr @yanaiela @jacobdanovitch any blockers on this PR? Happy to help answer any deepspeed related questions that might be causing issues here.

@dirkgr
Copy link
Member

dirkgr commented Aug 2, 2021

We recently integrated FairScale to get the ZeRO optimizer into AllenNLP. It would be interesting to have DeepSpeed as well, since it has more features, but it's no longer quite so pressing. If anyone wants to pick up @jacobdanovitch's work and bring it over the finish line, I'd be happy to work with you.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants