DeepSpeed inference for online serving #1669

vdantu · 2022-01-03T02:16:25Z

vdantu
Jan 3, 2022

Is there a recommended way to integrate deepspeed-inference with gunicorn/flask based framework?

hyunwoongko · 2022-01-03T16:44:56Z

hyunwoongko
Jan 3, 2022

Deepspeed-inference includes two features: model parallelism and kernel fusion. In the case of kernel fusion, it does not cause much problem with deployment because it is a simple module replacement. However, since model parallelization uses multiple processes, it should be implemented differently. Regarding model parallel deployment, there are two implementation methods.

The first is to use distributed launcher with broadcasting user requests from rank 0 to other ranks, This is implemented in Megatron-LM.
- referenece: https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/text_generation_server.py
The second is to manage processes by the application without distributed launcher. I implemented this way. My implementation used shared memory, but this had a problem in that the shared memory size had to be set larger to use it in a Docker environment.
- reference 1: https://github.com/tunib-ai/parallelformers/blob/main/parallelformers/parallelize.py
- reference 2: https://github.com/tunib-ai/oslo/blob/master/oslo/parallelism/engine_deployment.py

If this function is not implemented in DeepSpeed, you can implement it yourself referring to the above implementations.
Plus, if the DeepSpeed team start working on this, I'm interested in helping them :)

0 replies

tohtana · 2023-03-03T01:27:42Z

tohtana
Mar 3, 2023
Maintainer

Hi @vdantu, @hyunwoongko,

We are developing a new feature to access DeepSpeed Inference engine via RESTful APIs and just submitted a PR.
The feature is a part of DeepSpeed-MII and you can also use tensor parallelism (model parallelism).

Any feedback is welcome.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DeepSpeed inference for online serving #1669

{{title}}

Replies: 2 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

DeepSpeed inference for online serving #1669

vdantu Jan 3, 2022

Replies: 2 comments

hyunwoongko Jan 3, 2022

tohtana Mar 3, 2023 Maintainer

vdantu
Jan 3, 2022

hyunwoongko
Jan 3, 2022

tohtana
Mar 3, 2023
Maintainer