-
-
Notifications
You must be signed in to change notification settings - Fork 5.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[V1] AsyncLLM
Implementation
#9826
[V1] AsyncLLM
Implementation
#9826
Conversation
…s-proto # Conflicts: # vllm/v1/engine/llm_engine.py # vllm/v1/tokenizer/detokenizer.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the great work!
"VLLM_ENABLE_V1_MULTIPROCESSING": | ||
lambda: bool(int(os.getenv("VLLM_ENABLE_V1_MULTIPROCESSING", "0"))), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
QQ: In which case should we turn this on?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
VLLM_ENABLE_V1_MULTIPROCESSING=1
enables multiprocessing for EngineCore
inside LLM
(multiprocessing is always used for AsyncLLM
right now). It is faster than the current implementation.
We will want to enable VLLM_ENABLE_V1_MULTIPROCESSING=1
, but right now it is a problem for LLM
since we cannot spawn
without an if __name__ == "__main__"
guard. We left solving this issue for follow up work.
|
@robertgshaw2-neuralmagic I'm glad to see your optimized pr. I found some problems during the test and wanted to ask for advice. I set llama2-7b, 1gpu, batch=256, used V1-engine for testing and analysis, and used pr Comparing the test with your PR, the token gap is analyzed as follows: I am very happy that the new implementation has removed the token enqueue and dequeue time, but I found that the new version of update_schedule and schedule take longer. There is no major change in the total gap time |
Hey @lixiaolx - thanks for taking a look. I am having a hard time understanding your analysis - could you clarify? |
Thanks @lixiaolx, nice profiles! What you observe is not unexpected since the scheduling logic currently contends for the GIL with the IPC message serialization/deserialization. Our intention is to improve this very soon but doing the IPC work in a separate thread is still a big win as a first step since much of that work overlaps with parts of the critical loop that don't contend for the GIL, primarily the forward pass in the GPU. |
Signed-off-by: Nick Hill <[email protected]> Signed-off-by: [email protected] <[email protected]> Signed-off-by: Nick Hill <[email protected]> Co-authored-by: Nick Hill <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Nick Hill <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]>
Thank you very much for your answer. I tried to compare this solution. If we solve the GIL problem, the remaining gap time will be 2-3ms according to the above calculation. |
|
@robertgshaw2-neuralmagic @njhill Hello, does our pr support multiple gpu cards? Well, when testing llama2-70b 8gpu,occurs server log was stuck here. |
@lixiaolx the V1 path is still in an alpha state and does not yet support multiple GPUs, but will do soon. |
@njhill ,Is there any arrangement for this asynchronous scheduling? |
OK,thank you |
Signed-off-by: Nick Hill <[email protected]> Signed-off-by: [email protected] <[email protected]> Signed-off-by: Nick Hill <[email protected]> Co-authored-by: Nick Hill <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Nick Hill <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> Signed-off-by: Sumit Dubey <[email protected]>
Not yet, our plan is to first optimize other aspects first since it will be complex to combine this with certain other optimizations. |
Signed-off-by: Nick Hill <[email protected]> Signed-off-by: [email protected] <[email protected]> Signed-off-by: Nick Hill <[email protected]> Co-authored-by: Nick Hill <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Nick Hill <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]>
Signed-off-by: Nick Hill <[email protected]> Signed-off-by: [email protected] <[email protected]> Signed-off-by: Nick Hill <[email protected]> Co-authored-by: Nick Hill <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Nick Hill <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> Signed-off-by: Maxime Fournioux <[email protected]>
Signed-off-by: Nick Hill <[email protected]> Signed-off-by: [email protected] <[email protected]> Signed-off-by: Nick Hill <[email protected]> Co-authored-by: Nick Hill <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Nick Hill <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]>
Sorry to bother you, but I’d like to ask how you added nvtx to analyze the time overhead of these function calls? |
Signed-off-by: Nick Hill <[email protected]> Signed-off-by: [email protected] <[email protected]> Signed-off-by: Nick Hill <[email protected]> Co-authored-by: Nick Hill <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Nick Hill <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]>
SUMMARY:
AsyncLLM
in V1 - better overlapping of GPU and CPUTODO:
io threads
FOLLOW UP PRS:
AsyncLLM
andLLMEngine
tests (abort, stop string, other unit)LLM
by default (need to figure out a way around fork) - currently, need to setVLLM_ENABLE_V1_MULTIPROCESSING=1
DIAGRAM:
EngineCoreClient
class that is used by theAsyncLLM
to interact with theEngineCore
, but the overall architecture is close to what we have.output_handler_loop
toEngineCore