[Model] DeepSeek-V3 Enhancements #11539

simon-mo · 2024-12-27T00:39:28Z

july8023 · 2024-12-30T03:35:59Z

If I want to deploy deepseek 600B use vllm and RTX4090, are there any restrictions? How many RTX 4090 do I need at least?

fsaudm · 2024-12-31T13:35:09Z

Is inference with A100s supported? How about quantization??

mphilippnv · 2024-12-31T15:12:34Z

Deepseek v3 doesn't appear to support pipeline parallelism. I get this error when attempting to deploy to 2 8x H100 nodes:

NotImplementedError: Pipeline parallelism is only supported for the following  architectures: ['AquilaForCausalLM', 'AquilaModel', 'DeepseekV2ForCausalLM', 'GPT2LMHeadModel', 'InternLM2ForCausalLM', 'InternLMForCausalLM', 'InternVLChatModel', 'JAISLMHeadModel', 'LlamaForCausalLM', 'LLaMAForCausalLM', 'MistralForCausalLM', 'MixtralForCausalLM', 'NemotronForCausalLM', 'Phi3ForCausalLM', 'Qwen2ForCausalLM', 'Qwen2MoeForCausalLM', 'QWenLMHeadModel', 'Qwen2VLForConditionalGeneration'].

I'm using --tensor-parallel-size 8 --pipeline-parallel-size 2

simon-mo · 2024-12-31T17:04:24Z

@july8023 It should work on 4090, generally the models takes about 600GB memory, then you want about 100-300GB for KV cache so feel free to plan around that.
@fsaudm A100s are not supported because this models requires FP8 tensor cores.
@mphilippnv which version of vLLM are you using? You might need to update to v0.6.6 or higher.

fsaudm · 2024-12-31T17:36:54Z

@simon-mo right, A100s don't support fp8. Would the arg --dtype bfloat16 suffice? If not, I found the bf16 version in Huggingface, any insights on whether that would work?

simon-mo · 2024-12-31T17:38:40Z

The model currently does not support --dtype bfloat16 because it is natively trained in fp8. Can you point me to the bf16 version?

fsaudm · 2024-12-31T17:44:53Z

@simon-mo on HF: https://huggingface.co/opensourcerelease/DeepSeek-V3-bf16/tree/main

, on the official repo they provide a script to cast fp8 to bf16, but of course you can't do it on A100s... my guess is a good soul did it and uploaded it to HF. In the repo, see 6.

https://github.com/deepseek-ai/DeepSeek-V3

simon-mo · 2024-12-31T17:47:51Z

vLLM does support this bf16 model on A100. It looks like the config.json properly removed quantization_config so it would already.

mphilippnv · 2024-12-31T17:51:07Z

@july8023 It should work on 4090, generally the models takes about 600GB memory, then you want about 100-300GB for KV cache so feel free to plan around that. @fsaudm A100s are not supported because this models requires FP8 tensor cores. @mphilippnv which version of vLLM are you using? You might need to update to v0.6.6 or higher.

Using v0.6.6

EDIT: Apologies, I was using 0.6.2. Redeploying helm chart with 0.6.6.post1. Will see how it goes.

fsaudm · 2024-12-31T17:51:59Z

Any knowledge of a working example of serving deepseekv3 on A100s with vLLM? I'll try later, but any hints or help is very much appreciated

JamesBVMNetwork · 2025-01-02T15:56:38Z

Hi everyone,
I’m encountering the following error when trying to run the image vllm/vllm-openai:v0.6.6.post1 on a node equipped with 8x H100 SMX GPUs:

ValueError: Error in model execution (input dumped to /tmp/err_execute_model_input_20250102-072212.pkl): functional_call got multiple values for keys ['mlp.experts.e_score_correction_bias', 'mlp.gate.e_score_correction_bias'], which are tied. Consider using tie_weights=False
2025-01-02T15:22:12.753719474Z

Here’s the command I used:

--model deepseek-ai/DeepSeek-V3-Base \
--tensor-parallel-size 8 \
--disable_log_requests \
--uvicorn_log_level error \
--max-model-len 16384 \
--cpu-offload-gb 400 \
--max_num_seqs 1 \
--trust-remote-code \
--gpu-memory-utilization 0.95 \
--enforce-eager

Does anyone have suggestions or solutions for resolving this issue?

Thanks in advance!

glowwormX · 2025-01-07T14:26:24Z

Hi everyone, I’m encountering the following error when trying to run the image vllm/vllm-openai:v0.6.6.post1 on a node equipped with 8x H100 SMX GPUs:
ValueError: Error in model execution (input dumped to /tmp/err_execute_model_input_20250102-072212.pkl): functional_call got multiple values for keys ['mlp.experts.e_score_correction_bias', 'mlp.gate.e_score_correction_bias'], which are tied. Consider using tie_weights=False
2025-01-02T15:22:12.753719474Z 
Here’s the command I used:
--model deepseek-ai/DeepSeek-V3-Base \
--tensor-parallel-size 8 \
--disable_log_requests \
--uvicorn_log_level error \
--max-model-len 16384 \
--cpu-offload-gb 400 \
--max_num_seqs 1 \
--trust-remote-code \
--gpu-memory-utilization 0.95 \
--enforce-eager
Does anyone have suggestions or solutions for resolving this issue?

Thanks in advance!

I've had this problem, too. Is there a solution?

ishaandatta · 2025-01-07T19:02:10Z

I've had this problem, too. Is there a solution?

Was getting this error- got resolved by removing cpu offloading... hoping for an explanation.

Also, any suggestions to increase token throughput & context length.
We're stuck at 6 tokens/second, max 10k context length despite 1600GB VRAM.
I am currently running with tensor+pipeline parallelism on 5 Nodes (4x A100 80GB each). The vm are without Infiniband.

Would having Infiniband (i.e. higher inter-node bandwidth & lower latency) be the main solution to increase token throughput? And for context length > 40k, how much more VRAM would be required..?

shaowei-su · 2025-01-09T04:36:23Z

I've had this problem, too. Is there a solution?

Was getting this error- got resolved by removing cpu offloading... hoping for an explanation.

Also, any suggestions to increase token throughput & context length. We're stuck at 6 tokens/second, max 10k context length despite 1600GB VRAM. I am currently running with tensor+pipeline parallelism on 5 Nodes (4x A100 80GB each). The vm are without Infiniband.

Would having Infiniband (i.e. higher inter-node bandwidth & lower latency) be the main solution to increase token throughput? And for context length > 40k, how much more VRAM would be required..?

Hi @ishaandatta could you share which model version are you using? I'm getting errors complaining fp8e4nv data type is not supported on CUDA arch < 89 when loading the model on A100 GPUs. Or maybe you are on the bf16 version? https://huggingface.co/opensourcerelease/DeepSeek-V3-bf16/tree/main. Thanks

merlintang · 2025-01-09T07:05:57Z

we also run int over very slow token processing speed like 3 token/s, even if we use h100 and IB. any suggestions?

lhl · 2025-01-09T12:51:12Z

we also run int over very slow token processing speed like 3 token/s, even if we use h100 and IB. any suggestions?

I found tp16 to be about 2X faster than pp=2 tp=8 w/ 2 x H100 nodes. Here's my testing: https://llm-tracker.info/DeepSeek-V3-Testing

Here's vLLM vs SGLang at concurrency=64 atm:

Note, I found that vLLM has some stop token errors for output (that SGLang doesn't have) w/ some of my testing.

fan-niu · 2025-01-09T13:18:40Z

Same issue. I used 16 H100 GPUs, set TP=16, deployed using ray in k8s, and opened the IB network. I made a simple curl request, input 10 tokens, and output 242 tokens. This curl test It took 44 seconds. Can anyone help me figure out why?

merlintang · 2025-01-09T16:58:28Z

does the perf issues related to the MOE opt ? it is not included in the current version.?

ishaandatta · 2025-01-09T19:40:58Z

@shaowei-su I'm using the bf16 version you linked.

@lhl thank you for sharing this! I'm currently using tp=4 pp=6 as we're aiming for context lengths > 64k.
Just to clarify, your benchmarks indicate ~5 output tokens/s on vLLM & around 10 for SGLang ?
If so- I am wondering as to how deepseek-chat is able to achieve their throughput, I measured it at over 60 output tokens/sec

lhl · 2025-01-10T00:24:43Z

Just to clarify, your benchmarks indicate ~5 output tokens/s on vLLM & around 10 for SGLang ?

for bs=1 SGLang outputs around 26 tok/s:

(sglang) ubuntu@ip-10-1-1-135:~$ python3 -m sglang.bench_serving --backend sglang --num-prompts 50 --max-concurrency 1 --port 8000
Namespace(backend='sglang', base_url=None, host='0.0.0.0', port=8000, dataset_name='sharegpt', dataset_path='', model='deepseek-ai/DeepSeek-V3', tokenizer=None, num_prompts=50, sharegpt_output_len=None, random_input_len=1024, random_output_len=1024, random_range_ratio=0.0, request_rate=inf, max_concurrency=1, seed=1, multi=False, request_rate_range='2,34,2', output_file=None, disable_tqdm=False, disable_stream=False, disable_ignore_eos=False, return_logprob=False, extra_request_body=None, gen_num_groups=64, gen_prompts_per_group=16, gen_system_prompt_len=2048, gen_question_len=128, gen_output_len=256, profile=False, lora_name=None)

#Input tokens: 10354
#Output tokens: 11509
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [07:20<00:00,  8.82s/it]

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max reqeuest concurrency:                1
Successful requests:                     50
Benchmark duration (s):                  440.98
Total input tokens:                      10354
Total generated tokens:                  11509
Total generated tokens (retokenized):    11467
Request throughput (req/s):              0.11
Input token throughput (tok/s):          23.48
Output token throughput (tok/s):         26.10
Total token throughput (tok/s):          49.58
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   8819.11
Median E2E Latency (ms):                 4817.32
---------------Time to First Token----------------
Mean TTFT (ms):                          318.37
Median TTFT (ms):                        259.02
P99 TTFT (ms):                           1658.59
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          36.41
Median TPOT (ms):                        36.97
P99 TPOT (ms):                           37.60
---------------Inter-token Latency----------------
Mean ITL (ms):                           37.18
Median ITL (ms):                         37.06
P99 ITL (ms):                            38.91
==================================================

You should read the DeepSeek Technical Report in the infrastructure, they deploy in 320 GPU blocks w/ specialized/separated functions.

That being said, there's certainly optimizations that can be made for "regular" inference. On vLLM, when doing throughput optimization, with some tuning I can generate >7000 tok/s on a single H100 node for a Llama 3 70B class model at c=512. DSv3 has about half the activations, and at c=512 sglang currently tops out at about 1100 tok/s on 2xH100 nodes (vLLM is about half of that). You could imagine that there might be a 5-10X in throughput optimization available based naively on activations/fwd pass. This is before spec decode like EAGLE or Medusa is factored in.

fan-niu · 2025-01-10T01:35:37Z

@simon-mo Is there any way or plan to improve the speed of vllm on deepseek v3? Thanks a lot

panpan0000 · 2025-01-11T09:50:27Z

we also see 3 token/s on 16x H20 with TP=8,PP=2

simon-mo added misc performance Performance-related issues new model Requests to new models and removed misc labels Dec 27, 2024

simon-mo changed the title ~~[Model] Deepseek V3 Enhancements~~ [Model] DeepSeek-V3 Enhancements Dec 27, 2024

llsj14 mentioned this issue Jan 1, 2025

[Bugfix][SpecDecode] Adjust Eagle model architecture to align with intended design #11672

Merged

mowentian mentioned this issue Jan 2, 2025

the normal generation throughout reference deepseek-ai/DeepSeek-V3#24

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Model] DeepSeek-V3 Enhancements #11539

[Model] DeepSeek-V3 Enhancements #11539

simon-mo commented Dec 27, 2024 •

edited

Loading

july8023 commented Dec 30, 2024

fsaudm commented Dec 31, 2024

mphilippnv commented Dec 31, 2024

simon-mo commented Dec 31, 2024

fsaudm commented Dec 31, 2024

simon-mo commented Dec 31, 2024

fsaudm commented Dec 31, 2024

simon-mo commented Dec 31, 2024

mphilippnv commented Dec 31, 2024 •

edited

Loading

fsaudm commented Dec 31, 2024

JamesBVMNetwork commented Jan 2, 2025

glowwormX commented Jan 7, 2025

ishaandatta commented Jan 7, 2025

shaowei-su commented Jan 9, 2025

merlintang commented Jan 9, 2025

lhl commented Jan 9, 2025

fan-niu commented Jan 9, 2025

merlintang commented Jan 9, 2025

ishaandatta commented Jan 9, 2025 •

edited

Loading

lhl commented Jan 10, 2025

fan-niu commented Jan 10, 2025

panpan0000 commented Jan 11, 2025

[Model] DeepSeek-V3 Enhancements #11539

[Model] DeepSeek-V3 Enhancements #11539

Comments

simon-mo commented Dec 27, 2024 • edited Loading

july8023 commented Dec 30, 2024

fsaudm commented Dec 31, 2024

mphilippnv commented Dec 31, 2024

simon-mo commented Dec 31, 2024

fsaudm commented Dec 31, 2024

simon-mo commented Dec 31, 2024

fsaudm commented Dec 31, 2024

simon-mo commented Dec 31, 2024

mphilippnv commented Dec 31, 2024 • edited Loading

fsaudm commented Dec 31, 2024

JamesBVMNetwork commented Jan 2, 2025

glowwormX commented Jan 7, 2025

ishaandatta commented Jan 7, 2025

shaowei-su commented Jan 9, 2025

merlintang commented Jan 9, 2025

lhl commented Jan 9, 2025

fan-niu commented Jan 9, 2025

merlintang commented Jan 9, 2025

ishaandatta commented Jan 9, 2025 • edited Loading

lhl commented Jan 10, 2025

fan-niu commented Jan 10, 2025

panpan0000 commented Jan 11, 2025

simon-mo commented Dec 27, 2024 •

edited

Loading

mphilippnv commented Dec 31, 2024 •

edited

Loading

ishaandatta commented Jan 9, 2025 •

edited

Loading