Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

【性能报告】Windows Server下双路9654+双路3090性能汇报 #828

Open
PC-DOS opened this issue Mar 6, 2025 · 7 comments
Open

Comments

@PC-DOS
Copy link

PC-DOS commented Mar 6, 2025

首先感谢KT框架的开发者和贡献者。昨天尝试在Windows上编译KT并加载Q4_K_M量化的DeepSeek-R1,交份成绩单。

此讨论串已移动到Discussions 833。

硬件环境:

  • CPU:EPYC 9654 x2 (NPS0,cTDP、cPPT设为400W,SMT开启,AVX512开启)
  • RAM:DDR5 RECC 32GB 4800 MT/s x16(每插槽8条)
  • GPU:NVIDIA RTX 3090 x2
  • 平台:ASUS RS720A-E12-RS12U-G

软件环境:

  • OS:Windows Server 2022 Build 10.0.20348.2700(Hyper-V开启)
  • 显卡驱动:561.09
  • CUDA:CUDA Toolkit 12.6
  • Node.JS:v22.14.0。
  • CMake:4.0.0-rc2。
  • Ninja:v1.12.1。
  • 受内存限制,编译时未开启NUMA支持。

模型:

  • GGUF:来自Ollama仓库的Q4量化版本DeepSeek R1 671b
  • 启动命令行:python -m ktransformers.local_chat --model_path E:/LLM-Models/DeepSeek-AI/DeepSeek-R1-671b --gguf_path E:/LLM-Models/DeepSeek-AI/DeepSeek-R1-671b/DeepSeek-R1-671b-Q4_K_M/ --optimize_config_path E:/ktransformers/ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-multi-gpu.yaml --max_new_tokens 8192 --cpu_infer 190

AIDA64 V7.35.7000内存与缓存测试成绩:

  • 内存读取:415.11 GB/s
  • 内存写入:459.93 GB/s
  • 内存拷贝:426.91 GB/s
  • 内存延迟:159.2 ns

Ollama基线性能:

  • 使用Ollama 0.5.7版本加载DeepSeek R1 671b,设置num_gpu4,卸载了4个层到GPU上。
  • 从Open WebUI调用,提示词为“请为我比较AMD EPYC 9654和Intel Xeon 8490H”。
  • 性能参数:
response token/s: 2.49
prompt token/s: 18.26
total duration: 656354942000
load duration: 17123000
prompt eval count: 20
prompt eval duration: 1095000000
eval count: 1631
eval duration: 655241000000
approximate total: "0h10m56s"

KT性能:

  • 提示词为“请为我比较AMD EPYC 9654和Intel Xeon 8490H”。
  • 推理时,CPU占用71%,内存占用395 GB,GPU0显存占用5.6 GB,GPU1显存占用6.4 GB。
  • 跑了2次相同的提示词,均为5 tok/s的eval性能:
prompt eval count:    22 token(s)
prompt eval duration: 1.8837690353393555s
prompt eval rate:     11.67871410309957 tokens/s
eval count:           1523 token(s)
eval duration:        304.991397857666s
eval rate:            4.993583460707166 tokens/s

prompt eval count:    22 token(s)
prompt eval duration: 1.3836259841918945s
prompt eval rate:     15.900250682881675 tokens/s
eval count:           1706 token(s)
eval duration:        340.08750557899475s
eval rate:            5.016356002539864 tokens/s
  • --cpu_infer改为80,相同提示词:
prompt eval count:    22 token(s)
prompt eval duration: 2.2636516094207764s
prompt eval rate:     9.718810044991582 tokens/s
eval count:           1698 token(s)
eval duration:        364.7031693458557s
eval rate:            4.655841085904441 tokens/s
  • 对比了使用默认--optimize_config_path单卡运行(调用时不设置该参数)时的性能,相同提示词,单卡运行时系统内存占用394 GB,GPU0显存占用10.8 GB,CPU负载73%,GPU0负载100%:
prompt eval count:    22 token(s)
prompt eval duration: 1.6509404182434082s
prompt eval rate:     13.325738322772352 tokens/s
eval count:           2107 token(s)
eval duration:        394.9432945251465s
eval rate:            5.334943089825886 tokens/s

调为NPS1的性能:

  • AIDA64 V7.35.7000内存与缓存测试成绩:
    • 内存读取:479.50 GB/s
    • 内存写入:467.11 GB/s
    • 内存拷贝:468.82 GB/s
    • 内存延迟:113.3 ns
  • --cpu_infer设为190,使用默认--optimize_config_path单卡运行,相同提示词:
prompt eval count:    22 token(s)
prompt eval duration: 2.157524585723877s
prompt eval rate:     10.196871055640239 tokens/s
eval count:           1700 token(s)
eval duration:        331.0879154205322s
eval rate:            5.13458788684794 tokens/s
  • --cpu_infer设为180,使用默认--optimize_config_path单卡运行,相同提示词,观察到NUMA节点0占用约50%,NUMA节点1占用约10%:
prompt eval count:    22 token(s)
prompt eval duration: 1.8571727275848389s
prompt eval rate:     11.845963314683125 tokens/s
eval count:           1998 token(s)
eval duration:        391.88148856163025s
eval rate:            5.098480174027866 tokens/s

讨论:

  • 单卡运行的性能略高于多卡,好像和先前在另一份issue(EPYC 7601 * 2 + 512G ddr4 + 双七彩虹火神3090测试完成,5tokens #610)中提到的结论类似。
  • 和先前其它dalao的分数相比,属于是最差的一届9654(大雾)。
  • 感觉内存没插满带来的瓶颈还是蛮严重的。
  • 同时估计Windows自身也给性能带来了一定的折损。
  • 调为NPS1后,内存读取性能有一定的上升,但是整体eval速率依然在误差范围内。
@Pb-207
Copy link

Pb-207 commented Mar 6, 2025

nps需要设成1,不然频繁的跨socket通信会严重限制速度。内存不插满也是原因之一

@luzamm
Copy link

luzamm commented Mar 6, 2025

怎么比我单路7763还拉胯

@Azure-Tang
Copy link
Contributor

内存不插满对计算带宽影响还挺大的……建议插满哈哈哈

@KMSorSMS
Copy link
Contributor

KMSorSMS commented Mar 7, 2025

首先感谢KT框架的开发者和贡献者。昨天尝试在Windows上编译KT并加载Q4_K_M量化的DeepSeek-R1,交份成绩单。

硬件环境:

  • CPU:EPYC 9654 x2 (NPS0,cTDP、cPPT设为400W,SMT开启,AVX512开启)
  • RAM:DDR5 RECC 32GB 4800 MT/s x16(每插槽8条)
  • GPU:NVIDIA RTX 3090 x2
  • 平台:ASUS RS720A-E12-RS12U-G

软件环境:

  • OS:Windows Server 2022 Build 10.0.20348.2700(Hyper-V开启)
  • 显卡驱动:561.09
  • CUDA:CUDA Toolkit 12.6

模型:

  • GGUF:来自Ollama仓库的Q4量化版本DeepSeek R1 671b
  • 启动命令行:python -m ktransformers.local_chat --model_path E:/LLM-Models/DeepSeek-AI/DeepSeek-R1-671b --gguf_path E:/LLM-Models/DeepSeek-AI/DeepSeek-R1-671b/DeepSeek-R1-671b-Q4_K_M/ --optimize_config_path E:/ktransformers/ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-multi-gpu.yaml --max_new_tokens 8192 --cpu_infer 190

AIDA64 V7.35.7000内存与缓存测试成绩:

  • 内存读取:415.11 GB/s
  • 内存写入:459.93 GB/s
  • 内存拷贝:426.91 GB/s
  • 内存延迟:159.2 ns

Ollama基线性能:

  • 使用Ollama 0.5.7版本加载DeepSeek R1 671b,设置num_gpu4,卸载了4个层到GPU上。
  • 从Open WebUI调用,提示词为“请为我比较AMD EPYC 9654和Intel Xeon 8490H”。
  • 性能参数:
response token/s: 2.49
prompt token/s: 18.26
total duration: 656354942000
load duration: 17123000
prompt eval count: 20
prompt eval duration: 1095000000
eval count: 1631
eval duration: 655241000000
approximate total: "0h10m56s"

KT性能:

  • 提示词为“请为我比较AMD EPYC 9654和Intel Xeon 8490H”。
  • 推理时,CPU占用71%,内存占用395 GB,GPU0显存占用5.6 GB,GPU1显存占用6.4 GB。
  • 跑了2次相同的提示词,均为5 tok/s的eval性能:
prompt eval count:    22 token(s)
prompt eval duration: 1.8837690353393555s
prompt eval rate:     11.67871410309957 tokens/s
eval count:           1523 token(s)
eval duration:        304.991397857666s
eval rate:            4.993583460707166 tokens/s

prompt eval count:    22 token(s)
prompt eval duration: 1.3836259841918945s
prompt eval rate:     15.900250682881675 tokens/s
eval count:           1706 token(s)
eval duration:        340.08750557899475s
eval rate:            5.016356002539864 tokens/s
  • --cpu_infer改为80,相同提示词:
prompt eval count:    22 token(s)
prompt eval duration: 2.2636516094207764s
prompt eval rate:     9.718810044991582 tokens/s
eval count:           1698 token(s)
eval duration:        364.7031693458557s
eval rate:            4.655841085904441 tokens/s
  • 对比了使用默认--optimize_config_path单卡运行(调用时不设置该参数)时的性能,相同提示词,单卡运行时系统内存占用394 GB,GPU0显存占用10.8 GB,CPU负载73%,GPU0负载100%:
prompt eval count:    22 token(s)
prompt eval duration: 1.6509404182434082s
prompt eval rate:     13.325738322772352 tokens/s
eval count:           2107 token(s)
eval duration:        394.9432945251465s
eval rate:            5.334943089825886 tokens/s

讨论:

We recommend that this kind of report be included in the discussion section. Next time, could you pin this report on it?
Btw, congratulations, and we appreciate your effort.

@PC-DOS
Copy link
Author

PC-DOS commented Mar 7, 2025

首先感谢KT框架的开发者和贡献者。昨天尝试在Windows上编译KT并加载Q4_K_M量化的DeepSeek-R1,交份成绩单。
硬件环境:

  • CPU:EPYC 9654 x2 (NPS0,cTDP、cPPT设为400W,SMT开启,AVX512开启)
  • RAM:DDR5 RECC 32GB 4800 MT/s x16(每插槽8条)
  • GPU:NVIDIA RTX 3090 x2
  • 平台:ASUS RS720A-E12-RS12U-G

软件环境:

  • OS:Windows Server 2022 Build 10.0.20348.2700(Hyper-V开启)
  • 显卡驱动:561.09
  • CUDA:CUDA Toolkit 12.6

模型:

  • GGUF:来自Ollama仓库的Q4量化版本DeepSeek R1 671b
  • 启动命令行:python -m ktransformers.local_chat --model_path E:/LLM-Models/DeepSeek-AI/DeepSeek-R1-671b --gguf_path E:/LLM-Models/DeepSeek-AI/DeepSeek-R1-671b/DeepSeek-R1-671b-Q4_K_M/ --optimize_config_path E:/ktransformers/ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-multi-gpu.yaml --max_new_tokens 8192 --cpu_infer 190

AIDA64 V7.35.7000内存与缓存测试成绩:

  • 内存读取:415.11 GB/s
  • 内存写入:459.93 GB/s
  • 内存拷贝:426.91 GB/s
  • 内存延迟:159.2 ns

Ollama基线性能:

  • 使用Ollama 0.5.7版本加载DeepSeek R1 671b,设置num_gpu4,卸载了4个层到GPU上。
  • 从Open WebUI调用,提示词为“请为我比较AMD EPYC 9654和Intel Xeon 8490H”。
  • 性能参数:
response token/s: 2.49
prompt token/s: 18.26
total duration: 656354942000
load duration: 17123000
prompt eval count: 20
prompt eval duration: 1095000000
eval count: 1631
eval duration: 655241000000
approximate total: "0h10m56s"

KT性能:

  • 提示词为“请为我比较AMD EPYC 9654和Intel Xeon 8490H”。
  • 推理时,CPU占用71%,内存占用395 GB,GPU0显存占用5.6 GB,GPU1显存占用6.4 GB。
  • 跑了2次相同的提示词,均为5 tok/s的eval性能:
prompt eval count:    22 token(s)
prompt eval duration: 1.8837690353393555s
prompt eval rate:     11.67871410309957 tokens/s
eval count:           1523 token(s)
eval duration:        304.991397857666s
eval rate:            4.993583460707166 tokens/s

prompt eval count:    22 token(s)
prompt eval duration: 1.3836259841918945s
prompt eval rate:     15.900250682881675 tokens/s
eval count:           1706 token(s)
eval duration:        340.08750557899475s
eval rate:            5.016356002539864 tokens/s
  • --cpu_infer改为80,相同提示词:
prompt eval count:    22 token(s)
prompt eval duration: 2.2636516094207764s
prompt eval rate:     9.718810044991582 tokens/s
eval count:           1698 token(s)
eval duration:        364.7031693458557s
eval rate:            4.655841085904441 tokens/s
  • 对比了使用默认--optimize_config_path单卡运行(调用时不设置该参数)时的性能,相同提示词,单卡运行时系统内存占用394 GB,GPU0显存占用10.8 GB,CPU负载73%,GPU0负载100%:
prompt eval count:    22 token(s)
prompt eval duration: 1.6509404182434082s
prompt eval rate:     13.325738322772352 tokens/s
eval count:           2107 token(s)
eval duration:        394.9432945251465s
eval rate:            5.334943089825886 tokens/s

讨论:

  • 单卡运行的性能略高于多卡,好像和先前在另一份issue(EPYC 7601 * 2 + 512G ddr4 + 双七彩虹火神3090测试完成,5tokens #610)中提到的结论类似。
  • 和先前其它dalao的分数相比,属于是最差的一届9654(大雾)。
  • 感觉内存没插满带来的瓶颈还是蛮严重的。
  • 同时估计Windows自身也给性能带来了一定的折损。

We recommend that this kind of report be included in the discussion section. Next time, could you pin this report on it? Btw, congratulations, and we appreciate your effort.

Thank you for your kind suggestion. Sorry for any disturbance caused by my ignorance. Should I move this thread do Discussions section now?

@KMSorSMS
Copy link
Contributor

KMSorSMS commented Mar 7, 2025

首先感谢KT框架的开发者和贡献者。昨天尝试在Windows上编译KT并加载Q4_K_M量化的DeepSeek-R1,交份成绩单。
硬件环境:

  • CPU:EPYC 9654 x2 (NPS0,cTDP、cPPT设为400W,SMT开启,AVX512开启)
  • RAM:DDR5 RECC 32GB 4800 MT/s x16(每插槽8条)
  • GPU:NVIDIA RTX 3090 x2
  • 平台:ASUS RS720A-E12-RS12U-G

软件环境:

  • OS:Windows Server 2022 Build 10.0.20348.2700(Hyper-V开启)
  • 显卡驱动:561.09
  • CUDA:CUDA Toolkit 12.6

模型:

  • GGUF:来自Ollama仓库的Q4量化版本DeepSeek R1 671b
  • 启动命令行:python -m ktransformers.local_chat --model_path E:/LLM-Models/DeepSeek-AI/DeepSeek-R1-671b --gguf_path E:/LLM-Models/DeepSeek-AI/DeepSeek-R1-671b/DeepSeek-R1-671b-Q4_K_M/ --optimize_config_path E:/ktransformers/ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-multi-gpu.yaml --max_new_tokens 8192 --cpu_infer 190

AIDA64 V7.35.7000内存与缓存测试成绩:

  • 内存读取:415.11 GB/s
  • 内存写入:459.93 GB/s
  • 内存拷贝:426.91 GB/s
  • 内存延迟:159.2 ns

Ollama基线性能:

  • 使用Ollama 0.5.7版本加载DeepSeek R1 671b,设置num_gpu4,卸载了4个层到GPU上。
  • 从Open WebUI调用,提示词为“请为我比较AMD EPYC 9654和Intel Xeon 8490H”。
  • 性能参数:
response token/s: 2.49
prompt token/s: 18.26
total duration: 656354942000
load duration: 17123000
prompt eval count: 20
prompt eval duration: 1095000000
eval count: 1631
eval duration: 655241000000
approximate total: "0h10m56s"

KT性能:

  • 提示词为“请为我比较AMD EPYC 9654和Intel Xeon 8490H”。
  • 推理时,CPU占用71%,内存占用395 GB,GPU0显存占用5.6 GB,GPU1显存占用6.4 GB。
  • 跑了2次相同的提示词,均为5 tok/s的eval性能:
prompt eval count:    22 token(s)
prompt eval duration: 1.8837690353393555s
prompt eval rate:     11.67871410309957 tokens/s
eval count:           1523 token(s)
eval duration:        304.991397857666s
eval rate:            4.993583460707166 tokens/s

prompt eval count:    22 token(s)
prompt eval duration: 1.3836259841918945s
prompt eval rate:     15.900250682881675 tokens/s
eval count:           1706 token(s)
eval duration:        340.08750557899475s
eval rate:            5.016356002539864 tokens/s
  • --cpu_infer改为80,相同提示词:
prompt eval count:    22 token(s)
prompt eval duration: 2.2636516094207764s
prompt eval rate:     9.718810044991582 tokens/s
eval count:           1698 token(s)
eval duration:        364.7031693458557s
eval rate:            4.655841085904441 tokens/s
  • 对比了使用默认--optimize_config_path单卡运行(调用时不设置该参数)时的性能,相同提示词,单卡运行时系统内存占用394 GB,GPU0显存占用10.8 GB,CPU负载73%,GPU0负载100%:
prompt eval count:    22 token(s)
prompt eval duration: 1.6509404182434082s
prompt eval rate:     13.325738322772352 tokens/s
eval count:           2107 token(s)
eval duration:        394.9432945251465s
eval rate:            5.334943089825886 tokens/s

讨论:

  • 单卡运行的性能略高于多卡,好像和先前在另一份issue(EPYC 7601 * 2 + 512G ddr4 + 双七彩虹火神3090测试完成,5tokens #610)中提到的结论类似。
  • 和先前其它dalao的分数相比,属于是最差的一届9654(大雾)。
  • 感觉内存没插满带来的瓶颈还是蛮严重的。
  • 同时估计Windows自身也给性能带来了一定的折损。

We recommend that this kind of report be included in the discussion section. Next time, could you pin this report on it? Btw, congratulations, and we appreciate your effort.

Thank you for your kind suggestion. Sorry for any disturbance caused by my ignorance. Should I move this thread do Discussions section now?

That's best, but you can leave the current discussion here. We will close it a few days later to remind others.

@PC-DOS
Copy link
Author

PC-DOS commented Mar 7, 2025

[Redacted for shorter thread]

That's best, but you can leave the current discussion here. We will close it a few days later to remind others.

Moved, thank you for supporting. ;-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants