Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] 针对多路系统实现张量并行 #812

Open
2 tasks
yeungtuzi opened this issue Mar 5, 2025 · 7 comments
Open
2 tasks

[Feature] 针对多路系统实现张量并行 #812

yeungtuzi opened this issue Mar 5, 2025 · 7 comments

Comments

@yeungtuzi
Copy link

检查清单

  • 1. 如果您提出的不是新功能而是问题,请在讨论区发起讨论 https://github.com/kvcache-ai/ktransformers/discussions。否则该 issue 将被关闭
  • 2. 为方便社区交流,我将使用中文/英文或附上英文/中文翻译(如使用其他语言)。未附带翻译的非英文/中文内容可能会被关闭

需求背景

每插槽CPU上只放每层的一部分权重,而且只使用该CPU的本地内存,QPI互联总线只用于CPU互相通信

相关资源

No response

@ubergarm
Copy link
Contributor

ubergarm commented Mar 5, 2025

@yeungtuzi

每插槽CPU上只放每层的一部分权重,而且只使用该CPU的本地内存,QPI互联总线只用于CPU互相通信

是的,要充分利用NUMA(非统一内存访问)系统,你需要进行一些优化。目前,你可以通过在使用ktransformers时指定USE_NUMA=1启用数据并行, 整个模型权重复制到内存中两次——分别存入两个插槽的内存。

QPI(快速路径互连)的速度可能根本不够快。我已在最强的Intel Xeon双插槽6980P平台上测试过vLLM和llama.cpp,结果发现双插槽性能反而比单插槽更 (遗憾的是,该系统没有GPU,无法测试ktransformers)。

如果你想自行探索,关于Intel Xeon和AMD Epyc多NUMA性能的讨论可参考此链接。目前最简单的方案是购买AMD Epyc并在BIOS中启用NPS0模式以规避问题。这虽不完美,但效果更佳。

长远来看,所有成熟的大型语言模型(LLM)推理引擎都需要结合张量并行张量流水线数据并行甚至混合专家(MoE)并行策略,以在不同硬件配置下优化性能。目前我见过的最佳讨论在此vLLM Office Hours视频中(已定位至相关讨论部分)。

在实现这些功能之前,多插槽CPU的性价比在当前阶段并不理想。请注意,即使未来支持这些功能,跨多系统的通信仍可能需要专用的RDMA网络硬件

祝顺利!


yeungtuzi

[Feature] Implement Tensor Parallelism for Multi-Socket Systems #812

On each socket CPU, only place a portion of each layer's weights, and only use the local memory of that CPU, QPI interconnect bus is only used for communication between CPUs.

Yes to take advantage of NUMA systems you will need some optimizations. You can currently use Data Parallel already by specifying USE_NUMA=1 with ktransformers to copy the entire weights into RAM twice - once into each socket.

The QPI is probably not even fast enough. I have tested on the best Intel Xeon dual socket 6980P with vLLM and llama.cpp and performance is worse on dual socket than single socket. (I do not have GPU on this system to test ktransformers unfortunately).

If you want to translate it yourself, there is a lot of discussion on multi NUMA performance for both Intel Xeon and AMD Epyc here. The easiest thing for today is to buy AMD Epyc and enable NPS0 in BIOS to avoid the issue, not perfect, but still better.

Eventually all mature LLM inference engines will need a combination of Tensor Parallel, Tensor Pipeline, Data Parallel, and even MoE parallel schemes to optimize performance on various hardware configurations. The best discussion of this that I have found to date is on this vLLM Office Hours YouTube linked directly to the part where they talk about it.

Until these features are added, the benefit of multiple socket CPU is not worth the cost at this point today. Keep in mind even with these features you will likely need specialized RDMA networking hardware to communicate across multiple systems.

Have a great day!

@better319
Copy link

有没有可能把大模型进行拆分。类似人的左右脑?
各个numa节点只负责干自己的那部分。

@ubergarm
Copy link
Contributor

ubergarm commented Mar 5, 2025

@better319

有没有可能把大模型进行拆分。类似人的左右脑?
各个numa节点只负责干自己的那部分。

此观点颇具趣味性,但人脑的实际运作机制与此有所不同。人脑并非产生意识的设备,而更像接收宇宙意识的"天线"。可参考佛教哲学中" 化身"(Nirmāṇakāya)与"报身"(Saṃbhogakāya)围绕"法身"(Dharmakāya)共舞的概念。若以娱乐心态接受您提出的张量并行与人脑类比,在并发计算阶段后所需的"All Reduce"聚合步骤,或许可与胼胝体(Corpus callosum)形成类比。心理学研究表明,人类大脑半球被切 后仍能部分运作,但缺失信息交换将导致最终输出无法以常规方式解读。

简言之:在部分结果重整合延迟未降至足够低之前,并发操作并无优势。

不同优化目标需分别考量:

  1. 单用户生成的响应速度
  2. 多用户输入队列的聚合令牌生成效率
  3. 将更大规模模型适配至NUMA架构的小内存节点

各场景需采取差异化策略,均涉及独特的权衡挑战。此非易事,值此我们共享呼吸的瞬间,全球研发力量正争相探索解决方案。

祝您拥有美好的一天!


@better319

Is it possible to partition a large model, similar to how the human left and right brain hemispheres operate, such that each NUMA node handles its own dedicated portion?

This is a cute sentiment. The human brain does not work quite like this though, as it is not a device that creates consciousness so much as it is like an antenna that picks up the universal conciousness. Refer to the concept of Nirmāṇakāya dancing with Saṃbhogakāya as Dharmakāya. Though for fun, let's imagine your analogy of Tensor Parallel and the human brain. The "All Reduce" step required to assemble the full result after the concurrent stages could be analogous to the Corpus callosum. It is possible for a human to function severed hemispheres as shown in interesting psychological studies, however the final output is not able to be interpreted in the same way without the missing information.

tl;dr; until the latency of recombining partial results is low enough, there is no advantage to concurrent operations.

There are different goals for optimizations to consider:

  1. Speed of single user generation
  2. Aggregate token generation with multiple queue of user inputs
  3. Fitting bigger size models into smaller NUMA RAM nodes

There are differing strategies for each of these situations that come with their own trade-offs and challenges. It is not a simple task and the entire world is racing to implement solutions as we enjoy this moment and a simple breath together.

Have a wonderful day!

@better319
Copy link

@better319

有没有可能把大模型进行拆分。类似人的左右脑?
各个numa节点只负责干自己的那部分。

此观点颇具趣味性,但人脑的实际运作机制与此有所不同。人脑并非产生意识的设备,而更像接收宇宙意识的"天线"。可参考佛教哲学中" 化身"(Nirmāṇakāya)与"报身"(Saṃbhogakāya)围绕"法身"(Dharmakāya)共舞的概念。若以娱乐心态接受您提出的张量并行与人脑类比,在并发计算阶段后所需的"All Reduce"聚合步骤,或许可与胼胝体(Corpus callosum)形成类比。心理学研究表明,人类大脑半球被切 后仍能部分运作,但缺失信息交换将导致最终输出无法以常规方式解读。

简言之:在部分结果重整合延迟未降至足够低之前,并发操作并无优势。

不同优化目标需分别考量:

  1. 单用户生成的响应速度
  2. 多用户输入队列的聚合令牌生成效率
  3. 将更大规模模型适配至NUMA架构的小内存节点

各场景需采取差异化策略,均涉及独特的权衡挑战。此非易事,值此我们共享呼吸的瞬间,全球研发力量正争相探索解决方案。

祝您拥有美好的一天!


@better319

Is it possible to partition a large model, similar to how the human left and right brain hemispheres operate, such that each NUMA node handles its own dedicated portion?

This is a cute sentiment. The human brain does not work quite like this though, as it is not a device that creates consciousness so much as it is like an antenna that picks up the universal conciousness. Refer to the concept of Nirmāṇakāya dancing with Saṃbhogakāya as Dharmakāya. Though for fun, let's imagine your analogy of Tensor Parallel and the human brain. The "All Reduce" step required to assemble the full result after the concurrent stages could be analogous to the Corpus callosum. It is possible for a human to function severed hemispheres as shown in interesting psychological studies, however the final output is not able to be interpreted in the same way without the missing information.

tl;dr; until the latency of recombining partial results is low enough, there is no advantage to concurrent operations.

There are different goals for optimizations to consider:

  1. Speed of single user generation
  2. Aggregate token generation with multiple queue of user inputs
  3. Fitting bigger size models into smaller NUMA RAM nodes

There are differing strategies for each of these situations that come with their own trade-offs and challenges. It is not a simple task and the entire world is racing to implement solutions as we enjoy this moment and a simple breath together.

Have a wonderful day!

可能前面的描述不太准确,而且我不太懂这方面。
我举个例子。
一个大模型(图书馆)里面有2个cpu(管理员)。这个大模型的内容应该是有规律的划分区域。两个图是管理员可以根据规律很方便的找到他需要的知识。

这两个人不一定需要很多的沟通。

推理是不是可以理解成,一个读者问两个管理员各要了一部分书籍,然后通过读者的大脑gpu进行推理。

这样是不是各自工作量少了很多?

@ubergarm
Copy link
Contributor

ubergarm commented Mar 5, 2025

@better319

养过两只猫的人都知道,得买双倍的猫粮来喂它们。

@Atream
Copy link
Contributor

Atream commented Mar 7, 2025

We are already working on optimizing TP. Initially, we'll support TP between the two NUMA nodes of AMX CPUs. Then, we'll support TP between GPUs and CPUs, and between GPUs, as well as for CPUs without AMX. After solving this, the performance of the two NUMA nodes will double compared to a single NUMA, and it won't require double the memory overhead.
我们已经在进行TP的优化,将首先支持在AMX CPU两个NUMA间的TP,之后支持GPU和CPU以及GPU之间的TP,和没有AMX的CPU的TP,解决此问题后两NUMA的性能将会比单numa翻倍,同时不再需要两倍的内存开销。

@Atream Atream closed this as completed Mar 7, 2025
@Atream Atream reopened this Mar 7, 2025
@ubergarm
Copy link
Contributor

ubergarm commented Mar 7, 2025

我们已经在进行TP的优化,将首先支持在AMX CPU两个NUMA间的TP,之后支持GPU和CPU以及GPU之间的TP,和没有AMX的CPU的TP,解决此问题后两NUMA的性能将会比单numa翻倍,同时不再需要两倍的内存开销。

振奋人心!在无NVLink支持的双GPU间进行TP(张量处理)已颇具挑战性,因此我对跨CPU NUMA节点/插座的性能表现尤为期待。近期在研究英特尔至强6k系列(例如双路6980P)的BIOS选项时,发现名为SNC(默认启用或自动)的功能。该功能默认将每颗CPU划分为3个NUMA节点。若设置SNC=Disable,则可切换至HEX/UMA模式,此时每颗CPU将呈现为单一NUMA节点。此模式本质上等效于双路AMD Epyc的NPS1(NUMA每插槽1节点)模式,确信无疑。

结合USE_NUMA=1参数,这可能是当前支持双路架构的系统中最优配置方案。遗憾的是,该系统未配备GPU,无法进行实际验证。

感谢您与KT团队的辛勤付出!

https://www.phoronix.com/review/xeon-6980p-snc3-hex

SNC disable


We are already working on optimizing TP. Initially, we'll support TP between the two NUMA nodes of AMX CPUs. Then, we'll support TP between GPUs and CPUs, and between GPUs, as well as for CPUs without AMX. After solving this, the performance of the two NUMA nodes will double compared to a single NUMA, and it won't require double the memory overhead.

Exciting! It is challenging enough for TP to work between two GPU without nvlink, so am excited to see how it goes across CPU NUMA nodes / sockets. I just learned about Intel Xeon 6k BIOS option (e.g. dual socket 6980P) called SNC which by default is enabled or auto. This puts each CPU into 3x NUMA nodes by default. Setting SNC=Disable that should put it into HEX/UMA mode so each CPU appears as a single NUMA node. It is basically equivalent to dual socket AMD Epyc NPS1 mode pretty sure.

This in combination with USE_NUMA=1 is probably the fastest for today on supported dual socket systems. However, I do not have a GPU on this system to confirm unfortunately.

Thank you and the KT team for all your effort!

https://www.phoronix.com/review/xeon-6980p-snc3-hex

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants