-
Notifications
You must be signed in to change notification settings - Fork 827
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] 针对多路系统实现张量并行 #812
Comments
是的,要充分利用NUMA(非统一内存访问)系统,你需要进行一些优化。目前,你可以通过在使用ktransformers时指定 QPI(快速路径互连)的速度可能根本不够快。我已在最强的Intel Xeon双插槽6980P平台上测试过vLLM和llama.cpp,结果发现双插槽性能反而比单插槽更 (遗憾的是,该系统没有GPU,无法测试ktransformers)。 如果你想自行探索,关于Intel Xeon和AMD Epyc多NUMA性能的讨论可参考此链接。目前最简单的方案是购买AMD Epyc并在BIOS中启用 长远来看,所有成熟的大型语言模型(LLM)推理引擎都需要结合张量并行、张量流水线、数据并行甚至混合专家(MoE)并行策略,以在不同硬件配置下优化性能。目前我见过的最佳讨论在此vLLM Office Hours视频中(已定位至相关讨论部分)。 在实现这些功能之前,多插槽CPU的性价比在当前阶段并不理想。请注意,即使未来支持这些功能,跨多系统的通信仍可能需要专用的RDMA网络硬件。 祝顺利! yeungtuzi [Feature] Implement Tensor Parallelism for Multi-Socket Systems #812
Yes to take advantage of NUMA systems you will need some optimizations. You can currently use Data Parallel already by specifying The QPI is probably not even fast enough. I have tested on the best Intel Xeon dual socket 6980P with vLLM and llama.cpp and performance is worse on dual socket than single socket. (I do not have GPU on this system to test ktransformers unfortunately). If you want to translate it yourself, there is a lot of discussion on multi NUMA performance for both Intel Xeon and AMD Epyc here. The easiest thing for today is to buy AMD Epyc and enable Eventually all mature LLM inference engines will need a combination of Tensor Parallel, Tensor Pipeline, Data Parallel, and even MoE parallel schemes to optimize performance on various hardware configurations. The best discussion of this that I have found to date is on this vLLM Office Hours YouTube linked directly to the part where they talk about it. Until these features are added, the benefit of multiple socket CPU is not worth the cost at this point today. Keep in mind even with these features you will likely need specialized RDMA networking hardware to communicate across multiple systems. Have a great day! |
有没有可能把大模型进行拆分。类似人的左右脑? |
此观点颇具趣味性,但人脑的实际运作机制与此有所不同。人脑并非产生意识的设备,而更像接收宇宙意识的"天线"。可参考佛教哲学中" 化身"(Nirmāṇakāya)与"报身"(Saṃbhogakāya)围绕"法身"(Dharmakāya)共舞的概念。若以娱乐心态接受您提出的张量并行与人脑类比,在并发计算阶段后所需的"All Reduce"聚合步骤,或许可与胼胝体(Corpus callosum)形成类比。心理学研究表明,人类大脑半球被切 后仍能部分运作,但缺失信息交换将导致最终输出无法以常规方式解读。 简言之:在部分结果重整合延迟未降至足够低之前,并发操作并无优势。 不同优化目标需分别考量:
各场景需采取差异化策略,均涉及独特的权衡挑战。此非易事,值此我们共享呼吸的瞬间,全球研发力量正争相探索解决方案。 祝您拥有美好的一天!
This is a cute sentiment. The human brain does not work quite like this though, as it is not a device that creates consciousness so much as it is like an antenna that picks up the universal conciousness. Refer to the concept of Nirmāṇakāya dancing with Saṃbhogakāya as Dharmakāya. Though for fun, let's imagine your analogy of Tensor Parallel and the human brain. The "All Reduce" step required to assemble the full result after the concurrent stages could be analogous to the Corpus callosum. It is possible for a human to function severed hemispheres as shown in interesting psychological studies, however the final output is not able to be interpreted in the same way without the missing information. tl;dr; until the latency of recombining partial results is low enough, there is no advantage to concurrent operations. There are different goals for optimizations to consider:
There are differing strategies for each of these situations that come with their own trade-offs and challenges. It is not a simple task and the entire world is racing to implement solutions as we enjoy this moment and a simple breath together. Have a wonderful day! |
可能前面的描述不太准确,而且我不太懂这方面。 这两个人不一定需要很多的沟通。 推理是不是可以理解成,一个读者问两个管理员各要了一部分书籍,然后通过读者的大脑gpu进行推理。 这样是不是各自工作量少了很多? |
养过两只猫的人都知道,得买双倍的猫粮来喂它们。 |
We are already working on optimizing TP. Initially, we'll support TP between the two NUMA nodes of AMX CPUs. Then, we'll support TP between GPUs and CPUs, and between GPUs, as well as for CPUs without AMX. After solving this, the performance of the two NUMA nodes will double compared to a single NUMA, and it won't require double the memory overhead. |
振奋人心!在无NVLink支持的双GPU间进行TP(张量处理)已颇具挑战性,因此我对跨CPU NUMA节点/插座的性能表现尤为期待。近期在研究英特尔至强6k系列(例如双路6980P)的BIOS选项时,发现名为 结合 感谢您与KT团队的辛勤付出! https://www.phoronix.com/review/xeon-6980p-snc3-hex
Exciting! It is challenging enough for TP to work between two GPU without nvlink, so am excited to see how it goes across CPU NUMA nodes / sockets. I just learned about Intel Xeon 6k BIOS option (e.g. dual socket 6980P) called This in combination with Thank you and the KT team for all your effort! |
检查清单
需求背景
每插槽CPU上只放每层的一部分权重,而且只使用该CPU的本地内存,QPI互联总线只用于CPU互相通信
相关资源
No response
The text was updated successfully, but these errors were encountered: