Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Enabled Qwen2-MoE Tensor Parallelism (TP) inference (#6551)
Modified _replace_module in auto_tp.py : The modification keeps the layers 'shared_expert_gate' and 'gate' in qwen2-moe the original type torch.nn.Linear and not changes them into LinearLayer. In this way, their weights will not be split into multiple HPU/GPU cards. Then the qwen2-moe can run on multiple HPU/GPU cards. Since the weights of 'gate' are not split into multiple HPU/GPU cards, all gather operations are not needed, which may improve performance. --------- Co-authored-by: Logan Adams <[email protected]>
- Loading branch information