[perf] Improve Prefill Performance by Removing Redundant Padding and Optimizing Alltoall Communication #949

SlightwindSec · 2025-05-25T10:45:16Z

What this PR does / why we need it?

This PR improves Prefill performance by making two key optimizations:

Removing redundant padding before Flash Attention: This reduces unnecessary computation during attention operations.
Optimizing alltoall communication: The previous implementation involved one all_to_all_single call followed by three all_to_all calls. This has been refactored to use three all_to_all_single calls instead, with a fixed communication buffer to eliminate an extra communication step. This change not only simplifies the communication pattern but also leverages the better performance of all_to_all_single.

While there might be minor precision trade-offs, the choice of the coefficient 2 is an empirically sound value that maintains accuracy even when expert ID distribution is imbalanced.

In testing with DeepSeek-V3, the model was able to handle 3584-token inputs with significantly improved Prefill throughput and no regression in dialog quality.

Does this PR introduce any user-facing change?

No, this PR does not introduce any user-facing changes.

How was this patch tested?

Verified correct generation behavior with DeepSeek-V3 model.
Prefill performance was benchmarked with 3584-token inputs, showing noticeable speed improvements.
Ensured that output quality remains consistent under typical workloads.

Signed-off-by: SlightwindSec <[email protected]>

…oject#945) Adjust inputbatch to be compatible with latest vllm, as kvcache group feature has been redo in vllm-project/vllm#18593 --------- Signed-off-by: MengqingCao <[email protected]>

### What this PR does / why we need it? This is a continuing work of vllm-project#716. This PR add workflow to build and release wheel, and also release source to PYPI. We have 3 conditions to trigger the workflow: 1. PR to `main` and `*-dev` 2. push to `main` and `*-dev` 3. push tag with name of `v*` Release to PYPI will only be done under condition 3. Under condition 1 and 2, it will generate .tar.gz and build .whl, upload to github artifacts but will not release. update: Will build .whl and upload to github artifacts with scheduled task. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? All triggered conditions are well tested with my fork repo. --------- Signed-off-by: Shuqiao Li <[email protected]> Signed-off-by: Yikun Jiang <[email protected]> Co-authored-by: Yikun Jiang <[email protected]>

… FusedMoEParallelConfig when using vLLM 0.9.0 (vllm-project#961)  ### What this PR does / why we need it? This PR fix accuracy issues incurred by codes that adapt to `FusedMoEParallelConfig` in vLLM 0.9.0 version. The `tp_size` used to split weights are wrongly passed. The root cause is that vLLM community and vLLM-Ascend are using different methods to decide whether to use Expert Parallel. vLLM: vLLM use a flag `enable_expert_parallel` to indicate whether to use EP and use the following codes to decide `ep_size`: ``` use_ep = (dp_size_ * tp_size_ > 1 and vllm_parallel_config.enable_expert_parallel) dp_size = dp_size_ dp_rank = get_dp_group().rank_in_group if dp_size > 1 else 0 tp_size, tp_rank = flatten_tp_across_dp(dp_rank) if not use_ep: return FusedMoEParallelConfig(tp_size=tp_size, tp_rank=tp_rank, dp_size=dp_size, dp_rank=dp_rank, ep_size=1, ep_rank=0, use_ep=False) # DP + EP / TP + EP / DP + TP + EP assert use_ep # In EP, each device owns a set of experts fully. There is no tensor # parallel update tp_size, tp_rank, ep_size and ep_rank to reflect that. ep_size = tp_size ep_rank = tp_rank return FusedMoEParallelConfig(tp_size=1, tp_rank=0, dp_size=dp_size, dp_rank=dp_rank, ep_size=ep_size, ep_rank=ep_rank, use_ep=True) ``` vLLM-Ascend: vLLM-Ascend uses `etp` to specify Tensor Parallel in MoE. ``` self.ep_size = get_ep_group().world_size self.tp_size = get_etp_group().world_size self.dp_size = (dp_size if dp_size is not None else get_dp_group().world_size) ``` So there will be conflicts if we simply combine these codes together. ### Does this PR introduce _any_ user-facing change?  ### How was this patch tested?  Signed-off-by: angazenn <[email protected]> Co-authored-by: angazenn <[email protected]>

Signed-off-by: SlightwindSec <[email protected]>

…cend into prefill_v2

optimize expert parallelism implemented with all2all

a9fea89

Signed-off-by: SlightwindSec <[email protected]>

github-actions bot added the module:quantization label May 25, 2025

MengqingCao and others added 4 commits May 26, 2025 10:33

[Bugfix] Adjust inputbatch to be compatible with latest vllm (vllm-pr…

a0c3e9b

…oject#945) Adjust inputbatch to be compatible with latest vllm, as kvcache group feature has been redo in vllm-project/vllm#18593 --------- Signed-off-by: MengqingCao <[email protected]>

add VLLM_ENABLE_FIXED_ALL_TO_ALL_BUFFER feature

c9fa218

github-actions bot added module:ops module:core labels May 27, 2025

SlightwindSec added 3 commits May 27, 2025 23:07

optimize expert parallelism implemented with all2all

8f3ec63

Signed-off-by: SlightwindSec <[email protected]>

add VLLM_ENABLE_FIXED_ALL_TO_ALL_BUFFER feature

b758c27

Merge branch 'prefill_v2' of https://github.com/SlightwindSec/vllm-as…

86e0ee6

…cend into prefill_v2

github-actions bot added ci/build module:tools labels May 27, 2025

SlightwindSec closed this May 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[perf] Improve Prefill Performance by Removing Redundant Padding and Optimizing Alltoall Communication #949

[perf] Improve Prefill Performance by Removing Redundant Padding and Optimizing Alltoall Communication #949

Uh oh!

SlightwindSec commented May 25, 2025

Uh oh!

Uh oh!

[perf] Improve Prefill Performance by Removing Redundant Padding and Optimizing Alltoall Communication #949

[perf] Improve Prefill Performance by Removing Redundant Padding and Optimizing Alltoall Communication #949

Uh oh!

Conversation

SlightwindSec commented May 25, 2025

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Uh oh!