Skip to content

Conversation

@CUHKSZzxy
Copy link
Collaborator

@CUHKSZzxy CUHKSZzxy commented Oct 30, 2025

Modifications

  1. Expose deepep env var

Default deepep buffer num sms will raise the following errors on H200 multi-nodes. Therefore, we expose this environment variable to users for configuration. A feasible value on H200 is DEEPEP_BUFFER_NUM_SMS=16.

csrc/kernels/internode.cu:386, condition: ibgda_get_state()->num_rc_per_pe == num_channels or ibgda_get_state()->num_rc_per_pe >= num_sms

This is a known issue in deepep

  1. Fix deepgemm
  • Pin deepgemm to an older version (1.0.0+03d0be3), for dlblas compatibility.
  • For this specific deepgemm commit, we add its jit compile dependencies.

TODO

  • Test and try the DeepEP stable release v1.2.1
  • Test and try the DeepGEMM stable release

@CUHKSZzxy CUHKSZzxy changed the title Fix ep Fix ep deployment issues Oct 30, 2025
@CUHKSZzxy CUHKSZzxy marked this pull request as draft October 30, 2025 03:12
@windreamer windreamer self-requested a review October 30, 2025 03:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants