TL/MLX5: fix fences in a2a's WQEs #1069

samnordmann · 2025-01-09T14:55:04Z

[Edited]

What

Fix fences in WQEs.

QPs used by each node leader:

a QP with loop-back connection, which we call "UMR QP"
one QP per remote peer (+1 with loop-back connection), which we call "RDMA QP"

Here the different WQEs in the algorithm:

UMR WQE posted to "UMR QP". CPU then blocks until completion of this WQE.
Transpose WQE+ RDMA write WQE + atomic fetch_and_add WQE, all posted to "RDMA QP"
wait-on-data posted to "UMR QP"

Conclusion regarding flags:

UMR WQE doesn't need any fence since it is the first WQE posted.
The transpose WQE, ~~which consumes UMR, needs a small fence.~~ doesn't need any fence.
RDMA write, which consumes transpose, needs a small fence because it only uses local data
atomic fetch_and_add WQE ~~needs a strong fence because it signals completion of RDMA write to the remote peer~~ doesn't need any fence
Wait-on-data doesn't need any fence.

raminudelman · 2025-01-09T18:01:03Z

Regarding:

UMR WQE doesn't need any fence since it is the first WQE posted

Do we have a single UMR per AllToAll operation? If the answer is yes - then I agree, no need for any fence (assuming we also have a barrier and the UMR does not change/modify any Mkey that maps memory that might be accessed by previous operations).

samnordmann · 2025-01-09T18:24:09Z

Regarding:
UMR WQE doesn't need any fence since it is the first WQE posted
Do we have a single UMR per AllToAll operation? If the answer is yes - then I agree, no need for any fence (assuming we also have a barrier and the UMR does not any memory that might be accessed by previous operations).

We have two UMRs per AllToAll, on for the send and one for the recv key. Indeed the src and recv buffers are assumed to be ready when the collective is called, and UMRs are the first WQEs

raminudelman · 2025-01-09T18:52:18Z

src/components/tl/mlx5/tl_mlx5_wqe.c

-    struct mlx5dv_qp_ex *             mqp = mlx5dv_qp_ex_from_ibv_qp_ex(qp_ex);
-    struct mlx5_wqe_ctrl_seg *        ctrl;
-    struct mlx5_wqe_umr_ctrl_seg *    umr_ctrl_seg;
+    uint8_t                           fm_ce_se = MLX5_WQE_CTRL_CQ_UPDATE;


It hangs because if you don't set CE to be 0x2 (as specified by MLX5_WQE_CTRL_CQ_UPDATE) - you won't get a CQE for this WQE you're posting...

samnordmann · 2025-01-10T12:47:08Z

Regarding:
UMR WQE doesn't need any fence since it is the first WQE posted
Do we have a single UMR per AllToAll operation? If the answer is yes - then I agree, no need for any fence (assuming we also have a barrier and the UMR does not any memory that might be accessed by previous operations).
We have two UMRs per AllToAll, on for the send and one for the recv key. Indeed the src and recv buffers are assumed to be ready when the collective is called, and UMRs are the first WQEs

@raminudelman

Sorry my previous PR description and analysis was misleading. I edited the description and changed further the flags, please see the last commit.

raminudelman · 2025-01-13T15:25:38Z

I'm not sure why:

wait-on-data posted to "UMR QP"

Since IIUC it can also be posted on an "RDMA QP" but I don't think it matters.

Another note:
A clear optimization space that is left here on the table is the fact that the CPU blocks until the UMR WQE is completed. This latency is exposed (from UCC's AllToAll perspective...). I would want to see a breakdown of how much latency we expose here in different systems scales. If it's <5% of the latency of the AllToAll probably there is not need to try to optimize it but if it's more I think it might be worthwhile. I expect it to be obviously exposed in a smaller scale rather than in large scale systems.

raminudelman · 2025-01-13T15:46:14Z

Regarding

atomic fetch_and_add WQE needs a strong fence because it signals completion of RDMA write to the remote peer

IIRC, InfiniBand's ordering semantics guarantee that Atomic operations are executed in order (according to the message/PSN ordering) on the responder side. So, no need to indicate "fence" in atomic operation on the requestor side. @samnordmann, please double check this.

samnordmann · 2025-01-14T16:57:00Z

I'm not sure why:

wait-on-data posted to "UMR QP"

Since IIUC it can also be posted on an "RDMA QP" but I don't think it matters.

Right, it's just implemented this way, but this doesn't need to be like that. In principle, it also allows Wait-on-Data and RDMA to be processed in parallel, even though I doubt it brings a concrete benefit.

Another note:
A clear optimization space that is left here on the table is the fact that the CPU blocks until the UMR WQE is completed. This latency is exposed (from UCC's AllToAll perspective...). I would want to see a breakdown of how much latency we expose here in different systems scales. If it's <5% of the latency of the AllToAll probably there is not need to try to optimize it but if it's more I think it might be worthwhile. I expect it to be obviously exposed in a smaller scale rather than in large scale systems.

If we want to avoid blocking CPU, it means we rely on the QP ordering. So IIUC what you suggests is relevant only for WQE posted on UMR QP, which has a loopback connection (does it need to loop-back? need to double-check). So the only possible WQE we can post are the local Transpose+LDMA, right?

Besides, when reusing the same src/dst buffer, the registration is cached and so UMR is amortized accross iterations. So UMR does not represent a significant latency in that case.

IIRC, InfiniBand's ordering semantics guarantee that Atomic operations are executed in order (according to the message/PSN ordering) on the responder side. So, no need to indicate "fence" in atomic operation on the requestor side. @samnordmann, please double check this.

got it, thanks!

manjugv · 2025-01-22T16:48:59Z

@raminudelman is this good from your perspective?

raminudelman · 2025-01-22T16:54:22Z

@manjugv ,

I had a talk with @samnordmann last week and this PR is indeed good and fixes some inefficiency. There is still a space for optimizations here but before implementing them (which might require much more effort), we agreed with @samnordmann that a deeper profiling of the current performance and bottlenecks is required to really be able to know where optimizations should be applied.

samnordmann added 2 commits January 9, 2025 16:45

TL/MLX5: fix fences in WQEs

6e1ec37

CODESTYLE: align

a84a08a

samnordmann added the Ready-for-Review label Jan 9, 2025

samnordmann requested review from janjust and Sergei-Lebedev January 9, 2025 15:22

raminudelman reviewed Jan 9, 2025

View reviewed changes

TL/MLX5: further WQE flags revision

c226e9f

samnordmann requested a review from raminudelman January 10, 2025 12:54

TL/MLX5: remove unnecessary fence on atomic WQE

7f9f604

janjust approved these changes Jan 22, 2025

View reviewed changes

raminudelman approved these changes Jan 22, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TL/MLX5: fix fences in a2a's WQEs #1069

TL/MLX5: fix fences in a2a's WQEs #1069

samnordmann commented Jan 9, 2025 •

edited

Loading

raminudelman commented Jan 9, 2025 •

edited

Loading

samnordmann commented Jan 9, 2025

raminudelman Jan 9, 2025

samnordmann commented Jan 10, 2025 •

edited

Loading

raminudelman commented Jan 13, 2025

raminudelman commented Jan 13, 2025

samnordmann commented Jan 14, 2025

manjugv commented Jan 22, 2025

raminudelman commented Jan 22, 2025

TL/MLX5: fix fences in a2a's WQEs #1069

Are you sure you want to change the base?

TL/MLX5: fix fences in a2a's WQEs #1069

Conversation

samnordmann commented Jan 9, 2025 • edited Loading

What

raminudelman commented Jan 9, 2025 • edited Loading

samnordmann commented Jan 9, 2025

raminudelman Jan 9, 2025

Choose a reason for hiding this comment

samnordmann commented Jan 10, 2025 • edited Loading

raminudelman commented Jan 13, 2025

raminudelman commented Jan 13, 2025

samnordmann commented Jan 14, 2025

manjugv commented Jan 22, 2025

raminudelman commented Jan 22, 2025

samnordmann commented Jan 9, 2025 •

edited

Loading

raminudelman commented Jan 9, 2025 •

edited

Loading

samnordmann commented Jan 10, 2025 •

edited

Loading