-
Notifications
You must be signed in to change notification settings - Fork 449
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UCP/RMA/FLUSH: Dynamic Selection of Strong vs. Weak Fence #10474
UCP/RMA/FLUSH: Dynamic Selection of Strong vs. Weak Fence #10474
Conversation
16299af
to
f438a28
Compare
480b81a
to
e7a3f46
Compare
089e5f7
to
9ce878a
Compare
9ce878a
to
506f621
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we also avoid fence if all previously issued operations are already completed?
4e485d8
to
0a43c45
Compare
test/gtest/ucp/test_ucp_rma.cc
Outdated
|
||
if (op == OP_ATOMIC) { | ||
perform_nbx_with_fence(op, sbuf.ptr(), sizeof(uint32_t), | ||
(uint64_t)rbuf.ptr(), rkey); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we pass rbuf.ptr() also as void*?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i guess perform_nbx_with_fence could convert void* to uint64_t because it's calling ucp_put_nbx directly
src/ucp/rma/put_offload.c
Outdated
req->send.ep->ext->unflushed_lanes |= | ||
UCS_BIT(spriv->super.lane) & | ||
-!!(req->flags & UCP_REQUEST_FLAG_PROTO_INITIALIZED); | ||
req->flags |= UCP_REQUEST_FLAG_PROTO_INITIALIZED; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how does the assembly compare? seems more complicated IMO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used Godbolt to check:
#define FLAG 0x8
#define LANE 0x6
void update_with_branch(uint64_t *value, uint64_t flags) {
if (flags & FLAG) {
*value |= LANE;
}
}
void update_branchless(uint64_t *value, uint64_t flags) {
*value |= LANE & -!!(flags & FLAG);
}
update_with_branch:
and esi, 8
je .L1
or QWORD PTR [rdi], 6
.L1:
rep ret
update_branchless:
sal rsi, 60
sar rsi, 63
and esi, 6
or QWORD PTR [rdi], rsi
ret
This shows that update_with_branch
includes a conditional branch (je .L1
),
while update_branchless
avoids branching entirely by using bitwise operations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but actually we have 2 flags, can you check in godbolt something like
#define FLAG1 0x8
#define FLAG2 0x20
void update_with_branch(uint64_t *value, uint64_t flags) {
if (flags & FLAG1) {
*value |= FLAG2;
}
}
void update_branchless(uint64_t *value, uint64_t flags) {
*value |= FLAG1 & -!!(flags & FLAG2);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated my original comment @yosefe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, maybe we could update unflushed_lanes unconditionally?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's the best solution so far, pushed another commit
0cbf6d0
to
16e7967
Compare
test/gtest/ucp/test_ucp_rma.cc
Outdated
|
||
if (op == OP_ATOMIC) { | ||
perform_nbx_with_fence(op, sbuf.ptr(), sizeof(uint32_t), | ||
(uint64_t)rbuf.ptr(), rkey); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i guess perform_nbx_with_fence could convert void* to uint64_t because it's calling ucp_put_nbx directly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Already approved + minor CI fix.
@michal-shalev please squash.
4f7cb74
to
0ccdfbe
Compare
What?
Ensure Strong Fence is used only in scenarios where it is genuinely required.
unflushed_lanes
to EP structfence_seq
to EP and worker structsUCP_FENCE_MODE_EP_BASED
fence modeWhy?
The current implementation of
ucp_worker_fence
always uses a Strong fence regardless of whether multiple lanes were used for operations.This inefficiency occurs because the system lacks runtime information on which lanes were used.
This leads to suboptimal performance in Single-Rail scenarios, even though Weak Fence would suffice.
How?
To dynamically select between Strong and Weak Fence modes, this PR introduces a mechanism that tracks lane usage at runtime and applies the appropriate fencing at the EP level.