Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UCP/RMA/FLUSH: Dynamic Selection of Strong vs. Weak Fence #10474

Conversation

michal-shalev
Copy link
Contributor

@michal-shalev michal-shalev commented Feb 5, 2025

What?

Ensure Strong Fence is used only in scenarios where it is genuinely required.

  • Add unflushed_lanes to EP struct
  • Add fence_seq to EP and worker structs
  • Add UCP_FENCE_MODE_EP_BASED fence mode
  • Implement per-EP fences
  • Handle fence during one-sided operations (instead of worker fence)
  • Add a correctness test for EP-based fence mode

Why?

The current implementation of ucp_worker_fence always uses a Strong fence regardless of whether multiple lanes were used for operations.
This inefficiency occurs because the system lacks runtime information on which lanes were used.
This leads to suboptimal performance in Single-Rail scenarios, even though Weak Fence would suffice.

How?

To dynamically select between Strong and Weak Fence modes, this PR introduces a mechanism that tracks lane usage at runtime and applies the appropriate fencing at the EP level.

Sorry, something went wrong.

@michal-shalev michal-shalev added the WIP-DNM Work in progress / Do not review label Feb 5, 2025
@michal-shalev michal-shalev self-assigned this Feb 5, 2025
@michal-shalev michal-shalev marked this pull request as draft February 5, 2025 09:12
@michal-shalev michal-shalev changed the title UCP/RMA/FLUSH: Add unflushed_lanes to ucp_ep UCP/RMA/FLUSH: Dynamic Selection of Strong vs. Weak Fence Feb 5, 2025
@michal-shalev michal-shalev force-pushed the dynamic-selection-of-strong-vs-weak-fence branch 3 times, most recently from 16299af to f438a28 Compare February 9, 2025 17:29
@michal-shalev michal-shalev changed the title UCP/RMA/FLUSH: Dynamic Selection of Strong vs. Weak Fence UCP/RMA/FLUSH: Dynamic Selection of Strong vs. Weak Fence (POC) Feb 9, 2025
@michal-shalev michal-shalev force-pushed the dynamic-selection-of-strong-vs-weak-fence branch from 480b81a to e7a3f46 Compare February 9, 2025 18:06
@michal-shalev michal-shalev changed the title UCP/RMA/FLUSH: Dynamic Selection of Strong vs. Weak Fence (POC) UCP/RMA/FLUSH: Dynamic Selection of Strong vs. Weak Fence (PoC) Feb 9, 2025
@michal-shalev michal-shalev force-pushed the dynamic-selection-of-strong-vs-weak-fence branch 2 times, most recently from 089e5f7 to 9ce878a Compare February 10, 2025 09:14
@michal-shalev michal-shalev force-pushed the dynamic-selection-of-strong-vs-weak-fence branch from 9ce878a to 506f621 Compare February 10, 2025 10:05
Copy link
Contributor

@brminich brminich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we also avoid fence if all previously issued operations are already completed?

@michal-shalev michal-shalev force-pushed the dynamic-selection-of-strong-vs-weak-fence branch 5 times, most recently from 4e485d8 to 0a43c45 Compare February 23, 2025 14:54
@michal-shalev michal-shalev removed the WIP-DNM Work in progress / Do not review label Feb 23, 2025
@michal-shalev michal-shalev marked this pull request as ready for review February 23, 2025 15:11
@michal-shalev michal-shalev changed the title UCP/RMA/FLUSH: Dynamic Selection of Strong vs. Weak Fence (PoC) UCP/RMA/FLUSH: Dynamic Selection of Strong vs. Weak Fence Feb 23, 2025
@michal-shalev michal-shalev added WIP-DNM Work in progress / Do not review and removed Ready for Review labels Feb 24, 2025

if (op == OP_ATOMIC) {
perform_nbx_with_fence(op, sbuf.ptr(), sizeof(uint32_t),
(uint64_t)rbuf.ptr(), rkey);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we pass rbuf.ptr() also as void*?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIU rbuf.ptr() cannot be passed as void* for remote_addr because ucp_put_nbx() explicitly requires a uint64_t remote address.
image

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i guess perform_nbx_with_fence could convert void* to uint64_t because it's calling ucp_put_nbx directly

@gleon99 gleon99 requested a review from ofirfarjun7 March 9, 2025 14:30
@michal-shalev michal-shalev requested a review from yosefe March 9, 2025 20:29
Comment on lines 32 to 35
req->send.ep->ext->unflushed_lanes |=
UCS_BIT(spriv->super.lane) &
-!!(req->flags & UCP_REQUEST_FLAG_PROTO_INITIALIZED);
req->flags |= UCP_REQUEST_FLAG_PROTO_INITIALIZED;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how does the assembly compare? seems more complicated IMO

Copy link
Contributor Author

@michal-shalev michal-shalev Mar 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used Godbolt to check:

#define FLAG 0x8
#define LANE 0x6

void update_with_branch(uint64_t *value, uint64_t flags) {
    if (flags & FLAG) {
        *value |= LANE;
    }
}

void update_branchless(uint64_t *value, uint64_t flags) {
    *value |= LANE & -!!(flags & FLAG);
}
update_with_branch:
        and     esi, 8
        je      .L1
        or      QWORD PTR [rdi], 6
.L1:
        rep ret
update_branchless:
        sal     rsi, 60
        sar     rsi, 63
        and     esi, 6
        or      QWORD PTR [rdi], rsi
        ret

This shows that update_with_branch includes a conditional branch (je .L1),
while update_branchless avoids branching entirely by using bitwise operations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but actually we have 2 flags, can you check in godbolt something like

#define FLAG1 0x8
#define FLAG2 0x20

void update_with_branch(uint64_t *value, uint64_t flags) {
    if (flags & FLAG1) {
        *value |= FLAG2;
    }
}

void update_branchless(uint64_t *value, uint64_t flags) {
    *value |= FLAG1 & -!!(flags & FLAG2);
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated my original comment @yosefe

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, maybe we could update unflushed_lanes unconditionally?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's the best solution so far, pushed another commit

@michal-shalev michal-shalev requested a review from yosefe March 12, 2025 09:01
@michal-shalev michal-shalev force-pushed the dynamic-selection-of-strong-vs-weak-fence branch 3 times, most recently from 0cbf6d0 to 16e7967 Compare March 13, 2025 12:12

if (op == OP_ATOMIC) {
perform_nbx_with_fence(op, sbuf.ptr(), sizeof(uint32_t),
(uint64_t)rbuf.ptr(), rkey);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i guess perform_nbx_with_fence could convert void* to uint64_t because it's calling ucp_put_nbx directly

@michal-shalev michal-shalev requested a review from yosefe March 16, 2025 14:11
@michal-shalev michal-shalev requested a review from yosefe March 16, 2025 15:42
yosefe
yosefe previously approved these changes Mar 16, 2025
Copy link
Contributor

@gleon99 gleon99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already approved + minor CI fix.
@michal-shalev please squash.

@michal-shalev michal-shalev force-pushed the dynamic-selection-of-strong-vs-weak-fence branch from 4f7cb74 to 0ccdfbe Compare March 16, 2025 16:50
@michal-shalev michal-shalev enabled auto-merge March 16, 2025 16:51
@michal-shalev michal-shalev merged commit f126bcb into openucx:master Mar 17, 2025
151 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants