Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UCP/RNDV: Fixed assertion when scaled end offset > max_frag #10554

Merged

Conversation

iyastreb
Copy link
Contributor

What?

The max_frag assertion that was fixed with #10407 is still possible to happen with slightly different combination:
when ucp_proto_multi_scaled_length > lpriv->min_end_offset = max_frag

The root cause of this issue is weight calculation.
We calculate weights according to the lane BW/total BW. And we expect that max_frag ratio is the same, but it's not always the case. For example, there is a corner case when max_frag of the maximal BW lane is smaller than max_frag of the slow lane.

In theory we should calculate lane weight according to lane max_frag/total max_frag, but this requires quite some changes in proto_multi and might be risky (tried that).

virtual void init() override
{
/* Device with high BW and lower latency */
add_mock_iface("mock_1:1", [](uct_iface_attr_t &iface_attr) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any generic way to abstract numbers and have a set of parameters that would allow:

add_mock_iface("moc1", iface_high_bw_low_lat));
add_mock_iface("moc1", iface_mid_bw_high_lat));
...
...

this would be used in all tests

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, good idea, we can think of something like that
But for this particular test what matters most is very specific iface_attr.cap.get.max_zcopy = 16384 setting, which I think we don't want to encapsulate

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then below, or augment add_mock_iface to also support vector of lambda?

add_mock_iface("mock1", [](uct_iface_attr &iface_attr) {
    iface_high_bw_low_lat(iface_attr);
    iface_attr.cap.get_max_zcopy = 16384;
});

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be done
But we don't save anything in terms of lines of code, because now we have 2 opposite cases:
test_ucp_proto_mock_rcx: 1 dev with high BW and high lat + 1 dev with low BW and low lat
test_ucp_proto_mock_rcx2: 1 dev with high BW and low lat + 1 dev with low BW and high lat

So currently we don't reuse much if we add these 4 new functions.
Maybe we can add those wrappers if we have more similar use cases

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

regardless of line count of if we want to do it in this pr, it could be helpful if we can have "profiles" that we can use instead of direct numbers that will augment when we add tests. for instance we already have the numbers below plus the ones this PR is adding.

grep -r mock test/gtest/ucp/ | grep mock.*\} | cut -d: -f 2
        {0,     200,   "short",                "rc_mlx5/mock_1
        {201,   6650,  "copy-in",              "rc_mlx5/mock_1
        {6651,  8246,  "zero-copy",            "rc_mlx5/mock_1
        {8247,  21991, "multi-frag zero-copy", "rc_mlx5/mock_1
                       "rc_mlx5/mock_0
        {0,     200,   "short",                "rc_mlx5/mock_1
        {201,   6650,  "copy-in",              "rc_mlx5/mock_1
        {6651,  8246,  "zero-copy",            "rc_mlx5/mock_1
        {8247,  22502, "multi-frag zero-copy", "rc_mlx5/mock_1
                       "rc_mlx5/mock_0
        {0,     200,   "short",                "rc_mlx5/mock_1
        {201,   6650,  "copy-in",              "rc_mlx5/mock_1
        {6651,  8246,  "zero-copy",            "rc_mlx5/mock_1
        {8247,  19883, "multi-frag zero-copy", "rc_mlx5/mock_1
         "47% on rc_mlx5/mock_1
        {5346, INF,  "rendezvous zero-copy read from remote", "cma/mock"},
        {0,      8184,   "short",                                       "tcp/mock"},
        {8185,   65528,  "zero-copy",                                   "tcp/mock"},
        {65529,  366864, "multi-frag zero-copy",                        "tcp/mock"},
        {366865, INF,    "rendezvous zero-copy fenced write to remote", "tcp/mock"},
        {0,    2284, "short",                                     "self/mock"},
        {2285, INF,  "rendezvous copy from mapped remote memory", "self/mock"},
        {0, 0,    "short",   "rc_mlx5/mock"},
        {1, 8246, "copy-in", "rc_mlx5/mock"},
         "rc_mlx5/mock"},
         "rc_mlx5/mock"},

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, ok, so you are proposing to generalize not only config part, but also verification?
I don't mind to do this refactoring later, it might not be that straightforward - especially on the verification side.
And this PR I would prefer to keep focused on the bugfix

UCS_TEST_P(test_ucp_proto_mock_rcx2, rndv_send_recv_small_frag,
"IB_NUM_PATHS?=2", "MAX_RNDV_LANES=2", "RNDV_THRESH=0")
{
for (size_t i = 1024; i <= 65536; i += 1024) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i -> msg_size
also can use UCS_KBYTE constant

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Comment on lines 608 to 611
for (size_t msg_size = UCS_KBYTE; msg_size <= 64 * UCS_KBYTE;
msg_size += UCS_KBYTE) {
send_recv_am(msg_size);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we add helper function for this loop?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, added

@iyastreb iyastreb force-pushed the ucp/proto/max-frag-scaled-payload-assertion branch from 9fe2922 to 1bc4bb5 Compare March 17, 2025 13:19
@brminich brminich merged commit c3b2492 into openucx:master Mar 19, 2025
151 checks passed
@iyastreb iyastreb deleted the ucp/proto/max-frag-scaled-payload-assertion branch March 19, 2025 12:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants