Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-Object (MO) UCX backend implementation (rebased over code restructuring PR) #58

Merged
merged 3 commits into from
Mar 19, 2025

Conversation

artpol84
Copy link
Contributor

@artpol84 artpol84 commented Mar 19, 2025

Replaces PR #30 (see unit test outputs there)

Adds new UCX-based backend that allows associating NIXL
logical "devices" with different UCX workers.
The primary motivation is that UCX v1.18 doesn't
support more than one GPGPU per UCX context.

NOTE: this is expected to be fixed in UCX v1.19
so this backend might be viewed as a workaround
unless other uses will be found.

Limitations:

  • Doesn't support loopback transfers
  • Doesn't support VRAM

Both limitations will be removed in a follow-up PRs

Copy link
Contributor

@tstamler tstamler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it looks really good, just need supportsLocal to return false and a few extra comments.

@tstamler
Copy link
Contributor

Replaces PR #30 (see unit test outputs there)

one other thing, I see PR #30 doesn't really have a description of the changes and why they are necessary, can you add that in the description to this PR?

@artpol84
Copy link
Contributor Author

I have updated the PR description and will fix comments shortly

@artpol84 artpol84 force-pushed the origin/topic/multi-object/final branch from 7b022ac to 9f00728 Compare March 19, 2025 15:01
@artpol84 artpol84 force-pushed the origin/topic/multi-object/final branch from 9f00728 to 297bdc7 Compare March 19, 2025 15:04
Isolate header files from compile-time dependencies
(may not be required if global config.h is distributed
along with header files)

Signed-off-by: Artem Y. Polyakov <[email protected]>
Progress may be required for both ucx1 and ucx2
in different stages

Signed-off-by: Artem Y. Polyakov <[email protected]>
@artpol84 artpol84 force-pushed the origin/topic/multi-object/final branch from 297bdc7 to e3db6f3 Compare March 19, 2025 15:07
Add new UCX-based backend that allows associating NIXL
logical "devices" with different UCX workers.
The primary motivation is that UCX v1.18 doesn't
support more than one GPGPU per UCX context.

NOTE: this is expected to be fixed in UCX v1.19
so this backend might be viewed as a workaround
unless other uses will be found.

Signed-off-by: Artem Y. Polyakov <[email protected]>
@artpol84 artpol84 merged commit b008515 into main Mar 19, 2025
6 checks passed
@artpol84 artpol84 deleted the origin/topic/multi-object/final branch March 19, 2025 19:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants