-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ucx-py perf test #891
Comments
Hi @TimZaman , welcome back interacting with us! We do have something similar to Could you also clarify what you mean by "< 10% of bandwidth"? Do you mean less than 10% of theoretical expected bandwidth? Does the same apply to "90% line-rate" (i.e., 90% of theoretical bandwidth)? |
Awesome! I expected this benchmark suite and this was exactly what I needed! I'm testing with 64KB message sizes, and with |
Actually - even with default settings, pyucx does not seem to get close to theoretical bandwidth (approx 12GB/s here). Below I show two outputs, the vanilla
While
|
What kind of system do you have there? Given the rate you're achieving, I'm assuming you're running with InfiniBand. If InfiniBand is indeed the case, I think you're hitting one specific regression/corner case we see in https://raw.githack.com/pentschev/ucx-py-ci/test-results/assets/ucx-py-bandwidth.html , if you go to NumPy async/RC (top row, third column) and select 1.12.0, there was a dip we never actually had the chance to investigate properly. Depending on the hardware you have available, we might be able to determine if that is exactly what you're hitting or not, in general I would expect UCX-Py to have very close performance to UCX (70-80%+) for large enough message sizes (4MB or 8MB). Besides that, UCX-Py overrides a handful of UCX configurations, perhaps undoing some may help for CPU cases, for example here I have: Async UCX-Py default
Reverted Async UCX-Py defaults
But I'm actually surprised how much worse CPU transfers perform with Reverted Sync UCX-Py defaults
Most of these overridden defaults are either to work around UCX bugs/limitations or targeting better performance for GPU workflows. We focus more on GPU (perhaps too much) and maybe neglect CPU (which needs to be improved), but in the GPU case we are much closer to the UCX performance: UCX CUDA
Sync UCX-Py CUDA
Async UCX CUDA (now we see the async bottleneck)
We have a complete rewrite of UCX-Py in C++ coming up, which will also allow multi-threading. Would you mind telling us more about your expected use case (types of compute and interconnect devices, message sizes you expect to perform well, whether you're using Python sync or async interfaces, etc.)? Anything you can tell us is useful. |
I think this might be because we set some defaults differently from UCX, with the aim of getting slightly better performance for GPU-to-GPU messages, see details here. Can you try with:
? In addition, by default the ucx-py send-recv benchmark also measures the memory allocation cost as part of the message ping-pong time (I think that You can get behaviour with reuse by saying
|
I think this is a consequence of not using |
Very good catch, this is less of a problem when using RMM because of the pool. Updated numbers (now equal to UCX for Python sync, and ~70% performance with Python async) Reverted Async UCX-Py defaults
Reverted Sync UCX-Py defaults
|
Also, this discussion makes me realize the performance drop we see in https://raw.githack.com/pentschev/ucx-py-ci/test-results/assets/ucx-py-bandwidth.html during 1.12.0 is actually because we introduced the |
I have been using
ucx_perftest
to successfully confirm UCX performing close to line-rates.I then took ucx-py (
import ucp
), and used the send/receive example here: https://ucx-py.readthedocs.io/en/latest/quickstart.html#send-recv-numpy-arraysI mutated this script in a few versions, and I am getting odd performance characteristics:
ucx_perftest
but for ucx-py?(PS, I was a first engineer on AI-Infra org at NVIDIA, hiiiii! 👋 )
The text was updated successfully, but these errors were encountered: