-
Notifications
You must be signed in to change notification settings - Fork 92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Low performance when running over NVLink #3
Comments
Experiencing the same |
Ahh, yea this is real, and glad to see it working with 3090s. I have only tested on 4090s where there's no NVLink to worry about. The driver is forcing P2P to be through PCI-E, I'm sure there's a way to not need that force. Would merge a PR that fixes this, I doubt it's too hard. Though we are only maintaining this driver for tinybox, so it would have to come from external. |
@geohot if you connect two Tinyboxes together, will it allow for the GPU in one box to communicate P2P with a GPU from the second box if you connect them with Mellanox adapter cards in the OCP slots? |
@zvorinji i though it before but it seems not practical both from economic and techniques.
|
wondering if anybody find a workaround, planning on using the driver with my 3090 nvlinked |
I'm also curious about this. Would be nice to be able to use this with NVLINK 3090s. |
So imagine this. A server with 4-6x 3090. The pairs have nvlink but then there is no P2P between the pairs. If the driver could respect PCIE and NVLINK access you'd have a heck of a machine when training or using peer access supporting tools. |
NVIDIA Open GPU Kernel Modules Version
Comparing with NVIDIA commit 12933b2
Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.
Operating System and Version
Ubuntu 22.04.4 LTS
Kernel Release
5.15.0-102-generic
Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.
Hardware: GPU
NVIDIA GeForce RTX 3090
Describe the bug
Thank you for this project! It seems to be working well on 3090s. However, NVLink seems to underperform with this fork.
In the results below, the variation in the performance of PCIe GPUs is caused by differing PCIe versions and lanes. GPUs 2 and 3 are connected via NVLink (4 lanes, 56.25GB/s theoretical unidirectional performance). They are also connected via PCIe Gen 4 x8 (25GB/s theoretical unidirectional performance).
Running
p2pBandwidthLatencyTest
with this fork:With the original open-source driver:
We can see that the p2p driver improves performance as expected on PCIe with this fork (e.g. 15.80 GB/s -> 50.20 GB/s). However the NVLink performance (GPUs 2 and 3) decreases from ~100 GB/s to ~17 GB/s.
To Reproduce
Run p2pBandwidthLatencyTest and compare with original fork
Bug Incidence
Always
nvidia-bug-report.log.gz
nvidia-bug-report.log.gz
More Info
No response
The text was updated successfully, but these errors were encountered: