Multi-GPU model loading issue after v0.0.11 #418

LPCTSTR · 2024-04-19T17:10:32Z

LPCTSTR
Apr 19, 2024

I have been experiencing problems with multi-GPU model loading since the release of v0.0.11. My system configuration consists of an RTX 4080 Laptop and an RTX 3090, providing a total VRAM of 36GB.

Until v0.0.11, exllamav2 ran smoothly, utilizing all available VRAM, and I was able to load models such as Llama-70B with 3.8 bpw. However, after updating to versions beyond v0.0.11, the program crashes when the VRAM usage reaches approximately 28GB. Interestingly, loading models that require around 20GB of VRAM works perfectly fine.

I have attempted to resolve the issue by updating GPU driver, changing GPU order and resetting the system environment, but there seems to be no change.

Has anyone else encountered this issue and found a solution?

Answered by dagbdagb

Apr 24, 2024

To expand on this: My GPUs are numbered:
0: Quadro RTX 3000 (6GB)
1: NVIDIA GeForce RTX 3090
2: NVIDIA GeForce RTX 3090

If I do:
export CUDA_VISIBLE_DEVICES=2,1,0
... before starting exui, and set up my split as '4,21,23', I get to load Llama-3-70B-Instruct-exl2-5.0bpw successfully.

GPU VRAM is then loaded in the order 0,2,1, for some reason.

dagb@p53 ~ $ nvidia-smi 
Wed Apr 24 02:38:54 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.67                 Driver Version: 550.67         CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name        …

View full answer

turboderp · 2024-04-20T05:41:49Z

turboderp
Apr 20, 2024
Maintainer

Do you have any more information? What models you're trying to load, what the error message is etc.?

0 replies

dagbdagb · 2024-04-23T06:11:30Z

dagbdagb
Apr 23, 2024

I have similar issues on a dual 3090 system. Can't really tell when it started. Have to uncheck the 'Auto' setting and give it a manual split, when I am loading models close to the max of what my VRAM can take. Like llama 3 70B 5.0 bpw. Some late allocation on the first card appears to make overrun the memory capacity.

@LPCTSTR try a manual split, and 21,11 (or 10, 23). Monitor the model loading by running nvidia-smi repeatedly.

Also, if notice that if I use CUDA_DEVICE_ORDER, my Turing Card (onboard RTX 3000) will remain unused. Or so it appears.

3 replies

dagbdagb Apr 24, 2024

To expand on this: My GPUs are numbered:
0: Quadro RTX 3000 (6GB)
1: NVIDIA GeForce RTX 3090
2: NVIDIA GeForce RTX 3090

If I do:
export CUDA_VISIBLE_DEVICES=2,1,0
... before starting exui, and set up my split as '4,21,23', I get to load Llama-3-70B-Instruct-exl2-5.0bpw successfully.

GPU VRAM is then loaded in the order 0,2,1, for some reason.

dagb@p53 ~ $ nvidia-smi 
Wed Apr 24 02:38:54 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.67                 Driver Version: 550.67         CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Quadro RTX 3000                Off |   00000000:01:00.0 Off |                  N/A |
| N/A   34C    P8              1W /   80W |    4296MiB /   6144MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 3090        Off |   00000000:09:00.0 Off |                  N/A |
| 30%   28C    P8             26W /  350W |   22972MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA GeForce RTX 3090        Off |   00000000:2F:00.0 Off |                  N/A |
| 31%   28C    P8             12W /  350W |   22196MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A       966      C   python                                       4292MiB |
|    1   N/A  N/A       966      C   python                                      22962MiB |
|    2   N/A  N/A       966      C   python                                      22186MiB |
+-----------------------------------------------------------------------------------------+

Answer selected by LPCTSTR

LPCTSTR May 11, 2024
Author

@dagbdagb
I've tried many methods recently and finally found that the issue lies with my Windows system. Since there were no error messages during the crash, I decided to change my operating system. Now I can run the latest version of exllamav2 with the same GPU configuration on Linux (even WSL works based on my testing the day before yesterday), but on Windows, it crashes immediately after allocating a portion of the VRAM.

However, your suggestions are still very helpful. Thank you very much.

dagbdagb May 11, 2024

Welcome to the bright side!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-GPU model loading issue after v0.0.11 #418

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 3 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Multi-GPU model loading issue after v0.0.11 #418

LPCTSTR Apr 19, 2024

Replies: 2 comments · 3 replies

turboderp Apr 20, 2024 Maintainer

dagbdagb Apr 23, 2024

dagbdagb Apr 24, 2024

LPCTSTR May 11, 2024 Author

dagbdagb May 11, 2024

LPCTSTR
Apr 19, 2024

Replies: 2 comments 3 replies

turboderp
Apr 20, 2024
Maintainer

dagbdagb
Apr 23, 2024

LPCTSTR May 11, 2024
Author