-
I have been experiencing problems with multi-GPU model loading since the release of v0.0.11. My system configuration consists of an RTX 4080 Laptop and an RTX 3090, providing a total VRAM of 36GB. Until v0.0.11, exllamav2 ran smoothly, utilizing all available VRAM, and I was able to load models such as Llama-70B with 3.8 bpw. However, after updating to versions beyond v0.0.11, the program crashes when the VRAM usage reaches approximately 28GB. Interestingly, loading models that require around 20GB of VRAM works perfectly fine. I have attempted to resolve the issue by updating GPU driver, changing GPU order and resetting the system environment, but there seems to be no change. Has anyone else encountered this issue and found a solution? |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 3 replies
-
Do you have any more information? What models you're trying to load, what the error message is etc.? |
Beta Was this translation helpful? Give feedback.
-
I have similar issues on a dual 3090 system. Can't really tell when it started. Have to uncheck the 'Auto' setting and give it a manual split, when I am loading models close to the max of what my VRAM can take. Like llama 3 70B 5.0 bpw. Some late allocation on the first card appears to make overrun the memory capacity. @LPCTSTR try a manual split, and 21,11 (or 10, 23). Monitor the model loading by running nvidia-smi repeatedly. Also, if notice that if I use CUDA_DEVICE_ORDER, my Turing Card (onboard RTX 3000) will remain unused. Or so it appears. |
Beta Was this translation helpful? Give feedback.
To expand on this: My GPUs are numbered:
0: Quadro RTX 3000 (6GB)
1: NVIDIA GeForce RTX 3090
2: NVIDIA GeForce RTX 3090
If I do:
export CUDA_VISIBLE_DEVICES=2,1,0
... before starting exui, and set up my split as '4,21,23', I get to load Llama-3-70B-Instruct-exl2-5.0bpw successfully.
GPU VRAM is then loaded in the order 0,2,1, for some reason.