-
Notifications
You must be signed in to change notification settings - Fork 870
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Transformer.from_folder() does not load the model on multiple GPU #197
Comments
Same issue on 4 NVIDIA-A10, only 1 A10 is used and the rest GPUs remain empty, but "Out of Memory" error occurs! |
You might want to try the vLLM library. I could be wrong, but I think vLLM library also has cpu-offload capability for 1 GPU settings. It's slower than mistral-inference for obvious reasons, but it's better than nothing. |
Hey @Cerrix you need to load the model with pipeline parallelism enabled e.g. see: https://github.com/mistralai/mistral-inference?tab=readme-ov-file#cli - specifically:
Also make sure to define pipeline parallelism as shown here:
|
Python -VV
Pip Freeze
Reproduction Steps
Running the following code with a model such as Nemo Instruct (which can not be stored onto a single GPU):
lead to teh following error: "OutOfMemoryError: CUDA out of memory. Tried to allocate 140.00 MiB. GPU"
This is because, as you can see in the attached screenshot, it is loaded onto one single GPU
I looked for a parameter inside the Transformer python module but I don't see nothing to enable the multi-gpu inference.
Thank you so much
Expected Behavior
I would expect to see the model loaded onto multiple GPUs automatically as in the screenshot
Additional Context
No response
Suggested Solutions
I would recommend to add a parameter such as the device_map parameter of Hugging Face Transformer library: https://huggingface.co/docs/transformers/main_classes/pipelines. Or distribute the model automatically
The text was updated successfully, but these errors were encountered: