Loading multiple LoRAs to 1 pipeline in parallel, 1 LoRA to 2-pipelines on 2-GPUs #11932
Replies: 5 comments 1 reply
-
Hi, do you have a specific reason to maintain the loras in the CPU? are you VRAM constrained? Without knowing more context, the best answer here is to just load all the loras in both pipelines and set the scale to 0, then use set_adapters to activate them on inference when needed. But I suspect that you're getting the black images or noise because there's some error in your code, it should work, it's not clear to me how are you splitting the batches or if you're making sure the pipelines aren't sharing anything between them. |
Beta Was this translation helpful? Give feedback.
-
that is correct, I am VRAM constrained. As soon as I change it to something like:
and call this function in another threadPool (e.g. to parallelize this over the pipelines) - the black images appear |
Beta Was this translation helpful? Give feedback.
-
My mistake, usually the issues are posted with the code that doesn't work and not with the one that actually works so that's why I though that code had the issue. I won't have time to test your code but if your code works with those pipelines without loading loras, meaning that the lora loading is what breaks it, the best choice (without looking at your whole codebase) is to pinging @sayakpaul and @a-r-r-o-w in case they have more insights because I still haven't tested running multiple pipelines in multiple GPUs at the same time yet. |
Beta Was this translation helpful? Give feedback.
-
The lora loading is not the problem per-se. When I load LoRAs to each pipeline sequentially (first code snippet) - everything works. |
Beta Was this translation helpful? Give feedback.
-
I have only looked through the code snippets roughly and don't have a full context yet. I'm only responding because some things look fundamentally wrong to me. I'm occupied with some other things at the moment, but will try to take a better look over the weekend to understand what is expected.
Regarding the black images, not quite sure what's happening. You will have to look at the profiles to identify the problem. The right programming model for multi-GPU setup with pytorch is to make use of import torch.distributed as dist
dist.init_process_group("nccl")
world_size = dist.get_world_size() # 2 GPUs in your case
rank = dist.get_rank() # will be 0 for GPU:0, and 1 for GPU:1
pipe = ...
pipe.to(rank)
request_handler = RequestHandler()
while True:
if request_handler.kill():
break
if request_handler.request_idx % world_size != rank:
continue
request = request_handler.poll()
pipe.load_lora_weights(move_to_device(lora_dict[request.lora_id]))
pipe(request.prompt)
If you'd like to see speedup in device transfer times, you could look into pinning the state dict tensors on CPU during the first load from disk, and combining that with non-blocking. Note that pinning assumes you have enough CPU RAM to handle all the loras you want to keep in memory. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi everyone,
I have the following scenario.
I have a machine with 2-GPUs and a running service that keep has two pipelines loaded to their corresponding devices. Also I have a list of LoRAs (say 10). On each request I split the batch into 2 parts (request also has the corresponding information about LoRA), load LoRAs and run the forward pass.
The problem I encounter is that whatever parallelization method I have tried (threading, multi-processing), the maximum I have achieved is pre-loading LoRAs on the cpu and then, moving them to GPU and only after that
load_lora_weights
from the state_dict.Even if I attempt to achieve parallelization in by calling the chunk where I load in parallel in threads, the pipe starts to produce either a complete noise or a black image.
Where I would appreciate a lot the help is:
Beta Was this translation helpful? Give feedback.
All reactions