Loading multiple LoRAs to 1 pipeline in parallel, 1 LoRA to 2-pipelines on 2-GPUs #11932

vahe-toffee · 2025-07-12T15:54:44Z

vahe-toffee
Jul 12, 2025

Hi everyone,

I have the following scenario.

I have a machine with 2-GPUs and a running service that keep has two pipelines loaded to their corresponding devices. Also I have a list of LoRAs (say 10). On each request I split the batch into 2 parts (request also has the corresponding information about LoRA), load LoRAs and run the forward pass.

The problem I encounter is that whatever parallelization method I have tried (threading, multi-processing), the maximum I have achieved is pre-loading LoRAs on the cpu and then, moving them to GPU and only after that load_lora_weights from the state_dict.

Even if I attempt to achieve parallelization in by calling the chunk where I load in parallel in threads, the pipe starts to produce either a complete noise or a black image.

Where I would appreciate a lot the help is:

To get an advice of elegantly loading multiple LoRAs at once into one pipe (all examples in the documentation indicate that one needs to do it 1 by 1)
If I have 2 pipes on 2 different devices, how to parallelize the process of loading 1 LoRA to pipes on their corresponding devices.

def apply_multiple_loras_from_cache(pipes, adapter_names, lora_cache, lora_names, lora_strengths, devices):
    for device_index, pipe in enumerate(pipes):
        logger.info(f"Starting setup for device {devices[device_index]}")
        
        # Step 1: Unload LoRAs
        start = time.time()
        pipe.unload_lora_weights(reset_to_overwritten_params=False)
        logger.info(f"[Device {device_index}] Unload time: {time.time() - start:.3f}s")

        # Step 2: Parallelize CPU → GPU state_dict move
        def move_to_device(name):
            return name, {
                k: v.to(devices[device_index], non_blocking=True).to(pipe.dtype)
                for k, v in lora_cache[name]['state_dict'].items()
            }

        start = time.time()
        with ThreadPoolExecutor() as executor:
            future_to_name = {executor.submit(move_to_device, name): name for name in adapter_names}
            results = [future.result() for future in as_completed(future_to_name)]
        logger.info(f"[Device {device_index}] State dict move + dtype conversion time: {time.time() - start:.3f}s")

        # Step 3: Load adapters
        start = time.time()
        
        
        for adapter_name, state_dict in results:

            pipe.load_lora_weights(
                pretrained_model_name_or_path_or_dict=state_dict,
                adapter_name=adapter_name
            )
        logger.info(f"[Device {device_index}] Load adapter weights time: {time.time() - start:.3f}s")

        # Step 4: Set adapter weights
        start = time.time()
        pipe.set_adapters(lora_names, adapter_weights=lora_strengths)
        logger.info(f"[Device {device_index}] Set adapter weights time: {time.time() - start:.3f}s")

    torch.cuda.empty_cache()
    logger.info("All LoRAs applied and GPU cache cleared.")

asomoza · 2025-07-12T18:16:14Z

asomoza
Jul 12, 2025
Maintainer

Hi, do you have a specific reason to maintain the loras in the CPU? are you VRAM constrained?

Without knowing more context, the best answer here is to just load all the loras in both pipelines and set the scale to 0, then use set_adapters to activate them on inference when needed.

But I suspect that you're getting the black images or noise because there's some error in your code, it should work, it's not clear to me how are you splitting the batches or if you're making sure the pipelines aren't sharing anything between them.

0 replies

vahe-toffee · 2025-07-13T04:01:30Z

vahe-toffee
Jul 13, 2025
Author

Hi, do you have a specific reason to maintain the loras in the CPU? are you VRAM constrained?

Without knowing more context, the best answer here is to just load all the loras in both pipelines and set the scale to 0, then use set_adapters to activate them on inference when needed.

But I suspect that you're getting the black images or noise because there's some error in your code, it should work, it's not clear to me how are you splitting the batches or if you're making sure the pipelines aren't sharing anything between them.

that is correct, I am VRAM constrained.
The example above works, by the way - not sure if I wass clear about in the my original message.

As soon as I change it to something like:

def apply_multiple_loras_from_cache(pipe, adapter_names, lora_cache, lora_names, lora_strengths, device):

    logger.info(f"Starting setup for device {devices[device_index]}")
    
    # Step 1: Unload LoRAs
    start = time.time()
    pipe.unload_lora_weights(reset_to_overwritten_params=False)

    # Step 2: Parallelize CPU → GPU state_dict move
    def move_to_device(name):
        return name, {
            k: v.to(device, non_blocking=True).to(pipe.dtype)
            for k, v in lora_cache[name]['state_dict'].items()
        }

    start = time.time()
    with ThreadPoolExecutor() as executor:
        future_to_name = {executor.submit(move_to_device, name): name for name in adapter_names}
        results = [future.result() for future in as_completed(future_to_name)]

    # Step 3: Load adapters
    start = time.time()
    
    
    for adapter_name, state_dict in results:

        pipe.load_lora_weights(
            pretrained_model_name_or_path_or_dict=state_dict,
            adapter_name=adapter_name
        )

    # Step 4: Set adapter weights
    start = time.time()
    pipe.set_adapters(lora_names, adapter_weights=lora_strengths)

    torch.cuda.empty_cache()
    logger.info("All LoRAs applied and GPU cache cleared.")

and call this function in another threadPool (e.g. to parallelize this over the pipelines) - the black images appear

0 replies

asomoza · 2025-07-14T16:49:07Z

asomoza
Jul 14, 2025
Maintainer

My mistake, usually the issues are posted with the code that doesn't work and not with the one that actually works so that's why I though that code had the issue.

I won't have time to test your code but if your code works with those pipelines without loading loras, meaning that the lora loading is what breaks it, the best choice (without looking at your whole codebase) is to fuse and unfuse the loras, I think this is the only way it also works with a distributed environment, like separating the transformer model in multiple GPUs.

pinging @sayakpaul and @a-r-r-o-w in case they have more insights because I still haven't tested running multiple pipelines in multiple GPUs at the same time yet.

0 replies

vahe-toffee · 2025-07-15T18:03:16Z

vahe-toffee
Jul 15, 2025
Author

My mistake, usually the issues are posted with the code that doesn't work and not with the one that actually works so that's why I though that code had the issue.

I won't have time to test your code but if your code works with those pipelines without loading loras, meaning that the lora loading is what breaks it, the best choice (without looking at your whole codebase) is to fuse and unfuse the loras, I think this is the only way it also works with a distributed environment, like separating the transformer model in multiple GPUs.

pinging @sayakpaul and @a-r-r-o-w in case they have more insights because I still haven't tested running multiple pipelines in multiple GPUs at the same time yet.

The lora loading is not the problem per-se. When I load LoRAs to each pipeline sequentially (first code snippet) - everything works.
When I try to load them in parallel to two pipelines - e.g. invoking second code snippet in a threadpool - everything breaks.

0 replies

a-r-r-o-w · 2025-07-15T18:38:40Z

a-r-r-o-w
Jul 15, 2025
Maintainer

I have only looked through the code snippets roughly and don't have a full context yet. I'm only responding because some things look fundamentally wrong to me. I'm occupied with some other things at the moment, but will try to take a better look over the weekend to understand what is expected.

Python operates under the global interpreter lock i.e. Launching threads to move tensors to device will not really parallelize anything (atleast does not seem like it will yield any significant speedup in the provided code). IIRC, python threads are just asynchronous code and are only useful in IO-bound tasks (maybe like fetching content from websites)
When running any accelerator code (cuda in your case), the host (CPU) runs much farther ahead than the GPU, queueing up multiple instructions to the active device stream buffer. That is, your code running on the CPU is asynchronous with respect to the GPU (you're already using non_blocking=True too, which is great), and it's running so far ahead that the CPU can practically start issuing calls to operations performed in the generation model before the weight load is complete (assuming there are no synchronization points in your or diffusers code). Trying to parallelize this already async code does not seem like it will be fruitful.

Regarding the black images, not quite sure what's happening. You will have to look at the profiles to identify the problem.

The right programming model for multi-GPU setup with pytorch is to make use of torchrun (or similar distributed environment handler). What you're effectively doing here is data parallelism. The request handler code will be same on each GPU, but you can use some kind of schedule to tell which GPU continues execution with which request.

import torch.distributed as dist

dist.init_process_group("nccl")
world_size = dist.get_world_size()  # 2 GPUs in your case
rank = dist.get_rank()  # will be 0 for GPU:0, and 1 for GPU:1

pipe = ...
pipe.to(rank)

request_handler = RequestHandler()

while True:
    if request_handler.kill():
        break
    if request_handler.request_idx % world_size != rank:
        continue
    request = request_handler.poll()
    pipe.load_lora_weights(move_to_device(lora_dict[request.lora_id]))
    pipe(request.prompt)

torchrun --nnodes 1 --nproc_per_node 2 test.py

If you'd like to see speedup in device transfer times, you could look into pinning the state dict tensors on CPU during the first load from disk, and combining that with non-blocking. Note that pinning assumes you have enough CPU RAM to handle all the loras you want to keep in memory.

1 reply

vahe-toffee Jul 16, 2025
Author

I have only looked through the code snippets roughly and don't have a full context yet. I'm only responding because some things look fundamentally wrong to me. I'm occupied with some other things at the moment, but will try to take a better look over the weekend to understand what is expected.

Python operates under the global interpreter lock i.e. Launching threads to move tensors to device will not really parallelize anything (atleast does not seem like it will yield any significant speedup in the provided code). IIRC, python threads are just asynchronous code and are only useful in IO-bound tasks (maybe like fetching content from websites)

When running any accelerator code (cuda in your case), the host (CPU) runs much farther ahead than the GPU, queueing up multiple instructions to the active device stream buffer. That is, your code running on the CPU is asynchronous with respect to the GPU (you're already using non_blocking=True too, which is great), and it's running so far ahead that the CPU can practically start issuing calls to operations performed in the generation model before the weight load is complete (assuming there are no synchronization points in your or diffusers code). Trying to parallelize this already async code does not seem like it will be fruitful.

Regarding the black images, not quite sure what's happening. You will have to look at the profiles to identify the problem.

The right programming model for multi-GPU setup with pytorch is to make use of torchrun (or similar distributed environment handler). What you're effectively doing here is data parallelism. The request handler code will be same on each GPU, but you can use some kind of schedule to tell which GPU continues execution with which request.
import torch.distributed as dist

dist.init_process_group("nccl")
world_size = dist.get_world_size()  # 2 GPUs in your case
rank = dist.get_rank()  # will be 0 for GPU:0, and 1 for GPU:1

pipe = ...
pipe.to(rank)

request_handler = RequestHandler()

while True:
    if request_handler.kill():
        break
    if request_handler.request_idx % world_size != rank:
        continue
    request = request_handler.poll()
    pipe.load_lora_weights(move_to_device(lora_dict[request.lora_id]))
    pipe(request.prompt)
torchrun --nnodes 1 --nproc_per_node 2 test.py
If you'd like to see speedup in device transfer times, you could look into pinning the state dict tensors on CPU during the first load from disk, and combining that with non-blocking. Note that pinning assumes you have enough CPU RAM to handle all the loras you want to keep in memory.

thanks for the reply.

I did not find any other way of invoking torchrun from within a running service, that's why I have multi-threading/multi-processing was my last resort.

Based on what you're saying in the second paragraph. Black or noise images can be result of LoRAS not being fully loaded when the inference starts which results in the corrupted output.

Do you know a way I can invoke torchrun from within the code or any equivalent methodology ?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Loading multiple LoRAs to 1 pipeline in parallel, 1 LoRA to 2-pipelines on 2-GPUs #11932

Uh oh!

{{title}}

Uh oh!

Replies: 5 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Loading multiple LoRAs to 1 pipeline in parallel, 1 LoRA to 2-pipelines on 2-GPUs #11932

Uh oh!

vahe-toffee Jul 12, 2025

Replies: 5 comments · 1 reply

Uh oh!

asomoza Jul 12, 2025 Maintainer

Uh oh!

vahe-toffee Jul 13, 2025 Author

Uh oh!

Uh oh!

asomoza Jul 14, 2025 Maintainer

Uh oh!

vahe-toffee Jul 15, 2025 Author

Uh oh!

Uh oh!

a-r-r-o-w Jul 15, 2025 Maintainer

Uh oh!

vahe-toffee Jul 16, 2025 Author

vahe-toffee
Jul 12, 2025

Replies: 5 comments 1 reply

asomoza
Jul 12, 2025
Maintainer

vahe-toffee
Jul 13, 2025
Author

asomoza
Jul 14, 2025
Maintainer

vahe-toffee
Jul 15, 2025
Author

a-r-r-o-w
Jul 15, 2025
Maintainer

vahe-toffee Jul 16, 2025
Author