patching lora load for faster loading #39

daanelson · 2024-09-27T06:34:05Z

Profiled lora loading in prod to see what was taking so long - it looks like our old nemesis kaiming_uniform, and other such functions that are used to randomly initialize empty tensors.

conveniently, someone just pushed functionality to peft to disable this. So we don't have to get too wild to turn it off; we just need to patch load_lora_into_transformer to take advantage of that functionality until it's integrated into diffusers.

That's what this PR does. With this I saw lora load times (after download) drop from about 10 seconds to about 1.2 seconds. Tested locally w/dev and schnell.

Important

Patch load_lora_into_transformer to use low_cpu_mem_usage=True, significantly reducing LoRA load times, and update dependencies accordingly.

Behavior:
- Patch load_lora_into_transformer in lora_loading_patch.py to use low_cpu_mem_usage=True, reducing LoRA load times from ~10s to ~1.2s.
- Update predict.py to use the patched load_lora_into_transformer for FluxPipeline, FluxImg2ImgPipeline, and FluxInpaintPipeline.
Dependencies:
- Upgrade peft version from 0.12.0 to 0.13.0 in cog.yaml.
Misc:
- Add /weights-cache to .dockerignore.

^{This description was created by}^{for f53e34c. It will automatically update as commits are pushed.}

andreasjansson

Great sleuthing! A couple nitpicks but not blockers.

andreasjansson · 2024-09-27T13:19:46Z

predict.py

@@ -101,6 +102,9 @@ def setup(self) -> None:  # pyright: ignore
            "FLUX.1-dev",
            torch_dtype=torch.bfloat16,
        ).to("cuda")
+        dev_pipe.__class__.load_lora_into_transformer = classmethod(


This is fine, but could probably also done a similar pattern to https://github.com/replicate/flux-fine-tuner/blob/main/submodule_patches.py to avoid duplicating this line.

andreasjansson · 2024-09-27T13:20:13Z

predict.py

@@ -430,6 +440,7 @@ def load_single_lora(self, lora_url: str, model: str):
        lora_path = self.weights_cache.ensure(lora_url)
        pipe.load_lora_weights(lora_path, adapter_name="main")
        self.loaded_lora_urls[model] = LoadedLoRAs(main=lora_url, extra=None)
+        pipe = pipe.to("cuda")


Is it not already on gpu at this point? How did it work before? Might be worth adding a comment here.

andreasjansson · 2024-09-27T15:20:30Z

Did a test-only cog-safe-push, it passed https://github.com/replicate/flux-fine-tuner/actions/runs/11072824989

patching lora load for faster loading

e2bb939

daanelson requested a review from andreasjansson September 27, 2024 06:34

Dont lint lora loading patch

34b4947

fofr force-pushed the fast-lora-load branch from 126730e to 34b4947 Compare September 27, 2024 11:24

ruff format

e636fb8

andreasjansson approved these changes Sep 27, 2024

View reviewed changes

Update cog safe push test cases

f53e34c

andreasjansson force-pushed the fast-lora-load branch from 0861945 to f53e34c Compare September 27, 2024 13:25

andreasjansson merged commit 9359a41 into main Sep 27, 2024
3 checks passed

andreasjansson deleted the fast-lora-load branch September 27, 2024 16:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

patching lora load for faster loading #39

patching lora load for faster loading #39

daanelson commented Sep 27, 2024 •

edited by ellipsis-dev bot

Loading

andreasjansson left a comment

andreasjansson Sep 27, 2024

andreasjansson Sep 27, 2024

andreasjansson commented Sep 27, 2024

patching lora load for faster loading #39

patching lora load for faster loading #39

Conversation

daanelson commented Sep 27, 2024 • edited by ellipsis-dev bot Loading

andreasjansson left a comment

Choose a reason for hiding this comment

andreasjansson Sep 27, 2024

Choose a reason for hiding this comment

andreasjansson Sep 27, 2024

Choose a reason for hiding this comment

andreasjansson commented Sep 27, 2024

daanelson commented Sep 27, 2024 •

edited by ellipsis-dev bot

Loading