MoE offloading and code simplification #21

gabe56f · 2024-02-22T16:18:58Z

Make model creation a bit more readable and implement an LRU cache-based offload for MoE layers, which makes 4x2 SDXL models able to run on 12gb vram.

Example usage:

from segmoe import SegMoEPipeline as seg
import torch

pipe = seg("SegMoE-4x2-v0", on_device_layers=800)
torch.cuda.empty_cache()  # make sure to clear cache before to experience vram savings
pipe("1girl, hatsune miku, beach, smiling").images[0].save("test.png")

Warlord-K · 2024-02-25T20:42:32Z

Hi @gabe56f, Thank you for you contribution, I have checked the loading code and fixed a small bug corresponding to the model loading of experts from civit.

But from my tests so far, the inference memory requirement for 4x2 SDXL models is still ~18 GB, Am I missing something here?

gabe56f · 2024-02-26T00:17:18Z

I don't have access to any higher capacity GPUs, so I was only able to do testing on my 4080, (16GB) and on there, there are both performance, and VRAM savings - not total memory, just VRAM. Did you torch.cuda.empty_cache() before generating?

for some context, meganime-mix-4x2 is just a custom merge of 4 different anime SDXL models. If needed, here's the config.yaml

Once again, this doesn't save total memory required, it just saves GPU device memory required.

Warlord-K · 2024-02-26T05:57:07Z

I don't have access to any higher capacity GPUs, so I was only able to do testing on my 4080, (16GB) and on there, there are both performance, and VRAM savings - not total memory, just VRAM. Did you torch.cuda.empty_cache() before generating?

for some context, meganime-mix-4x2 is just a custom merge of 4 different anime SDXL models. If needed, here's the config.yaml

Once again, this doesn't save total memory required, it just saves GPU device required.

Yeah, I have gotten the memory savings now. I have set the cache to empty after both model loading and inference in the class itself, so that it doesn't have to be manually cleared everytime.

Warlord-K · 2024-02-26T05:59:53Z

I also got the inference running at <8GB with 400 GPU layers, which is awesome and should enable a lot more users to be able run it locally. Thanks again for your contribution! I will update the README to highlight the optimized memory usage.

Warlord-K · 2024-02-26T06:43:55Z

Although The GPU VRAM utilization has decreased the peak memory requirement is still higher. When I was trying to SegMoE-4x2 on this configuration: 16GB A6000 and 64 GB RAM, I constantly got cuda out of memory error with 800 and 600 on device layers, How are you running it?

gabe56f · 2024-02-26T10:23:31Z

I think the bottleneck from this point on is gonna be the loading of the model -- loading it with on_device_layers>0 makes RAM and VRAM spike for whatever reason. I've run into this exact issue over at your old project repo VoltaML fast stable diffusion and I think it was this change that fixed the issues.

The other reason I can think of is gonna be the VAE, which is once again, the diffusers library's problem, since SDXL VAEs need to run at FP32 due to precision issues. This is only gonna be present at the end of every generation.

If you load using the following the OOM issues should hopefully be fixed:

from diffusers import AutoencoderKL
from segmoe import SegMoEPipeline as seg
import torch

pipe = seg("SegMoE-4x2-v0", on_device_layers=800)

# set VAE to FP16 one
pipe.pipe.vae = AutoencoderKL.from_pretrained("madebyollin/sdxl-vae-fp16-fix", torch_dtype=torch.float16).to("cuda:0")
# no longer necessary
# torch.cuda.empty_cache()  # make sure to clear cache before to experience vram savings

pipe("1girl, hatsune miku, beach, smiling").images[0].save("test.png")

From my testing 800 layers on GPU is around 11gb, 1200 is 15gb. so for SDXL models it's a flat (roughly) 1gb/100 layers.
I don't think this has much use for SD1.5 models, maybe if there are some extreme 10x3 merges it'll prove useful for SD1.5 as well

gabe56f · 2024-02-26T15:20:44Z

I've also went ahead and made it so that it's possible to change on_device_layers and scheduler_class without having to reinstate pipelines.

from diffusers import AutoencoderKL
from segmoe import SegMoEPipeline as seg
import torch

# try without offload
pipe = seg("SegMoE-4x2-v0")

# set VAE to FP16 one
pipe.pipe.vae = AutoencoderKL.from_pretrained("madebyollin/sdxl-vae-fp16-fix", torch_dtype=torch.float16).to("cuda:0")
torch.cuda.empty_cache()  # make sure to clear cache before to experience vram savings

pipe("1girl, hatsune miku, beach, smiling").images[0].save("test.png")

# try with offload=400
pipe.on_device_layers = 400  # automatically offloads every MoE block to CPU
torch.cuda.empty_cache()  # make sure to clear cache before to experience vram savings
pipe("1girl, hatsune miku, beach, smiling").images[0].save("test_400.png")

# try with offload=1200
pipe.on_device_layers = 1200  # automatically offloads every MoE block to CPU
torch.cuda.empty_cache()  # make sure to clear cache before to experience vram savings
pipe("1girl, hatsune miku, beach, smiling").images[0].save("test_1200.png")

# also be able to change schedulers
pipe.scheduler_class = "EulerAncestralDiscreteScheduler"
pipe("1girl, hatsune miku, beach, smiling").images[0].save("test_1200_euler.png")

Warlord-K · 2024-03-27T08:16:21Z

@gabe56f Hi, sorry for the delay but I have been trying to get it to work on a machine with 16GB VRAM and 64BG VRAM and it still doesn't work, could you try to run the 2x1 model on colab and share the colab with me?

imba-pericia · 2024-03-31T01:59:36Z

Not sure about 16GB, free colab has little ram, but 4x2 runs on a tesla p40 with 22.5GB vram.
1.5 6x3 runs on 6gb vram, slowly)

gabe56f · 2024-03-31T02:18:47Z

Not sure about 16GB, free colab has little ram, but 4x2 runs on a tesla p40 with 22.5GB vram. 1.5 6x3 runs on 6gb vram, slowly)

was about to say this, been having loooooaaaads of issues even with loading models on colab due to 12gb of available ram...

unet = self.create_empty(cached_folder) seems like the part with the big ram issues, could look into that

Warlord-K · 2024-03-31T17:37:14Z

I am running it on a server with 64GB RAM and 16GB VRAM, but it doesn't load, I tried the exact code but it always goes out of memory, could you try with 2x1 on colab?

harakiru · 2024-12-22T16:54:46Z

Would this help with model creation? I have a 16GB gpu and im unable to create any SDXL based MoE models.

gabe56f and others added 11 commits February 21, 2024 09:40

Simplify empty model creation

4d7e79f

Simplify cast

e485d7e

Start work on offload

eb62ab3

More work on offload

f992903

Probably finish offload

73f93d9

Done with offload

ef7f842

Change move & add cachetools

10a15a6

Revert bf16

eb736d9

Revert "Change move"

1aa808d

Scheduler, simplification, and everything working.

92ec5b7

Fix the Expert model loading from civit

a23430b

Clear Cache by default

9bd88b9

gabe56f added 2 commits February 26, 2024 11:30

Make base_model optional and don't hard-hard depend on torch.cuda

f24b087

Allow changing of offload and scheduler

ffef548

gabe56f and others added 2 commits February 26, 2024 20:16

Actually working offload change

6b55836

Fix loading bug

f97ef77

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MoE offloading and code simplification #21

MoE offloading and code simplification #21

gabe56f commented Feb 22, 2024 •

edited

Loading

Warlord-K commented Feb 25, 2024

gabe56f commented Feb 26, 2024 •

edited

Loading

Warlord-K commented Feb 26, 2024

Warlord-K commented Feb 26, 2024 •

edited

Loading

Warlord-K commented Feb 26, 2024

gabe56f commented Feb 26, 2024 •

edited

Loading

gabe56f commented Feb 26, 2024

Warlord-K commented Mar 27, 2024

imba-pericia commented Mar 31, 2024 •

edited

Loading

gabe56f commented Mar 31, 2024

Warlord-K commented Mar 31, 2024

harakiru commented Dec 22, 2024

MoE offloading and code simplification #21

Are you sure you want to change the base?

MoE offloading and code simplification #21

Conversation

gabe56f commented Feb 22, 2024 • edited Loading

Warlord-K commented Feb 25, 2024

gabe56f commented Feb 26, 2024 • edited Loading

Warlord-K commented Feb 26, 2024

Warlord-K commented Feb 26, 2024 • edited Loading

Warlord-K commented Feb 26, 2024

gabe56f commented Feb 26, 2024 • edited Loading

gabe56f commented Feb 26, 2024

Warlord-K commented Mar 27, 2024

imba-pericia commented Mar 31, 2024 • edited Loading

gabe56f commented Mar 31, 2024

Warlord-K commented Mar 31, 2024

harakiru commented Dec 22, 2024

gabe56f commented Feb 22, 2024 •

edited

Loading

gabe56f commented Feb 26, 2024 •

edited

Loading

Warlord-K commented Feb 26, 2024 •

edited

Loading

gabe56f commented Feb 26, 2024 •

edited

Loading

imba-pericia commented Mar 31, 2024 •

edited

Loading