Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MoE offloading and code simplification #21

Open
wants to merge 16 commits into
base: main
Choose a base branch
from
Open

Conversation

gabe56f
Copy link

@gabe56f gabe56f commented Feb 22, 2024

Make model creation a bit more readable and implement an LRU cache-based offload for MoE layers, which makes 4x2 SDXL models able to run on 12gb vram.

Example usage:

from segmoe import SegMoEPipeline as seg
import torch

pipe = seg("SegMoE-4x2-v0", on_device_layers=800)
torch.cuda.empty_cache()  # make sure to clear cache before to experience vram savings
pipe("1girl, hatsune miku, beach, smiling").images[0].save("test.png")

@Warlord-K
Copy link
Contributor

Hi @gabe56f, Thank you for you contribution, I have checked the loading code and fixed a small bug corresponding to the model loading of experts from civit.

But from my tests so far, the inference memory requirement for 4x2 SDXL models is still ~18 GB, Am I missing something here?

@gabe56f
Copy link
Author

gabe56f commented Feb 26, 2024

I don't have access to any higher capacity GPUs, so I was only able to do testing on my 4080, (16GB) and on there, there are both performance, and VRAM savings - not total memory, just VRAM. Did you torch.cuda.empty_cache() before generating?

VRAM full

VRAM isn't full

Once again, this doesn't save total memory required, it just saves GPU device memory required.

@Warlord-K
Copy link
Contributor

I don't have access to any higher capacity GPUs, so I was only able to do testing on my 4080, (16GB) and on there, there are both performance, and VRAM savings - not total memory, just VRAM. Did you torch.cuda.empty_cache() before generating?

VRAM full

VRAM isn't full

Once again, this doesn't save total memory required, it just saves GPU device required.

Yeah, I have gotten the memory savings now. I have set the cache to empty after both model loading and inference in the class itself, so that it doesn't have to be manually cleared everytime.

@Warlord-K
Copy link
Contributor

Warlord-K commented Feb 26, 2024

I also got the inference running at <8GB with 400 GPU layers, which is awesome and should enable a lot more users to be able run it locally. Thanks again for your contribution! I will update the README to highlight the optimized memory usage.

@Warlord-K
Copy link
Contributor

Although The GPU VRAM utilization has decreased the peak memory requirement is still higher. When I was trying to SegMoE-4x2 on this configuration: 16GB A6000 and 64 GB RAM, I constantly got cuda out of memory error with 800 and 600 on device layers, How are you running it?

@gabe56f
Copy link
Author

gabe56f commented Feb 26, 2024

I think the bottleneck from this point on is gonna be the loading of the model -- loading it with on_device_layers>0 makes RAM and VRAM spike for whatever reason. I've run into this exact issue over at your old project repo VoltaML fast stable diffusion and I think it was this change that fixed the issues.

The other reason I can think of is gonna be the VAE, which is once again, the diffusers library's problem, since SDXL VAEs need to run at FP32 due to precision issues. This is only gonna be present at the end of every generation.

If you load using the following the OOM issues should hopefully be fixed:

from diffusers import AutoencoderKL
from segmoe import SegMoEPipeline as seg
import torch

pipe = seg("SegMoE-4x2-v0", on_device_layers=800)

# set VAE to FP16 one
pipe.pipe.vae = AutoencoderKL.from_pretrained("madebyollin/sdxl-vae-fp16-fix", torch_dtype=torch.float16).to("cuda:0")
# no longer necessary
# torch.cuda.empty_cache()  # make sure to clear cache before to experience vram savings

pipe("1girl, hatsune miku, beach, smiling").images[0].save("test.png")

From my testing 800 layers on GPU is around 11gb, 1200 is 15gb. so for SDXL models it's a flat (roughly) 1gb/100 layers.
I don't think this has much use for SD1.5 models, maybe if there are some extreme 10x3 merges it'll prove useful for SD1.5 as well

@gabe56f
Copy link
Author

gabe56f commented Feb 26, 2024

I've also went ahead and made it so that it's possible to change on_device_layers and scheduler_class without having to reinstate pipelines.

from diffusers import AutoencoderKL
from segmoe import SegMoEPipeline as seg
import torch

# try without offload
pipe = seg("SegMoE-4x2-v0")

# set VAE to FP16 one
pipe.pipe.vae = AutoencoderKL.from_pretrained("madebyollin/sdxl-vae-fp16-fix", torch_dtype=torch.float16).to("cuda:0")
torch.cuda.empty_cache()  # make sure to clear cache before to experience vram savings

pipe("1girl, hatsune miku, beach, smiling").images[0].save("test.png")

# try with offload=400
pipe.on_device_layers = 400  # automatically offloads every MoE block to CPU
torch.cuda.empty_cache()  # make sure to clear cache before to experience vram savings
pipe("1girl, hatsune miku, beach, smiling").images[0].save("test_400.png")

# try with offload=1200
pipe.on_device_layers = 1200  # automatically offloads every MoE block to CPU
torch.cuda.empty_cache()  # make sure to clear cache before to experience vram savings
pipe("1girl, hatsune miku, beach, smiling").images[0].save("test_1200.png")

# also be able to change schedulers
pipe.scheduler_class = "EulerAncestralDiscreteScheduler"
pipe("1girl, hatsune miku, beach, smiling").images[0].save("test_1200_euler.png")

@Warlord-K
Copy link
Contributor

@gabe56f Hi, sorry for the delay but I have been trying to get it to work on a machine with 16GB VRAM and 64BG VRAM and it still doesn't work, could you try to run the 2x1 model on colab and share the colab with me?

@imba-pericia
Copy link

imba-pericia commented Mar 31, 2024

Not sure about 16GB, free colab has little ram, but 4x2 runs on a tesla p40 with 22.5GB vram.
1.5 6x3 runs on 6gb vram, slowly)

@gabe56f
Copy link
Author

gabe56f commented Mar 31, 2024

Not sure about 16GB, free colab has little ram, but 4x2 runs on a tesla p40 with 22.5GB vram. 1.5 6x3 runs on 6gb vram, slowly)

was about to say this, been having loooooaaaads of issues even with loading models on colab due to 12gb of available ram...

unet = self.create_empty(cached_folder) seems like the part with the big ram issues, could look into that

@Warlord-K
Copy link
Contributor

I am running it on a server with 64GB RAM and 16GB VRAM, but it doesn't load, I tried the exact code but it always goes out of memory, could you try with 2x1 on colab?

@harakiru
Copy link

Would this help with model creation? I have a 16GB gpu and im unable to create any SDXL based MoE models.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants