-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MoE offloading and code simplification #21
base: main
Are you sure you want to change the base?
Conversation
Hi @gabe56f, Thank you for you contribution, I have checked the loading code and fixed a small bug corresponding to the model loading of experts from civit. But from my tests so far, the inference memory requirement for 4x2 SDXL models is still ~18 GB, Am I missing something here? |
I don't have access to any higher capacity GPUs, so I was only able to do testing on my 4080, (16GB) and on there, there are both performance, and VRAM savings - not total memory, just VRAM. Did you
Once again, this doesn't save total memory required, it just saves GPU device memory required. |
Yeah, I have gotten the memory savings now. I have set the cache to empty after both model loading and inference in the class itself, so that it doesn't have to be manually cleared everytime. |
I also got the inference running at <8GB with 400 GPU layers, which is awesome and should enable a lot more users to be able run it locally. Thanks again for your contribution! I will update the README to highlight the optimized memory usage. |
Although The GPU VRAM utilization has decreased the peak memory requirement is still higher. When I was trying to SegMoE-4x2 on this configuration: 16GB A6000 and 64 GB RAM, I constantly got cuda out of memory error with 800 and 600 on device layers, How are you running it? |
I think the bottleneck from this point on is gonna be the loading of the model -- loading it with The other reason I can think of is gonna be the VAE, which is once again, the If you load using the following the OOM issues should hopefully be fixed: from diffusers import AutoencoderKL
from segmoe import SegMoEPipeline as seg
import torch
pipe = seg("SegMoE-4x2-v0", on_device_layers=800)
# set VAE to FP16 one
pipe.pipe.vae = AutoencoderKL.from_pretrained("madebyollin/sdxl-vae-fp16-fix", torch_dtype=torch.float16).to("cuda:0")
# no longer necessary
# torch.cuda.empty_cache() # make sure to clear cache before to experience vram savings
pipe("1girl, hatsune miku, beach, smiling").images[0].save("test.png") From my testing 800 layers on GPU is around 11gb, 1200 is 15gb. so for SDXL models it's a flat (roughly) |
I've also went ahead and made it so that it's possible to change from diffusers import AutoencoderKL
from segmoe import SegMoEPipeline as seg
import torch
# try without offload
pipe = seg("SegMoE-4x2-v0")
# set VAE to FP16 one
pipe.pipe.vae = AutoencoderKL.from_pretrained("madebyollin/sdxl-vae-fp16-fix", torch_dtype=torch.float16).to("cuda:0")
torch.cuda.empty_cache() # make sure to clear cache before to experience vram savings
pipe("1girl, hatsune miku, beach, smiling").images[0].save("test.png")
# try with offload=400
pipe.on_device_layers = 400 # automatically offloads every MoE block to CPU
torch.cuda.empty_cache() # make sure to clear cache before to experience vram savings
pipe("1girl, hatsune miku, beach, smiling").images[0].save("test_400.png")
# try with offload=1200
pipe.on_device_layers = 1200 # automatically offloads every MoE block to CPU
torch.cuda.empty_cache() # make sure to clear cache before to experience vram savings
pipe("1girl, hatsune miku, beach, smiling").images[0].save("test_1200.png")
# also be able to change schedulers
pipe.scheduler_class = "EulerAncestralDiscreteScheduler"
pipe("1girl, hatsune miku, beach, smiling").images[0].save("test_1200_euler.png") |
@gabe56f Hi, sorry for the delay but I have been trying to get it to work on a machine with 16GB VRAM and 64BG VRAM and it still doesn't work, could you try to run the 2x1 model on colab and share the colab with me? |
Not sure about 16GB, free colab has little ram, but 4x2 runs on a tesla p40 with 22.5GB vram. |
was about to say this, been having loooooaaaads of issues even with loading models on colab due to 12gb of available ram...
|
I am running it on a server with 64GB RAM and 16GB VRAM, but it doesn't load, I tried the exact code but it always goes out of memory, could you try with 2x1 on colab? |
Would this help with model creation? I have a 16GB gpu and im unable to create any SDXL based MoE models. |
Make model creation a bit more readable and implement an LRU cache-based offload for MoE layers, which makes 4x2 SDXL models able to run on 12gb vram.
Example usage: