Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REQUEST] Is it possible and a lot of trouble to support flux? #631

Open
3 tasks done
Ph0rk0z opened this issue Sep 22, 2024 · 4 comments
Open
3 tasks done

[REQUEST] Is it possible and a lot of trouble to support flux? #631

Ph0rk0z opened this issue Sep 22, 2024 · 4 comments

Comments

@Ph0rk0z
Copy link

Ph0rk0z commented Sep 22, 2024

Problem

Flux is a transformers based image model. It's rather large and fills a whole 24g card. People have made GGUF, bitsnbytes and NF4 loaders for comfyui which all use those LLM quantizations. Seemingly with little modification. I recently found a marlin implementation too: https://github.com/MinusZoneAI/ComfyUI-Flux1Quantize-MZ

Solution

Even though it's not an LLM, the model was shoehorned into several LLM only quants. The comfyui nodes that load them don't seem super complicated, but I'm not familiar with the entire codebase to know if it's a big ask architecture wise or even something you're interested in.

Alternatives

The other quants leave much to be desired, they either quantize too much or don't perform very well. GGUF is slower than the native torch FP8 quantization. While using exl2 has been suggested, nobody has asked.

Explanation

It would make flux fast and it's something new.

Examples

No response

Additional context

No response

Acknowledgements

  • I have looked for similar requests before submitting this one.
  • I understand that the developers have lives and my issue will be answered when possible.
  • I understand the developers of this program are human, and I will make my requests politely.
@Downtown-Case
Copy link
Contributor

+1

To add to this, it could potentially make exllama and tabbyAPI the "production" backend of Flux, right? There's no analogue to vllm for flux.

@Downtown-Case
Copy link
Contributor

Downtown-Case commented Sep 23, 2024

...But another thing to note is that half of flux's vram usage is the T5 encoder, which also quantizes fairly poorly, and I think that alone would be a large endeavor for exllama to support. Most backends are going to just swap it in/out, and supporting easy swapping in exllama may also be a tricky endeavor.

@turboderp
Copy link
Member

Everything is possible, but it's definitely "a lot of trouble", yes. It would be a completely different pipeline and very much outside of what ExLlama currently does, which is language modeling.

You could possibly do something with the transformer component, but even then it'd be a different quantization objective than next-token prediction so this would really make more sense as a standalone project.

@Ph0rk0z
Copy link
Author

Ph0rk0z commented Sep 23, 2024

The AWQ guy used the marlin kernel's matmul. https://github.com/MinusZoneAI/ComfyUI-Flux1Quantize-MZ/blob/main/q_awq_marlin_loader.py It would be a separate project in that it would be a comfy node, not something in tabby or exui. Latter would be a yuge ask.

I think the first hurdle is how to convert the model into the format itself due to the calibration dataset being text based. Model is basically almost all transformer layers though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants