Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Flux: clip_l support (SD3/3.5 improvements included) #397

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

stduhpf
Copy link
Contributor

@stduhpf stduhpf commented Sep 4, 2024

Fixes #396

I made the clip backend skip the last text projection if the text_projection tensor doesn't exist This is mathematically equaivalent to replacing the text_projection with the identity matrix (aka torch.eye()).
Also replaced the matrix multiplication with a biasless linear layer when text_projection exists (somehow this changes the outcome).

Flux.1 Schnell (q3_k):

clip-L ViT-L-14
master test-schnell-clip test-schnell-vit
PR test-schnell-clip test-schnell-vit

SD3 2B (q8_0):

Master PR
sd3-master output

SD3.5 Large Turbo (q4_1)

Master PR
output output

@stduhpf
Copy link
Contributor Author

stduhpf commented Sep 5, 2024

If anyone know how this is all supposed to work, feel free to improve the code.

@stduhpf stduhpf marked this pull request as draft September 13, 2024 10:05
@stduhpf
Copy link
Contributor Author

stduhpf commented Sep 13, 2024

I'm pretty sure this is not the way it's supposed to work, so I'm drafting this PR until I figure out how to make it work properly. Right now, I think this is only adding some "noise" the the t5 prompt.

@stduhpf
Copy link
Contributor Author

stduhpf commented Sep 13, 2024

Ok I believe i got it now

@stduhpf stduhpf marked this pull request as ready for review September 13, 2024 16:33
@stduhpf
Copy link
Contributor Author

stduhpf commented Sep 13, 2024

Hard-coding the prompt for clip to always be "Painting, in the style of starry night by Van Gogh", while keeping the same example prompt for T5 ("a lovely cat holding a sign says 'flux.cpp'") now gives this result:
test
I'm 99.99% certain this PR is now ready for merge.

Patch for hardcoded Clip prompt
diff --git a/conditioner.hpp b/conditioner.hpp
index 8a710d1..75efb08 100644
--- a/conditioner.hpp
+++ b/conditioner.hpp
@@ -1038,9 +1038,10 @@ struct FluxCLIPEmbedder : public Conditioner {
         std::vector<float> t5_weights;
         for (const auto& item : parsed_attention) {
             const std::string& curr_text = item.first;
+            const std::string& curr_text_l = "Painting, in the style of starry night by Van Gogh";
             float curr_weight            = item.second;
 
-            std::vector<int> curr_tokens = clip_l_tokenizer.encode(curr_text, on_new_token_cb);
+            std::vector<int> curr_tokens = clip_l_tokenizer.encode(curr_text_l, on_new_token_cb);
             clip_l_tokens.insert(clip_l_tokens.end(), curr_tokens.begin(), curr_tokens.end());
             clip_l_weights.insert(clip_l_weights.end(), curr_tokens.size(), curr_weight);
 

Edit: I tried this test with my previous attempt at fixing clip (e6314d3) and I got a similar result. It was actually working-ish even though it was definitely not implemented like in the official Flux inference code.

Comparison

(all of those with the same hardcoded prompt for clip_l)

PR (af4f83f) Previous attempt (d7679c9) Master
test test test

@stduhpf stduhpf changed the title Fix Flux: clip_l support Fix Flux: clip_l support (SD3 improvements included) Sep 13, 2024
@stduhpf
Copy link
Contributor Author

stduhpf commented Sep 18, 2024

@leejet does this look good enough to merge?

conditioner.hpp Outdated
@@ -1073,7 +1065,7 @@ struct FluxCLIPEmbedder : public Conditioner {
return {{clip_l_tokens, clip_l_weights}, {t5_tokens, t5_weights}};
}

SDCondition get_learned_condition_common(ggml_context* work_ctx,
SDCondition get_learned_condition_common(ggml_context* work_ctx,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra spaces?

clip.hpp Outdated
@@ -711,7 +711,11 @@ class CLIPTextModel : public GGMLBlock {
if (return_pooled) {
auto text_projection = params["text_projection"];
ggml_tensor* pooled = ggml_view_1d(ctx, x, hidden_size, x->nb[1] * max_token_idx);
pooled = ggml_mul_mat(ctx, ggml_cont(ctx, ggml_transpose(ctx, text_projection)), pooled);
if(text_projection != NULL){
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try to fit in the formatting more with the surrounding. ( spaces )

@stduhpf
Copy link
Contributor Author

stduhpf commented Oct 24, 2024

This breaks SD3.5 support somehow!

@stduhpf stduhpf marked this pull request as draft October 24, 2024 14:24
@stduhpf stduhpf mentioned this pull request Oct 24, 2024
@stduhpf
Copy link
Contributor Author

stduhpf commented Oct 24, 2024

Ok it works again now.

@stduhpf stduhpf marked this pull request as ready for review October 24, 2024 16:08
@stduhpf stduhpf changed the title Fix Flux: clip_l support (SD3 improvements included) Fix Flux: clip_l support (SD3/3.5 improvements included) Oct 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug: Clip-L does absolutely nothing with Flux models.
2 participants