Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request]: Stable Diffusion "o1" #46

Open
1 task done
iwr-redmond opened this issue Jan 9, 2025 · 0 comments
Open
1 task done

[Feature Request]: Stable Diffusion "o1" #46

iwr-redmond opened this issue Jan 9, 2025 · 0 comments
Labels
enhancement New feature or request

Comments

@iwr-redmond
Copy link

iwr-redmond commented Jan 9, 2025

Is there an existing issue for this?

  • I have searched the existing issues and checked the recent builds/commits

Summary

Implement Flows combining the tried and tested Stable Diffusion 1.5 architecture with more recent advances in AI technology, a combination I have unimaginatively christened 'o1' (original-1)*

Description

Core Generation Nodes

Stable Diffusion 1.5 remains perennially popular due to its low resource use and large catalogue of tools. However, the architecture is now aging, and is ready to be revitalized with updated AI advances:

  1. Tencent ELLA: boost comprehension and details with a multilingual Flan T5 LLM similar to those found in current DiT models like FLUX
  2. Megvii HiDiffusion: generate natively in full HD resolution without major VRAM usage increases
  3. Koshi-Star Samplers: enhanced Euler-family samplers for better hand generation (user option, but recommended)

All of these technologies support ControlNet and inpainting - see notes for ELLA and HiDiffusion.

Lora & Embedding Support

ELLA supports concatenating the T5 and CLIP conditioning prior to generation, allowing embeddings and Loras to be integrated into workflows for both positive and negative prompts.

While the sample workflow suggests that the two prompts should be fundamentally separate, more recent work by Liu et al proposes encoding the same prompt twice, once with T5 and once with CLIP, before concatenation.

This suggests that loading the same prompt into both the 'Conditioning' and 'Clip Conditioning fields', with the T5 field receiving the text stripped of any unsupported prompt weighting characters using a simple regex, would allow Loras and embeddings to be easily supported without requiring bifurcated prompts.

Caption Upscaling

An LLM should be used for caption upscaling, as recommended by the ELLA authors.

I recommend the GPT4all Python client, which is not a custom node but can be called directly by Flow. Relying on an LLM node that requires manual configuration, e.g. installing Ollama, is unnecessary for such a simple single-task job. GPT4all is a simplified version of LllamaCPP that uses Vulkan and Metal by default for Q4_0 GGUF inference, with optional CUDA support.

A suitable abliterated model from FailSpy, such as Phi3-mini-128k-v3, could then be used for inference with the sample instructional prompt provided in the ELLA repository. Note that Phi-3 does not support system prompt use, meaning the instructions + user prompt would be concatenated for the "user" prompt.

Additional information

Limitations

  • Due to the addition of a ~3GB Flan-T5 model and increased computational resources required by the Euler SMEA Dy sampler, the VRAM required for inference will be higher than SD1.5, at a guess probably around the same as SDXL (see ELLA issue 15)
  • While ELLA+CLIP conditioning, necessary for using embeddings and Loras, is supported, the ELLA-encoded portion of the generation prompt does not support weighting (workaround per @YUHANG-Ma)
  • IP Adapters may not be supported (per issue 47)
  • While the SD1.5 Hiresfix would no longer be needed, the Fooocus equivalent of image2image upscaling would be possible with these updated technologies
  • Flow would need to rely on a PR fork of ELLA until pull 68 is merged or forked
  • Flow would also need to rely on a (comparatively minor) PR fork of Euler-SMEA until pull 31 is merged or forked (fixed by @yoinked-h)

Due to the small but noteworthy differences between SD1.5 and the proposed hybrid, I reckon it would be best to use a new class-name to avoid at least some confusion (although it may cause some too...). If anyone has a better idea than 'Stable Diffusion o1', I'll happily put on my Sunday-meeting clothes and sing hallelujah for it!

* with apologies to OpenAI

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant