[Feature Request]: Stable Diffusion "o1" #46

iwr-redmond · 2025-01-09T13:14:11Z

Is there an existing issue for this?

I have searched the existing issues and checked the recent builds/commits

Summary

Implement Flows combining the tried and tested Stable Diffusion 1.5 architecture with more recent advances in AI technology, a combination I have unimaginatively christened 'o1' (original-1)*

Description

Core Generation Nodes

Stable Diffusion 1.5 remains perennially popular due to its low resource use and large catalogue of tools. However, the architecture is now aging, and is ready to be revitalized with updated AI advances:

Tencent ELLA: boost comprehension and details with a multilingual Flan T5 LLM similar to those found in current DiT models like FLUX
Megvii HiDiffusion: generate natively in full HD resolution without major VRAM usage increases
Koshi-Star Samplers: enhanced Euler-family samplers for better hand generation (user option, but recommended)

All of these technologies support ControlNet and inpainting - see notes for ELLA and HiDiffusion.

Lora & Embedding Support

ELLA supports concatenating the T5 and CLIP conditioning prior to generation, allowing embeddings and Loras to be integrated into workflows for both positive and negative prompts.

While the sample workflow suggests that the two prompts should be fundamentally separate, more recent work by Liu et al proposes encoding the same prompt twice, once with T5 and once with CLIP, before concatenation.

This suggests that loading the same prompt into both the 'Conditioning' and 'Clip Conditioning fields', with the T5 field receiving the text stripped of any unsupported prompt weighting characters using a simple regex, would allow Loras and embeddings to be easily supported without requiring bifurcated prompts.

Caption Upscaling

An LLM should be used for caption upscaling, as recommended by the ELLA authors.

I recommend the GPT4all Python client, which is not a custom node but can be called directly by Flow. Relying on an LLM node that requires manual configuration, e.g. installing Ollama, is unnecessary for such a simple single-task job. GPT4all is a simplified version of LllamaCPP that uses Vulkan and Metal by default for Q4_0 GGUF inference, with optional CUDA support.

A suitable abliterated model from FailSpy, such as Phi3-mini-128k-v3, could then be used for inference with the sample instructional prompt provided in the ELLA repository. Note that Phi-3 does not support system prompt use, meaning the instructions + user prompt would be concatenated for the "user" prompt.

Additional information

Limitations

Due to the addition of a ~3GB Flan-T5 model and increased computational resources required by the Euler SMEA Dy sampler, the VRAM required for inference will be higher than SD1.5, at a guess probably around the same as SDXL (see ELLA issue 15)
~~While ELLA+CLIP conditioning, necessary for using embeddings and Loras, is supported, the ELLA-encoded portion of the generation prompt does not support weighting~~ (workaround per @YUHANG-Ma)
IP Adapters may not be supported (per issue 47)
While the SD1.5 Hiresfix would no longer be needed, the Fooocus equivalent of image2image upscaling would be possible with these updated technologies
Flow would need to rely on a PR fork of ELLA until pull 68 is merged or forked
~~Flow would also need to rely on a (comparatively minor) PR fork of Euler-SMEA until pull 31 is merged or forked~~ (fixed by @yoinked-h)

Due to the small but noteworthy differences between SD1.5 and the proposed hybrid, I reckon it would be best to use a new class-name to avoid at least some confusion (although it may cause some too...). If anyone has a better idea than 'Stable Diffusion o1', I'll happily put on my Sunday-meeting clothes and sing hallelujah for it!

* with apologies to OpenAI

iwr-redmond added the enhancement New feature or request label Jan 9, 2025

iwr-redmond mentioned this issue Jan 19, 2025

[Refactor]: Upstream ComfyUI upgrade DavidDragonsage/FooocusPlus#34

Open

1 task

iwr-redmond mentioned this issue Feb 8, 2025

[Feature Request] New Model Type/Alias: torch-o1 Teriks/dgenerate#13

Open

iwr-redmond mentioned this issue Mar 1, 2025

[Bug]: The Default Aspect Ratio in Config.py is Overriding Both the Default Preset & Config.txt DavidDragonsage/FooocusPlus#98

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request]: Stable Diffusion "o1" #46

[Feature Request]: Stable Diffusion "o1" #46

iwr-redmond commented Jan 9, 2025 •

edited

Loading

[Feature Request]: Stable Diffusion "o1" #46

[Feature Request]: Stable Diffusion "o1" #46

Comments

iwr-redmond commented Jan 9, 2025 • edited Loading

Is there an existing issue for this?

Summary

Description

Additional information

iwr-redmond commented Jan 9, 2025 •

edited

Loading