You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have searched the existing issues and checked the recent builds/commits
Summary
Implement Flows combining the tried and tested Stable Diffusion 1.5 architecture with more recent advances in AI technology, a combination I have unimaginatively christened 'o1' (original-1)*
Description
Core Generation Nodes
Stable Diffusion 1.5 remains perennially popular due to its low resource use and large catalogue of tools. However, the architecture is now aging, and is ready to be revitalized with updated AI advances:
Tencent ELLA: boost comprehension and details with a multilingual Flan T5 LLM similar to those found in current DiT models like FLUX
Megvii HiDiffusion: generate natively in full HD resolution without major VRAM usage increases
All of these technologies support ControlNet and inpainting - see notes for ELLA and HiDiffusion.
Lora & Embedding Support
ELLA supports concatenating the T5 and CLIP conditioning prior to generation, allowing embeddings and Loras to be integrated into workflows for both positive and negative prompts.
While the sample workflow suggests that the two prompts should be fundamentally separate, more recent work by Liu et al proposes encoding the same prompt twice, once with T5 and once with CLIP, before concatenation.
This suggests that loading the same prompt into both the 'Conditioning' and 'Clip Conditioning fields', with the T5 field receiving the text stripped of any unsupported prompt weighting characters using a simple regex, would allow Loras and embeddings to be easily supported without requiring bifurcated prompts.
Caption Upscaling
An LLM should be used for caption upscaling, as recommended by the ELLA authors.
I recommend the GPT4all Python client, which is not a custom node but can be called directly by Flow. Relying on an LLM node that requires manual configuration, e.g. installing Ollama, is unnecessary for such a simple single-task job. GPT4all is a simplified version of LllamaCPP that uses Vulkan and Metal by default for Q4_0 GGUF inference, with optional CUDA support.
A suitable abliterated model from FailSpy, such as Phi3-mini-128k-v3, could then be used for inference with the sample instructional prompt provided in the ELLA repository. Note that Phi-3 does not support system prompt use, meaning the instructions + user prompt would be concatenated for the "user" prompt.
Additional information
Limitations
Due to the addition of a ~3GB Flan-T5 model and increased computational resources required by the Euler SMEA Dy sampler, the VRAM required for inference will be higher than SD1.5, at a guess probably around the same as SDXL (see ELLA issue 15)
While ELLA+CLIP conditioning, necessary for using embeddings and Loras, is supported, the ELLA-encoded portion of the generation prompt does not support weighting (workaround per @YUHANG-Ma)
While the SD1.5 Hiresfix would no longer be needed, the Fooocus equivalent of image2image upscaling would be possible with these updated technologies
Flow would need to rely on a PR fork of ELLA until pull 68 is merged or forked
Flow would also need to rely on a (comparatively minor) PR fork of Euler-SMEA until pull 31 is merged or forked (fixed by @yoinked-h)
Due to the small but noteworthy differences between SD1.5 and the proposed hybrid, I reckon it would be best to use a new class-name to avoid at least some confusion (although it may cause some too...). If anyone has a better idea than 'Stable Diffusion o1', I'll happily put on my Sunday-meeting clothes and sing hallelujah for it!
Is there an existing issue for this?
Summary
Implement Flows combining the tried and tested Stable Diffusion 1.5 architecture with more recent advances in AI technology, a combination I have unimaginatively christened 'o1' (original-1)*
Description
Core Generation Nodes
Stable Diffusion 1.5 remains perennially popular due to its low resource use and large catalogue of tools. However, the architecture is now aging, and is ready to be revitalized with updated AI advances:
All of these technologies support ControlNet and inpainting - see notes for ELLA and HiDiffusion.
Lora & Embedding Support
ELLA supports concatenating the T5 and CLIP conditioning prior to generation, allowing embeddings and Loras to be integrated into workflows for both positive and negative prompts.
While the sample workflow suggests that the two prompts should be fundamentally separate, more recent work by Liu et al proposes encoding the same prompt twice, once with T5 and once with CLIP, before concatenation.
This suggests that loading the same prompt into both the 'Conditioning' and 'Clip Conditioning fields', with the T5 field receiving the text stripped of any unsupported prompt weighting characters using a simple regex, would allow Loras and embeddings to be easily supported without requiring bifurcated prompts.
Caption Upscaling
An LLM should be used for caption upscaling, as recommended by the ELLA authors.
I recommend the GPT4all Python client, which is not a custom node but can be called directly by Flow. Relying on an LLM node that requires manual configuration, e.g. installing Ollama, is unnecessary for such a simple single-task job. GPT4all is a simplified version of LllamaCPP that uses Vulkan and Metal by default for Q4_0 GGUF inference, with optional CUDA support.
A suitable abliterated model from FailSpy, such as Phi3-mini-128k-v3, could then be used for inference with the sample instructional prompt provided in the ELLA repository. Note that Phi-3 does not support system prompt use, meaning the instructions + user prompt would be concatenated for the "user" prompt.
Additional information
Limitations
While ELLA+CLIP conditioning, necessary for using embeddings and Loras, is supported, the ELLA-encoded portion of the generation prompt does not support weighting(workaround per @YUHANG-Ma)Flow would also need to rely on a (comparatively minor) PR fork of Euler-SMEA until pull 31 is merged or forked(fixed by @yoinked-h)Due to the small but noteworthy differences between SD1.5 and the proposed hybrid, I reckon it would be best to use a new class-name to avoid at least some confusion (although it may cause some too...). If anyone has a better idea than 'Stable Diffusion o1', I'll happily put on my Sunday-meeting clothes and sing hallelujah for it!
The text was updated successfully, but these errors were encountered: