You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have searched the existing issues and checked the recent builds/commits
Summary
Implement Flows combining the tried and tested Stable Diffusion 1.5 architecture with more recent advances in AI technology, a combination I have unimaginatively christened 'o1' (original-1)*
Description
Core Generation Nodes
Stable Diffusion 1.5 remains perennially popular due to its low resource use and large catalogue of tools. However, the architecture is now aging, and is ready to be revitalized with updated AI advances:
Tencent ELLA: boost comprehension and details with a Flan T5 LLM similar to those found in current DiT models like FLUX
Megvii HiDiffusion: generate natively in full HD resolution without major VRAM usage increases
All of these technologies support ControlNet and inpainting - see notes for ELLA and HiDiffusion.
Caption Upscaling
An LLM should be used for caption upscaling, as recommended by the ELLA authors.
I recommend the GPT4all Python client, which is not a custom node but can be called directly by Flow. Relying on an LLM node that requires manual configuration, e.g. installing Ollama, is unnecessary for such a simple single-task job. GPT4all is a simplified version of LllamaCPP that uses Vulkan and Metal by default for Q4_0 GGUF inference, with optional CUDA support.
A suitable abliterated model from FailSpy, such as Phi3-mini-128k-v3, could then be used for inference with the sample instructional prompt provided in the ELLA repository. Note that Phi-3 does not support system prompt use, meaning the instructions + user prompt would be concatenated for the "user" prompt.
Additional information
Limitations
Due to the addition of a ~3GB Flan-T5 model and increased computational resources required by the Euler SMEA Dy sampler, the VRAM required for inference will be higher than SD1.5, at a guess probably around the same as SDXL (see ELLA issue 15)
While ELLA+CLIP conditioning, necessary for using embeddings and Loras, is supported, the ELLA-encoded portion of the generation prompt does not support weighting
While the SD1.5 Hiresfix would no longer be needed, the Fooocus equivalent of image2image upscaling would be possible with these updated technologies
Flow would need to rely on a PR fork of ELLA until pull 68 is merged or forked
Flow would also need to rely on a (comparatively minor) PR fork of Euler-SMEA until pull 31 is merged or forked (fixed by @yoinked-h)
Due to the small but noteworthy differences between SD1.5 and the proposed hybrid, I reckon it would be best to use a new class-name to avoid at least some confusion (although it may cause some too...). If anyone has a better idea than 'Stable Diffusion o1', I'll happily put on my Sunday-meeting clothes and sing hallelujah for it!
Is there an existing issue for this?
Summary
Implement Flows combining the tried and tested Stable Diffusion 1.5 architecture with more recent advances in AI technology, a combination I have unimaginatively christened 'o1' (original-1)*
Description
Core Generation Nodes
Stable Diffusion 1.5 remains perennially popular due to its low resource use and large catalogue of tools. However, the architecture is now aging, and is ready to be revitalized with updated AI advances:
All of these technologies support ControlNet and inpainting - see notes for ELLA and HiDiffusion.
Caption Upscaling
An LLM should be used for caption upscaling, as recommended by the ELLA authors.
I recommend the GPT4all Python client, which is not a custom node but can be called directly by Flow. Relying on an LLM node that requires manual configuration, e.g. installing Ollama, is unnecessary for such a simple single-task job. GPT4all is a simplified version of LllamaCPP that uses Vulkan and Metal by default for Q4_0 GGUF inference, with optional CUDA support.
A suitable abliterated model from FailSpy, such as Phi3-mini-128k-v3, could then be used for inference with the sample instructional prompt provided in the ELLA repository. Note that Phi-3 does not support system prompt use, meaning the instructions + user prompt would be concatenated for the "user" prompt.
Additional information
Limitations
Flow would also need to rely on a (comparatively minor) PR fork of Euler-SMEA until pull 31 is merged or forked(fixed by @yoinked-h)Due to the small but noteworthy differences between SD1.5 and the proposed hybrid, I reckon it would be best to use a new class-name to avoid at least some confusion (although it may cause some too...). If anyone has a better idea than 'Stable Diffusion o1', I'll happily put on my Sunday-meeting clothes and sing hallelujah for it!
The text was updated successfully, but these errors were encountered: