Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request]: Stable Diffusion "o1" #46

Open
1 task done
iwr-redmond opened this issue Jan 9, 2025 · 0 comments
Open
1 task done

[Feature Request]: Stable Diffusion "o1" #46

iwr-redmond opened this issue Jan 9, 2025 · 0 comments
Labels
enhancement New feature or request

Comments

@iwr-redmond
Copy link

iwr-redmond commented Jan 9, 2025

Is there an existing issue for this?

  • I have searched the existing issues and checked the recent builds/commits

Summary

Implement Flows combining the tried and tested Stable Diffusion 1.5 architecture with more recent advances in AI technology, a combination I have unimaginatively christened 'o1' (original-1)*

Description

Core Generation Nodes

Stable Diffusion 1.5 remains perennially popular due to its low resource use and large catalogue of tools. However, the architecture is now aging, and is ready to be revitalized with updated AI advances:

  1. Tencent ELLA: boost comprehension and details with a Flan T5 LLM similar to those found in current DiT models like FLUX
  2. Megvii HiDiffusion: generate natively in full HD resolution without major VRAM usage increases
  3. Koshi-Star Samplers: enhanced Euler-family samplers for better hand generation (user option, but recommended)

All of these technologies support ControlNet and inpainting - see notes for ELLA and HiDiffusion.

Caption Upscaling

An LLM should be used for caption upscaling, as recommended by the ELLA authors.

I recommend the GPT4all Python client, which is not a custom node but can be called directly by Flow. Relying on an LLM node that requires manual configuration, e.g. installing Ollama, is unnecessary for such a simple single-task job. GPT4all is a simplified version of LllamaCPP that uses Vulkan and Metal by default for Q4_0 GGUF inference, with optional CUDA support.

A suitable abliterated model from FailSpy, such as Phi3-mini-128k-v3, could then be used for inference with the sample instructional prompt provided in the ELLA repository. Note that Phi-3 does not support system prompt use, meaning the instructions + user prompt would be concatenated for the "user" prompt.

Additional information

Limitations

  • Due to the addition of a ~3GB Flan-T5 model and increased computational resources required by the Euler SMEA Dy sampler, the VRAM required for inference will be higher than SD1.5, at a guess probably around the same as SDXL (see ELLA issue 15)
  • While ELLA+CLIP conditioning, necessary for using embeddings and Loras, is supported, the ELLA-encoded portion of the generation prompt does not support weighting
  • IP Adapters may not be supported (per issue 47)
  • While the SD1.5 Hiresfix would no longer be needed, the Fooocus equivalent of image2image upscaling would be possible with these updated technologies
  • Flow would need to rely on a PR fork of ELLA until pull 68 is merged or forked
  • Flow would also need to rely on a (comparatively minor) PR fork of Euler-SMEA until pull 31 is merged or forked (fixed by @yoinked-h)

Due to the small but noteworthy differences between SD1.5 and the proposed hybrid, I reckon it would be best to use a new class-name to avoid at least some confusion (although it may cause some too...). If anyone has a better idea than 'Stable Diffusion o1', I'll happily put on my Sunday-meeting clothes and sing hallelujah for it!

* with apologies to OpenAI

@iwr-redmond iwr-redmond added the enhancement New feature or request label Jan 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant