-
-
Notifications
You must be signed in to change notification settings - Fork 439
FLUX
FLUX.1 family consists of 3 variations:
-
Pro
Model weights are NOT released, model is available only via Black Forest Labs -
Dev
Open-weight, guidance-distilled from Pro variation, available for non-commercial applications -
Schnell
Open-weight, timestep-distilled from Dev variation, available under Apache2.0 license
Additionally SD.Next includes pre-quantized variations of FLUX.1 Dev variation: qint8
, qint4
and nf4
To use either any variations or quantizations, simply select it from Networks -> Reference
and model will be auto-downloaded on first use
Notes:
- FLUX.1 Dev variant is a gated model, you need to accept the terms and conditions to use it
- Do not download any of the base model manually, use built-in downloader!
Tip
- Pick variant that uses less memory as model in original form has very high requirements
- Set appropriate offloading setting before loading the model to avoid out-of-memory errors
- Check compatibility of different quantizations with your platform and gpu!
There are already many FLUX.1 unofficial variations available
Any Diffuser-based variation can be downloaded and loaded into SD.Next using Models -> Huggingface -> Download
For example, interesting variation is a merge of Dev and Schnell variations by sayakpaul: sayakpaul/FLUX.1-merged
SD.Next includes support for FLUX.1 LoRAs
Since LoRA keys vary significantly between tools used to train LoRA as well as LoRA types,
support for additional LoRAs will be added as needed - please report any non-functional LoRAs!
Also note that compatibility of LoRA depends on the quantization type! If you have issues loading LoRA, try switching your FLUX.1 base model to different quantization type
Note: Loading of all-in-one single-file safetensors requires SD.Next.dev branch
Typical all-in-one safetensors file is over 20GB in size and contains full model with transformer, both text-encoders and VAE
Since text encoders and VAE are same between all FLUX.1 models, using all-in-one safetensors is not recommended
Unet/Transformer component of FLUX.1 is a typical model fine-tune and is around 11GB in size
To load a Unet/Transformer safetensors file:
- Download
safetensors
orgguf
file from desired source and place it inmodels/UNET
folder
example: FastFlux Unchained - Load FLUX.1 model as usual and then
- Replace transformer with one in desired safetensors file using:
Settings -> Execution & Models -> UNet
SD.Next allows changing optional text encoder on-the-fly
Go to Settings -> Models -> Text encoder and select the desired text encoder
T5 enhances text rendering and some details, but its otherwise very lightly used and optional
Loading lighter T5 will greatly decrease model resource usage, but may not be compatible with all offloading modes
Tip
To use prompt attention syntax with FLUX.1, set
Settings -> Execution -> Prompt attention to xhinker
Example image with different encoder quantization options
SD.Next allows changing VAE model used by FLUX.1 on-the-fly
There are no alternative VAE models released, so this setting is mostly for future use
Tip
To enable image previews during generate, set Settings -> Live Preview -> Method to TAESD
To further speed up generation, you can disable "full quality" which triggers use of TAESD instead of full VAE to decode final image
As mentioned, FLUX.1 at the moment supports only Euler FlowMatch scheduler, additional schedulers will be added in the future
Due to specifics of flow-matching methods, number of steps also has strong influence on the image composition, not just on the way how its resolved
Example image at different steps
Additionally, sampler can be tuned with shift parameter which roughly modifies how long does model spend on composition vs actual diffusion
Example image with different sampler shift values
Note: Support for FLUX.1 ControlNet requires SD.Next.dev branch
Support for all InstantX/Shakker-Labs models including Union-Pro
FLUX.1 ControlNets are large at over 6GB on top of already very large FLUX.1 model
as such, you may need to use offloading:sequential which is not as fast, but uses far less memory
When using union model, you must also select control mode in the control unit
FLUX.1 does not yet support img2img so to use ControlNet, you need to set input image via control unit override, not via main panel input image
Tip
For convience, you can add setting that allow quick replacements of model compoto your quicksettings by adding Settings -> User Interface -> Quicksettings list -> sd_unet, sd_vae, sd_text_encoder
Quantization can significantly reduce memory requirements, but it can also slightly reduce quality of outputs
Also, different quantization options are very platform and GPU dependent and are not supported on all platforms
-
qint8
andqint8
quantization requireoptimum-quanto
which will be auto-installed on first use -
nf4
quantization requiresbitsandbytes
which will be auto-installed on first use -
gguf
quantization requiresgguf
which will be auto-installed on first use
Another option is NNCF which performs quantization during model load (instead of having a pre-quantized model)
Tip
Advantage of NNCF is that does works on any platform: if you're having issues with optimum-quanto
or bitsandbytes
, try it out!
Example image with both dev and schnell variations and different transformer quantization options
Optimum-Quanto:
- requires
torch==2.4.0
if you're running older torch, you can try upgrading it or running sdnext with--reinstall
flag - not compatible with balanced offload
- note compatibile with most LoRAs
- not supported on Intel Arc/IPEX since IPEX is still based on Torch 2.1
- not supported with Zluda since Zluda does not support torch 2.4
- not supported on AMD ROCm on Linux using official package due to explicit CUDA checks
works with AMD ROCm on Linux using fork of: https://github.com/Disty0/optimum-quanto/
BitsAndBytes:
- default
bitsandbytes
package only supports nVidia GPUs - only supported on Linux due to reliance on
torch.compile
- some quantization types require newer GPU with supported CUDA ops
e.g. nVidia Turing GPUs or newer - intructions for AMD/ROCm support
- for Intel/IPEX support
GGUF:
-
gguf
supports wide range of quantization types and is not platform or GPU dependent -
gguf
does not provide native GPU kernels which means thatgguf
is purely a storage optimization -
gguf
reduces model size and memory usage, but it does slow down model inference since all quantized weights are de-quantized on-the-fly -
gguf
is not compatible with model offloading as it would trigger de-quantization -
note: only supported component in
gguf
binary format is UNET/Transformer
you cannot load single-filegguf
model, you must load standard FLUX.1 model withgguf
UNET/Transformer which can be loaded before or after model has been loaded
NNCF:
- broad platform and GPU support
- enable in Settings -> Compute -> Compress model weights with NNCF
- see NNCF Wiki for more details
FLUX.1 is a massive model at ~32GB and as such it is recommended to use offloading
To set offloading, see Settings -> Diffusers -> Model offload mode:
-
Balanced
Recommended for compatible high VRAM GPUs
Faster but requires compatible platform and sufficient VRAM
Not compatible with Quanto qint quantization -
Sequential
Recommended for low VRAM GPUs Much slower but allows FLUX.1 to run on GPUs with 6GB VRAM
Not compatible with Quanto qint or BitsAndBytes nf4 quantization -
Model
High compatibility than either balanced and sequential, but lesser savings
Performance and memory usage of different FLUX.1 variations:
dtype | time (sec) | performance | memory | offload | note |
---|---|---|---|---|---|
bf16 | >32 GB | none | *1 | ||
bf16 | 50.47 | 0.40 it/s | balanced | *2 | |
bf16 | 94.28 | 0.21 it/s | 1.89 GB | sequential | |
nf4 | 14.69 | 1.36 it/s | 17.92 GB | none | |
nf4 | 21.02 | 0.95 it/s | balanced | *2 | |
nf4 | sequential | *3 | |||
qint8 | 15.42 | 1.30 it/s | 18.85 GB | none | |
qint8 | balanced | *4 | |||
qint8 | sequential | *5 | |||
qint4 | 18.37 | 1.09 it/s | 11.38 GB | none | |
qint4 | balanced | *4 | |||
qint4 | sequential | *5 |
Notes:
- *1: Memory usage exceeeds 32GB and is not recommended
- *2: Balanced offload VRAM usage is not included since it depends on desired threshold
- *3: BitsAndBytes nf4 quantization is not compatible with sequential offload
Error: Blockwise quantization only supports 16/32-bit floats
- *4: Quanto qint quantization is not compatible with balanced offload
Error: QBytesTensor.new() missing 5 required positional arguments
- *5: Quanto qint quantization is not compatible with sequential offload
Error: Expected all tensors to be on the same device
- ControlNet XLabs-AI models: https://github.com/huggingface/diffusers/issues/9301
- Diffusers generic quantization: https://github.com/huggingface/diffusers/issues/9174
- Differential diffusion: https://github.com/huggingface/diffusers/pull/9268
- IP-Adapter: https://huggingface.co/XLabs-AI/flux-ip-adapter