Getting started with SD XL - Any Tips? #211

pw405 · 2023-07-29T23:31:42Z

pw405
Jul 29, 2023
Collaborator

Is anybody here running SD XL with DirectML deployment of Automatic1111? I downloaded the base SD XL Model, the Refiner Model, and the SD XL Offset Example LORA from Huggingface and put in appropriate folder.

I ran a Git Pull of the WebUI folder and also upgraded the python requirements.txt (see below for script).

Testing a few basic prompts such as:

"full body, anatomical photorealistic digital painting portrait of high elf wizard, fabric with intricate pattern, casting a spell of lightning, in a fantasy atmosphere, highest quality"

Negative:
"jpeg artifacts, low quality, lowres, doll, plastic, blur"

30 Steps, DPM++ 2M Karras, 512 X 768.

Results are nothing particularly noteworthy thus far:

So I'm thinking perhaps more needs to be done to run SD XL correctly?

Using 7900 XTX, I'm seeing very low utilization of GPU resources. Image above took 74 seconds to generate. Webui-user startup arg's:

set COMMANDLINE_ARGS= --medvram --no-half --precision full
set SAFETENSORS_FAST_GPU=1
set PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.9,max_split_size_mb:512

I'm thinking perhaps more needs to be done to upgrade for SD XL?

PS: Upgrade the WebUI repo & the Python requirements.txt in the venv in 1 step, just edit the following and save as a .bat. (I keep mine in same folder as webui-user.bat:

cd C:\Users\YOURFOLDERNAME\STABLEDIFFUSION\WEBUI\stable-diffusion-webui-directml
git pull

cd venv\Scripts
call activate.bat

cd ..\..
pip install -r requirements.txt --upgrade

pause

ClashSAN · 2023-07-30T00:31:47Z

ClashSAN
Jul 30, 2023

The models are designed for 1024x1024 generation. Generate with at-least 768x768 if possible.

Use https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0/blob/main/sd_xl_base_1.0_0.9vae.safetensors, avoiding the other models in the repo.

Remove --no-half and see what does the output look like.

Try --opt-sub-quad-attention --lowvram when testing your max size.

Not sure if --opt-sdp-attention saves memory or increases generation speed for your gpu, you might like to test that. Replace --opt-sub-quad-attention

1 reply

pw405 Jul 30, 2023
Collaborator Author

Ok will do. I was using SDP attention optimization as my default setting for earlier models. However, set in the UI instead of command line args.

Will tinker more later and share any useful updates.

pw405 · 2023-07-30T12:42:47Z

pw405
Jul 30, 2023
Collaborator Author

OK, making some progress. Per the readme at Huggingface, the diffusers, invisible watermark, transformers, and accelerate safetensors need to be updated. Using similar method above, one uses the activate.bat in ...venv\Scripts to pip install as follows. (This can be saved as a .bat or just run step by step in command prompt:

cd C:\Users\YOURFOLDERNAME\STABLEDIFFUSION\WEBUI\stable-diffusion-webui-directml\venv\Scripts

call activate.bat

cd ..\..
pip install diffusers --upgrade

pip install invisible_watermark transformers accelerate safetensors

pause

Next, I used the model with baked VAE from @ClashSAN listed here:
https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0/blob/main/sd_xl_base_1.0_0.9vae.safetensors

Finally, using the --upcast-sampling in command line args, and Scaled Dot Product Attention set in the UI Optimization options, performance is decent enough. (Still getting high frequency, lower power problem) Note that RDNA 1 (RX 5000's) and/or RDNA 2 (RX 6000's) may need different settings:

set COMMANDLINE_ARGS= --medvram --upcast-sampling --no-half
set SAFETENSORS_FAST_GPU=1
set PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.9,max_split_size_mb:512

Example Prompt:
"photo of a man with long hair, holding fiery sword, detailed face, (official art, beautiful and aesthetic:1.2), (dark art, erosion, fractal art:1.3), colorful, absurdres, highres, ultra detailed, in the style of Simon Bisley"

Negative:
"cartoon, painting, illustration, (worst quality, low quality, normal quality:2)"

DPM ++ 2M Karras, 32 Steps, 768 X 1024. Time taken: 51.7 seconds. Command prompt reports ~ 1.5 seconds per iteration

1 reply

ClashSAN Jul 30, 2023

--no-half is something you want to remove if possible. Do you get any errors?
you can disable the settings attention optimization and enable --opt-sub-quad-attention, they tend to be exclusive. But I also think its doing nothing on direct-ml.
You may be able to go much higher,

--opt-sub-quad-attention --medvram

Maybe This?, if it works: AUTOMATIC1111#9256

pw405 · 2023-07-30T14:49:54Z

pw405
Jul 30, 2023
Collaborator Author

Did some testing removing --no-half:
set COMMANDLINE_ARGS= --medvram --upcast-sampling

results in a message being echoed:

Creating model from config: C:\Users\MYUSERNAME\STABLEDIFFUSION\WEBUI\stable-diffusion-webui-directml\repositories\generative-models\configs\inference\sd_xl_base.yaml
Applying attention optimization: sdp... done.
Model loaded in 5.9s (load weights from disk: 1.2s, create model: 0.4s, apply weights to model: 1.1s, apply half(): 1.9s, calculate empty prompt: 1.3s).
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 32/32 [00:37<00:00, 1.16s/it]

A tensor with all NaNs was produced in VAE.
Web UI will now convert VAE into 32-bit float and retry.
To disable this behavior, disable the 'Automaticlly revert VAE to 32-bit floats' setting.
To always start with 32-bit VAE, use --no-half-vae commandline flag.

1 reply

ClashSAN Jul 31, 2023

For performance, you can keep --no-half out of your arguments, the program should do the same 32-bit VAE convert for 1.5 models too.

This vae https://huggingface.co/madebyollin/sdxl-vae-fp16-fix/blob/main/sdxl_vae.safetensors is now one version above the merged sdxl model, and should be used: as it should reduce the error rate for txt2img. Name it the same name as your sdxl model, adding .vae.safetensors at the end, for auto-detection when using the sdxl model.

pw405 · 2023-07-30T15:08:02Z

pw405
Jul 30, 2023
Collaborator Author

Trying with ONLY --medvram results in the same:

A tensor with all NaNs was produced in VAE.
Web UI will now convert VAE into 32-bit float and retry.
To disable this behavior, disable the 'Automaticlly revert VAE to 32-bit floats' setting.
To always start with 32-bit VAE, use --no-half-vae commandline flag.

with the original arguments:
set COMMANDLINE_ARGS= --medvram --upcast-sampling --no-half

The message is not produced. However, generation time is a tiny bit slower: about 1.18 seconds per iteration.

(PS - I noticed that the units of performance echoed change between s/it and it/s depending on the speed. It would be like quote miles per gallon for vehicle fuel efficiency, but instead of saying "0.5 mpg" it would say " 2 gpm". )

0 replies

pw405 · 2023-07-30T16:06:16Z

pw405
Jul 30, 2023
Collaborator Author

New test, same command line:

set COMMANDLINE_ARGS= --medvram --upcast-sampling --no-half

Regarding the quad optimization & token merging:

...
...
Maybe This?, if it works: AUTOMATIC1111#9256

Using same prompt/settings/sampler from above "...Man with fiery sword..." Token Merge Ratio = 0.5 and Cross Attention Optimization set to sub-quadratic (both in Settings -> Optimizations within the UI).

Results seem similar to using SDP Attention - about 1.2 seconds/iteration. Generating two 768 X 1024 images, 32 steps, DPM++ 2M Karras gives ~ 1.3 seconds/iteration or a total time of 1 min 44 seconds.

Pretty happy overall with performance and results so far! Can't wait for the community to start making LORA's, models, Control Net, etc.!

0 replies

pw405 · 2023-08-03T01:04:58Z

pw405
Aug 3, 2023
Collaborator Author

A few updates after having some time to play with DreamShaper XL and test a bit more:

The Refiner Model is tricky, and doesn't seem to work quite right. I use SDP Attention in menu, Token merge of 0.5. I've only had success using refiner with:

set COMMANDLINE_ARGS= --medvram --upcast-sampling --no-half

You can also add precision full:

set COMMANDLINE_ARGS= --medvram --upcast-sampling --no-half --precision full

However, I notice that --precision full only seems to increase the GPU temp. Junction temps on 7900XTX easily pass 95° and I never see temps like that in any other workload. Will need more testing to see if --precision full has any benefit.

"Success" using refiner is is quite loosely defined here. I feel like it actually makes the image worse.

Aside from that, everything seems to be running quite well!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Getting started with SD XL - Any Tips? #211

{{title}}

Replies: 6 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Getting started with SD XL - Any Tips? #211

pw405 Jul 29, 2023 Collaborator

Replies: 6 comments · 3 replies

ClashSAN Jul 30, 2023

pw405 Jul 30, 2023 Collaborator Author

pw405 Jul 30, 2023 Collaborator Author

ClashSAN Jul 30, 2023

pw405 Jul 30, 2023 Collaborator Author

ClashSAN Jul 31, 2023

pw405 Jul 30, 2023 Collaborator Author

pw405 Jul 30, 2023 Collaborator Author

pw405 Aug 3, 2023 Collaborator Author

pw405
Jul 29, 2023
Collaborator

Replies: 6 comments 3 replies

ClashSAN
Jul 30, 2023

pw405 Jul 30, 2023
Collaborator Author

pw405
Jul 30, 2023
Collaborator Author

pw405
Jul 30, 2023
Collaborator Author

pw405
Jul 30, 2023
Collaborator Author

pw405
Jul 30, 2023
Collaborator Author

pw405
Aug 3, 2023
Collaborator Author