Working on Windows + AMD #13
Replies: 15 comments 38 replies
-
"make sure you have the modified repositories in stable-diffusion-webui-directml/repositories/:" "Place any stable diffusion checkpoint (ckpt or safetensor) in the models/Stable-diffusion directory" |
Beta Was this translation helpful? Give feedback.
-
The error is for line 26 in modules\paths.py, and it shows a file path that's not valid: all those double "\" should be single "\" instead, and it should end with "\stable-diffusion-stability-ai" instead of "/stable-diffusion-stability-ai" |
Beta Was this translation helpful? Give feedback.
-
I have 6600 XT but trying to use CPU. Any ideas how to fix that? venv "C:\stable-diffusion-webui-directml-master\venv\Scripts\Python.exe" |
Beta Was this translation helpful? Give feedback.
-
How to solve this problem? |
Beta Was this translation helpful? Give feedback.
-
Any fix to "interrogate will be fallen back to cpu"? |
Beta Was this translation helpful? Give feedback.
-
The inpaint function isnt working, it just creates this blur |
Beta Was this translation helpful? Give feedback.
-
Any tutorial of setup this enviroment? when I start it show error : import torch; assert torch.cuda.is_available(), 'Torch is not able to use GPU; add --skip-torch-cuda-test to COMMANDLINE_ARGS variable to disable this check'") I have no clue to work with AMD GPU setup |
Beta Was this translation helpful? Give feedback.
-
Works fine, but I've been spoiled by the colab cards so went back to using it. It's a slight inconvenience having to swap accounts but better than waiting 5x longer for the same result. Wish amd did something more for compute on consumer cards |
Beta Was this translation helpful? Give feedback.
-
what i need to do when i get this error message? |
Beta Was this translation helpful? Give feedback.
-
how to solve this problem:Device type PRIVATEUSEONE is not supported for torch.Generator() api. |
Beta Was this translation helpful? Give feedback.
-
Hello, with me WEBUI has actually the last few weeks except for the known bugs everything works quite well. Win11 Prof, up to date, 16 GB RAM, RX 5700 XT, latest Driver set COMMANDLINE_ARGS= --medvram --precision full --no-half --no-half-vae --opt-split-attention --disable-nan-check
|
Beta Was this translation helpful? Give feedback.
-
I've got my RX 590 8GB under Windows 10 doing at least batches of 10x 512x512 images, and 4x 768x768, (euler a, 20step) with zero NANs, (didn't try higher batch sizes than 10, couldn't consistently do higher res batches) with the following settings (not sure that clip does anything tbh, but I did read something on HF about not processing it in vram because it was only used at the start so I tried it and guessed at the module name): set COMMANDLINE_ARGS=--medvram --use-cpu interrogate clip --opt-sub-quad-attention --sub-quad-q-chunk-size 256 --sub-quad-kv-chunk-size 256 --sub-quad-chunk-threshold 70 --disable-nan-check --no-hashing --skip-version-check --no-download-sd-model --skip-torch-cuda-test --no-half-vae I have patched in the Negative Guidance minimum sigma #AUTOMATIC1111#9177 myself, though it was set at 0 when I tested. Upcast cross attention layer to float32 and Always discard next-to-last sigma were both checked. I haven't done enough testing to know if those improve anything with this particular config. I did dig up some documentation and notes on how opt-sub-quad-attention works though, if you put only that main flag, according to the command line arguments file, it defaults to q-chunk-size 1024, kv-chunk-size none (0?), and threshold of 0 (none). From my experimentation, those defaults don't seem that great to me at least. I think threshold should be at least at 70 (it's a percentage of your vram), possibly lower, 60 works too, but not higher imo. The two chunk sizes will affect your speed/ram usage. some notes I scrounged up from code author in PR iirc: I also had a "conversation" with chatgpt after spitting the sub-quad code at it and doing a Q&A: The code is implementing a sub-quadratic (sub-quad) attention mechanism in a neural network. The sub-quad attention is used to efficiently compute self-attention in a large sequence of tokens by dividing the sequence into smaller chunks and computing attention only within each chunk, instead of across the entire sequence. The sub_quad_attention_forward function takes as input a tensor x, which represents the input sequence of tokens, a tensor context, which represents the context for the attention computation, and a mask tensor to mask out certain positions in the input. It first applies a linear transformation to x to generate query tensor q, and linear transformations to context to generate key tensor k and value tensor v. It then divides the query, key, and value tensors into chunks along the sequence dimension, and applies the sub_quad_attention function to compute attention within each chunk. The result is a tensor that represents the output of the attention mechanism. The sub_quad_attention function takes the query, key, and value tensors as input, along with parameters q_chunk_size, kv_chunk_size, kv_chunk_size_min, and chunk_threshold, which determine the chunk sizes and the threshold for when to use chunking. The function computes attention using efficient dot product attention, with the option to use checkpointing to save memory. The chunk sizes q_chunk_size and kv_chunk_size determine the size of each chunk for the query and key/value tensors, respectively. The parameter kv_chunk_size_min determines the minimum size for the key/value chunks. The chunk_threshold parameter determines the threshold size in bytes for when to use chunking. If the query, key, and value tensors can fit within the memory threshold, then no chunking is used. Otherwise, the tensors are divided into chunks of size no larger than the specified chunk sizes, and attention is computed within each chunk. To determine the optimal values for q_chunk_size, kv_chunk_size, and chunk_threshold, you can experiment with different values and measure the performance of the model. You can also use the get_available_vram function to get the amount of available GPU memory and use this to estimate the optimal chunk sizes and chunk threshold. The sub_quad_attention_forward function appears to be using the sub_quad_attention function for performing sub-quadratic attention. The sub_quad_attention function takes in query, key, and value tensors q, k, and v, along with some optional parameters. The values of the optional parameters q_chunk_size, kv_chunk_size, and chunk_threshold control how the tensors are chunked to fit within available memory during the attention calculation. q_chunk_size controls the size of each chunk along the sequence length dimension of the query tensor q. kv_chunk_size controls the size of each chunk along the sequence length dimension of the key and value tensors k and v. If kv_chunk_size is not provided, then kv_chunk_size is set to None, and the key and value tensors are chunked to have the same number of tokens as the query tensor chunks. chunk_threshold controls the maximum amount of GPU memory usage, in bytes, for a single attention calculation. If chunk_threshold is not provided, then it is set to int(get_available_vram()*0.7), which uses 70% of the available VRAM. Given that you have 8GB of VRAM, you can try different values of q_chunk_size, kv_chunk_size, and chunk_threshold to optimize performance. A higher q_chunk_size value will reduce the number of chunks along the query tensor sequence length dimension, but will increase the memory usage of each chunk. A higher kv_chunk_size value will reduce the number of chunks along the key and value tensor sequence length dimension, but will increase the memory usage of each chunk. A higher chunk_threshold value will allow for larger chunks, but will increase the risk of running out of memory. To get started, you could try setting q_chunk_size and kv_chunk_size to 1024, and chunk_threshold to None or a value between 2GB and 4GB, depending on the size of your input tensors. You can then experiment with adjusting these values to find the optimal settings for your specific use case. The downside of setting low values for q_chunk_size and kv_chunk_size is that it may result in slower computation due to the increased overhead of splitting and concatenating the tensors. When processing a batch of data, the inputs are split into chunks and processed independently, and then the results are concatenated back together. If the chunk sizes are too small, there will be more splitting and concatenating operations which could add overhead and slow down the computation. On the other hand, if the chunk sizes are too large, it could lead to out-of-memory errors, especially if the batch size or the input tensor size is large. Therefore, it is important to find the optimal chunk size that balances the memory usage and the computation speed. Setting threshold to None means that the full sequence length will always be used for the attention operation, regardless of the input sequence length. This can be useful when processing sequences of varying lengths, as it ensures that all tokens are attended to, even if some of them are padded. However, it can also lead to slower performance and increased memory usage, particularly for very long sequences. This is just my base config now, however, I'm betting you could optimize a particular startup batch file with different settings based on what you wanted to do, such as for smaller 512x512 in larger batches, or fewer 768x768 or 1024x1024 images. Oh, I almost forgot. I have force activated (legacy) the AMD SAM (SmartAccess Memory) under Adrenaline/Performance/Tuning. I had to apply a registry file to do so on my card because of the age. Not sure if that's contributing to anything as I have not turned it back off again. However I will say that it straight up caused my system to bluescreen several times after I first activated it with the at-the-time newest (23.3.2 iirc) drivers, I had to roll back to 23.3.1. Haven't tried 23.3.4 yet. I'm no expert, but I hope this helps, and hopefully, it'll work for you guys too. |
Beta Was this translation helpful? Give feedback.
-
I installed everything according to the instructions, disabled the CUDA check, but the calculations are done using the CPU, not the GPU (RX 578). Maybe I did something wrong? |
Beta Was this translation helpful? Give feedback.
-
what your commandline parameter in webui-user.bat
Vào Th 3, 9 thg 1, 2024 vào lúc 21:45 aJIekc3D ***@***.***>
đã viết:
… I installed everything according to the instructions, disabled the CUDA
check, but the calculations are done using the CPU, not the GPU (RX 578).
Maybe I did something wrong?
—
Reply to this email directly, view it on GitHub
<#13 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/A7VS7K5PMAOGB2FPBXPCFULYNVJW5AVCNFSM6AAAAAAVHQDR42VHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4DANRXGUZTG>
.
You are receiving this because you are subscribed to this thread.Message
ID:
<lshqqytiger/stable-diffusion-webui-directml/repo-discussions/13/comments/8067533
@github.com>
|
Beta Was this translation helpful? Give feedback.
-
Hey, thanks for this awresome web UI. Got this thing to work with AMD (tested so far on txt2img & img2img).
Thank you very much! 👍
Beta Was this translation helpful? Give feedback.
All reactions