Training results on A100 significantly worse than T4? #172
Replies: 4 comments 3 replies
-
If it's weaker, it might be because of the number of steps. If you have fewer images you often need to increase the repeats value. I'd say the number of images times the repeats should range from 100 to 400. It's also important to consider the dim/alpha and learning rate. The default values of the XL trainer should work well with the numbers I just mentioned. If you increase the dim/alpha you would need to lower the learning rate, and if you lower the learning rate you would need to train for longer. Other than that, the captioning is important, even moreso in XL than 1.5. It could affect the way you need to prompt while using your lora, or whether it learns properly at all. |
Beta Was this translation helpful? Give feedback.
-
Hmmm, thanks for the swift reply, and sorry I do tend to ramble a bit. I'll try to structure my issue a little better: I made one dataset for an SDXL/Pony lora, 343images, captioned using WD14, manually weeded.
I've also (Painfully) ran the training locally on my 3060, after many hours I got an Epoch 10 lora that was comparable to the one trained on the T4. Could using diffusers vs checkpoints make such a big difference? |
Beta Was this translation helpful? Give feedback.
-
unfortunately i gotta agree, i've used it for few month now on A100, but couple of days ago i retrained a lora that worked perfect before, i just changed couple of words in captions and it came out horrible, like not a bit off but completely ruined, insted of photo looking pictures i got some neon drawings on black. than i tried to train completely new lora with another dataset and all i got was completely black images on generation. I then retrained the lora on T4 without changing anything except turning on diffusers and switching to fp16 from bf16 and it came out fine, fully working. This all happened right after some colab update lasts week. |
Beta Was this translation helpful? Give feedback.
-
It's been a while since I had time to prepare a dataset, but now I've trained a couple more lora - here are my observation that work with A100 (and L4): And this one I found in my old LoRA training journals from 1.5, seems be true for Pony as well: |
Beta Was this translation helpful? Give feedback.
-
So I love this colab, very easy to use. I tried it out on the free T4 allotment I got from google and the standard Style LoCon settings produced amazing results without clogging up my local GPU. Didn't quite get to epoch 10, ran out of compute time after Epoch 8 (to be fair it's a dataset with almost 400 images), but I was more than happy with the results and decided to shell out for some time on the A100, just to get the good stuff quicker.
However: I trained a different dataset LoCon on nearly the same settings (aside from not using diffusers and batch size of 16 instead of 1) and 10 epochs produced a terribly weak and low quality network. So I retrained the one I never got to finish, 10 epochs, 1 repeat, Prodigy optimizer, sdpa attention etc. same settings as before except for diffuser and batch size) and the resulting epoch 10 LoCon was weaker than the epoch 5 LoCon trained on the T4.
At this point I just feel like I'm wasting compute credits trying to figure out what's going wrong. Am I missing something? Should I always train on diffusers instead of the safetensor? Or should I avoid large batch sizes? Any hints?
Beta Was this translation helpful? Give feedback.
All reactions