I fine tuned stable diffusion on youtube thumbnails and the video title. I was hoping to create a tool that help Youtubers brain storm with diffusion models to create eye-catchy thumbnails. The results for kids channels and cartoonish thumbnails were particulary cool. However, I found that stable diffusion had a hard time generating photo-realistic thumbnails.
Here are some of the generated thumbnails with the fine tuned model. First row of all the figures are the generated images by stable diffusion (not fine-tuned) for the video title. Rest of the rows are outputs of the fine tuned model on the data of the channel:
Title: animal figurines from fruits:
Title: baby in a space suit:
Title: baby is in a spaceship:
Title: blowing candles on birthday cake:
Title: playing with baloons:
Title: playing with other kids:
Title: running in a green garden:
Title: sad cocomelon baby:
Title: swimming in the pool:
Title: Birthday at the farm song with cocomelon:
Another interesting observation that I had was that when stable diffusion is being fine-tuned, the loss usually does not decrease fast and when it does decrease, perhaps it is overfitting to the data. Here is a figure that demonstrates the progress of generated images throughout fine-tuning:
Title: Birthday at the farm song with cocomelon:
First row is without fine tuning and next rows are when fine tuning progresses. As seen above, images in row 3 and after are identical to the images in training data for the exact same title.