Enhance personalizing stable diffusion for human via momentum textual inversion

Summary

Fine-tune identifier embedding vector for human images based on Google's Dreambooth with Stable Diffusion! Momentum Textual Inversion is the enhanced optimizing methodology-in terms of fidelity and similarity compared to original personal data-originated from Textual Inversion in human-image domain. In addition, I added some features including prompt engineering and facial loss to get high-fidelity generation. The follow overview is how I tackled enhancing fine-tuning embedding vector for human-images during internship on Generation Team at NCSOFT Vision AI Lab. This code repository is based on that of Textual Inversion.

Environment

Our code builds on, and shares requirements with Latent Diffusion Models (LDM). To set up their environment, please run:

conda env create -f environment.yaml
conda activate ldm

or build docker image with my Dockerfile:

docker build -t <username>/<imagename> .
docker run -d -it -P --name <image_name> <username>/<imagename>

Train

python main.py  
    --base configs/stable-diffusion/v1-finetune.yaml 
    -t --actual_resume ./models/ldm/dreambooth/yerinbaek_gs_3000.ckpt 
    -n yerinbaek 
    --gpus 0, 
    --data_root ./inputs/female_2/ 
    --gamma 0.1 
    --descriptive_p 0.1 
    --use_random_prompt 
    --class_word woman

Sampling

python scripts/stable_txt2img.py 
    --ddim_eta 0.0 
    --n_samples 8 
    --n_iter 2 
    --scale 7 
    --ddim_steps 100 
    --ckpt ./models/ldm/dreambooth/yerinbaek_gs_3000.ckpt
    --embedding_path ./<embedding_path>/embeddings_ep-2.pt 
    --prompt "photo of a * woman" 
    --outdir ./outputs/yerinbaek/ep2/random_facial

Momentum Textual Inversion

Background
I observed that using Textual Inversion optimization on Dreambooth model did not improve the similarity. Training embedding vector with TI converges but is farfrom original ID. What I expected using TI on dreambooth is described at the left side picture below, while how TI exactly worked is described at right side picture below. It converges somewhere but not the global one.

Principal
Hence I needed to apply new method based on momentum to optimize embedding vector. It helps embedding vector not jump far away from where it begins.
Textual Inversion optimizes embedding vector itself, whereas momentum Textual Inversion optmizes training vector that updates with ratio from the previous embedding vector. Follow picture shows comparison between TI and momentum TI.

Results
As a result, there was certain enhancement on similarity and fidelity when using additional momentum TI optimization. Using random prompt template provided from Textual Inversion slightly improves similarity.

##Prompt Engineering Regularization
I use "a photo of *" instead of "a photo of * woman" as training prompt. It brings regularization effect via omitting class word(woman) in training prompt.
As you can see, including class word in prompt works pretty as good as no class word prompt on standard inference prompt("a photo of a * woman"). However, class word prompt training brings worsen results when it comes to different inference prompt, in this case "a photo of a full body of * woman".

Descriptive words
I observed that concatenating descriptive words at inference prompt brings biased results(e.g. blue eyes, mixed-race) as below. These descriptive words for human were manually extracted from PromptHero. However, when concatenating descriptive words at training prompt, generated samples get more realistic and more natural. Each word is selected with probability from descriptive words list then concatenated to the traning prompt. The selecting probability is a hyperparameter which I found 0.1 as an approximately optimal value. For p=0.1, the sampled images are almost same but face feature and pose becomes more natural.

##Facial Loss I add facial loss term to optimize. The facial loss is same form of the original loss but only different thing is training image. It is crop-image of original training image. I used pretrained human face detection model from dlib library. As a result, it improves both face similarity and fidelity on some samples. Enough high quality samples maintain their qualities despite change of loss term.

##Results After learning new concept human on stable diffusion by dreambooth and momentum TI, we can generate creative image that we want to get from the target human. As you can see, the generated images were improved than vanilla dreambooth's results. Some prompts(e.g. a photo of a purple hair * woman) do not generate image with the new concept properly when using vanilla dreambooth, however, momentum TI makes new concept compatible with various prompt as above.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
assets		assets
configs		configs
debug		debug
evaluation		evaluation
inputs		inputs
latent_diffusion.egg-info		latent_diffusion.egg-info
ldm		ldm
models/ldm/dreambooth		models/ldm/dreambooth
output		output
pretrained_models		pretrained_models
scripts		scripts
taming/modules/autoencoder/lpips		taming/modules/autoencoder/lpips
util		util
.DS_Store		.DS_Store
Dockerfile		Dockerfile
README.md		README.md
environment.yaml		environment.yaml
main.py		main.py
merge_embeddings.py		merge_embeddings.py
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Enhance personalizing stable diffusion for human via momentum textual inversion

Summary

Environment

Train

Sampling

Momentum Textual Inversion

About

Releases

Packages

Languages

yeonsumia/momentum-textual-inversion

Folders and files

Latest commit

History

Repository files navigation

Enhance personalizing stable diffusion for human via momentum textual inversion

Summary

Environment

Train

Sampling

Momentum Textual Inversion

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages