Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add boft support in stable-diffusion #1295

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open

add boft support in stable-diffusion #1295

wants to merge 4 commits into from

Conversation

sywangyi
Copy link
Collaborator

What does this PR do?

Fixes # (issue)

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

@sywangyi sywangyi requested a review from regisss as a code owner August 28, 2024 08:50
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Signed-off-by: Wang, Yi A <[email protected]>
@imangohari1
Copy link
Contributor

Hi @sywangyi
Thanks for this PR.
Could you rebase this with OH main and make sure make style is applied.
Please share the results of the CI tests for test_diffusers.py with and without this changes.
Thanks.

@sywangyi
Copy link
Collaborator Author

same with latest main. 1 case fail

FAILED tests/test_diffusers.py::GaudiStableDiffusionXLImg2ImgPipelineTests::test_stable_diffusion_xl_img2img_euler - AssertionError: 0.21911774845123289 not less than 0.01
================================ 1 failed, 108 passed, 46 skipped, 274 warnings in 776.40s (0:12:56) =================================

Copy link
Contributor

@imangohari1 imangohari1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @sywangyi
I spent some time on this PR and did some testing:

  • I've reworked the README file for a better read. Please apply the changes with the attached patch using git am < 000* (don't copy past the changes, apply the patch please).
    0001-fea-dreambooth-reworked-the-readme.patch

  • I've tested the PEFT example with both lora and boft. The lora example finishes in about 6min (5m47.993s) but the boft one has been running for ~80min and only compeleted 24% (Steps: 24%|██▎ | 188/800 [1:21:18<3:53:56, 22.94s/it, loss=0.0225, lr=0.0001]). Any thoughts on why boft is so significantly slower than lora? Is this bc of lack of hpu graphs? Let's investigate this a bit more.

    • I've provided the tested cmd below.
export MODEL_NAME="CompVis/stable-diffusion-v1-4"
export INSTANCE_DIR="dog"
export CLASS_DIR="path-to-class-images"
export OUTPUT_DIR="out"

logfile=pr1295.$(date -u +%Y%m%d%H%M).$(hostname).log

time python ../../gaudi_spawn.py --world_size 8 --use_mpi train_dreambooth.py   --pretrained_model_name_or_path=$MODEL_NAME    --instance_data_dir=$INSTANCE_DIR   --output_dir=$OUTPUT_DIR   --class_data_dir=$CLASS_DIR   --with_prior_preservation --prior_loss_weight=1.0   --instance_prompt="a photo of sks dog"   --class_prompt="a photo of dog"   --resolution=512   --train_batch_size=1   --num_class_images=200   --gradient_accumulation_steps=1   --learning_rate=1e-4   --lr_scheduler="constant"   --lr_warmup_steps=0   --max_train_steps=800   --mixed_precision=bf16   --use_hpu_graphs_for_training   --use_hpu_graphs_for_inference   --gaudi_config_name Habana/stable-diffusion   lora --unet_r 8 --unet_alpha 8 2>&1 | tee $logfile

time python ../../gaudi_spawn.py --world_size 8 --use_mpi train_dreambooth.py   --pretrained_model_name_or_path=$MODEL_NAME    --instance_data_dir=$INSTANCE_DIR   --output_dir=$OUTPUT_DIR   --class_data_dir=$CLASS_DIR   --with_prior_preservation --prior_loss_weight=1.0   --instance_prompt="a photo of sks dog"   --class_prompt="a photo of dog"   --resolution=512   --train_batch_size=1   --num_class_images=200   --gradient_accumulation_steps=1   --learning_rate=1e-4   --lr_scheduler="constant"   --lr_warmup_steps=0   --max_train_steps=800   --mixed_precision=bf16     --gaudi_config_name Habana/stable-diffusion   boft 2>&1 | tee $logfile

@sywangyi
Copy link
Collaborator Author

Hi @sywangyi I spent some time on this PR and did some testing:

  • I've reworked the README file for a better read. Please apply the changes with the attached patch using git am < 000* (don't copy past the changes, apply the patch please).
    0001-fea-dreambooth-reworked-the-readme.patch

  • I've tested the PEFT example with both lora and boft. The lora example finishes in about 6min (5m47.993s) but the boft one has been running for ~80min and only compeleted 24% (Steps: 24%|██▎ | 188/800 [1:21:18<3:53:56, 22.94s/it, loss=0.0225, lr=0.0001]). Any thoughts on why boft is so significantly slower than lora? Is this bc of lack of hpu graphs? Let's investigate this a bit more.

    • I've provided the tested cmd below.
export MODEL_NAME="CompVis/stable-diffusion-v1-4"
export INSTANCE_DIR="dog"
export CLASS_DIR="path-to-class-images"
export OUTPUT_DIR="out"

logfile=pr1295.$(date -u +%Y%m%d%H%M).$(hostname).log

time python ../../gaudi_spawn.py --world_size 8 --use_mpi train_dreambooth.py   --pretrained_model_name_or_path=$MODEL_NAME    --instance_data_dir=$INSTANCE_DIR   --output_dir=$OUTPUT_DIR   --class_data_dir=$CLASS_DIR   --with_prior_preservation --prior_loss_weight=1.0   --instance_prompt="a photo of sks dog"   --class_prompt="a photo of dog"   --resolution=512   --train_batch_size=1   --num_class_images=200   --gradient_accumulation_steps=1   --learning_rate=1e-4   --lr_scheduler="constant"   --lr_warmup_steps=0   --max_train_steps=800   --mixed_precision=bf16   --use_hpu_graphs_for_training   --use_hpu_graphs_for_inference   --gaudi_config_name Habana/stable-diffusion   lora --unet_r 8 --unet_alpha 8 2>&1 | tee $logfile

time python ../../gaudi_spawn.py --world_size 8 --use_mpi train_dreambooth.py   --pretrained_model_name_or_path=$MODEL_NAME    --instance_data_dir=$INSTANCE_DIR   --output_dir=$OUTPUT_DIR   --class_data_dir=$CLASS_DIR   --with_prior_preservation --prior_loss_weight=1.0   --instance_prompt="a photo of sks dog"   --class_prompt="a photo of dog"   --resolution=512   --train_batch_size=1   --num_class_images=200   --gradient_accumulation_steps=1   --learning_rate=1e-4   --lr_scheduler="constant"   --lr_warmup_steps=0   --max_train_steps=800   --mixed_precision=bf16     --gaudi_config_name Habana/stable-diffusion   boft 2>&1 | tee $logfile

yes. I have file a bug to pytorch training team about the perf issue, will cc you in the jira

@imangohari1
Copy link
Contributor

Overal LGTM although there is a performance issue with boft. Thanks @sywangyi
@regisss how do you suggest we proceed?

@libinta
Copy link
Collaborator

libinta commented Sep 18, 2024

@sywangyi please test with latest synapse SW, if there is still issue, we dont need to merge this change for next synapse release as it's not functional

@sywangyi
Copy link
Collaborator Author

sywangyi commented Sep 18, 2024

@sywangyi please test with latest synapse SW, if there is still issue, we dont need to merge this change for next synapse release as it's not functional

which version do you mean? I think habana pytorch training team is still working on it.

@imangohari1
Copy link
Contributor

@sywangyi please test with latest synapse SW, if there is still issue, we dont need to merge this change for next synapse release as it's not functional

which version do you mean? I think habana pytorch training team is still working on it.

@sywangyi 1.18.0 release build id 410.

@libinta
Copy link
Collaborator

libinta commented Sep 24, 2024

@sywangyi do you have test result?

@kaixuanliu
Copy link
Contributor

@sywangyi do you have test result?

Hi, @libinta , we used a docker image with build 438 for 1.18.0 release from @yeonsily and have a test. The perf of boft is still significantly slower than lora, for single card finetune, lora takes ~4min 30s, while boft needs more than 2 hours.

@imangohari1
Copy link
Contributor

@sywangyi do you have test result?

Hi, @libinta , we used a docker image with build 438 for 1.18.0 release from @yeonsily and have a test. The perf of boft is still significantly slower than lora, for single card finetune, lora takes ~4min 30s, while boft needs more than 2 hours.

I have tested this with driver 1.18.0-460 and the corresponding docker. It still shows the same behavior.
Our team is looking at rewriting the code that is causing recompile.

@yao-matrix
Copy link
Contributor

@sywangyi do you have test result?

Hi, @libinta , we used a docker image with build 438 for 1.18.0 release from @yeonsily and have a test. The perf of boft is still significantly slower than lora, for single card finetune, lora takes ~4min 30s, while boft needs more than 2 hours.

I have tested this with driver 1.18.0-460 and the corresponding docker. It still shows the same behavior. Our team is looking at rewriting the code that is causing recompile.

@imangohari1 , do we have update on this?

@imangohari1
Copy link
Contributor

@sywangyi do you have test result?

Hi, @libinta , we used a docker image with build 438 for 1.18.0 release from @yeonsily and have a test. The perf of boft is still significantly slower than lora, for single card finetune, lora takes ~4min 30s, while boft needs more than 2 hours.

I have tested this with driver 1.18.0-460 and the corresponding docker. It still shows the same behavior. Our team is looking at rewriting the code that is causing recompile.

@imangohari1 , do we have update on this?

I am not if the issue is resolved or not.
@sywangyi WDYT?

@sywangyi
Copy link
Collaborator Author

according to https://habana.atlassian.net/browse/HS-3208, it has not been resolved yet

@Luca-Calabria
Copy link
Contributor

Luca-Calabria commented Dec 3, 2024

according to https://habana.atlassian.net/browse/HS-3208, it has not been resolved yet

Just to update: RnD guy found the low level issue that produces the slow compilation when "torch.block_diag" operation run. You can find all the details in the ticket.
The development time, to support operations like "torch.block_diag" is estimated around 2-3 weeks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants