add boft support in stable-diffusion #1295

sywangyi · 2024-08-28T08:50:09Z

What does this PR do?

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

Signed-off-by: Wang, Yi A <[email protected]>

HuggingFaceDocBuilderDev · 2024-08-28T08:54:25Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Signed-off-by: Wang, Yi A <[email protected]>

imangohari1 · 2024-09-12T15:17:04Z

Hi @sywangyi
Thanks for this PR.
Could you rebase this with OH main and make sure make style is applied.
Please share the results of the CI tests for test_diffusers.py with and without this changes.
Thanks.

sywangyi · 2024-09-13T14:08:26Z

same with latest main. 1 case fail

FAILED tests/test_diffusers.py::GaudiStableDiffusionXLImg2ImgPipelineTests::test_stable_diffusion_xl_img2img_euler - AssertionError: 0.21911774845123289 not less than 0.01
================================ 1 failed, 108 passed, 46 skipped, 274 warnings in 776.40s (0:12:56) =================================

imangohari1

Hi @sywangyi
I spent some time on this PR and did some testing:

I've reworked the README file for a better read. Please apply the changes with the attached patch using git am < 000* (don't copy past the changes, apply the patch please).
0001-fea-dreambooth-reworked-the-readme.patch
I've tested the PEFT example with both lora and boft. The lora example finishes in about 6min (5m47.993s) but the boft one has been running for ~80min and only compeleted 24% (Steps: 24%|██▎ | 188/800 [1:21:18<3:53:56, 22.94s/it, loss=0.0225, lr=0.0001]). Any thoughts on why boft is so significantly slower than lora? Is this bc of lack of hpu graphs? Let's investigate this a bit more.
- I've provided the tested cmd below.

export MODEL_NAME="CompVis/stable-diffusion-v1-4"
export INSTANCE_DIR="dog"
export CLASS_DIR="path-to-class-images"
export OUTPUT_DIR="out"

logfile=pr1295.$(date -u +%Y%m%d%H%M).$(hostname).log

time python ../../gaudi_spawn.py --world_size 8 --use_mpi train_dreambooth.py   --pretrained_model_name_or_path=$MODEL_NAME    --instance_data_dir=$INSTANCE_DIR   --output_dir=$OUTPUT_DIR   --class_data_dir=$CLASS_DIR   --with_prior_preservation --prior_loss_weight=1.0   --instance_prompt="a photo of sks dog"   --class_prompt="a photo of dog"   --resolution=512   --train_batch_size=1   --num_class_images=200   --gradient_accumulation_steps=1   --learning_rate=1e-4   --lr_scheduler="constant"   --lr_warmup_steps=0   --max_train_steps=800   --mixed_precision=bf16   --use_hpu_graphs_for_training   --use_hpu_graphs_for_inference   --gaudi_config_name Habana/stable-diffusion   lora --unet_r 8 --unet_alpha 8 2>&1 | tee $logfile

time python ../../gaudi_spawn.py --world_size 8 --use_mpi train_dreambooth.py   --pretrained_model_name_or_path=$MODEL_NAME    --instance_data_dir=$INSTANCE_DIR   --output_dir=$OUTPUT_DIR   --class_data_dir=$CLASS_DIR   --with_prior_preservation --prior_loss_weight=1.0   --instance_prompt="a photo of sks dog"   --class_prompt="a photo of dog"   --resolution=512   --train_batch_size=1   --num_class_images=200   --gradient_accumulation_steps=1   --learning_rate=1e-4   --lr_scheduler="constant"   --lr_warmup_steps=0   --max_train_steps=800   --mixed_precision=bf16     --gaudi_config_name Habana/stable-diffusion   boft 2>&1 | tee $logfile

sywangyi · 2024-09-14T08:29:42Z

Hi @sywangyi I spent some time on this PR and did some testing:

I've reworked the README file for a better read. Please apply the changes with the attached patch using git am < 000* (don't copy past the changes, apply the patch please).
0001-fea-dreambooth-reworked-the-readme.patch

I've tested the PEFT example with both lora and boft. The lora example finishes in about 6min (5m47.993s) but the boft one has been running for ~80min and only compeleted 24% (Steps: 24%|██▎ | 188/800 [1:21:18<3:53:56, 22.94s/it, loss=0.0225, lr=0.0001]). Any thoughts on why boft is so significantly slower than lora? Is this bc of lack of hpu graphs? Let's investigate this a bit more.

I've provided the tested cmd below.
export MODEL_NAME="CompVis/stable-diffusion-v1-4"
export INSTANCE_DIR="dog"
export CLASS_DIR="path-to-class-images"
export OUTPUT_DIR="out"

logfile=pr1295.$(date -u +%Y%m%d%H%M).$(hostname).log

time python ../../gaudi_spawn.py --world_size 8 --use_mpi train_dreambooth.py   --pretrained_model_name_or_path=$MODEL_NAME    --instance_data_dir=$INSTANCE_DIR   --output_dir=$OUTPUT_DIR   --class_data_dir=$CLASS_DIR   --with_prior_preservation --prior_loss_weight=1.0   --instance_prompt="a photo of sks dog"   --class_prompt="a photo of dog"   --resolution=512   --train_batch_size=1   --num_class_images=200   --gradient_accumulation_steps=1   --learning_rate=1e-4   --lr_scheduler="constant"   --lr_warmup_steps=0   --max_train_steps=800   --mixed_precision=bf16   --use_hpu_graphs_for_training   --use_hpu_graphs_for_inference   --gaudi_config_name Habana/stable-diffusion   lora --unet_r 8 --unet_alpha 8 2>&1 | tee $logfile

time python ../../gaudi_spawn.py --world_size 8 --use_mpi train_dreambooth.py   --pretrained_model_name_or_path=$MODEL_NAME    --instance_data_dir=$INSTANCE_DIR   --output_dir=$OUTPUT_DIR   --class_data_dir=$CLASS_DIR   --with_prior_preservation --prior_loss_weight=1.0   --instance_prompt="a photo of sks dog"   --class_prompt="a photo of dog"   --resolution=512   --train_batch_size=1   --num_class_images=200   --gradient_accumulation_steps=1   --learning_rate=1e-4   --lr_scheduler="constant"   --lr_warmup_steps=0   --max_train_steps=800   --mixed_precision=bf16     --gaudi_config_name Habana/stable-diffusion   boft 2>&1 | tee $logfile

yes. I have file a bug to pytorch training team about the perf issue, will cc you in the jira

imangohari1 · 2024-09-17T15:21:08Z

Overal LGTM although there is a performance issue with boft. Thanks @sywangyi
@regisss how do you suggest we proceed?

libinta · 2024-09-18T21:05:14Z

@sywangyi please test with latest synapse SW, if there is still issue, we dont need to merge this change for next synapse release as it's not functional

sywangyi · 2024-09-18T23:10:17Z

@sywangyi please test with latest synapse SW, if there is still issue, we dont need to merge this change for next synapse release as it's not functional

which version do you mean? I think habana pytorch training team is still working on it.

imangohari1 · 2024-09-20T20:18:19Z

@sywangyi please test with latest synapse SW, if there is still issue, we dont need to merge this change for next synapse release as it's not functional

which version do you mean? I think habana pytorch training team is still working on it.

@sywangyi 1.18.0 release build id 410.

libinta · 2024-09-24T22:02:12Z

@sywangyi do you have test result?

kaixuanliu · 2024-09-26T01:53:53Z

@sywangyi do you have test result?

Hi, @libinta , we used a docker image with build 438 for 1.18.0 release from @yeonsily and have a test. The perf of boft is still significantly slower than lora, for single card finetune, lora takes ~4min 30s, while boft needs more than 2 hours.

imangohari1 · 2024-09-26T16:32:10Z

@sywangyi do you have test result?

Hi, @libinta , we used a docker image with build 438 for 1.18.0 release from @yeonsily and have a test. The perf of boft is still significantly slower than lora, for single card finetune, lora takes ~4min 30s, while boft needs more than 2 hours.

I have tested this with driver 1.18.0-460 and the corresponding docker. It still shows the same behavior.
Our team is looking at rewriting the code that is causing recompile.

yao-matrix · 2024-11-11T07:55:13Z

@sywangyi do you have test result?

Hi, @libinta , we used a docker image with build 438 for 1.18.0 release from @yeonsily and have a test. The perf of boft is still significantly slower than lora, for single card finetune, lora takes ~4min 30s, while boft needs more than 2 hours.

I have tested this with driver 1.18.0-460 and the corresponding docker. It still shows the same behavior. Our team is looking at rewriting the code that is causing recompile.

@imangohari1 , do we have update on this?

imangohari1 · 2024-11-12T04:28:04Z

@sywangyi do you have test result?

Hi, @libinta , we used a docker image with build 438 for 1.18.0 release from @yeonsily and have a test. The perf of boft is still significantly slower than lora, for single card finetune, lora takes ~4min 30s, while boft needs more than 2 hours.

I have tested this with driver 1.18.0-460 and the corresponding docker. It still shows the same behavior. Our team is looking at rewriting the code that is causing recompile.

@imangohari1 , do we have update on this?

I am not if the issue is resolved or not.
@sywangyi WDYT?

sywangyi · 2024-11-12T05:04:51Z

according to https://habana.atlassian.net/browse/HS-3208, it has not been resolved yet

Luca-Calabria · 2024-12-03T11:08:13Z

according to https://habana.atlassian.net/browse/HS-3208, it has not been resolved yet

Just to update: RnD guy found the low level issue that produces the slow compilation when "torch.block_diag" operation run. You can find all the details in the ticket.
The development time, to support operations like "torch.block_diag" is estimated around 2-3 weeks.

add boft support in stable-diffusion

8394c03

Signed-off-by: Wang, Yi A <[email protected]>

sywangyi requested a review from regisss as a code owner August 28, 2024 08:50

add testcase

7f81362

Signed-off-by: Wang, Yi A <[email protected]>

Merge branch 'main' into boft

1ef942a

imangohari1 suggested changes Sep 14, 2024

View reviewed changes

fea(dreambooth): reworked the readme

3827489

libinta added the synapse1.20 label Nov 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add boft support in stable-diffusion #1295

add boft support in stable-diffusion #1295

sywangyi commented Aug 28, 2024

HuggingFaceDocBuilderDev commented Aug 28, 2024

imangohari1 commented Sep 12, 2024

sywangyi commented Sep 13, 2024

imangohari1 left a comment

sywangyi commented Sep 14, 2024

imangohari1 commented Sep 17, 2024

libinta commented Sep 18, 2024

sywangyi commented Sep 18, 2024 •

edited

Loading

imangohari1 commented Sep 20, 2024

libinta commented Sep 24, 2024

kaixuanliu commented Sep 26, 2024

imangohari1 commented Sep 26, 2024

yao-matrix commented Nov 11, 2024

imangohari1 commented Nov 12, 2024

sywangyi commented Nov 12, 2024

Luca-Calabria commented Dec 3, 2024 •

edited

Loading

add boft support in stable-diffusion #1295

Are you sure you want to change the base?

add boft support in stable-diffusion #1295

Conversation

sywangyi commented Aug 28, 2024

What does this PR do?

Before submitting

HuggingFaceDocBuilderDev commented Aug 28, 2024

imangohari1 commented Sep 12, 2024

sywangyi commented Sep 13, 2024

imangohari1 left a comment

Choose a reason for hiding this comment

sywangyi commented Sep 14, 2024

imangohari1 commented Sep 17, 2024

libinta commented Sep 18, 2024

sywangyi commented Sep 18, 2024 • edited Loading

imangohari1 commented Sep 20, 2024

libinta commented Sep 24, 2024

kaixuanliu commented Sep 26, 2024

imangohari1 commented Sep 26, 2024

yao-matrix commented Nov 11, 2024

imangohari1 commented Nov 12, 2024

sywangyi commented Nov 12, 2024

Luca-Calabria commented Dec 3, 2024 • edited Loading

sywangyi commented Sep 18, 2024 •

edited

Loading

Luca-Calabria commented Dec 3, 2024 •

edited

Loading