Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bugfix]Fix bug in Lora checkpoint saving step #10252

Closed
wants to merge 14 commits into from

Conversation

leisuzz
Copy link
Contributor

@leisuzz leisuzz commented Dec 17, 2024

What does this PR do?

Fix bug in checkpoint saving steps:

  1. Free variable referenced before assignment in the original code if it is only elif isinstance(model, type(unwrap_model(text_encoder)))
  2. IndexError: pop from empty list

Before submitting

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@leisuzz
Copy link
Contributor Author

leisuzz commented Dec 17, 2024

@sayakpaul Please take a look at this PR, I'm having trouble in dreambooth lora checkpoint saving. After these implementation, the issue has been solved

@sayakpaul
Copy link
Member

Just so I understand, this is applicable only when we're using DeepSpeed?

@leisuzz
Copy link
Contributor Author

leisuzz commented Dec 17, 2024

@sayakpaul I'm using deepspeed to train the model, I'm not sure whether it works or not without deepspeed

@sayakpaul
Copy link
Member

Same comments from your other PR that you closed apply here. We're missing:

  • Incorporating LoRA/model loading support when using DeepSpeed.
  • Other necessary checks when using DeepSpeed.

As mentioned in the other PR of yours, if you go here and search for DistributedType.DEEPSPEED, the outline of the changes should become evident and they will help you understand what we're missing in this PR. Without those changes, it'd be hard to merge this PR.

@leisuzz
Copy link
Contributor Author

leisuzz commented Dec 17, 2024

@sayakpaul I don' have any issue related with loading with deepspeed, the issue only occurs in saving steps

@sayakpaul
Copy link
Member

We have extensively tested the changes I mentioned in my comment. Hence, I kindly ask you to propagate them.

@leisuzz
Copy link
Contributor Author

leisuzz commented Dec 17, 2024

Thanks, let me think how to modify them

@leisuzz
Copy link
Contributor Author

leisuzz commented Dec 19, 2024

@sayakpaul I've updated the save mode for the correct version, please take a look. This should work properly for loading function as well

Copy link
Member

@sayakpaul sayakpaul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep the changes limited to the Flux, script perhaps?

@leisuzz
Copy link
Contributor Author

leisuzz commented Dec 19, 2024

@sayakpaul The issue happened both in sd3 and flux actually, I've also fixed the issue with distributed in condition with checkpoints_total_limit is not none

@leisuzz
Copy link
Contributor Author

leisuzz commented Dec 23, 2024

@sayakpaul I've fixed both loading and saving step, please take a look

@sayakpaul
Copy link
Member

We'd prefer to tackle this one script at a time. So, please keep the changes to a single script, only.

@leisuzz
Copy link
Contributor Author

leisuzz commented Dec 23, 2024

@sayakpaul I've edited and now the changes are only based on train_dreambooth_lora_sd3.py (sd3 and flux are similar and I tested both of them), please take a look. Let me know if that's ok for me to edit on flux as well.

Copy link
Member

@sayakpaul sayakpaul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few more changes.

@leisuzz
Copy link
Contributor Author

leisuzz commented Dec 23, 2024

@sayakpaul Please take a look at this newest version, thanks

@leisuzz leisuzz requested a review from sayakpaul December 24, 2024 07:33
@sayakpaul
Copy link
Member

@leisuzz where you not using a non LoRA version of the script? I think I got confused because of the repeated reviews.

Maybe we could close this PR and I could instead open a PR with you as a co-author?

@leisuzz
Copy link
Contributor Author

leisuzz commented Dec 25, 2024

@sayakpaul Sorry about this, I will close this PR then

@leisuzz leisuzz closed this Dec 25, 2024
@sayakpaul
Copy link
Member

@leisuzz thanks for your understanding. Could you please provide your GitHub commit email address so that I can add you as a co-author?

@leisuzz
Copy link
Contributor Author

leisuzz commented Dec 25, 2024

@sayakpaul My email address is: [email protected]. Thanks for your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants