Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement saving FSDP with LoRA #295

Merged
merged 3 commits into from
Nov 13, 2024
Merged

Implement saving FSDP with LoRA #295

merged 3 commits into from
Nov 13, 2024

Conversation

RobotSail
Copy link
Member

@RobotSail RobotSail commented Oct 23, 2024

Currently we cannot save LoRA models with FSDP, this PR addresses this limitation by instantiating a copy of the model on CPU, loading in the LoRA settings, loading the state dict after it has been gathered, and finally performing the same save as we do elsewhere throughout the codebase.

Resolves #241

Copy link
Contributor

mergify bot commented Oct 25, 2024

This pull request has merge conflicts that must be resolved before it can be
merged. @RobotSail please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

limit_all_gathers=True,
mixed_precision_policy=MixedPrecision(
param_dtype=torch.bfloat16,
reduce_dtype=torch.bfloat16,
buffer_dtype=torch.bfloat16,
),
backward_prefetch=BackwardPrefetch.BACKWARD_PRE,
backward_prefetch=BackwardPrefetch.BACKWARD_POST,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the impact of making this change for non-lora usage?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a performance/memory tradeoff. We should have it be configurable if possible, but I can limit it to only be this option when LoRA is used.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Backward prefetch vs. postfetch shouldn't be impacting correctness of LoRA, but not using prefetch could hurt default training times. I thing prefetch should be the default for non-lora cases.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 James, I'll create a follow-up issue to have this as a configurable setting.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.pylintrc Show resolved Hide resolved
@@ -2,7 +2,7 @@
set -eux -o pipefail

# ############### Read-only parameters ###############
MODEL_NAME="instructlab/granite-7b-lab"
MODEL_NAME="/home/ec2-user/.cache/huggingface/hub/models--instructlab--granite-7b-lab/snapshots/4fb6a018d68ab813b95c7f470e424a70f2f7e561"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

won't always be on ec2

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed it

Copy link
Contributor

mergify bot commented Nov 7, 2024

This pull request has merge conflicts that must be resolved before it can be
merged. @RobotSail please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Nov 7, 2024
LoRA models when training with FSDP as the distributed backend.
This is accomplished by creating a copy of the LoRA model on the CPU,
loading in the state dict after gathering it from the distributed model,
and saving after merging the adapters back into the original model.
Afterwards, the CPU copy is discarded and training continues.

Signed-off-by: Oleg S <[email protected]>
This commit adds a smoketest for testing LoRA + FSDP.

Signed-off-by: Oleg S <[email protected]>
Copy link
Contributor

@JamesKunstle JamesKunstle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for adding the test!

Additionally introuce a max_seq_len parameter to support testing
on lower-end hardware.

Signed-off-by: Oleg S <[email protected]>
@mergify mergify bot removed the one-approval label Nov 13, 2024
@nathan-weinberg nathan-weinberg removed the request for review from aldopareja November 13, 2024 20:52
@mergify mergify bot merged commit 8a49747 into instructlab:main Nov 13, 2024
13 checks passed
@RobotSail
Copy link
Member Author

#345

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI/CD Affects CI/CD configuration testing Relates to testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Get LoRA working with FSDP
4 participants