Implement saving FSDP with LoRA #295

RobotSail · 2024-10-23T14:53:44Z

Currently we cannot save LoRA models with FSDP, this PR addresses this limitation by instantiating a copy of the model on CPU, loading in the LoRA settings, loading the state dict after it has been gathered, and finally performing the same save as we do elsewhere throughout the codebase.

Resolves #241

mergify · 2024-10-25T17:01:26Z

This pull request has merge conflicts that must be resolved before it can be
merged. @RobotSail please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Maxusmusti · 2024-10-25T20:10:21Z

src/instructlab/training/setup_accelerator.py

        limit_all_gathers=True,
        mixed_precision_policy=MixedPrecision(
            param_dtype=torch.bfloat16,
            reduce_dtype=torch.bfloat16,
            buffer_dtype=torch.bfloat16,
        ),
-        backward_prefetch=BackwardPrefetch.BACKWARD_PRE,
+        backward_prefetch=BackwardPrefetch.BACKWARD_POST,


what is the impact of making this change for non-lora usage?

This is a performance/memory tradeoff. We should have it be configurable if possible, but I can limit it to only be this option when LoRA is used.

Backward prefetch vs. postfetch shouldn't be impacting correctness of LoRA, but not using prefetch could hurt default training times. I thing prefetch should be the default for non-lora cases.

+1 James, I'll create a follow-up issue to have this as a configurable setting.

.pylintrc

src/instructlab/training/utils.py

JamesKunstle · 2024-11-07T20:17:13Z

tests/smoketest.sh

@@ -2,7 +2,7 @@
 set -eux -o pipefail

 # ############### Read-only parameters ############### 
-MODEL_NAME="instructlab/granite-7b-lab"
+MODEL_NAME="/home/ec2-user/.cache/huggingface/hub/models--instructlab--granite-7b-lab/snapshots/4fb6a018d68ab813b95c7f470e424a70f2f7e561"


won't always be on ec2

I removed it

mergify · 2024-11-07T20:17:46Z

This pull request has merge conflicts that must be resolved before it can be
merged. @RobotSail please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

LoRA models when training with FSDP as the distributed backend. This is accomplished by creating a copy of the LoRA model on the CPU, loading in the state dict after gathering it from the distributed model, and saving after merging the adapters back into the original model. Afterwards, the CPU copy is discarded and training continues. Signed-off-by: Oleg S <[email protected]>

This commit adds a smoketest for testing LoRA + FSDP. Signed-off-by: Oleg S <[email protected]>

JamesKunstle

LGTM, thanks for adding the test!

Additionally introuce a max_seq_len parameter to support testing on lower-end hardware. Signed-off-by: Oleg S <[email protected]>

RobotSail · 2024-11-13T22:06:15Z

#345

RobotSail force-pushed the lora-4 branch from 78e2122 to 9404793 Compare October 23, 2024 14:54

mergify bot added the ci-failure label Oct 23, 2024

RobotSail force-pushed the lora-4 branch from 9404793 to 4f0820c Compare October 23, 2024 14:56

mergify bot added ci-failure and removed ci-failure labels Oct 23, 2024

RobotSail force-pushed the lora-4 branch from 4f0820c to 6ffffda Compare October 23, 2024 14:59

mergify bot removed the ci-failure label Oct 23, 2024

RobotSail force-pushed the lora-4 branch 5 times, most recently from 102e94c to 340326f Compare October 23, 2024 17:47

RobotSail requested review from aldopareja and Maxusmusti October 24, 2024 15:36

mergify bot added the needs-rebase label Oct 25, 2024

RobotSail force-pushed the lora-4 branch from 340326f to 55c8762 Compare October 25, 2024 19:04

mergify bot added ci-failure and removed needs-rebase labels Oct 25, 2024

RobotSail force-pushed the lora-4 branch from 55c8762 to ffce384 Compare October 25, 2024 20:12

mergify bot added CI/CD Affects CI/CD configuration ci-failure and removed ci-failure labels Oct 25, 2024

RobotSail force-pushed the lora-4 branch from ffce384 to 0c5b7c2 Compare October 25, 2024 20:18

mergify bot removed the ci-failure label Oct 25, 2024

Maxusmusti reviewed Oct 25, 2024

View reviewed changes

.pylintrc Show resolved Hide resolved

Maxusmusti reviewed Oct 25, 2024

View reviewed changes

src/instructlab/training/utils.py Show resolved Hide resolved

JamesKunstle self-requested a review October 25, 2024 20:27

RobotSail force-pushed the lora-4 branch from 0c5b7c2 to 9c140fd Compare October 25, 2024 20:29

JamesKunstle reviewed Nov 7, 2024

View reviewed changes

mergify bot added the needs-rebase label Nov 7, 2024

RobotSail added 2 commits November 7, 2024 21:41

chore: add smoketest for lora+fsdp

200d707

This commit adds a smoketest for testing LoRA + FSDP. Signed-off-by: Oleg S <[email protected]>

RobotSail force-pushed the lora-4 branch from 54a77a4 to 646a2fe Compare November 7, 2024 21:44

mergify bot removed the needs-rebase label Nov 7, 2024

RobotSail force-pushed the lora-4 branch from 646a2fe to 7c6b2d3 Compare November 7, 2024 21:47

JamesKunstle approved these changes Nov 7, 2024

View reviewed changes

mergify bot added the one-approval label Nov 7, 2024

RobotSail force-pushed the lora-4 branch from 7c6b2d3 to c9ad3ff Compare November 8, 2024 16:09

mergify bot added the ci-failure label Nov 8, 2024

RobotSail force-pushed the lora-4 branch from c9ad3ff to 0efa455 Compare November 8, 2024 16:12

mergify bot removed the ci-failure label Nov 8, 2024

RobotSail force-pushed the lora-4 branch from 0efa455 to f8ae124 Compare November 8, 2024 17:19

mergify bot added the ci-failure label Nov 8, 2024

RobotSail force-pushed the lora-4 branch from f8ae124 to f70c8e3 Compare November 8, 2024 17:20

mergify bot added ci-failure and removed ci-failure labels Nov 8, 2024

nathan-weinberg mentioned this pull request Nov 13, 2024

Add coverage for LoRA training to E2E CI instructlab/instructlab#2614

Open

Update tests/smoketest.sh to support FSDP + LoRA as a testing path.

b945f56

Additionally introuce a max_seq_len parameter to support testing on lower-end hardware. Signed-off-by: Oleg S <[email protected]>

RobotSail force-pushed the lora-4 branch from f70c8e3 to b945f56 Compare November 13, 2024 20:21

mergify bot removed the ci-failure label Nov 13, 2024

Maxusmusti approved these changes Nov 13, 2024

View reviewed changes

mergify bot removed the one-approval label Nov 13, 2024

nathan-weinberg removed the request for review from aldopareja November 13, 2024 20:52

nathan-weinberg approved these changes Nov 13, 2024

View reviewed changes

mergify bot merged commit 8a49747 into instructlab:main Nov 13, 2024
13 checks passed

cdoern mentioned this pull request Nov 18, 2024

ilab model train --pipeline simple (v0.21.0) instructlab/instructlab#2666

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement saving FSDP with LoRA #295

Implement saving FSDP with LoRA #295

RobotSail commented Oct 23, 2024 •

edited

Loading

mergify bot commented Oct 25, 2024

Maxusmusti Oct 25, 2024

RobotSail Oct 25, 2024

JamesKunstle Oct 25, 2024

RobotSail Oct 25, 2024

RobotSail Oct 25, 2024

JamesKunstle Nov 7, 2024

RobotSail Nov 7, 2024

mergify bot commented Nov 7, 2024

JamesKunstle left a comment

RobotSail commented Nov 13, 2024

Implement saving FSDP with LoRA #295

Implement saving FSDP with LoRA #295

Conversation

RobotSail commented Oct 23, 2024 • edited Loading

mergify bot commented Oct 25, 2024

Maxusmusti Oct 25, 2024

Choose a reason for hiding this comment

RobotSail Oct 25, 2024

Choose a reason for hiding this comment

JamesKunstle Oct 25, 2024

Choose a reason for hiding this comment

RobotSail Oct 25, 2024

Choose a reason for hiding this comment

RobotSail Oct 25, 2024

Choose a reason for hiding this comment

JamesKunstle Nov 7, 2024

Choose a reason for hiding this comment

RobotSail Nov 7, 2024

Choose a reason for hiding this comment

mergify bot commented Nov 7, 2024

JamesKunstle left a comment

Choose a reason for hiding this comment

RobotSail commented Nov 13, 2024

RobotSail commented Oct 23, 2024 •

edited

Loading