Support FSDP #149

tocean · 2024-01-08T11:44:55Z

Description
Support FSDP with FP8.

Major Revision

Add fsdp package
Add mnist example
Add FSDPAdam and FSDPAdamW optimizer
Add document
Add UT

tocean · 2024-01-08T11:51:14Z

Log for mnist_fsdp.py:
Train Epoch: 1 Loss: 0.536632
Test set: Average loss: 0.1431, Accuracy: 9581/10000 (95.81%)

Train Epoch: 2 Loss: 0.176441
Test set: Average loss: 0.0877, Accuracy: 9729/10000 (97.29%)

Train Epoch: 3 Loss: 0.127415
Test set: Average loss: 0.0687, Accuracy: 9793/10000 (97.93%)

Train Epoch: 4 Loss: 0.105661
Test set: Average loss: 0.0608, Accuracy: 9813/10000 (98.13%)

Train Epoch: 5 Loss: 0.096828
Test set: Average loss: 0.0551, Accuracy: 9826/10000 (98.26%)

Train Epoch: 6 Loss: 0.090231
Test set: Average loss: 0.0527, Accuracy: 9829/10000 (98.29%)

Train Epoch: 7 Loss: 0.083397
Test set: Average loss: 0.0506, Accuracy: 9833/10000 (98.33%)

Train Epoch: 8 Loss: 0.081701
Test set: Average loss: 0.0497, Accuracy: 9833/10000 (98.33%)

Train Epoch: 9 Loss: 0.081912
Test set: Average loss: 0.0488, Accuracy: 9839/10000 (98.39%)

Train Epoch: 10 Loss: 0.079299
Test set: Average loss: 0.0487, Accuracy: 9841/10000 (98.41%)

Train Epoch: 11 Loss: 0.078325
Test set: Average loss: 0.0482, Accuracy: 9837/10000 (98.37%)

Train Epoch: 12 Loss: 0.077337
Test set: Average loss: 0.0481, Accuracy: 9838/10000 (98.38%)

Train Epoch: 13 Loss: 0.077516
Test set: Average loss: 0.0479, Accuracy: 9836/10000 (98.36%)

Train Epoch: 14 Loss: 0.076482
Test set: Average loss: 0.0479, Accuracy: 9837/10000 (98.37%)

wkcn

Good job!

I have the following questions:

Which line of the code synchroinize the scaling factor (meta.scale or meta.scale_inv) across GPUs?
Gradient accumulation seems to be not supported yet. We can raise a NotImplemented exception when the steps of gradient accumulation is larger than 1.
Is there any comparision on memory footprint between BF16-FSDP and FP8-FSDP?

examples/mnist_fsdp.py

msamp/fsdp/_runtime_utils.py

tocean · 2024-01-09T12:32:21Z

Which line of the code synchroinize the scaling factor (meta.scale or meta.scale_inv) across GPUs?

Gradient accumulation seems to be not supported yet. We can raise a NotImplemented exception when the steps of gradient accumulation is larger than 1.

Is there any comparision on memory footprint between BF16-FSDP and FP8-FSDP?

For these 3 questions:

See FSDPAdamW::step. In this function, all_reduce of amax is called.
Add checking loigc in _fp8_post_backward_hook.
I checked the memory saving use T5 example here. The model I use is t5-3b. The memory footprint for bf16, FP32, MS-AMP is 28GB, 40GB and 34GB respectively. It is a bit strange that MS-AMP uses more memory than BF16.

wkcn · 2024-01-10T04:11:52Z

Which line of the code synchroinize the scaling factor (meta.scale or meta.scale_inv) across GPUs?

Gradient accumulation seems to be not supported yet. We can raise a NotImplemented exception when the steps of gradient accumulation is larger than 1.

Is there any comparision on memory footprint between BF16-FSDP and FP8-FSDP?

For these 3 questions:

See FSDPAdamW::step. In this function, all_reduce of amax is called.

Add checking loigc in _fp8_post_backward_hook.

I checked the memory saving use T5 example here. The model I use is t5-3b. The memory footprint for bf16, FP32, MS-AMP is 28GB, 40GB and 34GB respectively. It is a bit strange that MS-AMP uses more memory than BF16.

Regarding to the answer 3, it may be related to the argument mixed_precision of FSDP.

When mixed_precision is not None, FSDP will create a FP32 master weight for FP8Linear. It leads to duplicated master weights in FSDP and FP8Optimizer.

msamp/fsdp/flat_param.py

msamp/fsdp/_runtime_utils.py

msamp/optim/adamw.py

wkcn

LGTM. Thanks!

penghouwen

an initial version of FSDP in MSAMP

penghouwen · 2024-01-16T01:17:02Z

In the future, we will do:

gradient accumulation
speed optimization
acc calibration

tocean · 2024-01-16T02:59:46Z

In the future, we will do:

gradient accumulation

speed optimization

acc calibration

Sure. Let's improve it in next iteration.

tocean added 6 commits January 5, 2024 03:23

integrate with fsdp

97ee2fa

support auto wrap policy

e6d4822

FSDPAdam extends from FSDPAdamW

eb800e8

add document

993bb7c

add comment

515e6fc

fix linting

b8da96b

fix typo

f005202

tocean force-pushed the yuxiang/fsdp_opt branch from 37d2bc0 to f005202 Compare January 8, 2024 11:52

wkcn reviewed Jan 8, 2024

View reviewed changes

examples/mnist_fsdp.py Outdated Show resolved Hide resolved

msamp/fsdp/_runtime_utils.py Outdated Show resolved Hide resolved

tocean added 7 commits January 9, 2024 05:21

broadcast scaling_inv in optimizer.step

bcfd7e5

fix bug

103b7c7

add ut for fsdp

19a5434

fix comments

4415d82

fix lint

e2d2f03

fix lint

2379e32

fix bug in optimizer

6ffc5c1

wkcn reviewed Jan 10, 2024

View reviewed changes

msamp/fsdp/flat_param.py Outdated Show resolved Hide resolved

tocean added 3 commits January 10, 2024 07:02

fix comments

5ea7679

remove state type check in _fp8_post_backward_hook

8388adc

fix ut error

a8ff215

wkcn reviewed Jan 10, 2024

View reviewed changes

msamp/fsdp/_runtime_utils.py Outdated Show resolved Hide resolved

msamp/fsdp/_runtime_utils.py Outdated Show resolved Hide resolved

msamp/optim/adamw.py Show resolved Hide resolved

fix comments

d3acfa2

wkcn approved these changes Jan 11, 2024

View reviewed changes

tocean requested a review from penghouwen January 11, 2024 06:02

penghouwen approved these changes Jan 16, 2024

View reviewed changes

penghouwen closed this Jan 16, 2024

penghouwen reopened this Jan 16, 2024

tocean merged commit 2fbe898 into main Jan 16, 2024
17 checks passed

tocean deleted the yuxiang/fsdp_opt branch January 16, 2024 02:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support FSDP #149

Support FSDP #149

tocean commented Jan 8, 2024 •

edited

Loading

tocean commented Jan 8, 2024 •

edited

Loading

wkcn left a comment

tocean commented Jan 9, 2024 •

edited

Loading

wkcn commented Jan 10, 2024

wkcn left a comment

penghouwen left a comment

penghouwen commented Jan 16, 2024

tocean commented Jan 16, 2024

Support FSDP #149

Support FSDP #149

Conversation

tocean commented Jan 8, 2024 • edited Loading

tocean commented Jan 8, 2024 • edited Loading

wkcn left a comment

Choose a reason for hiding this comment

tocean commented Jan 9, 2024 • edited Loading

wkcn commented Jan 10, 2024

wkcn left a comment

Choose a reason for hiding this comment

penghouwen left a comment

Choose a reason for hiding this comment

penghouwen commented Jan 16, 2024

tocean commented Jan 16, 2024

tocean commented Jan 8, 2024 •

edited

Loading

tocean commented Jan 8, 2024 •

edited

Loading

tocean commented Jan 9, 2024 •

edited

Loading