Checkpointing support for transformer type models #247

zhenghh04 · 2025-02-05T22:40:30Z

In this PR, we addressed the issue that people have to manually input layer parameters and optimization groups in the checkpointing. #248

People are allowed to input just "vocab_size, hidden_size, ffn_hidden_size, num_layers", the layer parameters and optimization groups are calculated internally.
We also added option for specifying zero_stage. The default value is 0, where no zero is used.
We added several YAML configuration files for Llama transformer models.
We allow people to specify the datatype for writing data.
We added checkpoint only mode
We added support for checkpoint recovery tests

zhenghh04 · 2025-02-21T05:45:53Z

@hariharan-devarajan ready for you to review again.

I added two other features since last time we talked:

recovery support (read)
added checkpoint only (i.e., turning off train)

…ture/transformer

hariharan-devarajan

Almost there.

hariharan-devarajan · 2025-02-24T18:29:55Z

dlio_benchmark/checkpointing/base_checkpointing.py

+        if self.args.hidden_size <= 0:
+            return 0
+        head_size = self.args.hidden_size//self.args.num_attention_heads
+        dim_kv = head_size * self.args.num_kv_heads        


missed this. dim_kv

hariharan-devarajan · 2025-02-24T18:31:10Z

dlio_benchmark/checkpointing/base_checkpointing.py

+        mlp_4h_to_h = self.args.ffn_hidden_size*self.args.hidden_size
+        weight = self.args.hidden_size
+        lm_head = embedding
+        return embedding  + (input_norm + qkv + dense + layer_norm + mlp_h_to_4h + mlp_4h_to_h)*self.args.num_layers + weight + lm_head


what is qkv and mlp_h_to_4h? if mlp_h_to_4h is too big then at least add a line comment on what it is, when it is defined.

hariharan-devarajan · 2025-02-24T18:31:35Z

dlio_benchmark/checkpointing/base_checkpointing.py

+
+    def get_layer_parameters(self, layer_index):
+        head_size = self.args.hidden_size//self.args.num_attention_heads
+        dim_kv = head_size * self.args.num_kv_heads


full form dim_kv

hariharan-devarajan

Looks good. Thank you for all the changes.

…erging

zhenghh04 and others added 24 commits February 5, 2025 13:47

restructure config for transformer model

26c8170

added support for checkpointing transformer models

d77608d

fixed bug

f74708f

fixed bug

4c9302e

added llama7b model

0676625

restructure config

ee65c6a

added llama 7b

c024784

bug fix

84e8295

added doc

13e6cda

fix mem issue

13a05fe

fix mem issue

139eed7

fixed mem issue

02b31e9

fixed 0 layers

d25d730

added llama models

58c220e

fixing test

0c86f71

fixed bugs

7f427b4

added support for dtype

e6a4697

fixed issue with datatype

a4372cd

accommodate the old way

656edc9

modified checkpoint

c4e0218

added fsync support

1c95a35

fixed tests

05f5d42

fixed issue for testing checkpoints

6ae498e

fixing tests

2499e07

zhenghh04 added the enhancement New feature or request label Feb 7, 2025

zhenghh04 added 5 commits February 6, 2025 20:19

fixed tests for number of files calculation

841f406

fixed documentation action

9c160f7

added checkpoint throughput calculation

6fede9c

added metric

49d3f47

improving the doc

843cabb

zhenghh04 added 14 commits February 19, 2025 12:39

fixed variable name

5385ff7

fixed variable name

8e8429b

changed model state

4d4ea28

added support for checkpoint only

7c37e3f

added new lines

898fd46

fixed ci tests

f82fa90

fixed checkpoint mechanism bug

bf2fdc3

added support for checkpoint recovery tests

517d678

added caching setup and fixed tf instantation issue

d5e989c

fixed checkpoint->save_checkpoint

5c4df0b

checkpoint issue for layer_state

19dc632

fixed empty line and added some instruction on checkpoint only mode

f8a0bf4

minor changes to the configuration files

aafa316

added sanity check

d695c60

zhenghh04 requested a review from hariharan-devarajan February 21, 2025 05:46

zhenghh04 mentioned this pull request Feb 21, 2025

Validating Magatron-DeepSpeed #134

Closed

zhenghh04 added 4 commits February 21, 2025 10:33

per_epoch_stats reported from all ranks

b535005

added timing for model and optimizer saparately

5e9a98a

Merge branch 'main' of github.com:argonne-lcf/dlio_benchmark into fea…

1de1139

…ture/transformer

pull recent changes on main

9bb04b8

hariharan-devarajan requested changes Feb 24, 2025

View reviewed changes

added some comments

e0a355b

hariharan-devarajan approved these changes Feb 24, 2025

View reviewed changes

zhenghh04 added 4 commits February 24, 2025 21:59

Merge branch 'main' into feature/transformer

e0c6f4b

Merge branch 'main' into feature/transformer

340a56d

fixed save checkpoint call

16b708e

checkpoint -> save_checkpoint in main.py to reflect the issue after m…

eff1a78

…erging

zhenghh04 merged commit 67f0fbf into main Feb 25, 2025
12 checks passed

zhenghh04 deleted the feature/transformer branch February 25, 2025 05:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Checkpointing support for transformer type models #247

Checkpointing support for transformer type models #247

zhenghh04 commented Feb 5, 2025 •

edited

Loading

zhenghh04 commented Feb 21, 2025

hariharan-devarajan left a comment

hariharan-devarajan Feb 24, 2025

hariharan-devarajan Feb 24, 2025

hariharan-devarajan Feb 24, 2025

hariharan-devarajan left a comment

Checkpointing support for transformer type models #247

Checkpointing support for transformer type models #247

Conversation

zhenghh04 commented Feb 5, 2025 • edited Loading

zhenghh04 commented Feb 21, 2025

hariharan-devarajan left a comment

Choose a reason for hiding this comment

hariharan-devarajan Feb 24, 2025

Choose a reason for hiding this comment

hariharan-devarajan Feb 24, 2025

Choose a reason for hiding this comment

hariharan-devarajan Feb 24, 2025

Choose a reason for hiding this comment

hariharan-devarajan left a comment

Choose a reason for hiding this comment

zhenghh04 commented Feb 5, 2025 •

edited

Loading