Load state dicts to CPU #328

epwalsh · 2023-10-12T16:20:51Z

It turns out we can load (legacy) sharded and unsharded checkpoints to CPU via torch.load() since FSDP.load_state_dict() will copy tensors to the right device anyway.
This should save some GPU memory.

ibeltagy · 2023-10-12T16:35:19Z

olmo/train.py

                load_fsdp_optim_state(self.fsdp_model, self.optim, optim_state_dict)

            # Load other state.
            try:
-                train_state_dict = torch.load(resource_path(load_path, "train.pt", local_cache=local_cache))
+                train_state_dict = load_state_dict(load_path, "train.pt", local_cache=local_cache)


why no map_location="cpu" here?

There's only one tiny tensor in the trainer state - the GPU RNG state. Which needs to go on GPU anyway.

ibeltagy · 2023-10-12T16:36:10Z

olmo/checkpoint.py

@@ -218,7 +218,13 @@ def save_state_dict(
        upload(target_path, upload_target, save_overwrite=save_overwrite)


does this need a unit test?

Our trainer unit tests are severely lacking at the moment, but I just tested this on spare MosaicML nodes are we recover the same loss exactly after these changes:

epwalsh · 2023-10-12T17:11:11Z

@ibeltagy if this doesn't avoid OOM with our LUMI runs we will have to set --fsdp.wrapping_strategy=by_block.

epwalsh added 2 commits October 12, 2023 08:56

load (legacy) sharded checkpoint on CPU

89020f1

clean up

ec28abd

epwalsh requested review from dirkgr and ibeltagy October 12, 2023 16:21

ibeltagy reviewed Oct 12, 2023

View reviewed changes

ibeltagy approved these changes Oct 12, 2023

View reviewed changes

Merge branch 'main' into petew/load-state-dict

52d64e3

epwalsh merged commit 809fe9d into main Oct 12, 2023
10 checks passed

epwalsh deleted the petew/load-state-dict branch October 12, 2023 17:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Load state dicts to CPU #328

Load state dicts to CPU #328

epwalsh commented Oct 12, 2023

ibeltagy Oct 12, 2023

epwalsh Oct 12, 2023

ibeltagy Oct 12, 2023

epwalsh Oct 12, 2023

epwalsh commented Oct 12, 2023

		@@ -218,7 +218,13 @@ def save_state_dict(
		upload(target_path, upload_target, save_overwrite=save_overwrite)

Load state dicts to CPU #328

Load state dicts to CPU #328

Conversation

epwalsh commented Oct 12, 2023

ibeltagy Oct 12, 2023

Choose a reason for hiding this comment

epwalsh Oct 12, 2023

Choose a reason for hiding this comment

ibeltagy Oct 12, 2023

Choose a reason for hiding this comment

epwalsh Oct 12, 2023

Choose a reason for hiding this comment

epwalsh commented Oct 12, 2023