make checkpointing thread safe #245

tushar00jain · 2025-07-26T01:59:02Z

Summary:

the checkpointing wasn't thread safe for http transport so use lock the model in the pre step hook and when we want to transfer the checkpoint

d4l3k

Is there some other way we can handle this?

Two main concerns:

cloning has significant memory impact
send_checkpoint may race with the first inner optimizer step so cloining in _to_cpu may not even be safe

I thought with the DiLoCo implementation we had a second copy of the weights to compute the pseudo gradient. Can we not reuse those for the state_dict transfer?

d4l3k · 2025-07-28T23:50:55Z

torchft/checkpointing/http_transport.py

-                out.append(v)
+                if isinstance(v, DTensor):
+                    clone = distribute_tensor(
+                        v.to_local().clone(), v.device_mesh, v.placements


Is v.clone() not sufficient? What does that do? Do we also need special logic for cuda DTensors?

Is v.clone() not sufficient?

tried it and didn't work. it made an empty dtesnor. took a long time to figure how to clone dtensors

d4l3k · 2025-07-28T23:52:57Z

torchft/checkpointing/http_transport.py

@@ -278,7 +279,13 @@ def _to_cpu(values: List[T], pin_memory: bool) -> List[T]:
                else:
                    out.append(v.cpu())
            else:
-                out.append(v)
+                if isinstance(v, DTensor):


should we rename _to_cpu to _clone_cpu?

tushar00jain · 2025-07-29T00:20:23Z

@d4l3k

I thought with the DiLoCo implementation we had a second copy of the weights to compute the pseudo gradient. Can we not reuse those for the state_dict transfer?

the model transfer isn't even controlled by the wrapper in general

Is there some other way we can handle this?

thought about blocking everything until checkpoint transfer is complete but that's probably also complicated since the node may never want to transfer checkpoint

or use locks, the inner step locks the model and we lock while actually transferring the state dict

tushar00jain · 2025-07-29T03:21:04Z

@d4l3k for the gpu case, we're clone the tensor into cpu memory anyway. thing is we don't control at what inner step the checkpoint will be sent so the regression test ends up being non deterministic. it changes what model parameters are used for syncing, even if we make it thread safe

also on thread safety, the cloning seems to be a blocking call in post step hook, so it shouldn't race with the inner optimizer step? i added some locking anyway

d4l3k · 2025-07-31T21:28:42Z

torchft/manager.py

-            "user": {key: value() for key, value in self._user_state_dicts.items()},
-            "torchft": self.state_dict(),
-        }
+        with self._state_dict_lock.r_lock():


we already have allow_checkpoint and disallow_checkpoint in HTTPTransport -- can we reuse those instead?

that also requires copying over state dict to the http transport

or keeping some tracking on which step was transferred last to the http transport

with a separate lock, we can decouple checkpoint specific logic with training logic

d4l3k

Seems reasonable to me -- just want to add a unit test specifically on the lock

d4l3k · 2025-08-01T22:58:02Z

torchft/manager.py

@@ -324,6 +328,21 @@ def __init__(
        # first step is 1
        self._participating_replica_rank: Optional[int] = None
        self._participating_replica_world_size: int = 0
+        self._is_state_dict_read_allowed = True
+
+    def allow_state_dict_read(self) -> None:


Can we add specific tests for these methods? Lock logic is pretty risk prone so would be nice to have unit test coverage for these

Summary: - the checkpointing wasn't thread safe for http transport so use lock the model in the pre step hook and when we want to transfer the checkpoint

This was referenced Jul 26, 2025

use http transport #244

Merged

only use nightly pytorch in ci #243

Merged

fix stream dependencies in callbacks #246

Merged

fix compute/communication overlap for gloo #240

Merged

return work from manager allreduce #247

Merged

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jul 26, 2025

tushar00jain mentioned this pull request Jul 26, 2025

setup stream dependencies inside work wrapper #248

Merged

tushar00jain force-pushed the pr245 branch 2 times, most recently from 5d2f3e4 to dd0f5fc Compare July 26, 2025 19:11

tushar00jain requested a review from d4l3k July 28, 2025 22:12

d4l3k reviewed Jul 28, 2025

View reviewed changes

tushar00jain force-pushed the pr245 branch 3 times, most recently from 0b91490 to 49a161f Compare July 29, 2025 03:19

tushar00jain force-pushed the pr245 branch 3 times, most recently from 636a86a to 2b7defd Compare July 29, 2025 03:37

tushar00jain mentioned this pull request Jul 29, 2025

fix managed pg allreduce #249

Open

tushar00jain changed the title ~~deep copy state dict for checkpoint~~ make checkpointing thread safe Jul 30, 2025

tushar00jain force-pushed the pr245 branch 6 times, most recently from cc0a37a to 76f2f60 Compare July 30, 2025 22:05

tushar00jain changed the title ~~make checkpointing thread safe~~ make checkpointing thread safe and deterministic Jul 30, 2025

tushar00jain force-pushed the pr245 branch 2 times, most recently from 2388879 to fc1fa08 Compare July 30, 2025 22:52

tushar00jain force-pushed the pr245 branch 2 times, most recently from f99b8dd to acd9ede Compare July 31, 2025 02:47

d4l3k reviewed Jul 31, 2025

View reviewed changes

tushar00jain force-pushed the pr245 branch 4 times, most recently from 0283856 to 1f5854d Compare August 1, 2025 19:17

tushar00jain changed the title ~~make checkpointing thread safe and deterministic~~ make checkpointing thread safe Aug 1, 2025

tushar00jain force-pushed the pr245 branch 2 times, most recently from 800c48f to 577bacd Compare August 1, 2025 20:02

d4l3k approved these changes Aug 1, 2025

View reviewed changes

tushar00jain force-pushed the pr245 branch 3 times, most recently from 495ab9a to 595e7e9 Compare August 5, 2025 18:32

make checkpointing thread safe

a215510

Summary: - the checkpointing wasn't thread safe for http transport so use lock the model in the pre step hook and when we want to transfer the checkpoint

tushar00jain force-pushed the pr245 branch from 595e7e9 to a215510 Compare August 5, 2025 18:33

tushar00jain merged commit ee2b322 into pytorch:main Aug 5, 2025
13 of 14 checks passed

tushar00jain deleted the pr245 branch August 5, 2025 22:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

make checkpointing thread safe #245

make checkpointing thread safe #245

Uh oh!

tushar00jain commented Jul 26, 2025 •

edited

Loading

Uh oh!

d4l3k left a comment

Uh oh!

d4l3k Jul 28, 2025

Uh oh!

tushar00jain Jul 29, 2025 •

edited

Loading

Uh oh!

d4l3k Jul 28, 2025

Uh oh!

tushar00jain commented Jul 29, 2025 •

edited

Loading

Uh oh!

tushar00jain commented Jul 29, 2025 •

edited

Loading

Uh oh!

d4l3k Jul 31, 2025

Uh oh!

tushar00jain Aug 1, 2025

Uh oh!

d4l3k left a comment

Uh oh!

d4l3k Aug 1, 2025

Uh oh!

Uh oh!

Uh oh!

make checkpointing thread safe #245

make checkpointing thread safe #245

Uh oh!

Conversation

tushar00jain commented Jul 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

d4l3k left a comment

Choose a reason for hiding this comment

Uh oh!

d4l3k Jul 28, 2025

Choose a reason for hiding this comment

Uh oh!

tushar00jain Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

d4l3k Jul 28, 2025

Choose a reason for hiding this comment

Uh oh!

tushar00jain commented Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tushar00jain commented Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

d4l3k Jul 31, 2025

Choose a reason for hiding this comment

Uh oh!

tushar00jain Aug 1, 2025

Choose a reason for hiding this comment

Uh oh!

d4l3k left a comment

Choose a reason for hiding this comment

Uh oh!

d4l3k Aug 1, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

tushar00jain commented Jul 26, 2025 •

edited

Loading

tushar00jain Jul 29, 2025 •

edited

Loading

tushar00jain commented Jul 29, 2025 •

edited

Loading

tushar00jain commented Jul 29, 2025 •

edited

Loading