Add tests for out of order with checkpointing #1428

michael-diggin · 2025-01-28T18:41:02Z

Follow up to #1423

Changes

Add tests for state management of both index datasets and iterable datasets when in_order=False
Remove warning logs about in_order=False

State management with in_order=False actually just works which is really nice.

I'm happy to keep the warning logs in for longer if we'd like to err on the side of caution. And please let me know if there are any other test cases I should add.

andrewkho · 2025-01-28T19:43:19Z

How can this work for out-of-order iteration? eg let's say the first sample blocks until the end of the epoch, but you take a checkpoint in the middle, what happens then? Without replaying all of history, how will sample 0 be returned later?

pytorch-bot · 2025-01-28T19:45:53Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/data/1428

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit cea6bdb with merge base e25df94 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

michael-diggin · 2025-01-28T19:59:24Z

How can this work for out-of-order iteration? eg let's say the first sample blocks until the end of the epoch, but you take a checkpoint in the middle, what happens then? Without replaying all of history, how will sample 0 be returned later?

That's exactly what the tests I've added are checking - the first sample for one of the workers is slow.
For index datasets it looks like it uses the fact that it has progressed a certain number of steps since the last checkpoint (which is part of the state dict) to skip forward, and for the iterable dataset it looks to be using the num_yielded field from the the state dict.
If this isn't currently expected to work I can add back in the warning logs while we fully determine if it should work as is?

andrewkho · 2025-01-28T20:44:55Z

@michael-diggin let's leave the warnings in for now just in case, while we think through this a bit more. Thanks again!

michael-diggin · 2025-01-28T21:16:01Z

@michael-diggin let's leave the warnings in for now just in case, while we think through this a bit more. Thanks again!

@andrewkho no problem - added back in just now.

andrewkho · 2025-01-29T01:11:42Z

test/stateful_dataloader/test_state_dict.py

+        self.assertEqual(sorted(output), list(range(10)))
+
+    def test_out_of_order_iterable_ds(self):
+        dataset = _TestSlowIterableDataset(start=0, end=10)


For iterable dataset the slow worker will just lead to a straggler, on resume the individual workers will resume and continue, though be limited to single-worker performance. I think I can see why this might "just work" for Iterable datasets

I'm not entirely sure - I've added a second case that breaks before either worker finishes, and so they both resume after the restart, which gives the same correct results.
I think maybe the fast-forwarding part of the resuming is what is allowing this to work, and since the dataset is deterministic (ie the slow samples don't change) the fast forwarding by X samples will bring it back to the same point.

andrewkho · 2025-01-29T01:14:11Z

test/stateful_dataloader/test_state_dict.py

+        output = []
+        for i, data in enumerate(dataloader):
+            output.append(data)
+            if i == 5:


Can we try to set this lower than 5? I think for end = 10, slow_index=0, and num_workers = 2, i = 5 will be index 0, since 0 through 4 will come from worker 1, and worker 0 will be blocked until index 0 is released

I'm not sure that was true - since we added the re-distribution of work. I checked this test's output and 0 was retuned at index 8, so after the break point.
I've dropped the break to 3 down however, and also added an assertion to check that 0 isn't in the output at that point as well.

andrewkho

Let's land this once CI finishes and figure out details for the next release

Add tests for out of order with checkpointing

3f6af6a

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jan 28, 2025

add warning logs back

7804572

andrewkho reviewed Jan 29, 2025

View reviewed changes

update test cases

cea6bdb

andrewkho approved these changes Jan 30, 2025

View reviewed changes

ramanishsingh merged commit fcdc8b9 into pytorch:main Jan 30, 2025
39 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add tests for out of order with checkpointing #1428

Add tests for out of order with checkpointing #1428

Uh oh!

michael-diggin commented Jan 28, 2025

Uh oh!

andrewkho commented Jan 28, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Jan 28, 2025 •

edited

Loading

Uh oh!

michael-diggin commented Jan 28, 2025

Uh oh!

andrewkho commented Jan 28, 2025

Uh oh!

michael-diggin commented Jan 28, 2025

Uh oh!

andrewkho Jan 29, 2025

Uh oh!

michael-diggin Jan 29, 2025

Uh oh!

andrewkho Jan 29, 2025

Uh oh!

michael-diggin Jan 29, 2025

Uh oh!

andrewkho left a comment

Uh oh!

Uh oh!

Uh oh!

Add tests for out of order with checkpointing #1428

Add tests for out of order with checkpointing #1428

Uh oh!

Conversation

michael-diggin commented Jan 28, 2025

Changes

Uh oh!

andrewkho commented Jan 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jan 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/data/1428

✅ No Failures

Uh oh!

michael-diggin commented Jan 28, 2025

Uh oh!

andrewkho commented Jan 28, 2025

Uh oh!

michael-diggin commented Jan 28, 2025

Uh oh!

andrewkho Jan 29, 2025

Choose a reason for hiding this comment

Uh oh!

michael-diggin Jan 29, 2025

Choose a reason for hiding this comment

Uh oh!

andrewkho Jan 29, 2025

Choose a reason for hiding this comment

Uh oh!

michael-diggin Jan 29, 2025

Choose a reason for hiding this comment

Uh oh!

andrewkho left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

andrewkho commented Jan 28, 2025 •

edited

Loading

pytorch-bot bot commented Jan 28, 2025 •

edited

Loading