Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

iteration resuming #37

Open
pluiez opened this issue Oct 30, 2024 · 2 comments
Open

iteration resuming #37

pluiez opened this issue Oct 30, 2024 · 2 comments
Labels
enhancement New feature or request

Comments

@pluiez
Copy link

pluiez commented Oct 30, 2024

Hi, I'm wondering for each backend and each dataset, do they support saving iteration state and resume later to continue previous iteration where it stopped?

This feature is required for resuming from a checkpoint during model training.

for example:

dataloader = jdl.DataLoader(ds, backend, shuffle=True)
for i, batch in enumerate(dataloader):
     if i == 100:
         state = dataloader.state_dict()
# re-init the dataloader, and then try to resume from state
dataloader = jdl.DataLoader(ds, backend, shuffle=True)
dataloader.load_state_dict(state)

for batch in enumerate(dataloader):
    ....
@BirkhoffG
Copy link
Owner

Hi @pluiez All dataloader backends do not assume the statefulness. For now, you should manually store the states (e.g., batch index, arguments, seeds, etc), and load the states by re-init the dataloader and iterate to the target batch index.

I am not planning to support load_state_dict as it is not a part of API in the (official) pytorch dataloader. I might consider supporting statefuldataloader in the future, but this is still an experimental feature from pytorch team.

@BirkhoffG BirkhoffG added the enhancement New feature or request label Oct 30, 2024
@pluiez
Copy link
Author

pluiez commented Nov 11, 2024

Hi @pluiez All dataloader backends do not assume the statefulness. For now, you should manually store the states (e.g., batch index, arguments, seeds, etc), and load the states by re-init the dataloader and iterate to the target batch index.

I am not planning to support load_state_dict as it is not a part of API in the (official) pytorch dataloader. I might consider supporting statefuldataloader in the future, but this is still an experimental feature from pytorch team.

I have checked statefuldataloader implementation, it does support resuming by simply skipping the same number of yielded batches before checkpointing. This is an available though a bit inefficient solution. I'm looking for good practice to support a memory efficient and fast dataset iteration supporting shuffling and resumable iteration. Thank you for your information.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants