Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with Multi-Epoch Training and max_train_steps Limitation #530

Open
NuanBaobao opened this issue Nov 7, 2024 · 1 comment
Open

Comments

@NuanBaobao
Copy link

Description

I have encountered an issue while trying to run multiple epochs during training. When I set the max_train_steps value greater than the number of steps produced in one epoch, the training process exits unexpectedly.

Problem

It seems that the training logic only runs the train_one_epoch function once, leading to a situation where the model does not iterate through multiple epochs as expected. Instead, it terminates early when reaching the max_train_steps, without continuing for the specified number of epochs.

Current Code Snippet

The current implementation appears as follows:

def train_one_epoch(prof_=None):
    if progress_info.global_step >= args.max_train_steps:
        return True
    for step, data_item in enumerate(train_dataloader):
        if train_one_step(step, data_item, prof_):
            break
        if step >= 2 and torch_npu is not None and npu_config is not None:
            npu_config.free_mm()

Proposed Solution

To address this limitation, I suggest modifying the training logic to allow for multiple epochs explicitly. Below is my proposed change:

def train_multi_epoch(prof_=None):
    progress_info.train_loss = 0.0
    for epoch in range(first_epoch, args.num_train_epochs):
        if train_one_epoch(prof_=prof_):
            break

I would appreciate any insights or suggestions regarding this proposed change. If there are additional considerations or implications for implementing this, please let me know.

@LinB203
Copy link
Member

LinB203 commented Nov 8, 2024

Refer to #508.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants