Issue with Multi-Epoch Training and max_train_steps Limitation #530

NuanBaobao · 2024-11-07T05:04:05Z

Description

I have encountered an issue while trying to run multiple epochs during training. When I set the max_train_steps value greater than the number of steps produced in one epoch, the training process exits unexpectedly.

Problem

It seems that the training logic only runs the train_one_epoch function once, leading to a situation where the model does not iterate through multiple epochs as expected. Instead, it terminates early when reaching the max_train_steps, without continuing for the specified number of epochs.

Current Code Snippet

The current implementation appears as follows:

def train_one_epoch(prof_=None):
    if progress_info.global_step >= args.max_train_steps:
        return True
    for step, data_item in enumerate(train_dataloader):
        if train_one_step(step, data_item, prof_):
            break
        if step >= 2 and torch_npu is not None and npu_config is not None:
            npu_config.free_mm()

Proposed Solution

To address this limitation, I suggest modifying the training logic to allow for multiple epochs explicitly. Below is my proposed change:

def train_multi_epoch(prof_=None):
    progress_info.train_loss = 0.0
    for epoch in range(first_epoch, args.num_train_epochs):
        if train_one_epoch(prof_=prof_):
            break

I would appreciate any insights or suggestions regarding this proposed change. If there are additional considerations or implications for implementing this, please let me know.

The text was updated successfully, but these errors were encountered:

LinB203 · 2024-11-08T08:53:38Z

Refer to #508.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with Multi-Epoch Training and max_train_steps Limitation #530

Issue with Multi-Epoch Training and max_train_steps Limitation #530

NuanBaobao commented Nov 7, 2024

LinB203 commented Nov 8, 2024

Issue with Multi-Epoch Training and max_train_steps Limitation #530

Issue with Multi-Epoch Training and max_train_steps Limitation #530

Comments

NuanBaobao commented Nov 7, 2024

Description

Problem

Current Code Snippet

Proposed Solution

LinB203 commented Nov 8, 2024