Cannot find the best model after training #31734

aladinggit · 2024-07-01T17:15:44Z

System Info

transformers version: 4.40.2
Platform: Linux-5.15.0
Python version: 3.10.0
Huggingface_hub version: 0.22.2
Safetensors version: 0.4.2
Accelerate version: 0.27.2
Accelerate config: not found
PyTorch version (GPU?): 2.1.2+cu121 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: One node with 8 A100 40G GPUs

Who can help?

No response

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

I am using the SFTtrainer to fully finetune meta-Llama3-8B model. My SFT config and training arguments are as below.

from transformers import AutoTokenizer
from transformers import AutoModelForCausalLM, Trainer, TrainingArguments

from tqdm import tqdm
import torch
import json, re, os, sys
import numpy as np
from datasets import load_dataset, DatasetDict
import ipdb
import random
from accelerate import Accelerator
from torch.utils.data import DataLoader
import evaluate
from trl import SFTConfig, SFTTrainer

dataset = load_dataset("allenai/c4", data_files="en/c4-train.00000-of-01024.json.gz")
model_name = "meta-llama/Meta-Llama-3-8B"
train_testvalid = dataset["train"].train_test_split(test_size=0.99, seed=42) 
valid_test = train_testvalid["test"].train_test_split(test_size=0.999, seed=42) 

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16, # float 32
    device_map= "auto",
)

model.train()

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token # tokenizer.pad_token == None
tokenizer.padding_side = "left"

dataset = DatasetDict({
    'train': train_testvalid['train'],
    'validation': valid_test['train'],
    'test': valid_test['test']})

repository_id = "llama3-tune"

sft_config = SFTConfig(
    dataset_text_field="text",
    output_dir=repository_id,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    max_seq_length=1024,
    # fp16_full_eval=True, # Overflows with fp16
    learning_rate=1e-4,
    num_train_epochs=1,
    optim="adamw_torch",
    warmup_ratio = 0.1,
    # logging & evaluation strategies
    logging_dir=f"{repository_id}/logs",
    logging_strategy="steps",
    logging_steps=0.1,
    logging_first_step=True,
    evaluation_strategy="steps",
    save_strategy="steps",
    save_steps= 0.1,
    save_total_limit=10,
    load_best_model_at_end=True,
    eval_accumulation_steps=2,
    eval_steps=0.1,
)



trainer = SFTTrainer(
    model=model,
    train_dataset=dataset["train"],
    eval_dataset=dataset["validation"],
    args=sft_config,
)

trainer.train()
model.save_pretrained(repository_id)
tokenizer.save_pretrained(repository_id)

Expected behavior

At the end of the training, I assume it should load the best model and save it in the directory. However, there is always a message pops up saying that "

Could not locate the best model at checkpoint-207/pytorch_model.bin, if you are running a distributed training on multiple nodes, you should activate `--save_on_each_node`.

I am only using one node for the training. I am not sure if the best model has been saved or loaded or it saved the model after the whole iteration finishes. Is this a bug related to safetensors? Could you please help me figure this out? Thanks!

The text was updated successfully, but these errors were encountered:

amyeroberts · 2024-07-02T21:47:28Z

cc @muellerzr @SunMarc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot find the best model after training #31734

Cannot find the best model after training #31734

aladinggit commented Jul 1, 2024 •

edited

Loading

amyeroberts commented Jul 2, 2024

Cannot find the best model after training #31734

Cannot find the best model after training #31734

Comments

aladinggit commented Jul 1, 2024 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

amyeroberts commented Jul 2, 2024

aladinggit commented Jul 1, 2024 •

edited

Loading