Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot find the best model after training #31734

Open
1 of 4 tasks
aladinggit opened this issue Jul 1, 2024 · 1 comment
Open
1 of 4 tasks

Cannot find the best model after training #31734

aladinggit opened this issue Jul 1, 2024 · 1 comment

Comments

@aladinggit
Copy link

aladinggit commented Jul 1, 2024

System Info

  • transformers version: 4.40.2
  • Platform: Linux-5.15.0
  • Python version: 3.10.0
  • Huggingface_hub version: 0.22.2
  • Safetensors version: 0.4.2
  • Accelerate version: 0.27.2
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.1.2+cu121 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: Yes
  • Using distributed or parallel set-up in script?: One node with 8 A100 40G GPUs

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

I am using the SFTtrainer to fully finetune meta-Llama3-8B model. My SFT config and training arguments are as below.

from transformers import AutoTokenizer
from transformers import AutoModelForCausalLM, Trainer, TrainingArguments

from tqdm import tqdm
import torch
import json, re, os, sys
import numpy as np
from datasets import load_dataset, DatasetDict
import ipdb
import random
from accelerate import Accelerator
from torch.utils.data import DataLoader
import evaluate
from trl import SFTConfig, SFTTrainer

dataset = load_dataset("allenai/c4", data_files="en/c4-train.00000-of-01024.json.gz")
model_name = "meta-llama/Meta-Llama-3-8B"
train_testvalid = dataset["train"].train_test_split(test_size=0.99, seed=42) 
valid_test = train_testvalid["test"].train_test_split(test_size=0.999, seed=42) 

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16, # float 32
    device_map= "auto",
)

model.train()

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token # tokenizer.pad_token == None
tokenizer.padding_side = "left"

dataset = DatasetDict({
    'train': train_testvalid['train'],
    'validation': valid_test['train'],
    'test': valid_test['test']})

repository_id = "llama3-tune"

sft_config = SFTConfig(
    dataset_text_field="text",
    output_dir=repository_id,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    max_seq_length=1024,
    # fp16_full_eval=True, # Overflows with fp16
    learning_rate=1e-4,
    num_train_epochs=1,
    optim="adamw_torch",
    warmup_ratio = 0.1,
    # logging & evaluation strategies
    logging_dir=f"{repository_id}/logs",
    logging_strategy="steps",
    logging_steps=0.1,
    logging_first_step=True,
    evaluation_strategy="steps",
    save_strategy="steps",
    save_steps= 0.1,
    save_total_limit=10,
    load_best_model_at_end=True,
    eval_accumulation_steps=2,
    eval_steps=0.1,
)



trainer = SFTTrainer(
    model=model,
    train_dataset=dataset["train"],
    eval_dataset=dataset["validation"],
    args=sft_config,
)

trainer.train()
model.save_pretrained(repository_id)
tokenizer.save_pretrained(repository_id)

Expected behavior

At the end of the training, I assume it should load the best model and save it in the directory. However, there is always a message pops up saying that "

Could not locate the best model at checkpoint-207/pytorch_model.bin, if you are running a distributed training on multiple nodes, you should activate `--save_on_each_node`.

I am only using one node for the training. I am not sure if the best model has been saved or loaded or it saved the model after the whole iteration finishes. Is this a bug related to safetensors? Could you please help me figure this out? Thanks!

@amyeroberts
Copy link
Collaborator

cc @muellerzr @SunMarc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants