You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
At the end of the training, I assume it should load the best model and save it in the directory. However, there is always a message pops up saying that "
Could not locate the best model at checkpoint-207/pytorch_model.bin, if you are running a distributed training on multiple nodes, you should activate `--save_on_each_node`.
I am only using one node for the training. I am not sure if the best model has been saved or loaded or it saved the model after the whole iteration finishes. Is this a bug related to safetensors? Could you please help me figure this out? Thanks!
The text was updated successfully, but these errors were encountered:
System Info
transformers
version: 4.40.2Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
I am using the SFTtrainer to fully finetune meta-Llama3-8B model. My SFT config and training arguments are as below.
Expected behavior
At the end of the training, I assume it should load the best model and save it in the directory. However, there is always a message pops up saying that "
I am only using one node for the training. I am not sure if the best model has been saved or loaded or it saved the model after the whole iteration finishes. Is this a bug related to safetensors? Could you please help me figure this out? Thanks!
The text was updated successfully, but these errors were encountered: