You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am getting eval metrics which are very off, I am using trl's SFTTrainer and unsloth_train to avoid gradient accumulation bug. I have isolated this to versions from 2025.2.6 onwards.
I ran the code yesterday and models were training well. but today the evaluation metrics were all off. On further investigation i found that four new versions were released today 2025.2.6 to 2025.2.9. My code was giving correct metrics till version 2025.2.5. I initially suspected the releases to be due to trl update from 0.14.0 to 0.15.0. Which was the case, since it was not compatible with the older version (2025.2.5) and was from 2025.2.6 onwards.
I suspected thus the issue to also be caused by trl and not unsloth, however on further investigation, even using the 2025.2.6 version with trl 0.14.0, thus the error is probably with some internal changes of how logits may be getting handled when passed on for either compute or pre-processing in unsloth_train.
I am still not sure of the root cause, and i am myself investigating the same.
here are some metrics for example (this was consistent across multiple runs):
on older version (for first 2-4 steps): {'bleu': 0.16595201266441403, 'chrf': 49.40949835765277, 'chrf++': 46.3269478539698, 'wer': 0.8411869358708604, 'cer': 0.7208625155549179}
on newer versions (for first 2-4 steps): {'bleu': 0.003462179056113745, 'chrf': 1.5924566575819799, 'chrf++': 1.597188643497329, 'wer': 0.9975353779359198, 'cer': 0.9274244024808702}
even loss was off. in older version, i was getting an eval_loss of ~1.42, but here i am getting a loss of ~9.9.
this is not even decreasing on training. however, the training loss is getting calculated accurately (matching with older version runs), thus i think it must be an issue with logits pass on to pre-processing or metrics computation.
here is the whole code:
fromevaluateimportloadimportnumpyasnp# Load the metrics from Hugging Face's evaluate librarybleu=load("bleu")
chrf=load("chrf")
wer=load("wer")
cer=load("cer")
defpreprocess_logits_for_metrics(logits, labels):
pred_ids=torch.argmax(logits, dim=-1)
returnpred_ids, labelsdefcompute_metrics(p):
print("=== In compute_metrics ===")
(preds, labels), _=pdel_labels[labels==-100] =tokenizer.pad_token_idpreds[preds==-100] =tokenizer.pad_token_idtry:
decoded_preds=tokenizer.batch_decode(preds, skip_special_tokens=True)
exceptExceptionase:
print("Error during decoding predictions:", e)
raiseetry:
decoded_labels=tokenizer.batch_decode(labels, skip_special_tokens=True)
exceptExceptionase:
print("Error during decoding labels:", e)
raisee# For BLEU/CHRF, references should be a list of lists (one inner list per example).decoded_labels_bleu= [[label] forlabelindecoded_labels]
# Compute metrics.bleu_score=bleu.compute(predictions=decoded_preds, references=decoded_labels_bleu)
chrf_score=chrf.compute(predictions=decoded_preds, references=decoded_labels_bleu)
chrfpp_score=chrf.compute(predictions=decoded_preds, references=decoded_labels_bleu, word_order=2) # CHRF++ (bigram)wer_score=wer.compute(predictions=decoded_preds, references=decoded_labels)
cer_score=cer.compute(predictions=decoded_preds, references=decoded_labels)
# print("Computed BLEU score:", bleu_score)metrics= {
"bleu": bleu_score["bleu"],
"chrf": chrf_score["score"],
"chrf++": chrfpp_score["score"],
"wer": wer_score,
"cer": cer_score,
}
print(metrics)
returnmetricstrainer=SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=train_dataset,
eval_dataset=eval_dataset, # Add an evaluation datasetdataset_text_field="text",
max_seq_length=max_seq_length, # 2048, since all sentences are shorterdataset_num_proc=2, # parallel processes for cpu (2 for colab)packing=False, # Can make training 5x faster for short sequences. (combines short values)compute_metrics=compute_metrics, # eval functionpreprocess_logits_for_metrics=preprocess_logits_for_metrics, # required function, for saving logits memory in unsloth (found on github)args=TrainingArguments(
per_device_train_batch_size=24,
gradient_accumulation_steps=4,
# num_train_epochs = 7, # 10 epochs, cause why not?max_steps=30,
fp16=notis_bfloat16_supported(),
bf16=is_bfloat16_supported(),
learning_rate=1.6e-3,
warmup_steps=8,
optim="adamw_8bit",
weight_decay=0.005,
# lr_scheduler_type = "linear",lr_scheduler_type="cosine_with_restarts",
# lr_scheduler_type = "cosineseed=3407,
logging_steps=1,
output_dir="outputs",
report_to="wandb", # Use this for WandB etcrun_name="sarvam_training_run_003",
# per_device_eval_batch_size=1,# eval_accumulation_steps=2,eval_steps=2, # Set how frequently to evaluate# batch_eval_metrics=True,# batch_eval_metrics=False,# evaluation_strategy="steps", # Enable evaluation during trainingeval_strategy="steps", # Enable evaluation during trainingdataloader_pin_memory=True, #fast gpu data transfermax_grad_norm=0.7, # clipping grads (prevents exploding grads)load_best_model_at_end=True,
save_strategy="steps",
save_steps=2, #double as eval steps (to stop time taken during saving)greater_is_better=False,
metric_for_best_model="eval_loss",
),
)
trainer_stats=unsloth_train(trainer)
This is the sample comparison between my older run - brown (yesterday with v2025.2.5) and initial steps of my current run - orange (with v2025.2.9, however this is repeated from v2025.2.6).
The text was updated successfully, but these errors were encountered:
i have a temporary solution for this. you will have to downgrade both unsloth and trl. since newer version of trl (0.15.0) is not supported by unsloth v2025.2.5
I also see this behavior across Llama and Qwen models. I am actually not sure if this is just a miscalculation or if the model weights are not getting correctly updated, causing the persistent error rate. I have not systematically tested it, but my observations show an actual, significant performance reduction since this started happening.
Downgrading seems to fix this. Thank you very much.
I am getting eval metrics which are very off, I am using trl's
SFTTrainer
andunsloth_train
to avoid gradient accumulation bug. I have isolated this to versions from2025.2.6
onwards.I ran the code yesterday and models were training well. but today the evaluation metrics were all off. On further investigation i found that four new versions were released today
2025.2.6
to2025.2.9
. My code was giving correct metrics till version2025.2.5
. I initially suspected the releases to be due to trl update from0.14.0
to0.15.0
. Which was the case, since it was not compatible with the older version (2025.2.5
) and was from2025.2.6
onwards.I suspected thus the issue to also be caused by trl and not unsloth, however on further investigation, even using the
2025.2.6
version with trl0.14.0
, thus the error is probably with some internal changes of how logits may be getting handled when passed on for either compute or pre-processing inunsloth_train
.I am still not sure of the root cause, and i am myself investigating the same.
here are some metrics for example (this was consistent across multiple runs):
on older version (for first 2-4 steps):
{'bleu': 0.16595201266441403, 'chrf': 49.40949835765277, 'chrf++': 46.3269478539698, 'wer': 0.8411869358708604, 'cer': 0.7208625155549179}
on newer versions (for first 2-4 steps):
{'bleu': 0.003462179056113745, 'chrf': 1.5924566575819799, 'chrf++': 1.597188643497329, 'wer': 0.9975353779359198, 'cer': 0.9274244024808702}
even loss was off. in older version, i was getting an eval_loss of ~
1.42
, but here i am getting a loss of ~9.9
.this is not even decreasing on training. however, the training loss is getting calculated accurately (matching with older version runs), thus i think it must be an issue with logits pass on to pre-processing or metrics computation.
here is the whole code:
This is the sample comparison between my older run - brown (yesterday with v

2025.2.5
) and initial steps of my current run - orange (with v2025.2.9
, however this is repeated from v2025.2.6
).The text was updated successfully, but these errors were encountered: