[URGENT] Wrong eval metrics (probably a logits error?) #1711

Tasmay-Tibrewal · 2025-02-14T18:38:18Z

I am getting eval metrics which are very off, I am using trl's SFTTrainer and unsloth_train to avoid gradient accumulation bug. I have isolated this to versions from 2025.2.6 onwards.

I ran the code yesterday and models were training well. but today the evaluation metrics were all off. On further investigation i found that four new versions were released today 2025.2.6 to 2025.2.9. My code was giving correct metrics till version 2025.2.5. I initially suspected the releases to be due to trl update from 0.14.0 to 0.15.0. Which was the case, since it was not compatible with the older version (2025.2.5) and was from 2025.2.6 onwards.

I suspected thus the issue to also be caused by trl and not unsloth, however on further investigation, even using the 2025.2.6 version with trl 0.14.0, thus the error is probably with some internal changes of how logits may be getting handled when passed on for either compute or pre-processing in unsloth_train.

I am still not sure of the root cause, and i am myself investigating the same.

here are some metrics for example (this was consistent across multiple runs):
on older version (for first 2-4 steps):
{'bleu': 0.16595201266441403, 'chrf': 49.40949835765277, 'chrf++': 46.3269478539698, 'wer': 0.8411869358708604, 'cer': 0.7208625155549179}

on newer versions (for first 2-4 steps):
{'bleu': 0.003462179056113745, 'chrf': 1.5924566575819799, 'chrf++': 1.597188643497329, 'wer': 0.9975353779359198, 'cer': 0.9274244024808702}

even loss was off. in older version, i was getting an eval_loss of ~1.42, but here i am getting a loss of ~9.9.

this is not even decreasing on training. however, the training loss is getting calculated accurately (matching with older version runs), thus i think it must be an issue with logits pass on to pre-processing or metrics computation.

here is the whole code:

from evaluate import load
import numpy as np

# Load the metrics from Hugging Face's evaluate library
bleu = load("bleu")
chrf = load("chrf")
wer = load("wer")
cer = load("cer")

def preprocess_logits_for_metrics(logits, labels):
    pred_ids = torch.argmax(logits, dim=-1)
    return pred_ids, labels

def compute_metrics(p):
    print("=== In compute_metrics ===")

    (preds, labels), _ = p
    del _

    labels[labels == -100] = tokenizer.pad_token_id
    preds[preds == -100] = tokenizer.pad_token_id

    try:
        decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
    except Exception as e:
        print("Error during decoding predictions:", e)
        raise e
    try:
        decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    except Exception as e:
        print("Error during decoding labels:", e)
        raise e

    # For BLEU/CHRF, references should be a list of lists (one inner list per example).
    decoded_labels_bleu = [[label] for label in decoded_labels]

    # Compute metrics.
    bleu_score = bleu.compute(predictions=decoded_preds, references=decoded_labels_bleu)
    chrf_score = chrf.compute(predictions=decoded_preds, references=decoded_labels_bleu)
    chrfpp_score = chrf.compute(predictions=decoded_preds, references=decoded_labels_bleu, word_order=2)  # CHRF++ (bigram)
    wer_score = wer.compute(predictions=decoded_preds, references=decoded_labels)
    cer_score = cer.compute(predictions=decoded_preds, references=decoded_labels)

    # print("Computed BLEU score:", bleu_score)
    metrics = {
        "bleu": bleu_score["bleu"],
        "chrf": chrf_score["score"],
        "chrf++": chrfpp_score["score"],
        "wer": wer_score,
        "cer": cer_score,
    }

    print(metrics)

    return metrics

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = train_dataset,
    eval_dataset = eval_dataset,  # Add an evaluation dataset
    dataset_text_field = "text",
    max_seq_length = max_seq_length, # 2048, since all sentences are shorter
    dataset_num_proc = 2, # parallel processes for cpu (2 for colab)
    packing = False, # Can make training 5x faster for short sequences. (combines short values)
    compute_metrics=compute_metrics,  # eval function
    preprocess_logits_for_metrics=preprocess_logits_for_metrics, # required function, for saving logits memory in unsloth (found on github)
    args = TrainingArguments(
        per_device_train_batch_size = 24,
        gradient_accumulation_steps = 4,
        # num_train_epochs = 7, # 10 epochs, cause why not?
        max_steps = 30,

        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),

        learning_rate = 1.6e-3,
        warmup_steps = 8,
        optim = "adamw_8bit",
        weight_decay = 0.005,
        # lr_scheduler_type = "linear",
        lr_scheduler_type="cosine_with_restarts",
        # lr_scheduler_type = "cosine
        seed = 3407,

        logging_steps = 1,
        output_dir = "outputs",
        report_to = "wandb", # Use this for WandB etc
        run_name = "sarvam_training_run_003",

        # per_device_eval_batch_size=1,
        # eval_accumulation_steps=2,
        eval_steps=2,  # Set how frequently to evaluate
        # batch_eval_metrics=True,
        # batch_eval_metrics=False,
        # evaluation_strategy="steps",  # Enable evaluation during training
        eval_strategy="steps",  # Enable evaluation during training

        dataloader_pin_memory=True, #fast gpu data transfer
        max_grad_norm=0.7, # clipping grads (prevents exploding grads)

        load_best_model_at_end=True,
        save_strategy="steps",
        save_steps=2, #double as eval steps (to stop time taken during saving)
        greater_is_better=False,
        metric_for_best_model="eval_loss",
    ),
)

trainer_stats = unsloth_train(trainer)

This is the sample comparison between my older run - brown (yesterday with v2025.2.5) and initial steps of my current run - orange (with v2025.2.9, however this is repeated from v2025.2.6).

The text was updated successfully, but these errors were encountered:

Tasmay-Tibrewal · 2025-02-14T18:42:31Z

i have a temporary solution for this. you will have to downgrade both unsloth and trl. since newer version of trl (0.15.0) is not supported by unsloth v2025.2.5

simply run:

!pip install trl==0.14.0
!pip install unsloth==2025.2.5

if this does not work try running:

!pip install --force-reinstall trl==0.14.0
!pip install --force-reinstall unsloth==2025.2.5

TheSittingCat · 2025-02-19T07:22:55Z

I also see this behavior across Llama and Qwen models. I am actually not sure if this is just a miscalculation or if the model weights are not getting correctly updated, causing the persistent error rate. I have not systematically tested it, but my observations show an actual, significant performance reduction since this started happening.

Downgrading seems to fix this. Thank you very much.

Tasmay-Tibrewal · 2025-02-19T17:59:13Z

No issues, you are welcome.

Tasmay-Tibrewal changed the title ~~Wrong eval metrics (probably a logits error?)~~ [URGENT] Wrong eval metrics (probably a logits error?) Feb 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[URGENT] Wrong eval metrics (probably a logits error?) #1711

[URGENT] Wrong eval metrics (probably a logits error?) #1711

Tasmay-Tibrewal commented Feb 14, 2025 •

edited

Loading

Tasmay-Tibrewal commented Feb 14, 2025 •

edited

Loading

TheSittingCat commented Feb 19, 2025

Tasmay-Tibrewal commented Feb 19, 2025

[URGENT] Wrong eval metrics (probably a logits error?) #1711

[URGENT] Wrong eval metrics (probably a logits error?) #1711

Comments

Tasmay-Tibrewal commented Feb 14, 2025 • edited Loading

Tasmay-Tibrewal commented Feb 14, 2025 • edited Loading

TheSittingCat commented Feb 19, 2025

Tasmay-Tibrewal commented Feb 19, 2025

Tasmay-Tibrewal commented Feb 14, 2025 •

edited

Loading

Tasmay-Tibrewal commented Feb 14, 2025 •

edited

Loading