Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[URGENT] Wrong eval metrics (probably a logits error?) #1711

Open
Tasmay-Tibrewal opened this issue Feb 14, 2025 · 3 comments
Open

[URGENT] Wrong eval metrics (probably a logits error?) #1711

Tasmay-Tibrewal opened this issue Feb 14, 2025 · 3 comments

Comments

@Tasmay-Tibrewal
Copy link

Tasmay-Tibrewal commented Feb 14, 2025

I am getting eval metrics which are very off, I am using trl's SFTTrainer and unsloth_train to avoid gradient accumulation bug. I have isolated this to versions from 2025.2.6 onwards.

I ran the code yesterday and models were training well. but today the evaluation metrics were all off. On further investigation i found that four new versions were released today 2025.2.6 to 2025.2.9. My code was giving correct metrics till version 2025.2.5. I initially suspected the releases to be due to trl update from 0.14.0 to 0.15.0. Which was the case, since it was not compatible with the older version (2025.2.5) and was from 2025.2.6 onwards.

I suspected thus the issue to also be caused by trl and not unsloth, however on further investigation, even using the 2025.2.6 version with trl 0.14.0, thus the error is probably with some internal changes of how logits may be getting handled when passed on for either compute or pre-processing in unsloth_train.

I am still not sure of the root cause, and i am myself investigating the same.

here are some metrics for example (this was consistent across multiple runs):
on older version (for first 2-4 steps):
{'bleu': 0.16595201266441403, 'chrf': 49.40949835765277, 'chrf++': 46.3269478539698, 'wer': 0.8411869358708604, 'cer': 0.7208625155549179}

on newer versions (for first 2-4 steps):
{'bleu': 0.003462179056113745, 'chrf': 1.5924566575819799, 'chrf++': 1.597188643497329, 'wer': 0.9975353779359198, 'cer': 0.9274244024808702}

even loss was off. in older version, i was getting an eval_loss of ~1.42, but here i am getting a loss of ~9.9.

this is not even decreasing on training. however, the training loss is getting calculated accurately (matching with older version runs), thus i think it must be an issue with logits pass on to pre-processing or metrics computation.

here is the whole code:

from evaluate import load
import numpy as np

# Load the metrics from Hugging Face's evaluate library
bleu = load("bleu")
chrf = load("chrf")
wer = load("wer")
cer = load("cer")

def preprocess_logits_for_metrics(logits, labels):
    pred_ids = torch.argmax(logits, dim=-1)
    return pred_ids, labels

def compute_metrics(p):
    print("=== In compute_metrics ===")

    (preds, labels), _ = p
    del _

    labels[labels == -100] = tokenizer.pad_token_id
    preds[preds == -100] = tokenizer.pad_token_id

    try:
        decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
    except Exception as e:
        print("Error during decoding predictions:", e)
        raise e
    try:
        decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    except Exception as e:
        print("Error during decoding labels:", e)
        raise e

    # For BLEU/CHRF, references should be a list of lists (one inner list per example).
    decoded_labels_bleu = [[label] for label in decoded_labels]

    # Compute metrics.
    bleu_score = bleu.compute(predictions=decoded_preds, references=decoded_labels_bleu)
    chrf_score = chrf.compute(predictions=decoded_preds, references=decoded_labels_bleu)
    chrfpp_score = chrf.compute(predictions=decoded_preds, references=decoded_labels_bleu, word_order=2)  # CHRF++ (bigram)
    wer_score = wer.compute(predictions=decoded_preds, references=decoded_labels)
    cer_score = cer.compute(predictions=decoded_preds, references=decoded_labels)

    # print("Computed BLEU score:", bleu_score)
    metrics = {
        "bleu": bleu_score["bleu"],
        "chrf": chrf_score["score"],
        "chrf++": chrfpp_score["score"],
        "wer": wer_score,
        "cer": cer_score,
    }

    print(metrics)

    return metrics

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = train_dataset,
    eval_dataset = eval_dataset,  # Add an evaluation dataset
    dataset_text_field = "text",
    max_seq_length = max_seq_length, # 2048, since all sentences are shorter
    dataset_num_proc = 2, # parallel processes for cpu (2 for colab)
    packing = False, # Can make training 5x faster for short sequences. (combines short values)
    compute_metrics=compute_metrics,  # eval function
    preprocess_logits_for_metrics=preprocess_logits_for_metrics, # required function, for saving logits memory in unsloth (found on github)
    args = TrainingArguments(
        per_device_train_batch_size = 24,
        gradient_accumulation_steps = 4,
        # num_train_epochs = 7, # 10 epochs, cause why not?
        max_steps = 30,

        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),

        learning_rate = 1.6e-3,
        warmup_steps = 8,
        optim = "adamw_8bit",
        weight_decay = 0.005,
        # lr_scheduler_type = "linear",
        lr_scheduler_type="cosine_with_restarts",
        # lr_scheduler_type = "cosine
        seed = 3407,

        logging_steps = 1,
        output_dir = "outputs",
        report_to = "wandb", # Use this for WandB etc
        run_name = "sarvam_training_run_003",

        # per_device_eval_batch_size=1,
        # eval_accumulation_steps=2,
        eval_steps=2,  # Set how frequently to evaluate
        # batch_eval_metrics=True,
        # batch_eval_metrics=False,
        # evaluation_strategy="steps",  # Enable evaluation during training
        eval_strategy="steps",  # Enable evaluation during training

        dataloader_pin_memory=True, #fast gpu data transfer
        max_grad_norm=0.7, # clipping grads (prevents exploding grads)

        load_best_model_at_end=True,
        save_strategy="steps",
        save_steps=2, #double as eval steps (to stop time taken during saving)
        greater_is_better=False,
        metric_for_best_model="eval_loss",
    ),
)

trainer_stats = unsloth_train(trainer)

This is the sample comparison between my older run - brown (yesterday with v2025.2.5) and initial steps of my current run - orange (with v2025.2.9, however this is repeated from v2025.2.6).
Image

@Tasmay-Tibrewal
Copy link
Author

Tasmay-Tibrewal commented Feb 14, 2025

i have a temporary solution for this. you will have to downgrade both unsloth and trl. since newer version of trl (0.15.0) is not supported by unsloth v2025.2.5

simply run:

!pip install trl==0.14.0
!pip install unsloth==2025.2.5

if this does not work try running:

!pip install --force-reinstall trl==0.14.0
!pip install --force-reinstall unsloth==2025.2.5

@Tasmay-Tibrewal Tasmay-Tibrewal changed the title Wrong eval metrics (probably a logits error?) [URGENT] Wrong eval metrics (probably a logits error?) Feb 15, 2025
@TheSittingCat
Copy link

I also see this behavior across Llama and Qwen models. I am actually not sure if this is just a miscalculation or if the model weights are not getting correctly updated, causing the persistent error rate. I have not systematically tested it, but my observations show an actual, significant performance reduction since this started happening.

Downgrading seems to fix this. Thank you very much.

@Tasmay-Tibrewal
Copy link
Author

No issues, you are welcome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants