grad_norm 0.0 while finetuning using group_by_label batch sampler #3130

AmoghM · 2024-12-10T22:45:39Z

I am currently training a Sentence Transformer on my dataset using triplet loss, but I am encountering an issue where the gradient norm (grad_norm) is consistently 0.0 during training. This problem persists when using the recommended group_by_label batch sampler for triplet loss.

Details

Current Setup:
- Model: Alibaba-NLP/gte-base-en-v1.5
- Loss Function: Triplet Loss
- Batch Sampler: group_by_label (recommended for triplet loss)

Observations

When I switch the batch sampler to either batch_sampler or no_duplicate, I notice an improvement in the training logs, and the grad_norm values become non-zero.
However, I want to utilize the group_by_label sampler as it is suggested for triplet loss, and I need assistance in understanding why this specific configuration is causing issues.

Below is the sample code:

training_args = SentenceTransformerTrainingArguments(
        num_train_epochs=1,
        per_device_train_batch_size=64,
        per_device_eval_batch_size=4,
        warmup_steps=200,
        weight_decay=0.01,
        logging_steps=1,
        logging_strategy="epoch",
        output_dir=output_dir, 
        learning_rate=2e-5,
        max_grad_norm=1.0,
        dataloader_drop_last=True,
        gradient_accumulation_steps=2, 
        gradient_checkpointing=True, 
        batch_sampler='group_by_label',
        evaluation_strategy="steps",
        logging_strategy="steps",
        eval_steps=50
    )

trainer = SentenceTransformerTrainer(
        model=model,
        loss=losses.BatchHardSoftMarginTripletLoss(model),
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
    )
trainer.train()

Tensorboard viz of training with different batch sampler. Orange line is no_duplicate. Blue line is group_label. Red line is batch_sampler

Training logs for no_duplicate batch sampler:

{'loss': 6.174, 'grad_norm': 38.07846450805664, 'learning_rate': 1.0000000000000001e-07, 'epoch': 0.08}                                                                                                                             
{'loss': 6.8544, 'grad_norm': 44.4666748046875, 'learning_rate': 2.0000000000000002e-07, 'epoch': 0.15}                                                                                                                             
{'loss': 5.7911, 'grad_norm': 37.91443634033203, 'learning_rate': 3.0000000000000004e-07, 'epoch': 0.23}                                                                                                                            
{'loss': 5.8593, 'grad_norm': 41.3128662109375, 'learning_rate': 4.0000000000000003e-07, 'epoch': 0.31}                                                                                                                             
{'loss': 6.1478, 'grad_norm': 40.226253509521484, 'learning_rate': 5.000000000000001e-07, 'epoch': 0.38}                                                                                                                            
{'loss': 6.2663, 'grad_norm': 37.63628005981445, 'learning_rate': 6.000000000000001e-07, 'epoch': 0.46}                                                                                                                             
{'loss': 6.5116, 'grad_norm': 45.362548828125, 'learning_rate': 7.000000000000001e-07, 'epoch': 0.54}                                                                                                                               
{'loss': 6.0732, 'grad_norm': 39.056190490722656, 'learning_rate': 8.000000000000001e-07, 'epoch': 0.62}                                                                                                                            
{'loss': 6.1131, 'grad_norm': 37.20143508911133, 'learning_rate': 9.000000000000001e-07, 'epoch': 0.69}                                                                                                                             
{'loss': 6.2785, 'grad_norm': 42.78799057006836, 'learning_rate': 1.0000000000000002e-06, 'epoch': 0.77}                                                                                                                            
{'loss': 6.2814, 'grad_norm': 38.738624572753906, 'learning_rate': 1.1e-06, 'epoch': 0.85}                                                                                                                                          
{'loss': 6.2216, 'grad_norm': 40.94490051269531, 'learning_rate': 1.2000000000000002e-06, 'epoch': 0.92}                                                                                                                            
{'loss': 5.776, 'grad_norm': 38.426063537597656, 'learning_rate': 1.3e-06, 'epoch': 1.0}

vs training logs for group_by_label batch sampler:

{'loss': 0.6931, 'grad_norm': 0.0, 'learning_rate': 1.0000000000000001e-07, 'epoch': 0.08}                                                                                                                                          
{'loss': 0.6931, 'grad_norm': 0.0, 'learning_rate': 2.0000000000000002e-07, 'epoch': 0.15}                                                                                                                                          
{'loss': 0.6931, 'grad_norm': 0.0, 'learning_rate': 3.0000000000000004e-07, 'epoch': 0.23}                                                                                                                                          
{'loss': 1.3355, 'grad_norm': 22.390493392944336, 'learning_rate': 4.0000000000000003e-07, 'epoch': 0.31}                                                                                                                           
{'loss': 0.6931, 'grad_norm': 0.0, 'learning_rate': 5.000000000000001e-07, 'epoch': 0.38}                                                                                                                                           
{'loss': 0.6931, 'grad_norm': 0.0, 'learning_rate': 6.000000000000001e-07, 'epoch': 0.46}                                                                                                                                           
{'loss': 0.6931, 'grad_norm': 0.0, 'learning_rate': 7.000000000000001e-07, 'epoch': 0.54}                                                                                                                                           
{'loss': 0.6931, 'grad_norm': 0.0, 'learning_rate': 8.000000000000001e-07, 'epoch': 0.62}                                                                                                                                           
{'loss': 0.6931, 'grad_norm': 0.0, 'learning_rate': 9.000000000000001e-07, 'epoch': 0.69}                                                                                                                                           
{'loss': 2.8415, 'grad_norm': 35.88393783569336, 'learning_rate': 1.0000000000000002e-06, 'epoch': 0.77}                                                                                                                            
{'loss': 0.6931, 'grad_norm': 0.0, 'learning_rate': 1.1e-06, 'epoch': 0.85}                                                                                                                                                         
{'loss': 0.6931, 'grad_norm': 0.0, 'learning_rate': 1.2000000000000002e-06, 'epoch': 0.92}

Questions

What could be causing the grad_norm to be 0.0 when using the group_by_label sampler?
Are there any adjustments or configurations you would recommend to resolve this issue while still using the recommended batch sampler?
Thank you!

The text was updated successfully, but these errors were encountered:

AmoghM changed the title ~~grad_norm 0.0 while finetuning sentence transformer~~ grad_norm 0.0 while finetuning using group_by_label batch sampler Dec 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

grad_norm 0.0 while finetuning using group_by_label batch sampler #3130

grad_norm 0.0 while finetuning using group_by_label batch sampler #3130

AmoghM commented Dec 10, 2024

grad_norm 0.0 while finetuning using group_by_label batch sampler #3130

grad_norm 0.0 while finetuning using group_by_label batch sampler #3130

Comments

AmoghM commented Dec 10, 2024

Details

Observations

Questions