Attention can be None in ModernBertForSequenceClassification #35917

ashmikuz · 2025-01-27T20:53:20Z

In the ModernBertForSequenceClassification class, the attention is never computed outside of the self.model (which is a ModernBertModel). Therefore when the attention is not used as input for the model the .unsqueeze() here fails.
I solved this issue by assagning torch.ones(batch_size,seq_len) to the attention_mask, but I am not sure whether this is correct.

Rocketknight1 · 2025-01-28T13:05:50Z

Hi @ashmikuz, when no attention mask is passed then we can't really work out which positions are masked! Although we could add code to estimate this (like attention_mask = input_ids != self.config.pad_token_id), this is error-prone and I think a better solution is just to raise a clear error in this case, telling the user they have to pass an attention mask if they want to use self.classifier_pooling == "mean".

Would you be interested in making a PR for that?

tom13878 · 2025-01-31T13:31:34Z

Hi @Rocketknight1 , @ashmikuz,

I had the same issue. Is anyone working on this? Otherwise I will raise a PR and add this RuntimeError:

if self.config.classifier_pooling == "mean" and attention_mask is None: raise RuntimeError("Mean pooling requires an attention mask to properly compute the pooled output. Please provide an attention mask to indicate which tokens should be considered in the mean pooling calculation.")

ashmikuz · 2025-01-31T13:34:40Z

Sorry I was quite busy in the last few days. Shouldn't it match how other models behave? As far as I understand, other models just print a warning and then create an attention mask from torch.ones, right?

tom13878 · 2025-01-31T13:49:47Z

Yes, you are right, I see the torch.ones for e.g. deberta here

I don't see the warning but I may have missed it ...

ashmikuz · 2025-01-31T13:52:06Z

I'm working on a quick PR, just a moment and i'll send it. Hopefully it fixes the issue and is in line with other models.

ashmikuz linked a pull request Jan 31, 2025 that will close this issue

[ModernBert] Prevent the attention mask from being None in ModernBertForSequenceClassification #35991

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Attention can be None in ModernBertForSequenceClassification #35917

Attention can be None in ModernBertForSequenceClassification #35917

ashmikuz commented Jan 27, 2025

Rocketknight1 commented Jan 28, 2025 •

edited

Loading

tom13878 commented Jan 31, 2025

ashmikuz commented Jan 31, 2025

tom13878 commented Jan 31, 2025

ashmikuz commented Jan 31, 2025

Attention can be None in ModernBertForSequenceClassification #35917

Attention can be None in ModernBertForSequenceClassification #35917

Comments

ashmikuz commented Jan 27, 2025

Rocketknight1 commented Jan 28, 2025 • edited Loading

tom13878 commented Jan 31, 2025

ashmikuz commented Jan 31, 2025

tom13878 commented Jan 31, 2025

ashmikuz commented Jan 31, 2025

Rocketknight1 commented Jan 28, 2025 •

edited

Loading