contextual embeddings of individual words #191

Jinsong-Chen · 2025-01-27T04:18:31Z

I need contextual embeddings of individual words for analysis. However, it turns out that ModernBERT performs much worse than BERT. I use the right version of transformer mentioned in Hugging Face and the following code (same for BERT which works well) to obtain the word-level embeddings. Could there be any problem?

from transformers import AutoTokenizer, AutoModel
llm = 'ModernBERT-large'
tokenizer = AutoTokenizer.from_pretrained(llm)
model = AutoModel.from_pretrained(llm)

def embedding(text, tokenizer, model):
    input = tokenizer(text, padding="longest", truncation=True, max_length=512, return_tensors="pt")
    with torch.no_grad():
        outputs = model(**input)
    return outputs.last_hidden_state

pos_adj=1
text = f"{textA}{word}{textB}{doc}"
start_pos = len(tokenizer.tokenize(textA)) + pos_adj
end_pos = len(tokenizer.tokenize(textA + word)) + pos_adj
all_embeddings = embedding(text, tokenizer, model)
word_embedding = torch.mean(all_embeddings[0, start_pos:end_pos], dim=0)

The text was updated successfully, but these errors were encountered:

mattlam · 2025-01-30T06:38:14Z

Might be related to this? #149

NohTow · 2025-01-30T13:06:56Z

Yeah, I don't know what you are evaluating exactly, but if you have per word metrics, it might be related to the issue mentioned.

We should have fixed the tokenizer in the latest version of transformers, could you try installing it (maybe from source just to be sure) and to use is_split_into_words=True while calling the tokenizer and also maybe use add_prefix_space=True when creating the tokenizer.

Could you try checking the tokenization with BERT and ModernBERT to make sure they are similar?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

contextual embeddings of individual words #191

contextual embeddings of individual words #191

Jinsong-Chen commented Jan 27, 2025 •

edited

Loading

mattlam commented Jan 30, 2025

NohTow commented Jan 30, 2025

contextual embeddings of individual words #191

contextual embeddings of individual words #191

Comments

Jinsong-Chen commented Jan 27, 2025 • edited Loading

mattlam commented Jan 30, 2025

NohTow commented Jan 30, 2025

Jinsong-Chen commented Jan 27, 2025 •

edited

Loading