Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

contextual embeddings of individual words #191

Open
Jinsong-Chen opened this issue Jan 27, 2025 · 2 comments
Open

contextual embeddings of individual words #191

Jinsong-Chen opened this issue Jan 27, 2025 · 2 comments

Comments

@Jinsong-Chen
Copy link

Jinsong-Chen commented Jan 27, 2025

I need contextual embeddings of individual words for analysis. However, it turns out that ModernBERT performs much worse than BERT. I use the right version of transformer mentioned in Hugging Face and the following code (same for BERT which works well) to obtain the word-level embeddings. Could there be any problem?

from transformers import AutoTokenizer, AutoModel
llm = 'ModernBERT-large'
tokenizer = AutoTokenizer.from_pretrained(llm)
model = AutoModel.from_pretrained(llm)

def embedding(text, tokenizer, model):
    input = tokenizer(text, padding="longest", truncation=True, max_length=512, return_tensors="pt")
    with torch.no_grad():
        outputs = model(**input)
    return outputs.last_hidden_state

pos_adj=1
text = f"{textA}{word}{textB}{doc}"
start_pos = len(tokenizer.tokenize(textA)) + pos_adj
end_pos = len(tokenizer.tokenize(textA + word)) + pos_adj
all_embeddings = embedding(text, tokenizer, model)
word_embedding = torch.mean(all_embeddings[0, start_pos:end_pos], dim=0)
@mattlam
Copy link

mattlam commented Jan 30, 2025

Might be related to this? #149

@NohTow
Copy link
Collaborator

NohTow commented Jan 30, 2025

Yeah, I don't know what you are evaluating exactly, but if you have per word metrics, it might be related to the issue mentioned.

We should have fixed the tokenizer in the latest version of transformers, could you try installing it (maybe from source just to be sure) and to use is_split_into_words=True while calling the tokenizer and also maybe use add_prefix_space=True when creating the tokenizer.

Could you try checking the tokenization with BERT and ModernBERT to make sure they are similar?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants