You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)
Reproduction
fromtransformersimportAutoModelForCausalLM, AutoTokenizercheckpoint="HuggingFaceTB/SmolLM2-135M"# same behavior with gpt2device="cuda"# for GPU usage or "cpu" for CPU usagetokenizer=AutoTokenizer.from_pretrained(checkpoint)
model=AutoModelForCausalLM.from_pretrained(checkpoint).to(device)
outputs=tokenizer("<|endoftext|><|im_start|>y", return_tensors="pt", return_special_tokens_mask=True)
The tokenizer code is written to only return that mask correctly for tokens that it added itself (i.e. when add_special_tokens=True). I'm not sure it's designed to handle cases when the user has encoded special tokens in the prompt itself, see e.g. here
cc @ArthurZucker for tokenizers - this is intended behaviour and not a bug, right?
System Info
transformers
version: 4.47.1Who can help?
@ArthurZucker @itazap
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
outputs:
Expected behavior
outputs:
given that:
The text was updated successfully, but these errors were encountered: