We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
你好,我有一些疑惑,看了一些其他的教程发现他们在tokenizer的时候是没有设置add_spec_tokens的,请问这个有什么说法吗?
The text was updated successfully, but these errors were encountered:
你好,tokenizer 中 add_spec_tokens 的默认参数就是 False,我们显式地设置只是为了便于读者理解,实际和不设置值是一样的哈
Sorry, something went wrong.
你好,我刚刚测试了,不加特殊的token,llama3在tokenizer的时候,会在前面加上<begin_of_text>这个特殊的标记,如下图: 以下是我使用的代码: from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained('LLM-Research/MetaLlama38BInstruct') text = 'You are so cute!' print(tokenizer([text])) print(tokenizer([text],add_special_tokens=False)) 结果如下: {'input_ids': [[128000, 2675, 527, 779, 19369, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1]]} {'input_ids': [[2675, 527, 779, 19369, 0]], 'attention_mask': [[1, 1, 1, 1, 1]]}。看到了FireFly、llama-factory等开源训练框架,会在前面设置begin_of_text。https://github.com/hiyouga/LLaMA-Factory/blob/f8d8690bf4c2981f3151b4ccf07daeb4f3cd38a9/src/llamafactory/data/template.py#L724 。请问加这个或者不加会有什么大的影响吗?在群里与群友讨论,说训练base模型可以不加,训练sft model,不加这个特殊token会有比较大的影响。谢谢回复!
from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained('LLM-Research/MetaLlama38BInstruct') text = 'You are so cute!' print(tokenizer([text])) print(tokenizer([text],add_special_tokens=False))
{'input_ids': [[128000, 2675, 527, 779, 19369, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1]]} {'input_ids': [[2675, 527, 779, 19369, 0]], 'attention_mask': [[1, 1, 1, 1, 1]]}
No branches or pull requests
你好,我有一些疑惑,看了一些其他的教程发现他们在tokenizer的时候是没有设置add_spec_tokens的,请问这个有什么说法吗?
The text was updated successfully, but these errors were encountered: