Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LlaMA3在英文或者中文上tokenizer是否需要加bos token? #140

Open
sugarandgugu opened this issue Jun 6, 2024 · 2 comments
Open

Comments

@sugarandgugu
Copy link

你好,我有一些疑惑,看了一些其他的教程发现他们在tokenizer的时候是没有设置add_spec_tokens的,请问这个有什么说法吗?

@logan-zou
Copy link
Contributor

你好,tokenizer 中 add_spec_tokens 的默认参数就是 False,我们显式地设置只是为了便于读者理解,实际和不设置值是一样的哈

@sugarandgugu
Copy link
Author

你好,我刚刚测试了,不加特殊的token,llama3在tokenizer的时候,会在前面加上<begin_of_text>这个特殊的标记,如下图:
image
以下是我使用的代码:
from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained('LLM-Research/MetaLlama38BInstruct') text = 'You are so cute!' print(tokenizer([text])) print(tokenizer([text],add_special_tokens=False))
结果如下:
{'input_ids': [[128000, 2675, 527, 779, 19369, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1]]} {'input_ids': [[2675, 527, 779, 19369, 0]], 'attention_mask': [[1, 1, 1, 1, 1]]}。看到了FireFly、llama-factory等开源训练框架,会在前面设置begin_of_text。https://github.com/hiyouga/LLaMA-Factory/blob/f8d8690bf4c2981f3151b4ccf07daeb4f3cd38a9/src/llamafactory/data/template.py#L724
image。请问加这个或者不加会有什么大的影响吗?在群里与群友讨论,说训练base模型可以不加,训练sft model,不加这个特殊token会有比较大的影响。谢谢回复!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants