Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BeautifulPrompt是否支持中文 #348

Open
Boomprogrammar opened this issue Jan 24, 2024 · 6 comments
Open

BeautifulPrompt是否支持中文 #348

Boomprogrammar opened this issue Jan 24, 2024 · 6 comments

Comments

@Boomprogrammar
Copy link

请问BeautifulPrompt支持中文吗?

@NicholasCao
Copy link
Contributor

BeautifulPrompt在纯英文数据集上训练,暂不支持中文;可以通过我们开源的代码进行中文的微调

@oldwangggggg
Copy link

oldwangggggg commented Nov 8, 2024

BeautifulPrompt在纯英文数据集上训练,暂不支持中文;可以通过我们开源的代码进行中文的微调

您好,现在的BeautifulPrompt好像在现有的环境跑不通,具体情况表现为最新的Transformer库中modeling_bloom方法中_make_causal_mask和._expand_mask函数已移除,我换了好几个版本均有问题,能麻烦看看吗T T @NicholasCao

@NicholasCao
Copy link
Contributor

NicholasCao commented Nov 8, 2024

试试4.27.4和4.30.0 @oldwangggggg

@oldwangggggg
Copy link

试试4.27.4和4.30.0 @oldwangggggg

@NicholasCao
然而,当我降低版本到4.27.4或者4.30.0后,又均会出现一下分词器问题:

Traceback (most recent call last):
File "/home/wangchongyu/Promptist_MY/BeautifulPrompt123/train_ppo.py", line 213, in
main(args)
File "/home/wangchongyu/Promptist_MY/BeautifulPrompt123/train_ppo.py", line 180, in main
reward_fn = create_reward_fn(args)
File "/home/wangchongyu/Promptist_MY/BeautifulPrompt123/train_ppo.py", line 41, in create_reward_fn
aes_tokenizer = AutoTokenizer.from_pretrained(args.aes_model_path)
File "/home/wangchongyu/anaconda3/envs/PPO/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 679, in from_pretrained
return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
File "/home/wangchongyu/anaconda3/envs/PPO/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 1804, in from_pretrained
return cls._from_pretrained(
File "/home/wangchongyu/anaconda3/envs/PPO/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 1958, in _from_pretrained
tokenizer = cls(*init_inputs, **init_kwargs)
File "/home/wangchongyu/anaconda3/envs/PPO/lib/python3.10/site-packages/transformers/models/bloom/tokenization_bloom_fast.py", line 118, in init
super().init(
File "/home/wangchongyu/anaconda3/envs/PPO/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 111, in init
fast_tokenizer = TokenizerFast.from_file(fast_tokenizer_file)
Exception: data did not match any variant of untagged enum ModelWrapper at line 1252509 column 3

@NicholasCao
Copy link
Contributor

看上去是分词器底层依赖出问题了,抱歉当时没有保存requirements.txt,可以去transformers库寻找帮助

@oldwangggggg
Copy link

看上去是分词器底层依赖出问题了,抱歉当时没有保存requirements.txt,可以去transformers库寻找帮助
好的 感谢🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants