We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
您好,我想问一下扩充的词表起到什么作用? https://github.com/pengxiao-song/LaWGPT/blob/main/resources/legal_vocab.txt 存在重复token(比如公正审判,第968行和第4137行),与chinese-llama合并时需要先对自身去重才能正常合并。
公正审判
# Load custom vocabulary new_tokens = open(VOC_PATH, "r").read().split("\n") new_tokens = list(set(new_tokens)) ## 去重 for token in new_tokens: if token not in llama_spm_tokens_set: new_token = model.ModelProto().SentencePiece() new_token.piece = token new_token.score = 0 llama_spm.pieces.append(new_token) print(f"Size of merged llama's vocabulary: {len(llama_spm.pieces)}")
但是用合并后的tokenizer.model替换原始chines-llama的tokenizer.model用于infer时,也可能会报错。chines-llama的tokenizer.model反而正常。 所以到底需不需要对词表进行扩充?
The text was updated successfully, but these errors were encountered:
还有一种方法,直接到github网站,下载对应版本的zip代码,然后进入代码目录,在你需要安装这些包的conda环境下 运行pip install .
Sorry, something went wrong.
No branches or pull requests
您好,我想问一下扩充的词表起到什么作用?
https://github.com/pengxiao-song/LaWGPT/blob/main/resources/legal_vocab.txt 存在重复token(比如
公正审判
,第968行和第4137行),与chinese-llama合并时需要先对自身去重才能正常合并。但是用合并后的tokenizer.model替换原始chines-llama的tokenizer.model用于infer时,也可能会报错。chines-llama的tokenizer.model反而正常。
所以到底需不需要对词表进行扩充?
The text was updated successfully, but these errors were encountered: