merge_vocabulary问题 #126

232525 · 2023-06-29T13:06:43Z

您好，我想问一下扩充的词表起到什么作用？
https://github.com/pengxiao-song/LaWGPT/blob/main/resources/legal_vocab.txt 存在重复token（比如公正审判，第968行和第4137行），与chinese-llama合并时需要先对自身去重才能正常合并。

    # Load custom vocabulary
    new_tokens = open(VOC_PATH, "r").read().split("\n")    
    new_tokens = list(set(new_tokens)) ## 去重
    for token in new_tokens:
        if token not in llama_spm_tokens_set:
            new_token = model.ModelProto().SentencePiece()
            new_token.piece = token
            new_token.score = 0
            llama_spm.pieces.append(new_token)
    print(f"Size of merged llama's vocabulary: {len(llama_spm.pieces)}")

但是用合并后的tokenizer.model替换原始chines-llama的tokenizer.model用于infer时，也可能会报错。chines-llama的tokenizer.model反而正常。
所以到底需不需要对词表进行扩充？

The text was updated successfully, but these errors were encountered:

cqwyj2000 · 2023-07-15T00:45:35Z

还有一种方法，直接到github网站，下载对应版本的zip代码，然后进入代码目录，在你需要安装这些包的conda环境下运行pip install .

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

merge_vocabulary问题 #126

merge_vocabulary问题 #126

232525 commented Jun 29, 2023

cqwyj2000 commented Jul 15, 2023

merge_vocabulary问题 #126

merge_vocabulary问题 #126

Comments

232525 commented Jun 29, 2023

cqwyj2000 commented Jul 15, 2023