int4推理

Jump to bottom

Li Yudong (李煜东) edited this page May 22, 2023 · 4 revisions

使用 llama.cpp 进行 int4 推理需要格式转换和模型量化

转换到 llama 格式

python3 scripts/convert_tencentpretrain_to_llama.py --input_model_path chatflow_7b.bin \
                                                    --output_model_path consolidated.00.pth \
                                                    --layers 32

转换到 ggml

git clone https://github.com/ggerganov/llama.cpp

将转换后的模型复制的 models/ 目录下并创建对应配置文件,配置文件格式

├── models
│   ├── chatflow_7b
│   │   ├── consolidated.00.pth
│   │   └── params.json
│   └── tokenizer.model

转换模型 python3 convert-pth-to-ggml.py models/chatflow_7b 1

模型量化

./quantize ./models/chatflow_7b/ggml-model-f16.bin ./models/chatflow_7b/ggml-model-q4_0.bin 2

运行

./main -m ./models/chatflow_7b/ggml-model-q4_0.bin -p "北京有什么好玩的地方？\n" -n 256