- Linux
- macOS
Clone the llama.cpp repository via Git:
git clone https://github.com/ggerganov/llama.cpp
Navigate into the llama.cpp directory and compile it:
cd llama.cpp
make
-
Create the Model Storage Path
cd llama.cpp/models mkdir Minicpm
-
Download the MiniCPM PyTorch Model Download all files from the MiniCPM PyTorch model and save them to the
llama.cpp/models/Minicpm
directory. -
Modify the Conversion Script Check the
_reverse_hf_permute
function in thellama.cpp/convert-hf-to-gguf.py
file. If you find the following code:def _reverse_hf_permute(self, weights: Tensor, n_head: int, n_kv_head: int | None = None) -> Tensor: if n_kv_head is not None and n_head != n_kv_head: n_head //= n_kv_head
Replace it with:
@staticmethod def permute(weights: Tensor, n_head: int, n_head_kv: int | None): if n_head_kv is not None and n_head != n_head_kv: n_head = n_head_kv return (weights.reshape(n_head, 2, weights.shape[0] // n_head // 2, *weights.shape[1:]) .swapaxes(1, 2) .reshape(weights.shape)) def _reverse_hf_permute(self, weights: Tensor, n_head: int, n_kv_head: int | None = None) -> Tensor: if n_kv_head is not None and n_head != n_kv_head: n_head //= n_kv_head
-
Install Dependencies and Convert the Model
python3 -m pip install -r requirements.txt python3 convert-hf-to-gguf.py models/Minicpm/
After completing these steps, there will be a model file named
ggml-model-f16.gguf
in thellama.cpp/models/Minicpm
directory.
Skip this step if the downloaded model is already in quantized format.
./llama-quantize ./models/Minicpm/ggml-model-f16.gguf ./models/Minicpm/ggml-model-Q4_K_M.gguf Q4_K_M
If you cannot find llama-quantize
, try recompiling:
cd llama.cpp
make llama-quantize
Perform inference using the quantized model:
./llama-cli -m ./models/Minicpm/ggml-model-Q4_K_M.gguf -n 128 --prompt "<User>Do you know openmbmb?<AI>"
Please note that the download links for the models are placeholders and should be replaced with actual URLs where the models can be downloaded.