This document provides detailed instructions on how to compile various inference engines for efficient deployment of large language models (LLMs) on the Huawei Kunpeng 920 platform.
- Introduction
- Prerequisites
- ChatGLM.cpp Compilation
- Qwen.cpp Compilation
- llama.cpp Compilation
- Optimizing Compilation for Kunpeng 920
- Troubleshooting Common Issues
Compiling inference engines is a crucial step in deploying LLMs efficiently. Custom-built inference engines can take advantage of specific hardware features and optimizations, leading to better performance on platforms like the Huawei Kunpeng 920.
Before proceeding, ensure you have the following installed:
- GCC/G++ (version 7.5.0 or higher recommended)
- CMake (version 3.18 or higher)
- Make
- Git
- Python 3.8+ with development headers
You can install these on Ubuntu-based systems with:
sudo apt-get update
sudo apt-get install build-essential cmake git python3-dev
Ensure you have the latest CMake:
wget https://github.com/Kitware/CMake/releases/download/v3.28.1/cmake-3.28.1-linux-aarch64.sh
chmod +x cmake-3.28.1-linux-aarch64.sh
sudo mkdir -p /opt/cmake
sudo ./cmake-3.28.1-linux-aarch64.sh --prefix=/opt/cmake --skip-license
export PATH=/opt/cmake/bin:$PATH
echo "export PATH=/opt/cmake/bin:$PATH" >> ~/.bashrc
To compile ChatGLM.cpp:
git clone --recursive https://github.com/li-plus/chatglm.cpp.git
cd chatglm.cpp
mkdir build && cd build
cmake ..
cmake --build . --config Release
If you encounter any issues with submodules, you can update them manually:
git submodule update --init --recursive
For Qwen.cpp:
git clone --recursive https://github.com/QwenLM/qwen.cpp.git
cd qwen.cpp
mkdir build && cd build
cmake ..
cmake --build . --config Release
For llama.cpp:
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
mkdir build && cd build
cmake ..
cmake --build . --config Release
To optimize for the Kunpeng 920's ARM architecture, you can add specific flags to the CMake command:
cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_C_FLAGS="-march=armv8.2-a+fp16" -DCMAKE_CXX_FLAGS="-march=armv8.2-a+fp16" ..
This enables ARM v8.2 features and FP16 support, which can significantly improve performance on the Kunpeng 920.
You can also enable specific optimizations in the CMake configuration. For example, for llama.cpp:
cmake -DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS -DCMAKE_C_FLAGS="-march=armv8.2-a+fp16" -DCMAKE_CXX_FLAGS="-march=armv8.2-a+fp16" ..
This enables BLAS support using OpenBLAS, which can accelerate matrix operations.
-
CMake version issues: If you encounter CMake version issues, make sure you've installed the latest version as described in the prerequisites.
-
Missing dependencies: If you encounter missing dependencies, you may need to install additional libraries. For example:
sudo apt-get install libopenblas-dev
-
Compilation errors: If you face compilation errors, try cleaning the build directory and recompiling:
rm -rf build mkdir build && cd build cmake .. cmake --build . --config Release
-
Performance issues: If the compiled engine isn't performing as expected, try different optimization flags or consider using profiling tools to identify bottlenecks.
Remember to run source ~/.bashrc
after modifying your PATH or environment variables.
After successful compilation, you can find the executable in the build
directory. You can run it to start inferencing with your converted and quantized models.
Note: Always refer to the official documentation and README files of these projects for the most up-to-date compilation instructions, as they may change over time.