I was trying to install Llama.CPP with CUDA support on my system as an LLM inference server to run my multi-agent environment. I had already tried a few other options but for various reasons, they came up a cropper:
- Ollama: Easy to use but the server is constrained on the types of roles you can use. They only allow , and roles. However, fine-tuned models like NousResearch's Theta and Pro are fine-tuned specifically for function calling using a new role . Hence, a lot of wrangling and manipulating of the user prompts were required to get the right output. This also increased my token usage. I couldn't get a solution for this even on their Discord server.
On a side note, though, if the functionality of Ollama is enough for you, it is a brilliant inference server and I can't stop recommending it.
- llama-cpp-python. Pretty brilliant again, but there were some issues about it being slower than the bare-bones Llama.cpp. Since, I am GPU-poor and wanted to maximize my inference speed, I decided to install Llama.cpp on my Windows laptop.
Oh boy!
- Initially, tried building Llama.cpp using w64devkit and OpenBLAS for Windows. CPU version worked but not CUDA.
- Visual Studio would not detect CUDA while making the executable. I traversed multiple discussions that on NVidia groups and VS forums that were complaining of similar errors.
- Tried installing stand-alone versions of CMake and the Windows SDK.
- Even tried editing the MAKE file as shown here, but to no avail. Honestly, I am not a C++ guy so I had no idea what I was doing.
I finally found the key to my problem here . More specifically, in the screenshot below:
Basically, the only Community version of Visual Studio that was available for download from Microsoft was incompatible even with the latest version of cuda (As of writing this post, the latest version of Nvidia is CUDA 12.5). Hence, all my errors were fundamentally derived from there. I also saw a lot of questions on forums and issues on Github repos of how various pieces of libraries just weren't working together. Hence, I wrote down this post to explain in detail, all the steps I took to ensure a smooth installation and running of the Llama.CPP server on Windows with CUDA.
To be fair, the README file of Llama.cpp is pretty well written and the steps are easy to follow. The problems are with getting CUDA and the C++ Desktop environment of VS to talk to each other.
- Download and install CUDA from here: Cuda Toolkit 12.5 downloads . If you are worried about Pytorch compatibility, currently CUDA 12.1 is supported by Pytorch.
-
Download and install Visual C++ as follows:
- Download and install the Visual Studio 2109 software from here. Unless you have a Professional or an Enterprise license, Microsoft does not give you access to Visual Studio 2019 versions i.e. there is no official download of Visual Studio 2019 Community available.
- Run Visual Studio Installer from the Start Menu. This software is the gateway to download all the libraries that you need to work within Visual Studio
- Once, the application has opened, click on the Modify option:
- Select the Desktop Development with C+++
- Make sure the following components are selected on the right side of your window:
- Click on the "Install while downloading" link:
-
There are 4 files that will be present in C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.5\extras\visual_studio_integration\MSBuildExtensions (Replace "v12.5" in the path to your CUDA version). These four files are:
- CUDA 11.8.props
- CUDA 11.8.targets
- CUDA 11.8.xml
- Nvda.Build.CudaTasks.v11.8.dll
Copy and paste all these files into the relevant Visual Studio directory: C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\MSBuild\Microsoft\VC\v160\BuildCustomizations
- Set the CMAKE_ARGS environment variable (Ensure your Windows account has administrative rights to perform these functions) as follows:
- Click on the Start icon on the bottom left and type: environment
- Click on "edit environment variables for your account
- In the system variables section in the pop up window, click on "New"
- Set the variable name as "CMAKE_ARGS" and the Variable value as "-DLLAMA_CUBLAS=on -DLLAMA_BLAS_VENDOR=OpenBLAS" as shown below and click "OK":
- Set the CUDA_PATH variable in a similar way:
- Set the LLAMA_CUDA variable:
- Ensure that the PATH variable for CUDA is set correctly. On installation of CUDA in step 1, the CUDA directory should have been set in PATH.
- Once all the variables are configured, restart Windows.
-
Clone the Llama.cpp repo. You will need Python (version 3.8+ just to be safe), pip and git installed.
- Run the following command in your command prompt:
git clone https://github.com/ggerganov/llama.cpp.git
- Navigate to the location where this folder "llama.cpp" is downloaded
cd llama.cpp
- Run the following command in your command prompt:
-
Build the executable for usage
- The Release version
cmake -B build -DLLAMA_CUDA=ON cmake --build --config Release -j 8
- The Debug version: For some reason, I was getting a few weird artifacts in the LLM response when I was using the Release version. I avoided these by switching to the Debug version of the build. If you face the same issues, you can re-perform step 9 and instead of step 10, you can build the executable as follows:
cmake -B build -DLLAMA_CUDA=ON cmake --build build -j 8
-
If you plan to deploy Llama.cpp as a server, you can build it in the following way:
- For the Release version:
cmake --build build --config Release -t llama-server
- For the Debug version:
cmake --build build --config Release -t llama-server
- Setup and run the server as per your build (Release or Debug)
- Release version
An example is as follows:
<path to server.exe within llama.cpp repo> -m <path to gguf model> -c <context length> --n-gpu-layers <no. of layers to be loaded onto the GPU> --host <ip address of host - typically 0.0.0.0 or the ip address itself> --port <port that you want the server to listen on>
"llama.cpp\build\bin\Release\server.exe" -m "D:\Hermes-2-Pro-Llama-3-Instruct-Merged-DPO-Q4_K_M.gguf" -c 2048 --n-gpu-layers 33 --host 0.0.0.0 --port 8080
- Debug version
The syntax is similar to the Release version. The only difference is the location of server.exe"llama.cpp\build\bin\Debug\llama-server.exe" -m "D:\Hermes-2-Pro-Llama-3-Instruct-Merged-DPO-Q4_K_M.gguf" -c 2048 --n-gpu-layers 33 --host 0.0.0.0 --port 8080`
- Release version
- You will first need to install the OpenAI library. This is because Llama.CPP uses the openAI API to inference local models:
pip install openai
- Create a python file e.g. test.py and enter the following:
An example of this would be:
import openai client = openai.OpenAI( base_url="<the ip address of the server. This should be the address that you entered in Step 12>:<port that you entered in step 12 e.g. 8080/v1", # "http://<Your api-server IP>:port" api_key = "sk-no-key-required" )
client = openai.OpenAI( base_url="http://192.168.0.1:8080/v1", api_key = "sk-no-key-required" )
- You can now use this "client" object to run your queries:
client.chat.completions.create( model="gpt-3.5-turbo", # <this parameter needs to be provided but is otherwise, irrelevant since the model loaded in the server is the one that will be used for inference> messages=messages, stream=True )