LlamaCpp Install Procedure in Windows

Introduction

I was trying to install Llama.CPP with CUDA support on my system as an LLM inference server to run my multi-agent environment. I had already tried a few other options but for various reasons, they came up a cropper:

Ollama: Easy to use but the server is constrained on the types of roles you can use. They only allow , and roles. However, fine-tuned models like NousResearch's Theta and Pro are fine-tuned specifically for function calling using a new role . Hence, a lot of wrangling and manipulating of the user prompts were required to get the right output. This also increased my token usage. I couldn't get a solution for this even on their Discord server. On a side note, though, if the functionality of Ollama is enough for you, it is a brilliant inference server and I can't stop recommending it.
llama-cpp-python. Pretty brilliant again, but there were some issues about it being slower than the bare-bones Llama.cpp. Since, I am GPU-poor and wanted to maximize my inference speed, I decided to install Llama.cpp on my Windows laptop.

Oh boy!

Issues and attempts:

Initially, tried building Llama.cpp using w64devkit and OpenBLAS for Windows. CPU version worked but not CUDA.
Visual Studio would not detect CUDA while making the executable. I traversed multiple discussions that on NVidia groups and VS forums that were complaining of similar errors.
Tried installing stand-alone versions of CMake and the Windows SDK.
Even tried editing the MAKE file as shown here, but to no avail. Honestly, I am not a C++ guy so I had no idea what I was doing.

Solution:

I finally found the key to my problem here . More specifically, in the screenshot below:

Basically, the only Community version of Visual Studio that was available for download from Microsoft was incompatible even with the latest version of cuda (As of writing this post, the latest version of Nvidia is CUDA 12.5). Hence, all my errors were fundamentally derived from there. I also saw a lot of questions on forums and issues on Github repos of how various pieces of libraries just weren't working together. Hence, I wrote down this post to explain in detail, all the steps I took to ensure a smooth installation and running of the Llama.CPP server on Windows with CUDA.

Steps (All the way from the basics):

To be fair, the README file of Llama.cpp is pretty well written and the steps are easy to follow. The problems are with getting CUDA and the C++ Desktop environment of VS to talk to each other.

CUDA:

Download and install CUDA from here: Cuda Toolkit 12.5 downloads . If you are worried about Pytorch compatibility, currently CUDA 12.1 is supported by Pytorch.

VISUAL STUDIO 2019:

Download and install Visual C++ as follows:
- Download and install the Visual Studio 2109 software from here. Unless you have a Professional or an Enterprise license, Microsoft does not give you access to Visual Studio 2019 versions i.e. there is no official download of Visual Studio 2019 Community available.
- Run Visual Studio Installer from the Start Menu. This software is the gateway to download all the libraries that you need to work within Visual Studio
- Once, the application has opened, click on the Modify option:
- Select the Desktop Development with C+++
- Make sure the following components are selected on the right side of your window:
- Click on the "Install while downloading" link:
There are 4 files that will be present in C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.5\extras\visual_studio_integration\MSBuildExtensions (Replace "v12.5" in the path to your CUDA version). These four files are:
1. CUDA 11.8.props
2. CUDA 11.8.targets
3. CUDA 11.8.xml
4. Nvda.Build.CudaTasks.v11.8.dll
Copy and paste all these files into the relevant Visual Studio directory: C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\MSBuild\Microsoft\VC\v160\BuildCustomizations

ENVIRONMENT VARIABLES IN WINDOWS:

Set the CMAKE_ARGS environment variable (Ensure your Windows account has administrative rights to perform these functions) as follows:
- Click on the Start icon on the bottom left and type: environment
- Click on "edit environment variables for your account
- In the system variables section in the pop up window, click on "New"
- Set the variable name as "CMAKE_ARGS" and the Variable value as "-DLLAMA_CUBLAS=on -DLLAMA_BLAS_VENDOR=OpenBLAS" as shown below and click "OK":
Set the CUDA_PATH variable in a similar way:
- Similarly, create a second system variable. Set the variable name as CUDA_PATH. The Variable value should be the path to your CUDA library. Examples as below:
Set the LLAMA_CUDA variable:
- Create a third system variable. Set the variable name as LLAMA_CUDA and its value to "on" as shown below and click "OK":
Ensure that the PATH variable for CUDA is set correctly. On installation of CUDA in step 1, the CUDA directory should have been set in PATH.
- Go to the environment variables as explained in step 3.
- Scroll through the system variables until you see a system variable named PATH or path
- Select the variable and click on "Edit".
- Ensure the CUDA path is configured in the list of entries provided:
Once all the variables are configured, restart Windows.

INSTALLATION OF LLAMA-CPP

Clone the Llama.cpp repo. You will need Python (version 3.8+ just to be safe), pip and git installed.
- Run the following command in your command prompt:
  git clone https://github.com/ggerganov/llama.cpp.git
- Navigate to the location where this folder "llama.cpp" is downloaded
  cd llama.cpp
Build the executable for usage
1. The Release version
```
cmake -B build -DLLAMA_CUDA=ON
cmake --build  --config Release -j 8
```
1. The Debug version: For some reason, I was getting a few weird artifacts in the LLM response when I was using the Release version. I avoided these by switching to the Debug version of the build. If you face the same issues, you can re-perform step 9 and instead of step 10, you can build the executable as follows:
```
cmake -B build -DLLAMA_CUDA=ON
cmake --build build -j 8
```
NOTE: The "-j 8" is optional. 'j' defines the number of workers that work in parallel to build the executable. The more the faster, but it is still optional
If you plan to deploy Llama.cpp as a server, you can build it in the following way:
1. For the Release version:
```
cmake --build build --config Release -t llama-server
```
1. For the Debug version:
```
cmake --build build --config Release -t llama-server
```

RUN THE SERVER

Setup and run the server as per your build (Release or Debug)

Release version

<path to server.exe within llama.cpp repo> -m <path to gguf model> -c <context length> --n-gpu-layers <no. of layers to be loaded onto the GPU> --host <ip address of host - typically 0.0.0.0 or the ip address itself> --port <port that you want the server to listen on>

An example is as follows:

"llama.cpp\build\bin\Release\server.exe" -m "D:\Hermes-2-Pro-Llama-3-Instruct-Merged-DPO-Q4_K_M.gguf" -c 2048 --n-gpu-layers 33 --host 0.0.0.0 --port 8080

Debug version
The syntax is similar to the Release version. The only difference is the location of server.exe

"llama.cpp\build\bin\Debug\llama-server.exe" -m "D:\Hermes-2-Pro-Llama-3-Instruct-Merged-DPO-Q4_K_M.gguf" -c 2048 --n-gpu-layers 33 --host 0.0.0.0 --port 8080`

RUNNING INFERENCE ON THE MODEL HOSTED ON LLAMA.CPP

You will first need to install the OpenAI library. This is because Llama.CPP uses the openAI API to inference local models:
```
pip install openai
```

Create a python file e.g. test.py and enter the following:

import openai

client = openai.OpenAI(
base_url="<the ip address of the server. This should be the address that you entered in Step 12>:<port that you entered in step 12 e.g. 8080/v1", # "http://<Your api-server IP>:port"
api_key = "sk-no-key-required"
)

An example of this would be:

client = openai.OpenAI(
base_url="http://192.168.0.1:8080/v1",
api_key = "sk-no-key-required"
)

You can now use this "client" object to run your queries:

client.chat.completions.create(
    model="gpt-3.5-turbo", # <this parameter needs to be provided but is otherwise, irrelevant since the model loaded in the server is the one that will be used for inference>
    messages=messages,
    stream=True
)

Name		Name	Last commit message	Last commit date
Latest commit History 95 Commits
images		images
.gitattributes		.gitattributes
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LlamaCpp Install Procedure in Windows

Introduction

Issues and attempts:

Solution:

Steps (All the way from the basics):

CUDA:

VISUAL STUDIO 2019:

ENVIRONMENT VARIABLES IN WINDOWS:

INSTALLATION OF LLAMA-CPP

NOTE: The "-j 8" is optional. 'j' defines the number of workers that work in parallel to build the executable. The more the faster, but it is still optional

RUN THE SERVER

RUNNING INFERENCE ON THE MODEL HOSTED ON LLAMA.CPP

About

Releases

Packages

SwamiKannan/LlamaCpp-Install-Procedure-in-Windows-and-CUDA

Folders and files

Latest commit

History

Repository files navigation

LlamaCpp Install Procedure in Windows

Introduction

Issues and attempts:

Solution:

Steps (All the way from the basics):

CUDA:

VISUAL STUDIO 2019:

ENVIRONMENT VARIABLES IN WINDOWS:

INSTALLATION OF LLAMA-CPP

NOTE: The "-j 8" is optional. 'j' defines the number of workers that work in parallel to build the executable. The more the faster, but it is still optional

RUN THE SERVER

RUNNING INFERENCE ON THE MODEL HOSTED ON LLAMA.CPP

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages