Skip to content

SwamiKannan/LlamaCpp-Install-Procedure-in-Windows-and-CUDA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

95 Commits
 
 
 
 
 
 

Repository files navigation

LlamaCpp Install Procedure in Windows

Introduction

I was trying to install Llama.CPP with CUDA support on my system as an LLM inference server to run my multi-agent environment. I had already tried a few other options but for various reasons, they came up a cropper:

  1. Ollama: Easy to use but the server is constrained on the types of roles you can use. They only allow , and roles. However, fine-tuned models like NousResearch's Theta and Pro are fine-tuned specifically for function calling using a new role . Hence, a lot of wrangling and manipulating of the user prompts were required to get the right output. This also increased my token usage. I couldn't get a solution for this even on their Discord server. On a side note, though, if the functionality of Ollama is enough for you, it is a brilliant inference server and I can't stop recommending it.
  2. llama-cpp-python. Pretty brilliant again, but there were some issues about it being slower than the bare-bones Llama.cpp. Since, I am GPU-poor and wanted to maximize my inference speed, I decided to install Llama.cpp on my Windows laptop.

    Oh boy!

Issues and attempts:

  • Initially, tried building Llama.cpp using w64devkit and OpenBLAS for Windows. CPU version worked but not CUDA.
  • Visual Studio would not detect CUDA while making the executable. I traversed multiple discussions that on NVidia groups and VS forums that were complaining of similar errors.
  • Tried installing stand-alone versions of CMake and the Windows SDK.
  • Even tried editing the MAKE file as shown here, but to no avail. Honestly, I am not a C++ guy so I had no idea what I was doing.

Solution:

I finally found the key to my problem here . More specifically, in the screenshot below:

Basically, the only Community version of Visual Studio that was available for download from Microsoft was incompatible even with the latest version of cuda (As of writing this post, the latest version of Nvidia is CUDA 12.5). Hence, all my errors were fundamentally derived from there. I also saw a lot of questions on forums and issues on Github repos of how various pieces of libraries just weren't working together. Hence, I wrote down this post to explain in detail, all the steps I took to ensure a smooth installation and running of the Llama.CPP server on Windows with CUDA.

Steps (All the way from the basics):

To be fair, the README file of Llama.cpp is pretty well written and the steps are easy to follow. The problems are with getting CUDA and the C++ Desktop environment of VS to talk to each other.

CUDA:

  1. Download and install CUDA from here: Cuda Toolkit 12.5 downloads . If you are worried about Pytorch compatibility, currently CUDA 12.1 is supported by Pytorch.

VISUAL STUDIO 2019:

  1. Download and install Visual C++ as follows:

    • Download and install the Visual Studio 2109 software from here. Unless you have a Professional or an Enterprise license, Microsoft does not give you access to Visual Studio 2019 versions i.e. there is no official download of Visual Studio 2019 Community available.
    • Run Visual Studio Installer from the Start Menu. This software is the gateway to download all the libraries that you need to work within Visual Studio
      screenshot of VSC Installer from the Start Menu
    • Once, the application has opened, click on the Modify option:
      screenshot of VSC Installer launch screen
    • Select the Desktop Development with C+++
      screenshot of VSC Installer launch screen
    • Make sure the following components are selected on the right side of your window:
    • Click on the "Install while downloading" link:
      screenshot of VSC Installer launch screen
  2. There are 4 files that will be present in C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.5\extras\visual_studio_integration\MSBuildExtensions (Replace "v12.5" in the path to your CUDA version). These four files are:

    1. CUDA 11.8.props
    2. CUDA 11.8.targets
    3. CUDA 11.8.xml
    4. Nvda.Build.CudaTasks.v11.8.dll

    Copy and paste all these files into the relevant Visual Studio directory: C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\MSBuild\Microsoft\VC\v160\BuildCustomizations

ENVIRONMENT VARIABLES IN WINDOWS:

  1. Set the CMAKE_ARGS environment variable (Ensure your Windows account has administrative rights to perform these functions) as follows:
    • Click on the Start icon on the bottom left and type: environment
    • Click on "edit environment variables for your account screenshot of start menu
    • In the system variables section in the pop up window, click on "New"
    • Set the variable name as "CMAKE_ARGS" and the Variable value as "-DLLAMA_CUBLAS=on -DLLAMA_BLAS_VENDOR=OpenBLAS" as shown below and click "OK":
  2. Set the CUDA_PATH variable in a similar way:
    • Similarly, create a second system variable. Set the variable name as CUDA_PATH. The Variable value should be the path to your CUDA library. Examples as below:
  3. Set the LLAMA_CUDA variable:
    • Create a third system variable. Set the variable name as LLAMA_CUDA and its value to "on" as shown below and click "OK":
  4. Ensure that the PATH variable for CUDA is set correctly. On installation of CUDA in step 1, the CUDA directory should have been set in PATH.
    • Go to the environment variables as explained in step 3.
    • Scroll through the system variables until you see a system variable named PATH or path
    • Select the variable and click on "Edit".
    • Ensure the CUDA path is configured in the list of entries provided:
  5. Once all the variables are configured, restart Windows.

INSTALLATION OF LLAMA-CPP

  1. Clone the Llama.cpp repo. You will need Python (version 3.8+ just to be safe), pip and git installed.

    • Run the following command in your command prompt:
      git clone https://github.com/ggerganov/llama.cpp.git
    • Navigate to the location where this folder "llama.cpp" is downloaded
      cd llama.cpp
  2. Build the executable for usage

    1. The Release version
    cmake -B build -DLLAMA_CUDA=ON
    cmake --build  --config Release -j 8
    
    1. The Debug version: For some reason, I was getting a few weird artifacts in the LLM response when I was using the Release version. I avoided these by switching to the Debug version of the build. If you face the same issues, you can re-perform step 9 and instead of step 10, you can build the executable as follows:
    cmake -B build -DLLAMA_CUDA=ON
    cmake --build build -j 8
    
    NOTE: The "-j 8" is optional. 'j' defines the number of workers that work in parallel to build the executable. The more the faster, but it is still optional
  3. If you plan to deploy Llama.cpp as a server, you can build it in the following way:

    1. For the Release version:
    cmake --build build --config Release -t llama-server
    
    1. For the Debug version:
    cmake --build build --config Release -t llama-server
    

RUN THE SERVER

  1. Setup and run the server as per your build (Release or Debug)
    1. Release version
      <path to server.exe within llama.cpp repo> -m <path to gguf model> -c <context length> --n-gpu-layers <no. of layers to be loaded onto the GPU> --host <ip address of host - typically 0.0.0.0 or the ip address itself> --port <port that you want the server to listen on>
      
      An example is as follows:
      "llama.cpp\build\bin\Release\server.exe" -m "D:\Hermes-2-Pro-Llama-3-Instruct-Merged-DPO-Q4_K_M.gguf" -c 2048 --n-gpu-layers 33 --host 0.0.0.0 --port 8080
      
    2. Debug version
      The syntax is similar to the Release version. The only difference is the location of server.exe
      "llama.cpp\build\bin\Debug\llama-server.exe" -m "D:\Hermes-2-Pro-Llama-3-Instruct-Merged-DPO-Q4_K_M.gguf" -c 2048 --n-gpu-layers 33 --host 0.0.0.0 --port 8080`
      

RUNNING INFERENCE ON THE MODEL HOSTED ON LLAMA.CPP

  1. You will first need to install the OpenAI library. This is because Llama.CPP uses the openAI API to inference local models:
    pip install openai
    
  2. Create a python file e.g. test.py and enter the following:
    import openai
    
    client = openai.OpenAI(
    base_url="<the ip address of the server. This should be the address that you entered in Step 12>:<port that you entered in step 12 e.g. 8080/v1", # "http://<Your api-server IP>:port"
    api_key = "sk-no-key-required"
    )
    
    An example of this would be:
    client = openai.OpenAI(
    base_url="http://192.168.0.1:8080/v1",
    api_key = "sk-no-key-required"
    )
    
  3. You can now use this "client" object to run your queries:
    client.chat.completions.create(
        model="gpt-3.5-turbo", # <this parameter needs to be provided but is otherwise, irrelevant since the model loaded in the server is the one that will be used for inference>
        messages=messages,
        stream=True
    )
    

Releases

No releases published

Packages

No packages published