Welcome to llama-cpp-delphi, the Delphi bindings for llama.cpp! This project allows you to integrate the power of Llama-based Large Language Models (LLMs) into your Delphi applications, enabling efficient and versatile local inference.
- Delphi Integration: Harness Llama models directly in your Delphi projects.
- Local Inference: No external servers or APIs required—your data stays local.
- Cross-Platform Support: Compatible with Windows, Linux, and Mac.
- 🖥️ Mac Silicon: GPU (MPS) and CPU inference supported.
- 💻 Windows: CPU inference supported, with options for CUDA, Vulkan, Kompute, and OpenBLAS.
- 🌏 Linux: CPU inference supported, with options for CUDA, Vulkan, Kompute, and OpenBLAS.
- 🚀 Android and iOS support coming soon!
- Pre-Built Libraries: Simplified setup with pre-compiled libraries.
- Customizable Sampling: Fine-tune your AI’s behavior with easy-to-configure samplers.
- Delphi IDE installed.
- Git installed (required for cloning model repositories).
- A basic understanding of Delphi development.
- Clone the llama-cpp-delphi repository:
git clone https://github.com/Embarcadero/llama-cpp-delphi.git
- Open the project in Delphi IDE.
- Build the project for your desired platform(s):
- Windows
- Linux
- Mac Silicon
The necessary llama.cpp libraries are distributed as part of the releases of this repository. You can find them under the "Release" section in the repository. Here's an explanation of the libraries available:
CPU-only builds for Windows, Linux, and macOS. Inference runs slow on CPU—consider using a GPU-based library.
Building the program with BLAS support may lead to some performance improvements in prompt processing using batch sizes higher than 32 (the default is 512). Using BLAS doesn't affect the generation performance. There are several different BLAS implementations available for build and use:
- Accelerate Framework: Available on macOS, enabled by default.
- OpenBLAS: Provides CPU-based BLAS acceleration. Ensure OpenBLAS is installed on your machine.
- BLIS: A high-performance portable BLAS framework. Learn more.
- Intel oneMKL: Optimized for Intel processors, supporting advanced instruction sets like avx_vnni.
SYCL is a higher-level programming model to improve programming productivity on various hardware accelerators.
llama.cpp based on SYCL is used to support Intel GPU (Data Center Max series, Flex series, Arc series, Built-in GPU and iGPU).
For detailed info, please refer to [llama.cpp for SYCL](https://github.com/ggerganov/llama.cpp/blob/master/docs/backend/SYCL.md).
On MacOS, Metal is enabled by default. Using Metal makes the computation run on the GPU.
When built with Metal support, you can explicitly disable GPU inference with the --n-gpu-layers 0
option in the Llama settings.
Provides GPU acceleration using an NVIDIA GPU. Refer to the CUDA guide for Fedora setup.
Vulkan provides GPU acceleration through a modern, low-overhead API. To use Vulkan:
- Ensure Vulkan is installed and supported by your GPU drivers.
Learn more at the official Vulkan site.
Kompute offers efficient compute operations for GPU workloads. It's designed for AI inference tasks and works seamlessly with Vulkan.
Provides NPU acceleration using the AI cores of Ascend NPUs. Learn more about CANN.
SYCL enables GPU acceleration on Intel GPUs. Refer to the SYCL documentation for setup details.
Supports GPU acceleration on AMD GPUs compatible with HIP.
Provides GPU acceleration using the MUSA cores of Moore Threads MTT GPUs.
- Llama: Delphi-friendly IDE component.
- Explore the
samples
directory for available examples, like SimpleChatWithDownload. - Follow the README provided in each sample folder for detailed instructions.
You can use any model compatible with llama.cpp (e.g., GGUF format). Popular options include:
- Llama-2: A robust and general-purpose model.
- Llama-3: A lightweight alternative with excellent performance.
- Mistral: A compact and efficient model.
- DeepSeek: An innovative model designed for exploratory tasks.
- Mac Silicon:
- GPU inference (via MPS) is recommended for optimal performance.
- CPU inference is available but slower.
- Windows:
- CPU inference supported, with additional support for CUDA, Vulkan, Kompute, HIP, and OpenBLAS.
- Linux:
- CPU inference supported, with additional support for CUDA, Vulkan, HIP, and MUSA.
We welcome contributions to improve llama-cpp-delphi! Feel free to:
- Report issues.
- Submit pull requests.
- Suggest enhancements.
This project is licensed under the MIT License—see the LICENSE
file for details.
Get started with llama-cpp-delphi and bring advanced AI capabilities to your Delphi projects. If you encounter any issues or have suggestions, let us know—we’re here to help! Happy coding! 🎉