-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
1 changed file
with
6 additions
and
50 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,54 +1,10 @@ | ||
# docker-llamacpp | ||
# docker-cudaml | ||
|
||
Repository which creates a llama.cpp server in a docker container, for amd64 and arm64, | ||
the latter of which is missing from the "official" repository. | ||
Repository which has some base images for running CUDA and cuDNN on Intel and ARM architectures. | ||
|
||
## Usage | ||
## CUDA Images | ||
|
||
If you want to use an NVIDIA GPU, then install the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html) first. | ||
If you want to use an NVIDIA GPU, then install the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html) first. You can use the following two images as the basis for your own images: | ||
|
||
You should put your `.gguf` model files in a directory called `/data`. Then use the following command | ||
to start the Llama server: | ||
|
||
```bash | ||
docker run \ | ||
--runtime nvidia --gpus all \ | ||
-v /data:/models -p 8080:8080 \ | ||
ghcr.io/mutablelogic/llamacpp-linux-arm64:0.0.3 \ | ||
--host 0.0.0.0 \ | ||
--model /models/mistral-7b-v0.1.Q4_K_M.gguf -ngl 32 --ctx-size 4096 --temp 0.7 --repeat_penalty 1.1 \ | ||
--in-prefix "<|im_start|>" --in-suffix "<|im_end|>" | ||
``` | ||
|
||
You can then access the Llama server on port 8080. | ||
|
||
## Building | ||
|
||
To build either the llama.cpp library or the onnxruntime library: | ||
|
||
```bash | ||
CUDA_HOME=/usr/local/cuda make llamacpp onnxruntime | ||
``` | ||
|
||
You can omit the CUDA_HOME environment variable if you don't want to build with CUDA support. | ||
The following will build a docker image and push to the repository: | ||
|
||
```bash | ||
git checkout [email protected]:mutablelogic/docker-llamacpp.git | ||
cd docker-llamacpp | ||
make docker && make docker-push | ||
``` | ||
|
||
Set the environment variable DOCKER_REGISTRY to the name of the registry to push to, e.g.: | ||
|
||
```bash | ||
git checkout [email protected]:mutablelogic/docker-llamacpp.git | ||
cd docker-llamacpp | ||
DOCKER_REGISTRY=docker.io/user make docker && make docker-push | ||
``` | ||
|
||
## Status | ||
|
||
Requires the ability to update the llama.cpp submodule to the master branch. | ||
Currently the github action uses a self-hosted runner to build the arm64 image. The runner | ||
seems to need about 12GB of memory to build the image. | ||
* `ghcr.io/mutablelogic/cuda-dev:1.0.2` - This image is based on Ubuntu 22.04 and includes the 12.6 CUDA toolkit and compiler build tools | ||
* `ghcr.io/mutablelogic/cuda-rt:1.0.2` - This image is based on Ubuntu 22.04 and includes the 12.6 CUDA runtime libraries. |