An offline, containerized LLM interface. Plug in any llama.cpp supported model.
- Docker for Desktop (https://www.docker.com/products/docker-desktop/)
- Docker configured to share sufficient RAM for your selected model (10GB for the stock build)
- Docker configured to share
THREAD_LIMIT
number of CPU cores (5 used on test machine)
Download and build the containers with:
docker compose build
This could take a long while to complete depending on your selected model and internet connection (stock build downloads ~5GB). Maybe turn off your monitor and go for a walk.
Once the build is complete, start the containers with:
docker compose up
Now, open your browser and point it to: http://localhost:8080
You can also directly query the backend with: http://localhost:8088/prompt?p=...
The /.env
file contains configuration options.
The build comes with a stock 7B LLM, but you can plug in any model supported by llama.cpp (https://github.com/ggerganov/llama.cpp).
MODEL_HOST_URL
is the path curl uses to download the model you select.
By default, the configuration uses 5 threads. The number of threads and the CPU performance will affect how long results take to generate.
THREAD_LIMIT
should be set to match the number of CPU cores shared in Docker.
This software is being made freely available under the MIT license and I take no responsibility for your use of it. Stay safe and do the right thing.