This is similar to the CUDA-based rLLM but built on top of llama.cpp.
If you're not using the supplied docker container follow the build setup instructions.
To compile and run first aicirt and then the rllm server, run:
./server.sh phi2
Run ./server.sh --help
for more options.
You can also try passing --cuda
before phi2
, which will enable cuBLASS in llama.cpp.
Note that this is different from rllm-cuda,
which may give you better performance when doing batched inference.