fix: revert llama cpp python server to 0.2.79 to enable gpu #44

lstocchi · 2024-08-08T09:07:04Z

What does this PR do?

It just reverts the llama cpp python server bc the 0.2.79 is the last version that actually works fine with vulkan

Screenshot / video of UI

N/A

What issues does this PR fix or reference?

it resolves #40

How to test this PR?

run the latest version of the vulkan image ghcr.io/containers/podman-desktop-extension-ai-lab-playground-images/ai-lab-playground-chat-vulkan:62b6f628ed77cf3f1518c32746e2e89d27072f0e and verify that it actually uses the cpu. The gpu detection is completely skipped.
You can use this command (update the model path)

podman run --device /dev/dri --mount type=bind,src=/Users/luca/.local/share/containers/podman-desktop/extensions-storage/redhat.ai-lab/models/hf.TheBloke.mistral-7b-instruct-v0.2.Q4_K_M/,target=/models/ -e MODEL_PATH=/models/mistral-7b-instruct-v0.2.Q4_K_M.gguf -e GPU_LAYERS=-1 ghcr.io/containers/podman-desktop-extension-ai-lab-playground-images/ai-lab-playground-chat-vulkan:62b6f628ed77cf3f1518c32746e2e89d27072f0e

In the logs you should just have

...
llm_load_tensors:    CPU buffer size = 4165.37 MiB
...

build a new image using llama cpp 0.2.79 and run it. Now you should see the actual logs that shows the gpu is being used

ggml_vulkan: Found 1 Vulkan devices:
Vulkan0: Virtio-GPU Venus (Apple M2 Pro) (venus) | uma: 1 | fp16: 1 | warp size: 32
llm_load_tensors: ggml ctx size =  0.30 MiB
warning: failed to mlock 73732096-byte buffer (after previously locking 0 bytes): Cannot allocate memory
Try increasing RLIMIT_MEMLOCK ('ulimit -l' as root).
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:    CPU buffer size =  70.31 MiB
llm_load_tensors:  Vulkan0 buffer size = 4095.05 MiB
.................................................................................................

2-b. if you do not want to build your own images you can use these below for testing using different version of llama_cpp
quay.io/lstocchi/vulkan:v4_279 -> llama_cpp 0.2.79
quay.io/lstocchi/vulkan:v4_280 -> llama_cpp 0.2.80
quay.io/lstocchi/vulkan:v4_284 -> llama_cpp 0.2.84
ghcr.io/containers/podman-desktop-extension-ai-lab-playground-images/ai-lab-playground-chat-vulkan:62b6f628ed77cf3f1518c32746e2e89d27072f0e -> llamacpp 0.2.85
quay.io/lstocchi/vulkan:v4_287 -> llama_cpp 0.2.87

Signed-off-by: lstocchi <[email protected]>

axel7083 · 2024-08-08T09:10:35Z

Is there an issue upstream to link ?

lstocchi · 2024-08-08T09:18:27Z

Was opening it -> containers/ai-lab-recipes#742

fix: revert llama cpp python server

1f85cb8

Signed-off-by: lstocchi <[email protected]>

lstocchi requested a review from a team as a code owner August 8, 2024 09:07

lstocchi requested review from dgolovin, axel7083 and SoniaSandler August 8, 2024 09:07

This was referenced Aug 8, 2024

chore(deps): update dependency llama-cpp-python to v0.2.87 #42

Closed

build(deps): Bump llama-cpp-python[server] from 0.2.85 to 0.2.87 #43

Closed

axel7083 approved these changes Aug 8, 2024

View reviewed changes

lstocchi merged commit 3ab12d0 into containers:main Aug 12, 2024
5 checks passed

lstocchi deleted the revertLlamaCpp branch August 12, 2024 10:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: revert llama cpp python server to 0.2.79 to enable gpu #44

fix: revert llama cpp python server to 0.2.79 to enable gpu #44

lstocchi commented Aug 8, 2024

axel7083 commented Aug 8, 2024

lstocchi commented Aug 8, 2024

fix: revert llama cpp python server to 0.2.79 to enable gpu #44

fix: revert llama cpp python server to 0.2.79 to enable gpu #44

Conversation

lstocchi commented Aug 8, 2024

What does this PR do?

Screenshot / video of UI

What issues does this PR fix or reference?

How to test this PR?

axel7083 commented Aug 8, 2024

lstocchi commented Aug 8, 2024