Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: revert llama cpp python server to 0.2.79 to enable gpu #44

Merged
merged 1 commit into from
Aug 12, 2024

Conversation

lstocchi
Copy link
Contributor

@lstocchi lstocchi commented Aug 8, 2024

What does this PR do?

It just reverts the llama cpp python server bc the 0.2.79 is the last version that actually works fine with vulkan

Screenshot / video of UI

N/A

What issues does this PR fix or reference?

it resolves #40

How to test this PR?

  1. run the latest version of the vulkan image ghcr.io/containers/podman-desktop-extension-ai-lab-playground-images/ai-lab-playground-chat-vulkan:62b6f628ed77cf3f1518c32746e2e89d27072f0e and verify that it actually uses the cpu. The gpu detection is completely skipped.
    You can use this command (update the model path)
podman run --device /dev/dri --mount type=bind,src=/Users/luca/.local/share/containers/podman-desktop/extensions-storage/redhat.ai-lab/models/hf.TheBloke.mistral-7b-instruct-v0.2.Q4_K_M/,target=/models/ -e MODEL_PATH=/models/mistral-7b-instruct-v0.2.Q4_K_M.gguf -e GPU_LAYERS=-1 ghcr.io/containers/podman-desktop-extension-ai-lab-playground-images/ai-lab-playground-chat-vulkan:62b6f628ed77cf3f1518c32746e2e89d27072f0e

In the logs you should just have

...
llm_load_tensors:    CPU buffer size = 4165.37 MiB
...
  1. build a new image using llama cpp 0.2.79 and run it. Now you should see the actual logs that shows the gpu is being used
ggml_vulkan: Found 1 Vulkan devices:
Vulkan0: Virtio-GPU Venus (Apple M2 Pro) (venus) | uma: 1 | fp16: 1 | warp size: 32
llm_load_tensors: ggml ctx size =  0.30 MiB
warning: failed to mlock 73732096-byte buffer (after previously locking 0 bytes): Cannot allocate memory
Try increasing RLIMIT_MEMLOCK ('ulimit -l' as root).
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:    CPU buffer size =  70.31 MiB
llm_load_tensors:  Vulkan0 buffer size = 4095.05 MiB
.................................................................................................

2-b. if you do not want to build your own images you can use these below for testing using different version of llama_cpp
quay.io/lstocchi/vulkan:v4_279 -> llama_cpp 0.2.79
quay.io/lstocchi/vulkan:v4_280 -> llama_cpp 0.2.80
quay.io/lstocchi/vulkan:v4_284 -> llama_cpp 0.2.84
ghcr.io/containers/podman-desktop-extension-ai-lab-playground-images/ai-lab-playground-chat-vulkan:62b6f628ed77cf3f1518c32746e2e89d27072f0e -> llamacpp 0.2.85
quay.io/lstocchi/vulkan:v4_287 -> llama_cpp 0.2.87

@axel7083
Copy link
Contributor

axel7083 commented Aug 8, 2024

Is there an issue upstream to link ?

@lstocchi
Copy link
Contributor Author

lstocchi commented Aug 8, 2024

Was opening it -> containers/ai-lab-recipes#742

@lstocchi lstocchi merged commit 3ab12d0 into containers:main Aug 12, 2024
5 checks passed
@lstocchi lstocchi deleted the revertLlamaCpp branch August 12, 2024 10:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Gpu not working with llama cpp python server > 0.2.79
2 participants