Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to get completion with llama cpp #81

Open
a-rbts opened this issue Oct 2, 2024 · 1 comment
Open

Unable to get completion with llama cpp #81

a-rbts opened this issue Oct 2, 2024 · 1 comment

Comments

@a-rbts
Copy link

a-rbts commented Oct 2, 2024

Hello, I am trying to configure lsp-ai to get copilot-like completion in helix. I intend to use only models running locally.
I ideally would like to get them served following an OpenAI compatible API, Llama.cpp provides this and so does mlx_lm server that I would like to use.

The problem is that I end up with a 400 error from the server that seems to receive a request under a wrong format. This happens both with llama cpp and mlx server so it doesn't seem to be server-related, but a problem with lsp-ai. Is there a way to monitor the requests sent back/forth?

There's a second option that I have tried, which is to use the direct llama_cpp feature from lsp-ai (how does it work? does it spawn it's own separate instance of the server? What if we have one running already on the same ports, what about memory usage by an additional server if the models are big, compared to just linking to a running one through the openAI-compatible api?)
Using the internal llama_cpp feature, it seems to send requests properly, at least the helix logs don't show any error, but there is no completion displayed at all.
Instead, here is what I am getting: (example on the right, note that the configuration shows on the left)
image
ai - text does nothing when selected.
Has anybody got any luck with this kind of configuration?

Edit Oct 12th
I have tried with Ollama following this discussion and got the same issue as with llama.cpp. Besides, I tried to use visual studio with a similar configuration and got the same behavior.
It seems to be the way the request is sent that is problematic as it doesn't seem to be built for completion (in that there is not really any next word to predict for the given prompt) but for chat completion, although it follows the completion standard format.
For example: enabling verbose mode on llama.cpp, I can see the request sent being:

request: POST /completions 127.0.0.1 200
request: {"echo":false,"frequency_penalty":0.0,"max_tokens":64,"model":"Qwen2.5-Coder-7B-Instruct-Q8_0","n":1,"presence_penalty":0.0,"prompt":"<fim_prefix>\n\ndef quicksort(arr):\n if len(arr) <= 1:\n return arr\n pivot = arr[len<fim_suffix>\n\n \n middle = [x for x in arr if x == pivot]\n right = [x for x in arr if x > pivot]\n return quicksort(left) + middle + quicksort(right)\n<fim_middle>","temperature":0.10000000149011612,"top_p":0.949999988079071}

and the response (note the empty content)

response: {"content":"","id_slot":0,"stop":true,"model":"../models/gguf/Qwen2.5-Coder-7B-Instruct-Q8_0.gguf","tokens_predicted":1,"tokens_evaluated":77,"generation_settings":{"n_ctx":4096,"n_predict":-1,"model":"../models/gguf/Qwen2.5-Coder-7B-Instruct-Q8_0.gguf","seed":4294967295,"seed_cur":979658403,"temperature":0.10000000149011612,"dynatemp_range":0.0,"dynatemp_exponent":1.0,"top_k":40,"top_p":0.949999988079071,"min_p":0.05000000074505806,"tfs_z":1.0,"typical_p":1.0,"repeat_last_n":64,"repeat_penalty":1.0,"presence_penalty":0.0,"frequency_penalty":0.0,"mirostat":0,"mirostat_tau":5.0,"mirostat_eta":0.10000000149011612,"penalize_nl":false,"stop":[],"max_tokens":64,"n_keep":0,"n_discard":0,"ignore_eos":false,"stream":false,"n_probs":0,"min_keep":0,"grammar":"","samplers":["top_k","tfs_z","typ_p","top_p","min_p","temperature"]},"prompt":"<fim_prefix>\n\ndef quicksort(arr):\n if len(arr) <= 1:\n return arr\n pivot = arr[len<fim_suffix>\n\n \n middle = [x for x in arr if x == pivot]\n right = [x for x in arr if x > pivot]\n return quicksort(left) + middle + quicksort(right)\n<fim_middle>","truncated":false,"stopped_eos":true,"stopped_word":false,"stopped_limit":false,"stopping_word":"","tokens_cached":77,"timings":{"prompt_n":77,"prompt_ms":378.821,"prompt_per_token_ms":4.919753246753247,"prompt_per_second":203.2622267508929,"predicted_n":1,"predicted_ms":0.009,"predicted_per_token_ms":0.009,"predicted_per_second":111111.11111111112},"index":0}

The same request also returns an empty response content with curl, while, if sending a request (with the same prompt) under the chat completion format with curl:

request: POST /v1/chat/completions 127.0.0.1 200
request: {
"messages": [{"role": "user", "content": "<fim_prefix>\n\ndef quicksort(arr):\n if len(arr) <= 1:\n return arr\n pivot = arr[len<fim_suffix>\n\n \n middle = [x for x in arr if x == pivot]\n right = [x for x in arr if x > pivot]\n return quicksort(left) + middle + quicksort(right)\n<fim_middle>"}],
"temperature": 0.7
}

The response seems to be spot on:

response: {"choices":[{"finish_reason":"stop","index":0,"message":{"content":"python\ndef quicksort(arr):\n if len(arr) <= 1:\n return arr\n pivot = arr[len(arr) // 2] # Choose the middle element as the pivot\n left = [x for x in arr if x < pivot] # Elements less than the pivot\n middle = [x for x in arr if x == pivot] # Elements equal to the pivot\n right = [x for x in arr if x > pivot] # Elements greater than the pivot\n return quicksort(left) + middle + quicksort(right)\n","role":"assistant"}}],"created":1728736275,"model":"gpt-3.5-turbo-0613","object":"chat.completion","usage":{"completion_tokens":123,"prompt_tokens":96,"total_tokens":219},"id":"chatcmpl-QXL7OiScMeo2SyWwAMdjR5DrWTXcqGBg","_verbose":{"content":"python\ndef quicksort(arr):\n if len(arr) <= 1:\n return arr\n pivot = arr[len(arr) // 2] # Choose the middle element as the pivot\n left = [x for x in arr if x < pivot] # Elements less than the pivot\n middle = [x for x in arr if x == pivot] # Elements equal to the pivot\n right = [x for x in arr if x > pivot] # Elements greater than the pivot\n return quicksort(left) + middle + quicksort(right)\n","id_slot":0,"stop":true,"model":"gpt-3.5-turbo-0613","tokens_predicted":123,"tokens_evaluated":96,"generation_settings":{"n_ctx":4096,"n_predict":-1,"model":"../models/gguf/Qwen2.5-Coder-7B-Instruct-Q8_0.gguf","seed":4294967295,"seed_cur":2759634986,"temperature":0.699999988079071,"dynatemp_range":0.0,"dynatemp_exponent":1.0,"top_k":40,"top_p":0.949999988079071,"min_p":0.05000000074505806,"tfs_z":1.0,"typical_p":1.0,"repeat_last_n":64,"repeat_penalty":1.0,"presence_penalty":0.0,"frequency_penalty":0.0,"mirostat":0,"mirostat_tau":5.0,"mirostat_eta":0.10000000149011612,"penalize_nl":false,"stop":[],"max_tokens":-1,"n_keep":0,"n_discard":0,"ignore_eos":false,"stream":false,"n_probs":0,"min_keep":0,"grammar":"","samplers":["top_k","tfs_z","typ_p","top_p","min_p","temperature"]},"prompt":"<|im_start|>user\n<fim_prefix>\n\ndef quicksort(arr):\n if len(arr) <= 1:\n return arr\n pivot = arr[len<fim_suffix>\n\n \n middle = [x for x in arr if x == pivot]\n right = [x for x in arr if x > pivot]\n return quicksort(left) + middle + quicksort(right)\n<fim_middle><|im_end|>\n<|im_start|>assistant\n","truncated":false,"stopped_eos":true,"stopped_word":false,"stopped_limit":false,"stopping_word":"","tokens_cached":218,"timings":{"prompt_n":96,"prompt_ms":382.361,"prompt_per_token_ms":3.9829270833333332,"prompt_per_second":251.07163125946423,"predicted_n":123,"predicted_ms":3248.555,"predicted_per_token_ms":26.4110162601626,"predicted_per_second":37.86298831326543},"index":0,"oaicompat_token_ctr":123}}
srv add_waiting
: add task 142 to waiting list. current waiting = 0 (before add)

It could explain why it also fails with mlx and other OpenAI compatible API servers that all follow the same format. Haven't been able to investigate why the direct llama.cpp type fails too, since I cannot control it's logs and launch it in verbose mode, but I suspect the issue to be the same.

@sayap
Copy link

sayap commented Nov 18, 2024

I can get completion to work with llama.cpp by using the open_ai backend.

Firstly, for completion, you can use a Qwen2.5 Coder base model, instead of an instruct model. The models from https://huggingface.co/mradermacher/Qwen2.5-32B-GGUF should be alright.

With a Qwen2.5 Coder base model, the FIM tokens should be <|fim_prefix|>, <|fim_suffix|>, and <|fim_middle|>.

Then, due to a compatibility issue with llama.cpp /v1/completions endpoint (see ggerganov/llama.cpp#9219), we will need a small patch for lsp-ai:

diff --git a/crates/lsp-ai/src/transformer_backends/open_ai/mod.rs b/crates/lsp-ai/src/transformer_backends/open_ai/mod.rs
index c75b580..61b5298 100644
--- a/crates/lsp-ai/src/transformer_backends/open_ai/mod.rs
+++ b/crates/lsp-ai/src/transformer_backends/open_ai/mod.rs
@@ -57,11 +57,6 @@ pub(crate) struct OpenAI {
     configuration: config::OpenAI,
 }
 
-#[derive(Deserialize, Serialize)]
-pub(crate) struct OpenAICompletionsChoice {
-    text: String,
-}
-
 #[derive(Deserialize, Serialize)]
 pub(crate) struct OpenAIError {
     pub(crate) error: Value,
@@ -69,7 +64,7 @@ pub(crate) struct OpenAIError {
 
 #[derive(Deserialize, Serialize)]
 pub(crate) struct OpenAIValidCompletionsResponse {
-    pub(crate) choices: Vec<OpenAICompletionsChoice>,
+    pub(crate) content: String,
 }
 
 #[derive(Deserialize, Serialize)]
@@ -163,7 +158,7 @@ impl OpenAI {
         );
         match res {
             OpenAICompletionsResponse::Success(mut resp) => {
-                Ok(std::mem::take(&mut resp.choices[0].text))
+                Ok(std::mem::take(&mut resp.content))
             }
             OpenAICompletionsResponse::Error(error) => {
                 anyhow::bail!(

I use micro instead of helix, and this is the relevant settings:

    "lsp.python": "{\"memory\":{\"file_store\":{}},\"models\":{\"model1\":{\"type\":\"open_ai\",\"completions_endpoint\":\"http://1.2.3.4:8080/v1/completions\",\"model\":\"Qwen/Qwen2.5-Coder-32B\",\"auth_token\":\"\"}},\"completion\":{\"model\":\"model1\",\"parameters\":{\"max_context\":4096,\"max_tokens\":512,\"top_p\":0.01,\"fim\":{\"start\":\"\u003c|fim_prefix|\u003e\",\"middle\":\"\u003c|fim_suffix|\u003e\",\"end\":\"\u003c|fim_middle|\u003e\"}}}}",
    "lsp.server": "python=lsp-ai --use-seperate-log-file,go=gopls,typescript=deno lsp,rust=rls",

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants