-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to get completion with llama cpp #81
Comments
I can get completion to work with llama.cpp by using the Firstly, for completion, you can use a Qwen2.5 Coder base model, instead of an instruct model. The models from https://huggingface.co/mradermacher/Qwen2.5-32B-GGUF should be alright. With a Qwen2.5 Coder base model, the FIM tokens should be Then, due to a compatibility issue with llama.cpp /v1/completions endpoint (see ggerganov/llama.cpp#9219), we will need a small patch for lsp-ai: diff --git a/crates/lsp-ai/src/transformer_backends/open_ai/mod.rs b/crates/lsp-ai/src/transformer_backends/open_ai/mod.rs
index c75b580..61b5298 100644
--- a/crates/lsp-ai/src/transformer_backends/open_ai/mod.rs
+++ b/crates/lsp-ai/src/transformer_backends/open_ai/mod.rs
@@ -57,11 +57,6 @@ pub(crate) struct OpenAI {
configuration: config::OpenAI,
}
-#[derive(Deserialize, Serialize)]
-pub(crate) struct OpenAICompletionsChoice {
- text: String,
-}
-
#[derive(Deserialize, Serialize)]
pub(crate) struct OpenAIError {
pub(crate) error: Value,
@@ -69,7 +64,7 @@ pub(crate) struct OpenAIError {
#[derive(Deserialize, Serialize)]
pub(crate) struct OpenAIValidCompletionsResponse {
- pub(crate) choices: Vec<OpenAICompletionsChoice>,
+ pub(crate) content: String,
}
#[derive(Deserialize, Serialize)]
@@ -163,7 +158,7 @@ impl OpenAI {
);
match res {
OpenAICompletionsResponse::Success(mut resp) => {
- Ok(std::mem::take(&mut resp.choices[0].text))
+ Ok(std::mem::take(&mut resp.content))
}
OpenAICompletionsResponse::Error(error) => {
anyhow::bail!(
I use micro instead of helix, and this is the relevant settings: "lsp.python": "{\"memory\":{\"file_store\":{}},\"models\":{\"model1\":{\"type\":\"open_ai\",\"completions_endpoint\":\"http://1.2.3.4:8080/v1/completions\",\"model\":\"Qwen/Qwen2.5-Coder-32B\",\"auth_token\":\"\"}},\"completion\":{\"model\":\"model1\",\"parameters\":{\"max_context\":4096,\"max_tokens\":512,\"top_p\":0.01,\"fim\":{\"start\":\"\u003c|fim_prefix|\u003e\",\"middle\":\"\u003c|fim_suffix|\u003e\",\"end\":\"\u003c|fim_middle|\u003e\"}}}}",
"lsp.server": "python=lsp-ai --use-seperate-log-file,go=gopls,typescript=deno lsp,rust=rls", |
Hello, I am trying to configure lsp-ai to get copilot-like completion in helix. I intend to use only models running locally.
I ideally would like to get them served following an OpenAI compatible API, Llama.cpp provides this and so does mlx_lm server that I would like to use.
The problem is that I end up with a 400 error from the server that seems to receive a request under a wrong format. This happens both with llama cpp and mlx server so it doesn't seem to be server-related, but a problem with lsp-ai. Is there a way to monitor the requests sent back/forth?
There's a second option that I have tried, which is to use the direct llama_cpp feature from lsp-ai (how does it work? does it spawn it's own separate instance of the server? What if we have one running already on the same ports, what about memory usage by an additional server if the models are big, compared to just linking to a running one through the openAI-compatible api?)
Using the internal llama_cpp feature, it seems to send requests properly, at least the helix logs don't show any error, but there is no completion displayed at all.
Instead, here is what I am getting: (example on the right, note that the configuration shows on the left)
ai - text does nothing when selected.
Has anybody got any luck with this kind of configuration?
Edit Oct 12th
I have tried with Ollama following this discussion and got the same issue as with llama.cpp. Besides, I tried to use visual studio with a similar configuration and got the same behavior.
It seems to be the way the request is sent that is problematic as it doesn't seem to be built for
completion
(in that there is not really any next word to predict for the given prompt) but forchat completion
, although it follows thecompletion
standard format.For example: enabling verbose mode on llama.cpp, I can see the request sent being:
and the response (note the empty content)
The same request also returns an empty response content with curl, while, if sending a request (with the same prompt) under the
chat completion
format with curl:The response seems to be spot on:
It could explain why it also fails with mlx and other OpenAI compatible API servers that all follow the same format. Haven't been able to investigate why the direct llama.cpp type fails too, since I cannot control it's logs and launch it in verbose mode, but I suspect the issue to be the same.
The text was updated successfully, but these errors were encountered: