-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug: llama-server crashes when started with --embeddings #8076
Comments
Text completion should not be used with the cc @iamlemec |
IMHO it's not a good idea to only be able to start llama server as either completion or embedding (but not both): ok, in "enterprise" RAG there should be 2 separate models for embedding and generation, but in "home" setups it makes sense to use only one model, as it saves memory. |
Indeed. Can we just call |
We would also need to guarantee that the batch that is decoded contains only either embedding or completion tokens. Could be determined based on the first slot that is added to the batch and after on we add only slots that are the same task as the first one |
I don't know if it's helpful, but completion and embedding coexisted peacefully (provided you didn't mix batches) up until commit 80ea089. I'm entirely unfamiliar with this codebase, but I took a look and while it seemed like it should be simple to restore the previous behavior in llama.cpp without trashing the LLAMA_POOLING_TYPE_LAST stuff, a couple of hours later I don't really seem any closer. If I've understood server.cpp correctly (which is not a given), it seems like this issue occurs because the server unconditionally passes the embedding setting into the internals if the --embeddings flag is given. Would it make sense / be difficult to make that conditional on whether the a given request is actually on one of the embedding endpoints? This would be separate from the need to ensure that a single batch only contains one or the other, but it should at least restore the previous functionality. (If need be, it could be limited to the case where there's only one slot, which is certainly one way to avoid collisions.) If that's acceptable, it's something I can try to look into, but there's definitely a learning curve here, and I'm at the bottom of it, staring upward with a dumb look on my face. |
Thanks for the info @jdwx! Yeah, I just tried running a generative model with I think we basically have to go all the way and ensure pure separation of batches to fix this. I took a stab at in iamlemec/llama.cpp:server-embed, but there are still some kinks to be worked out, especially as it relates to pooling. Note that even with the old behavior, the embeddings that come out will be a bit arbitray in the absense of a pooling specifier. In the case where it's a single sequence batch, you'll get the embedding for the first token. But if it's a multi-sequence batch, it will be the n'th token for the n'th sequence. Right now, the pooling option is restricted to the |
I believe I have a simple fix of only a few added lines. @ggerganov should I incorporate this into #8087 or make a new PR? |
A new PR seems better. Thanks! |
Waiting for this issue to bo fixed. Have to run two llama-server instances now, one for completion, and one for embedding. |
@beginor and other folks doing dual-use here: what sorts of models are you using? The only model I've seen that is designed for this type of thing is GritLM. For that one, you need to use causal attention for generation and non-causal attention for embeddings. Any fix that ensure batches are only either generation or embedding will have to take a stance on what to do about the attention type. Also, are you using this for token level embeddings or sequence level embeddings? |
@iamlemec I have tried And I want to use them for sequence level embeddings, follow the start guide from GraphRAG, and change the api base url to use |
@iamlemec we are using Gemma and Llama family models, for sequence level embeddings. We could separate out embeddings from generations, but being able to switch back and forth on a batch-by-batch basis lets us do less GPU juggling and keep our coordination code simpler. (And to answer the implicit question, we would be happy to use causal attention when generating embeddings.) |
Thanks @beginor and @josharian! Just submitted #8420 to address this. |
What happened?
System: Mac M1, latest OS (14.5), latest llama.cpp (b3204), build in a std. way (make).
Path to reproduce: start llama server with --embeddings (tested with llama3/8b/fp16 and mistral/7b/Q8_0), go to gui, type anything.
Expected result: system completes/respond.
Actual result: llama-server segfaults: llama_get_logits_ith: invalid logits id 23, reason: no logits / zsh: segmentation fault.
Other notes: The same build/models behave normally when llama-server is started without --embeddings. Similar issue confirmed on Linux/CUDA.
Name and Version
version: 3204 (45c0e2e)
built with Apple clang version 15.0.0 (clang-1500.3.9.4) for arm64-apple-darwin23.5.0
What operating system are you seeing the problem on?
Linux, Mac
Relevant log output
The text was updated successfully, but these errors were encountered: