You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
My understanding is that a draft model should be specified, and that the draft model should itself be another llm (typically, smaller with similar/same token formatting). This understanding is derived from ggerganov/llama.cpp#2926 which is referenced in some llama-cpp-python issues, specifically #675 . I've also come across #1120 .
There does not seem to be a direct interface to specify another llm, yet there is a "draft_model" argument in Llama() which instead points to a LlamaPromptLookupDecoding object. This is where my confusion arises from.
Expected Behavior
Adjusting the current example on the main page, i would have expected speculative decoding to operate something like the following:
from llama_cpp import Llama
from llama_cpp.llama_speculative import LlamaPromptLookupDecoding
llama = Llama(
model_path="path/to/model.gguf",
draft_model=LlamaPromptLookupDecoding(num_pred_tokens=10) # num_pred_tokens is the number of tokens to predict 10 is the default and generally good for gpu, 2 performs better for cpu-only machines.
)
here, the draft_model is not another llm.
Just asking for some clarification and if it's currently possible or on the roadmap to implement the expected behaviour
cheers!
The text was updated successfully, but these errors were encountered:
I actually just got dealing with this myself. The default method doesn't take an actual draft model, instead it looks for similar token sequences from the input, which as mentioned here works well for input grounded tasks. For instance, if you pass in a bunch of code and it outputs some code the output is likely going to contain many similar sequences.
If you want to try an actual draft model though you can extend the LlamaDraftModel class and do you own thing. Here is an example I've been testing in my application this afternoon.
Summary
Speculative decoding not interfacing as expected.
My understanding is that a draft model should be specified, and that the draft model should itself be another llm (typically, smaller with similar/same token formatting). This understanding is derived from ggerganov/llama.cpp#2926 which is referenced in some llama-cpp-python issues, specifically #675 . I've also come across #1120 .
There does not seem to be a direct interface to specify another llm, yet there is a "draft_model" argument in Llama() which instead points to a LlamaPromptLookupDecoding object. This is where my confusion arises from.
Expected Behavior
Adjusting the current example on the main page, i would have expected speculative decoding to operate something like the following:
Current Behavior
this is the current suggested operation:
here, the draft_model is not another llm.
Just asking for some clarification and if it's currently possible or on the roadmap to implement the expected behaviour
cheers!
The text was updated successfully, but these errors were encountered: