Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Confusion regarding operation/terminology of speculative decoding and sampling #1865

Open
MushroomHunting opened this issue Dec 15, 2024 · 2 comments

Comments

@MushroomHunting
Copy link

Summary

Speculative decoding not interfacing as expected.

My understanding is that a draft model should be specified, and that the draft model should itself be another llm (typically, smaller with similar/same token formatting). This understanding is derived from ggerganov/llama.cpp#2926 which is referenced in some llama-cpp-python issues, specifically #675 . I've also come across #1120 .

There does not seem to be a direct interface to specify another llm, yet there is a "draft_model" argument in Llama() which instead points to a LlamaPromptLookupDecoding object. This is where my confusion arises from.

Expected Behavior

Adjusting the current example on the main page, i would have expected speculative decoding to operate something like the following:

from llama_cpp import Llama

llama_draft = (
    model_path="path/to/**small_draft_model.gguf**"
)

llama = Llama(
    model_path="path/to/**big_primary_model**.gguf",
    draft_model=llama_draft
)

Current Behavior

this is the current suggested operation:

from llama_cpp import Llama
from llama_cpp.llama_speculative import LlamaPromptLookupDecoding

llama = Llama(
    model_path="path/to/model.gguf",
    draft_model=LlamaPromptLookupDecoding(num_pred_tokens=10) # num_pred_tokens is the number of tokens to predict 10 is the default and generally good for gpu, 2 performs better for cpu-only machines.
)

here, the draft_model is not another llm.

Just asking for some clarification and if it's currently possible or on the roadmap to implement the expected behaviour

cheers!

@acasto
Copy link

acasto commented Dec 16, 2024

I actually just got dealing with this myself. The default method doesn't take an actual draft model, instead it looks for similar token sequences from the input, which as mentioned here works well for input grounded tasks. For instance, if you pass in a bunch of code and it outputs some code the output is likely going to contain many similar sequences.

If you want to try an actual draft model though you can extend the LlamaDraftModel class and do you own thing. Here is an example I've been testing in my application this afternoon.

@Vaskivo
Copy link

Vaskivo commented Dec 17, 2024

@acasto Many thanks.

I've delaying doing my own implementation of speculative decoding with a draft model for some time. I'll be testing your implementation soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants