Confusion regarding operation/terminology of speculative decoding and sampling #1865

MushroomHunting · 2024-12-15T20:39:50Z

Summary

Speculative decoding not interfacing as expected.

My understanding is that a draft model should be specified, and that the draft model should itself be another llm (typically, smaller with similar/same token formatting). This understanding is derived from ggerganov/llama.cpp#2926 which is referenced in some llama-cpp-python issues, specifically #675 . I've also come across #1120 .

There does not seem to be a direct interface to specify another llm, yet there is a "draft_model" argument in Llama() which instead points to a LlamaPromptLookupDecoding object. This is where my confusion arises from.

Expected Behavior

Adjusting the current example on the main page, i would have expected speculative decoding to operate something like the following:

from llama_cpp import Llama

llama_draft = (
    model_path="path/to/**small_draft_model.gguf**"
)

llama = Llama(
    model_path="path/to/**big_primary_model**.gguf",
    draft_model=llama_draft
)

Current Behavior

this is the current suggested operation:

from llama_cpp import Llama
from llama_cpp.llama_speculative import LlamaPromptLookupDecoding

llama = Llama(
    model_path="path/to/model.gguf",
    draft_model=LlamaPromptLookupDecoding(num_pred_tokens=10) # num_pred_tokens is the number of tokens to predict 10 is the default and generally good for gpu, 2 performs better for cpu-only machines.
)

here, the draft_model is not another llm.

Just asking for some clarification and if it's currently possible or on the roadmap to implement the expected behaviour

cheers!

The text was updated successfully, but these errors were encountered:

acasto · 2024-12-16T00:18:42Z

I actually just got dealing with this myself. The default method doesn't take an actual draft model, instead it looks for similar token sequences from the input, which as mentioned here works well for input grounded tasks. For instance, if you pass in a bunch of code and it outputs some code the output is likely going to contain many similar sequences.

If you want to try an actual draft model though you can extend the LlamaDraftModel class and do you own thing. Here is an example I've been testing in my application this afternoon.

Vaskivo · 2024-12-17T22:24:54Z

@acasto Many thanks.

I've delaying doing my own implementation of speculative decoding with a draft model for some time. I'll be testing your implementation soon.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Confusion regarding operation/terminology of speculative decoding and sampling #1865

Confusion regarding operation/terminology of speculative decoding and sampling #1865

MushroomHunting commented Dec 15, 2024

acasto commented Dec 16, 2024

Vaskivo commented Dec 17, 2024

Confusion regarding operation/terminology of speculative decoding and sampling #1865

Confusion regarding operation/terminology of speculative decoding and sampling #1865

Comments

MushroomHunting commented Dec 15, 2024

Summary

Expected Behavior

Current Behavior

acasto commented Dec 16, 2024

Vaskivo commented Dec 17, 2024