Slow processing of follow-up prompt #54

woheller69 · 2024-05-09T19:41:49Z

In a multi-turn conversation I see that the combination of llama-cpp-python and llama-cpp-agent is much slower on the second prompt than the python bindings of gpt4all. See the 2 screenshots below. The evaluation of the first prompt is faster, probably due to the recent speed improvements for prompt processing which have not yet been adopted in gpt4all. When I reply to that first answer from the AI the second reply of gpt4all comes much faster than the first whereas llama-cpp-python/llama-cpp-agent are even slower than on the first prompt. My setup is CPU only.
Do you have an idea why this is the case? Do they handle memory in a more efficient way?

Llama-3-8b-instruct Q8
Prompt processing
round        gpt4all        llama-cpp-python/agent
1            12.03 s              7.17 s
2             3.73 s              8.46 s

The text was updated successfully, but these errors were encountered:

woheller69 · 2024-05-21T12:28:19Z

It seems that the model always needs to evaluate its own previous answer as part of the prompt.
In the following examples my own new prompt was quite short every time.
Number of tokens in prompt eval is always the number of previous tokens in answer plus a few more:

llama_print_timings: load time = 2561.54 ms
llama_print_timings: sample time = 157.69 ms / 59 runs ( 2.67 ms per token, 374.15 tokens per second)
llama_print_timings: prompt eval time = 143124.12 ms / 356 tokens ( 402.03 ms per token, 2.49 tokens per second)
llama_print_timings: eval time = 28492.60 ms / 58 runs ( 491.25 ms per token, 2.04 tokens per second)
llama_print_timings: total time = 62822.10 ms / 414 tokens
Inference terminated
Llama.generate: prefix-match hit

llama_print_timings: load time = 2561.54 ms
llama_print_timings: sample time = 429.62 ms / 154 runs ( 2.79 ms per token, 358.46 tokens per second)
llama_print_timings: prompt eval time = 8277.50 ms / 88 tokens ( 94.06 ms per token, 10.63 tokens per second)
llama_print_timings: eval time = 75161.81 ms / 153 runs ( 491.25 ms per token, 2.04 tokens per second)
llama_print_timings: total time = 86088.57 ms / 241 tokens
Inference terminated
Llama.generate: prefix-match hit

llama_print_timings: load time = 2561.54 ms
llama_print_timings: sample time = 195.41 ms / 68 runs ( 2.87 ms per token, 347.98 tokens per second)
llama_print_timings: prompt eval time = 15788.26 ms / 163 tokens ( 96.86 ms per token, 10.32 tokens per second)
llama_print_timings: eval time = 33127.41 ms / 67 runs ( 494.44 ms per token, 2.02 tokens per second)
llama_print_timings: total time = 50100.81 ms / 230 tokens

Shouldn`t the model already find a tokenization of its previous answer? Or can it be that the applied chat template differs a bit from what the model used in its answer so it does not recognize it?

woheller69 · 2024-06-26T19:43:54Z

maybe related to this?
abetlen/llama-cpp-python#893 (comment)
My guess is that the chat template differs from that used in the model response ( maybe just a \n or whatever) and it threrefore does not recognise it anymore.
So every answer has to be processed once when it is inserted via the template for a multiturn conversation.
Can't we just use the template only for new requests and keep a record of the history exactly as the model already knows it?
For the next request we send the exact history plus the new request wrapped by the template

Gemma needs another \n to avoid slow processing of follow up prompts, see Maximilian-Winter#54

woheller69 · 2024-07-07T18:24:44Z

I provided fixes for the chat templates in #73.
But it seems the models answer e.g. in case of GEMMA_2 contains the right number of "\n\n". Are you stripping these off when storing the answer? Maybe it would be better to store the \n(s) with the model answer and remove the \n(s) in the chat template from Roles.assistant: PromptMarkers(...). This might be helpful especially for MessageFormats which are used by various models, such as CHATML.
Maybe one model adds one \n and another adds \n\n?

woheller69 mentioned this issue May 10, 2024

Llama.generate: prefix-match hit is very slow. abetlen/llama-cpp-python#1437

Open

woheller69 added a commit to woheller69/llama-cpp-agent that referenced this issue Jul 7, 2024

Fix GEMMA MessagesFormatter

05325eb

Gemma needs another \n to avoid slow processing of follow up prompts, see Maximilian-Winter#54

woheller69 mentioned this issue Jul 7, 2024

Fix GEMMA_2, LLAMA_3, and PHI_3 MessagesFormatter #73

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow processing of follow-up prompt #54

Slow processing of follow-up prompt #54

woheller69 commented May 9, 2024

woheller69 commented May 21, 2024

woheller69 commented Jun 26, 2024

woheller69 commented Jul 7, 2024

Slow processing of follow-up prompt #54

Slow processing of follow-up prompt #54

Comments

woheller69 commented May 9, 2024

woheller69 commented May 21, 2024

woheller69 commented Jun 26, 2024

woheller69 commented Jul 7, 2024