-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slow processing of follow-up prompt #54
Comments
It seems that the model always needs to evaluate its own previous answer as part of the prompt. llama_print_timings: load time = 2561.54 ms llama_print_timings: load time = 2561.54 ms llama_print_timings: load time = 2561.54 ms Shouldn`t the model already find a tokenization of its previous answer? Or can it be that the applied chat template differs a bit from what the model used in its answer so it does not recognize it? |
maybe related to this? |
Gemma needs another \n to avoid slow processing of follow up prompts, see Maximilian-Winter#54
I provided fixes for the chat templates in #73. |
In a multi-turn conversation I see that the combination of llama-cpp-python and llama-cpp-agent is much slower on the second prompt than the python bindings of gpt4all. See the 2 screenshots below. The evaluation of the first prompt is faster, probably due to the recent speed improvements for prompt processing which have not yet been adopted in gpt4all. When I reply to that first answer from the AI the second reply of gpt4all comes much faster than the first whereas llama-cpp-python/llama-cpp-agent are even slower than on the first prompt. My setup is CPU only.
Do you have an idea why this is the case? Do they handle memory in a more efficient way?
The text was updated successfully, but these errors were encountered: