Description
I upgraded from an older version, and experienced a disturbingly long read-ahead time.
The load on my machine is about the same (a bit higher with python, but that's understandable)
I tried to specify the same environment, using an nvidia card, but with n_gpu_layers=0.
For python binding, it may take several seconds to start the response. The token generation itself is done at a similar speed, but for llama.cpp the response starts immediately, while for python binding it takes seconds.
I would like to know if I am the only one experiencing this?
I am using LLama3 model.
So, the original binary values are:
llama_print_timings: sample time = 92.31 ms / 1160 runs ( 0.08 ms per token, 12565.67 tokens per second)
The llamaPython's are:
llama_print_timings: sample time = 99.82 ms / 144 runs ( 0.69 ms per token, 1442.57 tokens per second)
This seems like a big difference.