Llama.generate: prefix-match hit is very slow.

I upgraded from an older version, and experienced a disturbingly long read-ahead time.
The load on my machine is about the same (a bit higher with python, but that's understandable)
I tried to specify the same environment, using an nvidia card, but with n_gpu_layers=0.
For python binding, it may take several seconds to start the response. The token generation itself is done at a similar speed, but for llama.cpp the response starts immediately, while for python binding it takes seconds.
I would like to know if I am the only one experiencing this?
I am using LLama3 model.

So, the original binary values are:
llama_print_timings: sample time = 92.31 ms / 1160 runs ( 0.08 ms per token, 12565.67 tokens per second)
The llamaPython's are:
llama_print_timings: sample time = 99.82 ms / 144 runs ( 0.69 ms per token, 1442.57 tokens per second)

This seems like a big difference.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Llama.generate: prefix-match hit is very slow. #1437

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Llama.generate: prefix-match hit is very slow. #1437

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions