You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I upgraded from an older version, and experienced a disturbingly long read-ahead time.
The load on my machine is about the same (a bit higher with python, but that's understandable)
I tried to specify the same environment, using an nvidia card, but with n_gpu_layers=0.
For python binding, it may take several seconds to start the response. The token generation itself is done at a similar speed, but for llama.cpp the response starts immediately, while for python binding it takes seconds.
I would like to know if I am the only one experiencing this?
I am using LLama3 model.
So, the original binary values are:
llama_print_timings: sample time = 92.31 ms / 1160 runs ( 0.08 ms per token, 12565.67 tokens per second)
The llamaPython's are:
llama_print_timings: sample time = 99.82 ms / 144 runs ( 0.69 ms per token, 1442.57 tokens per second)
This seems like a big difference.
The text was updated successfully, but these errors were encountered:
Maximilian-Winter/llama-cpp-agent#54
Probably that is related to my findings that llama-cpp-python with llama-cpp-agent is slower than gpt4all on the follow-up prompts.
First prompt is fast.
Encountered a similar problem, which manifested itself as loading the model abnormally slow under gpu (unified memory for arm platforms) and only using a single core single thread for the cpu. This problem only exists in the last few new releases. It was working fine before.
I upgraded from an older version, and experienced a disturbingly long read-ahead time.
The load on my machine is about the same (a bit higher with python, but that's understandable)
I tried to specify the same environment, using an nvidia card, but with n_gpu_layers=0.
For python binding, it may take several seconds to start the response. The token generation itself is done at a similar speed, but for llama.cpp the response starts immediately, while for python binding it takes seconds.
I would like to know if I am the only one experiencing this?
I am using LLama3 model.
So, the original binary values are:
llama_print_timings: sample time = 92.31 ms / 1160 runs ( 0.08 ms per token, 12565.67 tokens per second)
The llamaPython's are:
llama_print_timings: sample time = 99.82 ms / 144 runs ( 0.69 ms per token, 1442.57 tokens per second)
This seems like a big difference.
The text was updated successfully, but these errors were encountered: