Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ci: bench: support sse and fix prompt processing time / server: add tokens usage in stream OAI response #6495

Merged
merged 6 commits into from
Apr 6, 2024

Conversation

phymbert
Copy link
Collaborator

@phymbert phymbert commented Apr 4, 2024

Motivation

In the context of:

The prompt processing (pp) per second was not accurate because streaming was not enabled, so it was also including the token generation time.

SSE will not be supported in k6 core: the GrafanaLabs team proposed to introduce a dedicated xk6 extension repository, so I introduced xk6-sse extension (appreciate if you can also have a look).

Changes

  • utils.hpp includes now the usage field in the last chunk, it is not OAI compatible but I think it does not hurt and it is useful for the client to retrieve the number of prompt tokens according to the tokenizer.
  • Add guidance to build the xk6 extension and run the bench
  • Fix PR bot comment

Tests

server: add tokens usage in stream mode
@phymbert phymbert requested review from ggerganov and ngxson April 4, 2024 22:12
@phymbert phymbert merged commit 75cd4c7 into master Apr 6, 2024
66 of 70 checks passed
@phymbert phymbert deleted the hp/server/bench/sse branch April 6, 2024 03:40
@ggerganov
Copy link
Owner

It's hard for me to review this because I'm not familiar with the technology.

Which part of the benchmark requires SSE and is there a way to avoid it? The proposed approach is not ideal since the benchmark will now depend on an external package / extension, while we want llama.cpp to be self-contained as much as possible. We already depended on k6, which is also not very desirable, but for these type of benchmarks it is sort of OK. In general, the long-term goals should be to reduce such kind of dependencies as much as possible

@phymbert
Copy link
Collaborator Author

phymbert commented Apr 8, 2024

Yes sorry I merged without waiting for your feedback. I was asking you if we can depend on xk6-sse then I deleted my comment and merged, apologize.

I need SSE to get the exact prompt processing time from client side, I get it at the Time to Emit First Token, but I need streaming enabled and k6 does not support it.

As it is only for server benchmark purpose, I think it is ok ? I am supporting it on my own..., or we can move it to examples/server/bench/xk6-sse but I have the feeling it has nothing to do here.

I think it can be difficult to implement benchmark without proper tool, but if you prefer we can try to replace k6 by plain old python or move to Gatling.

Up to you, I will take the time if you require to go to the right direction.

@ggerganov
Copy link
Owner

As it is only for server benchmark purpose, I think it is ok ?

We can work with this for now since we are not going to require users/devs to run the benchmarks on their own, but we have to look for alternatives to simplify.

I need SSE to get the exact prompt processing time from client side

I see. I can understand it is more accurate to measure from the client side, but it might be simpler to see if there is any significant difference between client side measured speed and server side reported speed. If there isn't a big difference (which is what we expect) then we can stick with the simpler solution of using server reported speed for now and avoid SSE requirement.

@phymbert
Copy link
Collaborator Author

phymbert commented Apr 8, 2024

Understood, actually there is a difference:

{"i":391,"req":{"p95":31109.26,"avg":11862.55},"pp":{"p95":734.81,"avg":146.57,"0":827.86},"tg":{"p95":24.75,"avg":23.72,"0":17.74}}

"0" is the average from prometheus /metrics endpoint and other values comes from k6 metrics. Maybe because on the client side, pp ends after the first token, so it includes 1 tg.

So probably better to remove all this :)

BTW it looks the T4 node is down:
https://github.com/ggerganov/llama.cpp/actions/workflows/bench.yml?query=is%3Aqueued

Can you please have a look ?

@ggerganov
Copy link
Owner

Restarted - should be running now

tybalex pushed a commit to rubra-ai/tools.cpp that referenced this pull request Apr 17, 2024
…okens usage in stream OAI response (ggerganov#6495)

* ci: bench: support sse and fix prompt processing time
server: add tokens usage in stream mode

* ci: bench: README.md EOL

* ci: bench: remove total pp and tg as it is not accurate

* ci: bench: fix case when there is no token generated

* ci: bench: change to the 95 percentile for pp and tg as it is closer to what the server exports in metrics

* ci: bench: fix finish reason rate
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants