You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I got access to an Azure server with an NVidia H100 GPU, ran some very quick benchmarks and wanted to share the results.
Bench params:
Model: HuggingFaceH4/zephyr-7b-beta
Backend: vLLM
Backend is already started, and model preloaded in the GPU memory
System prompt: "You are a very succinct assistant and only do what you're told to do"
User prompts: "Just repeat the following number: [REQUEST_NUMBER]"
API path: /v1/chat/completions
Stream: true
LocalAI does the chat templating
Didn't spend much time on it and don't have tons of stats, but I ran 10K queries with some variations in the number of parallel requests, and registered the maximum times between request submission and result:
FT = longest time to first token (the longest time it took for LocalAI to start streaming the model answer) LT = longest time to last token (the longest time it took for LocalAI to finish streaming the model answer)
So yes, this is an extremely beefy GPU, a small model, an easy prompt and the performance degrades way passed usability... but LocalAI handled 10K parallel requests and absolutely never errored out!
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
I got access to an Azure server with an NVidia H100 GPU, ran some very quick benchmarks and wanted to share the results.
Bench params:
Didn't spend much time on it and don't have tons of stats, but I ran 10K queries with some variations in the number of parallel requests, and registered the maximum times between request submission and result:
FT = longest time to first token (the longest time it took for LocalAI to start streaming the model answer)
LT = longest time to last token (the longest time it took for LocalAI to finish streaming the model answer)
So yes, this is an extremely beefy GPU, a small model, an easy prompt and the performance degrades way passed usability... but LocalAI handled 10K parallel requests and absolutely never errored out!
You've built an amazing thing @mudler!
Beta Was this translation helpful? Give feedback.
All reactions