You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm benchmarking latency on an A100 and I've observed latency increasing substantially as I increase batch size–to much larger degree than I'm used to (logs included below):
bs=1 vs bs=2: ~30% increase in latency
bs=2 vs bs=4: ~30% increase in latency
I'd love to know if I'm missing something or if this is expected!
Setup
I'm benchmarking with The Bloke's gptq_model-4bit-128g llama-2-13B-chat-GPTQ checkpoint.
I'm using test_benchmark_generation.py with some minimal modifications to run these benchmarks.
I'm instantiating cache with batch size and I'm warming up with a batch of ids.
The kernels are very specifically optimized for matrix-vector operations (batch size = 1). It also does well on matrix-matrix by reconstructing full-precision matrices on the fly and relying on cuBLAS. The in-between territory is problematic, but I guess the question is what sort of throughput you would expect. (?)
Thanks for the wonderful repo, @turboderp!
I'm benchmarking latency on an A100 and I've observed latency increasing substantially as I increase batch size–to much larger degree than I'm used to (logs included below):
I'd love to know if I'm missing something or if this is expected!
Setup
I'm benchmarking with The Bloke's
gptq_model-4bit-128g
llama-2-13B-chat-GPTQ checkpoint.I'm using
test_benchmark_generation.py
with some minimal modifications to run these benchmarks.I'm instantiating
cache
with batch size and I'm warming up with a batch of ids.I'm generating tokens like:
bs=1
bs=2
bs=4
The text was updated successfully, but these errors were encountered: