Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Latency grows substantially as batch size increases, even with small batch sizes #202

Open
joehoover opened this issue Jul 27, 2023 · 2 comments

Comments

@joehoover
Copy link

Thanks for the wonderful repo, @turboderp!

I'm benchmarking latency on an A100 and I've observed latency increasing substantially as I increase batch size–to much larger degree than I'm used to (logs included below):

  • bs=1 vs bs=2: ~30% increase in latency
  • bs=2 vs bs=4: ~30% increase in latency

I'd love to know if I'm missing something or if this is expected!

Setup

I'm benchmarking with The Bloke's gptq_model-4bit-128g llama-2-13B-chat-GPTQ checkpoint.

I'm using test_benchmark_generation.py with some minimal modifications to run these benchmarks.

I'm instantiating cache with batch size and I'm warming up with a batch of ids.

I'm generating tokens like:

ids = torch.randint(0, 31999, (1, max_seq_len - gen_tokens)).repeat(batch_size, 1).cuda()

...

for i in range(gen_tokens):

    logits = logits[:, -1, :]
    id_per_batch = torch.argmax(logits, dim=-1)
    assert id_per_batch.shape == (batch_size,), f"{id_per_batch.shape} != {(batch_size,)}"
    next_id_per_batch = id_per_batch.unsqueeze(-1)
    sequence = torch.cat((sequence, next_id_per_batch), dim = -1)
    logits = next_logits(next_id_per_batch, lora)

bs=1

 ** Batch size: 1
         ** Latency: 46.10 tokens/second
         ** Throughput: 46.10 tokens/second
         ** Total time: 2.776369094848633
 -- Generating 128 tokens, 4 token prompt...
 ** Batch size: 1
         ** Latency: 54.13 tokens/second
         ** Throughput: 54.13 tokens/second
         ** Total time: 2.364816427230835

bs=2

-- Generating 128 tokens, 1920 token prompt...
 ** Batch size: 2
         ** Latency: 33.04 tokens/second
         ** Throughput: 66.08 tokens/second
         ** Total time: 3.87422776222229
 -- Generating 128 tokens, 4 token prompt...
 ** Batch size: 2
         ** Latency: 41.16 tokens/second
         ** Throughput: 82.31 tokens/second
         ** Total time: 3.1100616455078125

bs=4

 -- Generating 128 tokens, 1920 token prompt...
 ** Batch size: 4
         ** Latency: 23.13 tokens/second
         ** Throughput: 92.50 tokens/second
         ** Total time: 5.535064220428467
 -- Generating 128 tokens, 4 token prompt...
 ** Batch size: 4
         ** Latency: 28.07 tokens/second
         ** Throughput: 112.26 tokens/second
         ** Total time: 4.560673952102661
@turboderp
Copy link
Owner

The kernels are very specifically optimized for matrix-vector operations (batch size = 1). It also does well on matrix-matrix by reconstructing full-precision matrices on the fly and relying on cuBLAS. The in-between territory is problematic, but I guess the question is what sort of throughput you would expect. (?)

@fxmarty
Copy link

fxmarty commented Aug 18, 2023

The kernels from NVIDIA folks at https://github.com/tlc-pack/cutlass_fpA_intB_gemm are probably interesting in the batched scenario.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants