Latency grows substantially as batch size increases, even with small batch sizes

Thanks for the wonderful repo, @turboderp! 

I'm benchmarking latency on an A100 and I've observed latency increasing substantially as I increase batch size–to much larger degree than I'm used to (logs included below):

* bs=1 vs bs=2: ~30% increase in latency
* bs=2 vs bs=4: ~30% increase in latency

I'd love to know if I'm missing something or if this is expected!


### Setup 

I'm benchmarking with The Bloke's `gptq_model-4bit-128g` llama-2-13B-chat-GPTQ checkpoint. 

I'm using `test_benchmark_generation.py` with some minimal modifications to run these benchmarks.

I'm instantiating `cache` with batch size and I'm warming up with a batch of ids. 

I'm generating tokens like:

```
ids = torch.randint(0, 31999, (1, max_seq_len - gen_tokens)).repeat(batch_size, 1).cuda()

...

for i in range(gen_tokens):

    logits = logits[:, -1, :]
    id_per_batch = torch.argmax(logits, dim=-1)
    assert id_per_batch.shape == (batch_size,), f"{id_per_batch.shape} != {(batch_size,)}"
    next_id_per_batch = id_per_batch.unsqueeze(-1)
    sequence = torch.cat((sequence, next_id_per_batch), dim = -1)
    logits = next_logits(next_id_per_batch, lora)
```



**bs=1**

```
 ** Batch size: 1
         ** Latency: 46.10 tokens/second
         ** Throughput: 46.10 tokens/second
         ** Total time: 2.776369094848633
 -- Generating 128 tokens, 4 token prompt...
 ** Batch size: 1
         ** Latency: 54.13 tokens/second
         ** Throughput: 54.13 tokens/second
         ** Total time: 2.364816427230835
```

**bs=2**
```
-- Generating 128 tokens, 1920 token prompt...
 ** Batch size: 2
         ** Latency: 33.04 tokens/second
         ** Throughput: 66.08 tokens/second
         ** Total time: 3.87422776222229
 -- Generating 128 tokens, 4 token prompt...
 ** Batch size: 2
         ** Latency: 41.16 tokens/second
         ** Throughput: 82.31 tokens/second
         ** Total time: 3.1100616455078125
```

**bs=4**
```
 -- Generating 128 tokens, 1920 token prompt...
 ** Batch size: 4
         ** Latency: 23.13 tokens/second
         ** Throughput: 92.50 tokens/second
         ** Total time: 5.535064220428467
 -- Generating 128 tokens, 4 token prompt...
 ** Batch size: 4
         ** Latency: 28.07 tokens/second
         ** Throughput: 112.26 tokens/second
         ** Total time: 4.560673952102661
 ```
 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Latency grows substantially as batch size increases, even with small batch sizes #202

Setup

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Latency grows substantially as batch size increases, even with small batch sizes #202

Description

Setup

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions