Skip to content

Commit

Permalink
fix: Fix extra meta
Browse files Browse the repository at this point in the history
  • Loading branch information
Hugoch committed Sep 27, 2024
1 parent 4e28b9e commit f98d37c
Show file tree
Hide file tree
Showing 2 changed files with 52 additions and 34 deletions.
65 changes: 41 additions & 24 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,35 +3,35 @@
A lightweight benchmarking tool for LLM inference servers.
Benchmarks using constant arrival rate or constant virtual user count.



![ui.png](assets/ui.png)

## Table of contents

<!-- TOC -->

* [Text Generation Inference benchmarking tool](#text-generation-inference-benchmarking-tool)
* [Table of contents](#table-of-contents)
* [TODO](#todo)
* [Get started](#get-started)
* [Run a benchmark](#run-a-benchmark)
* [Configure your benchmark](#configure-your-benchmark)
* [Benchmark mode](#benchmark-mode)
* [Dataset configuration](#dataset-configuration)
* [Prompt configuration](#prompt-configuration)
* [Development](#development)
* [Frequently Asked Questions](#frequently-asked-questions)
* [Table of contents](#table-of-contents)
* [TODO](#todo)
* [Get started](#get-started)
* [Run a benchmark](#run-a-benchmark)
* [Configure your benchmark](#configure-your-benchmark)
* [Benchmark mode](#benchmark-mode)
* [Dataset configuration](#dataset-configuration)
* [Prompt configuration](#prompt-configuration)
* [Development](#development)
* [Frequently Asked Questions](#frequently-asked-questions)

<!-- TOC -->

## TODO

- [X] Customizable token count and variance
- [ ] Check results
- [X] Allow for system prompts for prefix caching
- [ ] Allow for multi-turn prompts
- [ ] Push results to Optimum benchmark backend
- [X] Script to generate plots from results


## Get started

### Run a benchmark
Expand All @@ -57,8 +57,7 @@ $ docker run \
--decode-options "num_tokens=50,max_tokens=60,min_tokens=40,variance=10"
```

Results will be saved in `results.json` in current directory.

Results will be saved in JSON format in current directory.

### Configure your benchmark

Expand All @@ -68,18 +67,20 @@ In default mode, tool runs a `sweep` benchmark. It first runs a throughput test
sweeps on QPS values up to the maximum throughput.

Available modes:

- `sweep`: runs a sweep benchmark
- `rate`: runs a benchmark at a fixed request rate
- `throughput`: runs a benchmark at a fixed throughput (constant VUs)


#### Dataset configuration

Prompts are sampled for a Hugging Face dataset file, using a [subset of ShareGPT
as default](https://huggingface.co/datasets/hlarcher/share_gpt_small). You can specify a different dataset file using the
as default](https://huggingface.co/datasets/hlarcher/share_gpt_small). You can specify a different dataset file using
the
`--dataset` and `--dataset-file` option.

Dataset is expected to be JSON with the following format:

```json
[
{
Expand All @@ -94,6 +95,7 @@ Dataset is expected to be JSON with the following format:
```

To benchmark with prefix caching, you can use a system prompt that will be sent with each request from a discussion.

```json
[
{
Expand All @@ -111,31 +113,46 @@ To benchmark with prefix caching, you can use a system prompt that will be sent
]
```


#### Prompt configuration

For consistent results you can configure the token count and variance. The tool will sample prompts with the specified
values, sampling token counts from a normal distribution with the specified variance.

```shell
--prompt-options "num_tokens=50,max_tokens=60,min_tokens=40,variance=10"
```

### Decode options

You can also configure the decoding options for the model. The tool will sample decoding options with the specified
values, sampling token counts from a normal distribution with the specified variance.

```shell
--decode-options "num_tokens=50,max_tokens=60,min_tokens=40,variance=10"
```

## Development

You need [Rust](https://rustup.rs/) installed to build the benchmarking tool.

```shell
$ make build
```


## Frequently Asked Questions

* **What's the difference between constant arrival rate and constant virtual user count?**
* **Constant virtual user count** means that the number of virtual users is fixed. Each virtual user can send a single requests and waits for server response. It's basically simulating a fixed number of users querying the server.
* **Constant arrival rate** means that the rate of requests is fixed and the number of virtual users is adjusted to maintain that rate. Queries hit the server independently of responses performances.
* **Constant virtual user count** means that the number of virtual users is fixed. Each virtual user can send a
single requests and waits for server response. It's basically simulating a fixed number of users querying the
server.
* **Constant arrival rate** means that the rate of requests is fixed and the number of virtual users is adjusted to
maintain that rate. Queries hit the server independently of responses performances.

**Constant virtual user count** is a closed loop model where the server's response time dictates the number of iterations. **Constant arrival rate** is an open-loop model more representative of real-life workloads.
**Constant virtual user count** is a closed loop model where the server's response time dictates the number of
iterations. **Constant arrival rate** is an open-loop model more representative of real-life workloads.

* **What is the influence of CUDA graphs?**
CUDA graphs are used to optimize the GPU usage by minimizing the overhead of launching kernels. This can lead to better performance in some cases, but can also lead to worse performance in others.
If your CUDA graphs are not evenly distributed, you may see a performance drop at some request rates as batch size may fall in a bigger CUDA graph batch size leading to a lost of compute due to excessive padding.
CUDA graphs are used to optimize the GPU usage by minimizing the overhead of launching kernels. This can lead to
better performance in some cases, but can also lead to worse performance in others.
If your CUDA graphs are not evenly distributed, you may see a performance drop at some request rates as batch size may
fall in a bigger CUDA graph batch size leading to a lost of compute due to excessive padding.
21 changes: 11 additions & 10 deletions src/main.rs
Original file line number Diff line number Diff line change
Expand Up @@ -81,11 +81,10 @@ struct Args {
/// File to use in the Dataset
#[clap(default_value = "share_gpt_filtered_small.json", long, env)]
dataset_file: String,
/// Extra metadata to include in the benchmark results file.
/// Extra metadata to include in the benchmark results file, comma-separated key-value pairs.
/// It can be, for example, used to include information about the configuration of the
/// benched server.
/// Can be specified multiple times.
/// Example: --extra-meta key1=value1 --extra-meta key2=value2
/// Example: --extra-meta "key1=value1,key2=value2"
#[clap(long, env, value_parser(parse_key_val))]
extra_meta: Option<HashMap<String, String>>,
}
Expand All @@ -102,15 +101,17 @@ fn parse_url(s: &str) -> Result<String, Error> {
}

fn parse_key_val(s: &str) -> Result<HashMap<String, String>, Error> {
let key_value = s.split("=").collect::<Vec<&str>>();
if key_value.len() % 2 != 0 {
return Err(Error::new(InvalidValue));
}
let mut key_val_map = HashMap::new();
for i in 0..key_value.len() / 2 {
key_val_map.insert(key_value[i * 2].to_string(), key_value[i * 2 + 1].to_string());
let items = s.split(",").collect::<Vec<&str>>();
for item in items.iter() {
let key_value = item.split("=").collect::<Vec<&str>>();
if key_value.len() % 2 != 0 {
return Err(Error::new(InvalidValue));
}
for i in 0..key_value.len() / 2 {
key_val_map.insert(key_value[i * 2].to_string(), key_value[i * 2 + 1].to_string());
}
}

Ok(key_val_map)
}

Expand Down

0 comments on commit f98d37c

Please sign in to comment.