fix: Fix extra meta

huggingface · Sep 27, 2024 · f98d37c · f98d37c
1 parent 4e28b9e
commit f98d37c
Show file tree

Hide file tree

Showing 2 changed files with 52 additions and 34 deletions.
diff --git a/README.md b/README.md
@@ -3,35 +3,35 @@
 A lightweight benchmarking tool for LLM inference servers.
 Benchmarks using constant arrival rate or constant virtual user count.
 
-
-
 ![ui.png](assets/ui.png)
 
 ## Table of contents
 
 <!-- TOC -->
+
 * [Text Generation Inference benchmarking tool](#text-generation-inference-benchmarking-tool)
-  * [Table of contents](#table-of-contents)
-  * [TODO](#todo)
-  * [Get started](#get-started)
-    * [Run a benchmark](#run-a-benchmark)
-    * [Configure your benchmark](#configure-your-benchmark)
-      * [Benchmark mode](#benchmark-mode)
-      * [Dataset configuration](#dataset-configuration)
-      * [Prompt configuration](#prompt-configuration)
-  * [Development](#development)
-  * [Frequently Asked Questions](#frequently-asked-questions)
+    * [Table of contents](#table-of-contents)
+    * [TODO](#todo)
+    * [Get started](#get-started)
+        * [Run a benchmark](#run-a-benchmark)
+        * [Configure your benchmark](#configure-your-benchmark)
+            * [Benchmark mode](#benchmark-mode)
+            * [Dataset configuration](#dataset-configuration)
+            * [Prompt configuration](#prompt-configuration)
+    * [Development](#development)
+    * [Frequently Asked Questions](#frequently-asked-questions)
+
 <!-- TOC -->
 
 ## TODO
+
 - [X] Customizable token count and variance
 - [ ] Check results
 - [X] Allow for system prompts for prefix caching
 - [ ] Allow for multi-turn prompts
 - [ ] Push results to Optimum benchmark backend
 - [X] Script to generate plots from results
 
-
 ## Get started
 
 ### Run a benchmark
@@ -57,8 +57,7 @@ $ docker run \
     --decode-options "num_tokens=50,max_tokens=60,min_tokens=40,variance=10"
 ```
 
-Results will be saved in `results.json` in current directory.
-
+Results will be saved in JSON format in current directory.
 
 ### Configure your benchmark
 
@@ -68,18 +67,20 @@ In default mode, tool runs a `sweep` benchmark. It first runs a throughput test
 sweeps on QPS values up to the maximum throughput.
 
 Available modes:
+
 - `sweep`: runs a sweep benchmark
 - `rate`: runs a benchmark at a fixed request rate
 - `throughput`: runs a benchmark at a fixed throughput (constant VUs)
 
-
 #### Dataset configuration
 
 Prompts are sampled for a Hugging Face dataset file, using a [subset of ShareGPT
-as default](https://huggingface.co/datasets/hlarcher/share_gpt_small). You can specify a different dataset file using the
+as default](https://huggingface.co/datasets/hlarcher/share_gpt_small). You can specify a different dataset file using
+the
 `--dataset` and `--dataset-file` option.
 
 Dataset is expected to be JSON with the following format:
+
 ```json
 [
   {
@@ -94,6 +95,7 @@ Dataset is expected to be JSON with the following format:
 ```
 
 To benchmark with prefix caching, you can use a system prompt that will be sent with each request from a discussion.
+
 ```json
 [
   {
@@ -111,31 +113,46 @@ To benchmark with prefix caching, you can use a system prompt that will be sent
 ]
 ```
 
-
 #### Prompt configuration
+
 For consistent results you can configure the token count and variance. The tool will sample prompts with the specified
 values, sampling token counts from a normal distribution with the specified variance.
 
 ```shell
 --prompt-options "num_tokens=50,max_tokens=60,min_tokens=40,variance=10"
 ```
 
+### Decode options
+
+You can also configure the decoding options for the model. The tool will sample decoding options with the specified
+values, sampling token counts from a normal distribution with the specified variance.
+
+```shell
+--decode-options "num_tokens=50,max_tokens=60,min_tokens=40,variance=10"
+```
 
 ## Development
 
 You need [Rust](https://rustup.rs/) installed to build the benchmarking tool.
+
 ```shell
 $ make build
 ```
 
-
 ## Frequently Asked Questions
+
 * **What's the difference between constant arrival rate and constant virtual user count?**
-  * **Constant virtual user count** means that the number of virtual users is fixed. Each virtual user can send a single requests and waits for server response. It's basically simulating a fixed number of users querying the server.
-  * **Constant arrival rate** means that the rate of requests is fixed and the number of virtual users is adjusted to maintain that rate. Queries hit the server independently of responses performances.
+    * **Constant virtual user count** means that the number of virtual users is fixed. Each virtual user can send a
+      single requests and waits for server response. It's basically simulating a fixed number of users querying the
+      server.
+    * **Constant arrival rate** means that the rate of requests is fixed and the number of virtual users is adjusted to
+      maintain that rate. Queries hit the server independently of responses performances.
 
-  **Constant virtual user count** is a closed loop model where the server's response time dictates the number of iterations. **Constant arrival rate** is an open-loop model more representative of real-life workloads.
+  **Constant virtual user count** is a closed loop model where the server's response time dictates the number of
+  iterations. **Constant arrival rate** is an open-loop model more representative of real-life workloads.
 
 * **What is the influence of CUDA graphs?**
-CUDA graphs are used to optimize the GPU usage by minimizing the overhead of launching kernels. This can lead to better performance in some cases, but can also lead to worse performance in others.
-If your CUDA graphs are not evenly distributed, you may see a performance drop at some request rates as batch size may fall in a bigger CUDA graph batch size leading to a lost of compute due to excessive padding.
+  CUDA graphs are used to optimize the GPU usage by minimizing the overhead of launching kernels. This can lead to
+  better performance in some cases, but can also lead to worse performance in others.
+  If your CUDA graphs are not evenly distributed, you may see a performance drop at some request rates as batch size may
+  fall in a bigger CUDA graph batch size leading to a lost of compute due to excessive padding.
diff --git a/src/main.rs b/src/main.rs
@@ -81,11 +81,10 @@ struct Args {
     /// File to use in the Dataset
     #[clap(default_value = "share_gpt_filtered_small.json", long, env)]
     dataset_file: String,
-    /// Extra metadata to include in the benchmark results file.
+    /// Extra metadata to include in the benchmark results file, comma-separated key-value pairs.
     /// It can be, for example, used to include information about the configuration of the
     /// benched server.
-    /// Can be specified multiple times.
-    /// Example: --extra-meta key1=value1 --extra-meta key2=value2
+    /// Example: --extra-meta "key1=value1,key2=value2"
     #[clap(long, env, value_parser(parse_key_val))]
     extra_meta: Option<HashMap<String, String>>,
 }
@@ -102,15 +101,17 @@ fn parse_url(s: &str) -> Result<String, Error> {
 }
 
 fn parse_key_val(s: &str) -> Result<HashMap<String, String>, Error> {
-    let key_value = s.split("=").collect::<Vec<&str>>();
-    if key_value.len() % 2 != 0 {
-        return Err(Error::new(InvalidValue));
-    }
     let mut key_val_map = HashMap::new();
-    for i in 0..key_value.len() / 2 {
-        key_val_map.insert(key_value[i * 2].to_string(), key_value[i * 2 + 1].to_string());
+    let items = s.split(",").collect::<Vec<&str>>();
+    for item in items.iter() {
+        let key_value = item.split("=").collect::<Vec<&str>>();
+        if key_value.len() % 2 != 0 {
+            return Err(Error::new(InvalidValue));
+        }
+        for i in 0..key_value.len() / 2 {
+            key_val_map.insert(key_value[i * 2].to_string(), key_value[i * 2 + 1].to_string());
+        }
     }
-
     Ok(key_val_map)
 }