diff --git a/README.md b/README.md
index 1e73a9b..0758a7b 100644
--- a/README.md
+++ b/README.md
@@ -6,7 +6,7 @@
-Scale Efficiently: Evaluate and Optimize Your LLM Deployments for Real-World Inference
+Scale Efficiently: Evaluate and Enhance Your LLM Deployments for Real-World Inference
[![GitHub Release](https://img.shields.io/github/release/neuralmagic/guidellm.svg?label=Version)](https://github.com/neuralmagic/guidellm/releases) [![Documentation](https://img.shields.io/badge/Documentation-8A2BE2?logo=read-the-docs&logoColor=%23ffffff&color=%231BC070)](https://github.com/neuralmagic/guidellm/tree/main/docs) [![License](https://img.shields.io/github/license/neuralmagic/guidellm.svg)](https://github.com/neuralmagic/guidellm/blob/main/LICENSE) [![PyPI Release](https://img.shields.io/pypi/v/guidellm.svg?label=PyPI%20Release)](https://pypi.python.org/pypi/guidellm) [![Pypi Release](https://img.shields.io/pypi/v/guidellm-nightly.svg?label=PyPI%20Nightly)](https://pypi.python.org/pypi/guidellm-nightly) [![Python Versions](https://img.shields.io/badge/Python-3.8--3.12-orange)](https://pypi.python.org/pypi/guidellm) [![Nightly Build](https://img.shields.io/github/actions/workflow/status/neuralmagic/guidellm/nightly.yml?branch=main&label=Nightly%20Build)](https://github.com/neuralmagic/guidellm/actions/workflows/nightly.yml)
@@ -20,7 +20,7 @@ Scale Efficiently: Evaluate and Optimize Your LLM Deployments for Real-World Inf
-**GuideLLM** is a powerful tool for evaluating and optimizing the deployment of large language models (LLMs). By simulating real-world inference workloads, GuideLLM helps users gauge the performance, resource needs, and cost implications of deploying LLMs on various hardware configurations. This approach ensures efficient, scalable, and cost-effective LLM inference serving while maintaining high service quality.
+**GuideLLM** is a powerful tool for evaluating and enhancing the deployment of large language models (LLMs). By simulating real-world inference workloads, GuideLLM helps users gauge the performance, resource needs, and cost implications of deploying LLMs on various hardware configurations. This approach ensures efficient, scalable, and cost-effective LLM inference serving while maintaining high service quality.
### Key Features
@@ -48,7 +48,7 @@ For detailed installation instructions and requirements, see the [Installation G
### Quick Start
-#### 1a. Start an OpenAI Compatible Server (vLLM)
+#### 1. Start an OpenAI Compatible Server (vLLM)
GuideLLM requires an OpenAI-compatible server to run evaluations. [vLLM](https://github.com/vllm-project/vllm) is recommended for this purpose. To start a vLLM server with a Llama 3.1 8B quantized model, run the following command:
@@ -56,23 +56,11 @@ GuideLLM requires an OpenAI-compatible server to run evaluations. [vLLM](https:/
vllm serve "neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w4a16"
```
-For more information on starting a vLLM server, see the [vLLM Documentation](https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html).
-
-#### 1b. Start an OpenAI Compatible Server (Hugging Face TGI)
-
-GuideLLM requires an OpenAI-compatible server to run evaluations. [Text Generation Inference](https://github.com/huggingface/text-generation-inference) can be used here. To start a TGI server with a Llama 3.1 8B using docker, run the following command:
+For more information on installing vLLM, see the [vLLM Installation Documentation](https://docs.vllm.ai/en/latest/getting_started/installation.html).
-```bash
-docker run --gpus 1 -ti --shm-size 1g --ipc=host --rm -p 8080:80 \
- -e MODEL_ID=https://huggingface.co/llhf/Meta-Llama-3.1-8B-Instruct \
- -e NUM_SHARD=1 \
- -e MAX_INPUT_TOKENS=4096 \
- -e MAX_TOTAL_TOKENS=6000 \
- -e HF_TOKEN=$(cat ~/.cache/huggingface/token) \
- ghcr.io/huggingface/text-generation-inference:2.2.0
-```
+For more information on starting a vLLM server, see the [vLLM Documentation](https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html).
-For more information on starting a TGI server, see the [TGI Documentation](https://huggingface.co/docs/text-generation-inference/index).
+To view more GuideLLM compatible backends such as TGI, llama.cpp, and DeepSparse, check out our [Supported Backends documentation](https://github.com/neuralmagic/guidellm/blob/main/docs/guides/supported_backends.md).
#### 2. Run a GuideLLM Evaluation
@@ -108,7 +96,7 @@ The end of the output will include important performance summary metrics such as
#### 4. Use the Results
-The results from GuideLLM are used to optimize your LLM deployment for performance, resource efficiency, and cost. By analyzing the performance metrics, you can identify bottlenecks, determine the optimal request rate, and select the most cost-effective hardware configuration for your deployment.
+The results from GuideLLM are used to enhance your LLM deployment for performance, resource efficiency, and cost. By analyzing the performance metrics, you can identify bottlenecks, determine the optimal request rate, and select the most cost-effective hardware configuration for your deployment.
For example, if we deploy a latency-sensitive chat application, we likely want to optimize for low time to first token (TTFT) and inter-token latency (ITL). A reasonable threshold will depend on the application requirements. Still, we may want to ensure time to first token (TTFT) is under 200ms and inter-token latency (ITL) is under 50ms (20 updates per second). From the example results above, we can see that the model can meet these requirements on average at a request rate of 2.37 requests per second for each server. If you'd like to target a higher percentage of requests meeting these requirements, you can use the **Performance Stats by Benchmark** section to determine the rate at which 90% or 95% of requests meet these requirements.
diff --git a/docs/assets/perf_stats.png b/docs/assets/perf_stats.png
new file mode 100644
index 0000000..010e5ee
Binary files /dev/null and b/docs/assets/perf_stats.png differ
diff --git a/docs/assets/perf_summary.png b/docs/assets/perf_summary.png
new file mode 100644
index 0000000..cf456fb
Binary files /dev/null and b/docs/assets/perf_summary.png differ
diff --git a/docs/assets/request_data.png b/docs/assets/request_data.png
new file mode 100644
index 0000000..d8d9a51
Binary files /dev/null and b/docs/assets/request_data.png differ
diff --git a/docs/assets/tokens_data.png b/docs/assets/tokens_data.png
new file mode 100644
index 0000000..ab66959
Binary files /dev/null and b/docs/assets/tokens_data.png differ
diff --git a/docs/guides/cli.md b/docs/guides/cli.md
index d30962b..f65f4c5 100644
--- a/docs/guides/cli.md
+++ b/docs/guides/cli.md
@@ -1 +1,176 @@
-# Coming Soon
+
+# GuideLLM CLI User Guide
+
+For more details on setup and installation, see the Setup and [Installation](https://github.com/neuralmagic/guidellm?tab=readme-ov-file#installation) sections.
+
+## About GuideLLM
+
+The GuideLLM CLI is a performance benchmarking tool to enable you to evaluate and enhance your LLM inference serving performance before you deploy to production. GuideLLM simulates user LLM workloads through rigorous benchmarking so that you can understand your LLM inference performance. This ultimately provides the ability to understand bottlenecks in your inference serving pipeline and make changes before your users are affected by your LLM application.
+
+## GuideLLM CLI Quickstart
+
+#### 1. Start an OpenAI Compatible Server (vLLM)
+
+GuideLLM requires an OpenAI-compatible server to run evaluations. It's recommended that [vLLM](https://github.com/vllm-project/vllm) be used for this purpose. To start a vLLM server with a Llama 3.1 8B quantized model, run the following command:
+
+```bash
+vllm serve "neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w4a16"
+```
+
+For more information on installing vLLM, see the [vLLM Installation Documentation](https://docs.vllm.ai/en/latest/getting_started/installation.html).
+
+For more information on starting a vLLM server, see the [vLLM Documentation](https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html).
+
+#### 2. Run a GuideLLM Evaluation
+
+To run a GuideLLM evaluation, use the `guidellm` command with the appropriate model name and options on the server hosting the model or one with network access to the deployment server. For example, to evaluate the full performance range of the previously deployed Llama 3.1 8B model, run the following command:
+
+```bash
+guidellm \
+ --target "http://localhost:8000/v1" \
+ --model "neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w4a16"
+```
+
+The above command will begin the evaluation and output progress updates similar to the following:
+
+Notes:
+
+- The `--target` flag specifies the server hosting the model. In this case, it is a local vLLM server.
+- The `--model` flag specifies the model to evaluate. The model name should match the name of the model deployed on the server
+- By default, GuideLLM will run a `sweep` of performance evaluations across different request rates, each lasting 120 seconds. The results will be saved to a local directory.
+
+#### 3. Analyze the Results
+
+After the evaluation is completed, GuideLLM will output a summary of the results, including various performance metrics. The results will also be saved to a local directory for further analysis.
+
+The output results will start with a summary of the evaluation, followed by the requested data for each benchmark run. For example, the start of the output will look like the following:
+
+
+
+The end of the output will include important performance summary metrics such as request latency, time to first token (TTFT), inter-token latency (ITL), and more:
+
+
+
+
+
+## GuideLLM CLI Details
+### Input Metrics
+The input arguments are split up into 3 sections:
+
+- **Workload Overview**
+- **Workload Data**
+- **Workload Type**
+
+Once you fill out these arguments and run the command, GuideLLM will run the simulated workload. Note the time it takes to run can be set with max_seconds, but may also depend on the hardware and model.
+
+#### Workload Overview
+
+This section of input parameters covers what to actually benchmark including the target host location, model, and task. The full list of arguments and their definitions are presented below:
+
+- **--target** (str, default: localhost with chat completions API for VLLM): Target for benchmarking
+
+ - optional breakdown args if the target isn't specified:
+
+ - **--host** (str): Host URL for benchmarking
+
+ - **--port** (str): Port available for benchmarking
+
+- **--backend** (str, default: server_openai [vllm, TGI, llama.cpp, DeepSparse, and many popular servers match this format]): Backend type for benchmarking
+
+- **--model** (str, default: auto-populated from vllm server): Model being used for benchmarking, running on the inference server
+
+- **--task** (str), optional): Task to use for benchmarking
+
+- **--output-path** (str), optional): Path to save report report to
+
+
+
+#### Workload Data
+
+This section of input parameters covers the data arguments that need to be supplied such as a reference to the dataset and tokenizer. The list of arguments and their defintions are presented below:
+
+- **--data** (str): Data file or alias for benchmarking
+
+- **--data-type** (ENUM, default: emulated; [file, transformers]): The type of data given for benchmarking
+
+- **--tokenizer** (str): Tokenizer to use for benchmarking
+
+#### Workload Type
+
+This section of input parameters covers the type of workload that you want to run to represent the type of load you expect on your server in production such as rate-per-second and the frequency of requests. The full list of arguments and their definitions are presented below:
+
+- **--rate-type** (ENUM, default: sweep; [serial, constant, poisson] where sweep will cover a range of constant request rates and ensure saturation of server, serial will send one request at a time, constant will send a constant request rate, poisson will send a request rate sampled from a Poisson distribution at a given mean) : Type of rate generation for benchmarking
+
+- **--rate** (float): Rate to use for constant and Poisson rate types
+
+- **--max-seconds** (integer): Number of seconds to result each request rate at
+
+- **--max-requests** (integer): Number of requests to send for each rate
+
+### Output Metrics via GuideLLM Benchmarks Report
+
+Once your GuideLLM run is complete, the output metrics are displayed as a GuideLLM Benchmarks Report via the Terminal in the following 4 sections:
+
+- **Requests Data by Benchmark**
+- **Tokens Data by Benchmark**
+- **Performance Stats by Benchmark**
+- **Performance Summary by Benchmark**
+
+The GuideLLM Benchmarks Report surfaces key LLM metrics to help you determine the health and performance of your inference server. You can use the numbers generated by the GuideLLM Benchmarks Report to make decisions around server request processing, Service Level Objective (SLO) success/failure for your task, general model performance, and hardware impact.
+
+#### Requests Data by Benchmark
+
+This section shows the request statistics for the benchmarks that were run. Request Data statistics highlight the details of the requests hitting the inference server. Viewing this information is essential to understanding the health of your server processing requests sent by GuideLLM and can surface potential issues in your inference serving pipeline including software and hardware issues.
+
+
+
+This table includes:
+- **Benchmark:** Synchronous or Asynchronous@X req/sec
+- **Requests Completed:** the number of successful requests handled
+- **Requests Failed:** the number of failed requests
+- **Duration (sec):** the time taken to run the specific benchmark, determined by max_seconds
+- **Start Time (HH:MI:SS):** local timestamp the GuideLLM benchmark started
+- **End Time (HH:MI:SS):** local timestamp the GuideLLM benchmark ended
+
+
+#### Tokens Data by Benchmark
+This section shows the prompt and output token distribution statistics for the benchmarks that were run. Token Data statistics highlight the details of your dataset in terms of prompts and generated outputs from the model. Viewing this information is integral to understanding model performance on your task and ensuring you are able to hit SLOs required to guarantee a good user experience from your application.
+
+
+
+This table includes:
+- **Benchmark:** Synchronous or Asynchronous@X req/sec
+- **Prompt (token length):** the average length of prompt tokens
+- **Prompt (1%, 5%, 50%, 95%, 99%):** Distribution of prompt token length
+- **Output (token length):** the average length of output tokens
+- **Output (1%, 5%, 50%, 95%, 99%):** Distribution of output token length
+
+#### Performance Stats by Benchmark
+This section shows the LLM performance statistics for the benchmarks that were run. Performance Statistics highlight the performance of the model across the key LLM performance metrics including Request Latency, Time to First Token (TTFT), Inter Token Latench (ITL or TPOT), and Output Token Throughput. To understand the definitions and importance of these LLM performance metrics further, check out the [Nvidia Metrics Guide](https://docs.nvidia.com/nim/benchmarking/llm/latest/metrics.html). Viewing these key metrics is integral to ensuring the performance of your inference server for your task on your designated hardware where you are running your inference server.
+
+
+
+This table includes:
+- **Benchmark:** Synchronous or Asynchronous@X req/sec
+- **Request Latency [1%, 5%, 10%, 50%, 90%, 95%, 99%] (sec)**: the time it takes from submitting a query to receiving the full response, including the performance of your queueing/batching mechanisms and network latencies
+- **Time to First Token [1%, 5%, 10%, 50%, 90%, 95%, 99%] (ms)**: the time it takes from submitting the query to receiving the first token (if the response is not empty); often abbreviated as TTFT
+- **Inter Token Latency [1%, 5%, 10%, 50%, 90% 95%, 99%] (ms)**: the time between consecutive tokens and is also known as time per output token (TPOT)
+- **Output Token Throughput (tokens/sec)**: the total output tokens per second throughput, accounting for all the requests happening simultaneously
+
+
+#### Performance Summary by Benchmark
+This section shows the averages for the LLM performance statistics for the benchmarks that were run. The average Performance Statistics provide an overall summary of the model performance across the key LLM performance metrics. Viewing these summary metrics is integral to ensuring the performance of your inference server for your task on the designated hardware where you are running your inference server.
+
+
+
+This table includes:
+- **Benchmark:** Synchronous or Asynchronous@X req/sec
+- **Request Latency (sec)**: the average time it takes from submitting a query to receiving the full response, including the performance of your queueing/batching mechanisms and network latencies
+- **Time to First Token (ms)**: the average time it takes from submitting the query to receiving the first token (if the response is not empty); often abbreviated as TTFT
+- **Inter Token Latency (ms)**: the average time between consecutive tokens and is also known as time per output token (TPOT)
+- **Output Token Throughput (tokens/sec)**: the total average output tokens per second throughput, accounting for all the requests happening simultaneously
+
+
+## Report a Bug
+
+To report a bug, file an issue on [GitHub Issues](https://github.com/neuralmagic/guidellm/issues).
diff --git a/docs/guides/supported_backends.md b/docs/guides/supported_backends.md
new file mode 100644
index 0000000..56287bd
--- /dev/null
+++ b/docs/guides/supported_backends.md
@@ -0,0 +1,51 @@
+# Supported Backends
+
+
+GuideLLM requires an OpenAI-compatible server to run evaluations. [vLLM](https://github.com/vllm-project/vllm) is recommended for this purpose; however, GuideLLM is compatible with many backend inference servers such as TGI, llama.cpp, and DeepSparse.
+
+## OpenAI/HTTP Backends
+
+### Text Generation Inference
+[Text Generation Inference](https://github.com/huggingface/text-generation-inference) can be used with GuideLLM. To start a TGI server with a Llama 3.1 8B using docker, run the following command:
+```bash
+docker run --gpus 1 -ti --shm-size 1g --ipc=host --rm -p 8080:80 \
+ -e MODEL_ID=https://huggingface.co/llhf/Meta-Llama-3.1-8B-Instruct \
+ -e NUM_SHARD=1 \
+ -e MAX_INPUT_TOKENS=4096 \
+ -e MAX_TOTAL_TOKENS=6000 \
+ -e HF_TOKEN=$(cat ~/.cache/huggingface/token) \
+ ghcr.io/huggingface/text-generation-inference:2.2.0
+```
+
+For more information on starting a TGI server, see the [TGI Documentation](https://huggingface.co/docs/text-generation-inference/index).
+
+
+### llama.cpp
+[llama.cpp](https://github.com/ggerganov/llama.cpp) can be used with GuideLLM. To start a llama.cpp server with a Llama 3 7B, run the following command:
+```bash
+llama-server --hf-repo Nialixus/Meta-Llama-3-8B-Q4_K_M-GGUF --hf-file meta-llama-3-8b-q4_k_m.gguf -c 2048
+```
+
+For more information on starting a llama.cpp server, see the [llama.cpp Documentation](https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md).
+
+
+### DeepSparse
+[DeepSparse](https://github.com/neuralmagic/deepsparse) can be used with GuideLLM to run LLM inference on CPUs. To start a DeepSparse Server with a Llama 3 7B, run the following command:
+```bash
+deepsparse.server --integration openai --task text-generation --model_path neuralmagic/Llama2-7b-chat-pruned50-quant-ds --port 8000
+```
+
+For more information on starting a DeepSparse Server, see the [llama.cpp Documentation](https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md).
+
+
+## Python Backends
+Coming Soon!
+
+
+## Contribute a new backend
+
+We appreciate contributions to the code, examples, integrations, documentation, bug reports, and feature requests! Your feedback and involvement are crucial in helping GuideLLM grow and improve. Below are some ways you can get involved:
+
+- [**DEVELOPING.md**](https://github.com/neuralmagic/guidellm/blob/main/DEVELOPING.md) - Development guide for setting up your environment and making contributions.
+- [**CONTRIBUTING.md**](https://github.com/neuralmagic/guidellm/blob/main/CONTRIBUTING.md) - Guidelines for contributing to the project, including code standards, pull request processes, and more.
+- [**CODE_OF_CONDUCT.md**](https://github.com/neuralmagic/guidellm/blob/main/CODE_OF_CONDUCT.md) - Our expectations for community behavior to ensure a welcoming and inclusive environment.