Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docs Updates #61

Open
wants to merge 9 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 7 additions & 19 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
</p>

<h3 align="center">
Scale Efficiently: Evaluate and Optimize Your LLM Deployments for Real-World Inference
Scale Efficiently: Evaluate and Enhance Your LLM Deployments for Real-World Inference
</h3>

[![GitHub Release](https://img.shields.io/github/release/neuralmagic/guidellm.svg?label=Version)](https://github.com/neuralmagic/guidellm/releases) [![Documentation](https://img.shields.io/badge/Documentation-8A2BE2?logo=read-the-docs&logoColor=%23ffffff&color=%231BC070)](https://github.com/neuralmagic/guidellm/tree/main/docs) [![License](https://img.shields.io/github/license/neuralmagic/guidellm.svg)](https://github.com/neuralmagic/guidellm/blob/main/LICENSE) [![PyPI Release](https://img.shields.io/pypi/v/guidellm.svg?label=PyPI%20Release)](https://pypi.python.org/pypi/guidellm) [![Pypi Release](https://img.shields.io/pypi/v/guidellm-nightly.svg?label=PyPI%20Nightly)](https://pypi.python.org/pypi/guidellm-nightly) [![Python Versions](https://img.shields.io/badge/Python-3.8--3.12-orange)](https://pypi.python.org/pypi/guidellm) [![Nightly Build](https://img.shields.io/github/actions/workflow/status/neuralmagic/guidellm/nightly.yml?branch=main&label=Nightly%20Build)](https://github.com/neuralmagic/guidellm/actions/workflows/nightly.yml)
Expand All @@ -20,7 +20,7 @@ Scale Efficiently: Evaluate and Optimize Your LLM Deployments for Real-World Inf
</picture>
</p>

**GuideLLM** is a powerful tool for evaluating and optimizing the deployment of large language models (LLMs). By simulating real-world inference workloads, GuideLLM helps users gauge the performance, resource needs, and cost implications of deploying LLMs on various hardware configurations. This approach ensures efficient, scalable, and cost-effective LLM inference serving while maintaining high service quality.
**GuideLLM** is a powerful tool for evaluating and enhancing the deployment of large language models (LLMs). By simulating real-world inference workloads, GuideLLM helps users gauge the performance, resource needs, and cost implications of deploying LLMs on various hardware configurations. This approach ensures efficient, scalable, and cost-effective LLM inference serving while maintaining high service quality.

### Key Features

Expand Down Expand Up @@ -48,31 +48,19 @@ For detailed installation instructions and requirements, see the [Installation G

### Quick Start

#### 1a. Start an OpenAI Compatible Server (vLLM)
#### 1. Start an OpenAI Compatible Server (vLLM)

GuideLLM requires an OpenAI-compatible server to run evaluations. [vLLM](https://github.com/vllm-project/vllm) is recommended for this purpose. To start a vLLM server with a Llama 3.1 8B quantized model, run the following command:

```bash
vllm serve "neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w4a16"
```

For more information on starting a vLLM server, see the [vLLM Documentation](https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html).

#### 1b. Start an OpenAI Compatible Server (Hugging Face TGI)

GuideLLM requires an OpenAI-compatible server to run evaluations. [Text Generation Inference](https://github.com/huggingface/text-generation-inference) can be used here. To start a TGI server with a Llama 3.1 8B using docker, run the following command:
For more information on installing vLLM, see the [vLLM Installation Documentation](https://docs.vllm.ai/en/latest/getting_started/installation.html).

```bash
docker run --gpus 1 -ti --shm-size 1g --ipc=host --rm -p 8080:80 \
-e MODEL_ID=https://huggingface.co/llhf/Meta-Llama-3.1-8B-Instruct \
-e NUM_SHARD=1 \
-e MAX_INPUT_TOKENS=4096 \
-e MAX_TOTAL_TOKENS=6000 \
-e HF_TOKEN=$(cat ~/.cache/huggingface/token) \
ghcr.io/huggingface/text-generation-inference:2.2.0
```
For more information on starting a vLLM server, see the [vLLM Documentation](https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html).

For more information on starting a TGI server, see the [TGI Documentation](https://huggingface.co/docs/text-generation-inference/index).
To view more GuideLLM compatible backends such as TGI, llama.cpp, and DeepSparse, check out our [Supported Backends documentation](https://github.com/neuralmagic/guidellm/blob/main/docs/guides/supported_backends.md).

#### 2. Run a GuideLLM Evaluation

Expand Down Expand Up @@ -108,7 +96,7 @@ The end of the output will include important performance summary metrics such as

#### 4. Use the Results

The results from GuideLLM are used to optimize your LLM deployment for performance, resource efficiency, and cost. By analyzing the performance metrics, you can identify bottlenecks, determine the optimal request rate, and select the most cost-effective hardware configuration for your deployment.
The results from GuideLLM are used to enhance your LLM deployment for performance, resource efficiency, and cost. By analyzing the performance metrics, you can identify bottlenecks, determine the optimal request rate, and select the most cost-effective hardware configuration for your deployment.

For example, if we deploy a latency-sensitive chat application, we likely want to optimize for low time to first token (TTFT) and inter-token latency (ITL). A reasonable threshold will depend on the application requirements. Still, we may want to ensure time to first token (TTFT) is under 200ms and inter-token latency (ITL) is under 50ms (20 updates per second). From the example results above, we can see that the model can meet these requirements on average at a request rate of 2.37 requests per second for each server. If you'd like to target a higher percentage of requests meeting these requirements, you can use the **Performance Stats by Benchmark** section to determine the rate at which 90% or 95% of requests meet these requirements.

Expand Down
Binary file added docs/assets/perf_stats.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/perf_summary.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/request_data.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/tokens_data.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
177 changes: 176 additions & 1 deletion docs/guides/cli.md
Original file line number Diff line number Diff line change
@@ -1 +1,176 @@
# Coming Soon

# GuideLLM CLI User Guide

For more details on setup and installation, see the Setup and [Installation](https://github.com/neuralmagic/guidellm?tab=readme-ov-file#installation) sections.

## About GuideLLM

The GuideLLM CLI is a performance benchmarking tool to enable you to evaluate and enhance your LLM inference serving performance before you deploy to production. GuideLLM simulates user LLM workloads through rigorous benchmarking so that you can understand your LLM inference performance. This ultimately provides the ability to understand bottlenecks in your inference serving pipeline and make changes before your users are affected by your LLM application.

## GuideLLM CLI Quickstart

#### 1. Start an OpenAI Compatible Server (vLLM)

GuideLLM requires an OpenAI-compatible server to run evaluations. It's recommended that [vLLM](https://github.com/vllm-project/vllm) be used for this purpose. To start a vLLM server with a Llama 3.1 8B quantized model, run the following command:

```bash
vllm serve "neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w4a16"
```

For more information on installing vLLM, see the [vLLM Installation Documentation](https://docs.vllm.ai/en/latest/getting_started/installation.html).

For more information on starting a vLLM server, see the [vLLM Documentation](https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html).

#### 2. Run a GuideLLM Evaluation

To run a GuideLLM evaluation, use the `guidellm` command with the appropriate model name and options on the server hosting the model or one with network access to the deployment server. For example, to evaluate the full performance range of the previously deployed Llama 3.1 8B model, run the following command:

```bash
guidellm \
--target "http://localhost:8000/v1" \
--model "neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w4a16"
```

The above command will begin the evaluation and output progress updates similar to the following: <img src="https://github.com/neuralmagic/guidellm/blob/main/docs/assets/sample-benchmark.gif" />

Notes:

- The `--target` flag specifies the server hosting the model. In this case, it is a local vLLM server.
- The `--model` flag specifies the model to evaluate. The model name should match the name of the model deployed on the server
- By default, GuideLLM will run a `sweep` of performance evaluations across different request rates, each lasting 120 seconds. The results will be saved to a local directory.
Comment on lines +38 to +40
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I would rename flag to parameter since our CLI supports both: parameters and flags. If you specify a flag - there is no value next to it. If you specify parameter - the value is requied then.

  2. In some cases we may get an error if the tokenizer is not specified. I would add another item here. Text is below:

  • The --tokenizer parameter specifies the tokenizer to encount the number of tokens in the dataset. If you faced any issues try using --tokenizer neuralmagic/Meta-Llama-3.1-8B-quantized.w8a8.


#### 3. Analyze the Results

After the evaluation is completed, GuideLLM will output a summary of the results, including various performance metrics. The results will also be saved to a local directory for further analysis.

The output results will start with a summary of the evaluation, followed by the requested data for each benchmark run. For example, the start of the output will look like the following:

<img alt="Sample GuideLLM benchmark start output" src="https://github.com/neuralmagic/guidellm/blob/main/docs/assets/sample-output-start.png" />

The end of the output will include important performance summary metrics such as request latency, time to first token (TTFT), inter-token latency (ITL), and more:

<img alt="Sample GuideLLM benchmark end output" src="https://github.com/neuralmagic/guidellm/blob/main/docs/assets/sample-output-end.png" />



## GuideLLM CLI Details
### Input Metrics
The input arguments are split up into 3 sections:

- **Workload Overview**
- **Workload Data**
- **Workload Type**

Once you fill out these arguments and run the command, GuideLLM will run the simulated workload. Note the time it takes to run can be set with <em>max_seconds</em>, but may also depend on the hardware and model.

#### Workload Overview

This section of input parameters covers what to actually benchmark including the target host location, model, and task. The full list of arguments and their definitions are presented below:

- **--target** <em>(str, default: localhost with chat completions API for VLLM)</em>: Target for benchmarking

- optional breakdown args if the target isn't specified:

- **--host** <em>(str)</em>: Host URL for benchmarking

- **--port** <em>(str)</em>: Port available for benchmarking

- **--backend** <em>(str, default: server_openai [vllm, TGI, llama.cpp, DeepSparse, and many popular servers match this format])</em>: Backend type for benchmarking

- **--model** <em>(str, default: auto-populated from vllm server)</em>: Model being used for benchmarking, running on the inference server

- **--task** <em>(str), optional)</em>: Task to use for benchmarking

- **--output-path** <em>(str), optional)</em>: Path to save report report to



#### Workload Data

This section of input parameters covers the data arguments that need to be supplied such as a reference to the dataset and tokenizer. The list of arguments and their defintions are presented below:

- **--data** <em>(str)</em>: Data file or alias for benchmarking

- **--data-type** <em>(ENUM, default: emulated; [file, transformers])</em>: The type of data given for benchmarking

- **--tokenizer** <em>(str)</em>: Tokenizer to use for benchmarking

#### Workload Type

This section of input parameters covers the type of workload that you want to run to represent the type of load you expect on your server in production such as rate-per-second and the frequency of requests. The full list of arguments and their definitions are presented below:

- **--rate-type** <em>(ENUM, default: sweep; [serial, constant, poisson] where sweep will cover a range of constant request rates and ensure saturation of server, serial will send one request at a time, constant will send a constant request rate, poisson will send a request rate sampled from a Poisson distribution at a given mean) </em>: Type of rate generation for benchmarking

- **--rate** <em>(float)</em>: Rate to use for constant and Poisson rate types

- **--max-seconds** <em>(integer)</em>: Number of seconds to result each request rate at

- **--max-requests** <em>(integer)</em>: Number of requests to send for each rate

### Output Metrics via GuideLLM Benchmarks Report

Once your GuideLLM run is complete, the output metrics are displayed as a GuideLLM Benchmarks Report via the Terminal in the following 4 sections:

- **Requests Data by Benchmark**
- **Tokens Data by Benchmark**
- **Performance Stats by Benchmark**
- **Performance Summary by Benchmark**

The GuideLLM Benchmarks Report surfaces key LLM metrics to help you determine the health and performance of your inference server. You can use the numbers generated by the GuideLLM Benchmarks Report to make decisions around server request processing, Service Level Objective (SLO) success/failure for your task, general model performance, and hardware impact.

#### Requests Data by Benchmark

This section shows the request statistics for the benchmarks that were run. Request Data statistics highlight the details of the requests hitting the inference server. Viewing this information is essential to understanding the health of your server processing requests sent by GuideLLM and can surface potential issues in your inference serving pipeline including software and hardware issues.

<img alt="Sample Requests Data by Benchmark" src="https://github.com/neuralmagic/guidellm/blob/main/docs/assets/request_data.png" />

This table includes:
- **Benchmark:** Synchronous or Asynchronous@X req/sec
- **Requests Completed:** the number of successful requests handled
- **Requests Failed:** the number of failed requests
- **Duration (sec):** the time taken to run the specific benchmark, determined by <em>max_seconds</em>
- **Start Time (HH:MI:SS):** local timestamp the GuideLLM benchmark started
- **End Time (HH:MI:SS):** local timestamp the GuideLLM benchmark ended


#### Tokens Data by Benchmark
This section shows the prompt and output token distribution statistics for the benchmarks that were run. Token Data statistics highlight the details of your dataset in terms of prompts and generated outputs from the model. Viewing this information is integral to understanding model performance on your task and ensuring you are able to hit SLOs required to guarantee a good user experience from your application.

<img alt="Sample Tokens Data by Benchmark" src="https://github.com/neuralmagic/guidellm/blob/main/docs/assets/tokens_data.png" />

This table includes:
- **Benchmark:** Synchronous or Asynchronous@X req/sec
- **Prompt (token length):** the average length of prompt tokens
- **Prompt (1%, 5%, 50%, 95%, 99%):** Distribution of prompt token length
- **Output (token length):** the average length of output tokens
- **Output (1%, 5%, 50%, 95%, 99%):** Distribution of output token length

#### Performance Stats by Benchmark
This section shows the LLM performance statistics for the benchmarks that were run. Performance Statistics highlight the performance of the model across the key LLM performance metrics including Request Latency, Time to First Token (TTFT), Inter Token Latench (ITL or TPOT), and Output Token Throughput. To understand the definitions and importance of these LLM performance metrics further, check out the [Nvidia Metrics Guide](https://docs.nvidia.com/nim/benchmarking/llm/latest/metrics.html). Viewing these key metrics is integral to ensuring the performance of your inference server for your task on your designated hardware where you are running your inference server.

<img alt="Sample Perf Stats by Benchmark" src="https://github.com/neuralmagic/guidellm/blob/main/docs/assets/perf_stats.png" />

This table includes:
- **Benchmark:** Synchronous or Asynchronous@X req/sec
- **Request Latency [1%, 5%, 10%, 50%, 90%, 95%, 99%] (sec)**: the time it takes from submitting a query to receiving the full response, including the performance of your queueing/batching mechanisms and network latencies
- **Time to First Token [1%, 5%, 10%, 50%, 90%, 95%, 99%] (ms)**: the time it takes from submitting the query to receiving the first token (if the response is not empty); often abbreviated as TTFT
- **Inter Token Latency [1%, 5%, 10%, 50%, 90% 95%, 99%] (ms)**: the time between consecutive tokens and is also known as time per output token (TPOT)
- **Output Token Throughput (tokens/sec)**: the total output tokens per second throughput, accounting for all the requests happening simultaneously


#### Performance Summary by Benchmark
This section shows the averages for the LLM performance statistics for the benchmarks that were run. The average Performance Statistics provide an overall summary of the model performance across the key LLM performance metrics. Viewing these summary metrics is integral to ensuring the performance of your inference server for your task on the designated hardware where you are running your inference server.

<img alt="Sample Perf Summary by Benchmark" src="https://github.com/neuralmagic/guidellm/blob/main/docs/assets/perf_summary.png" />

This table includes:
- **Benchmark:** Synchronous or Asynchronous@X req/sec
- **Request Latency (sec)**: the average time it takes from submitting a query to receiving the full response, including the performance of your queueing/batching mechanisms and network latencies
- **Time to First Token (ms)**: the average time it takes from submitting the query to receiving the first token (if the response is not empty); often abbreviated as TTFT
- **Inter Token Latency (ms)**: the average time between consecutive tokens and is also known as time per output token (TPOT)
- **Output Token Throughput (tokens/sec)**: the total average output tokens per second throughput, accounting for all the requests happening simultaneously


## Report a Bug

To report a bug, file an issue on [GitHub Issues](https://github.com/neuralmagic/guidellm/issues).
Loading