Skip to content

Commit

Permalink
Add support table for llama models,
Browse files Browse the repository at this point in the history
Add section in docs describing the process of running with a sharded Llama model
  • Loading branch information
stbaione committed Feb 5, 2025
1 parent e5cf360 commit e5bf0fd
Showing 1 changed file with 155 additions and 20 deletions.
175 changes: 155 additions & 20 deletions docs/shortfin/llm/user/llama_serving.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,19 @@
# Llama end to end serving instructions

## Supported Models

The following models are supported for serving:

<!-- TODO(https://github.com/iree-org/iree/issues/19832): Determine lower-bound of tp required for 405b -->
| Model Name | HuggingFace Model | Tensor Parallelism Range |
| ------------------------- | ----------------------------------------------------------------------------------------------- | ------------------------ |
| `Llama-3.1-8B` | [meta-llama/Llama-3.1-8B](https://huggingface.co/meta-llama/Llama-3.1-8B) | tp1-tp8 |
| `Llama-3.1-8B-Instruct` | [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) | tp1-tp8 |
| `Llama-3.1-70B` | [meta-llama/Llama-3.1-70B](https://huggingface.co/meta-llama/Llama-3.1-70B) | tp1-tp8 |
| `Llama-3.1-70B-Instruct` | [meta-llama/Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct) | tp1-tp8 |
| `Llama-3.1-405b` | [meta-llama/Llama-3.1-405B](https://huggingface.co/meta-llama/Llama-3.1-405B) | tp8 |
| `Llama-3.1-405b-Instruct` | [meta-llama/Llama-3.1-405B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-405B-Instruct) | tp8 |

## Introduction

This guide demonstrates how to serve the
Expand All @@ -22,6 +36,8 @@ Overview:
2. Download model files then compile the model for our accelerator(s) of choice
3. Start a server using the compiled model files
4. Send chat requests to the server and receive chat responses back
5. Running with sharded models
6. Server options

## 1. Setup

Expand Down Expand Up @@ -120,9 +136,7 @@ These variables configure the model export and compilation process:
export MLIR_PATH=$EXPORT_DIR/model.mlir
export OUTPUT_CONFIG_PATH=$EXPORT_DIR/config.json
export VMFB_PATH=$EXPORT_DIR/model.vmfb
export EXPORT_BATCH_SIZES=1,4
# NOTE: This is temporary, until multi-device is fixed
export ROCR_VISIBLE_DEVICES=1
export EXPORT_BATCH_SIZES=4
```
### Export to MLIR using sharktank
Expand Down Expand Up @@ -202,7 +216,8 @@ python -m shortfin_apps.llm.server \
--model_config=$OUTPUT_CONFIG_PATH \
--vmfb=$VMFB_PATH \
--parameters=$MODEL_PARAMS_PATH \
--device=hip > shortfin_llm_server.log 2>&1 &
--device=hip \
--device_ids 0 |& tee shortfin_llm_server.log &
shortfin_process=$!
```
Expand Down Expand Up @@ -283,7 +298,127 @@ If you want to find the process again:
ps -f | grep shortfin
```

## Server Options
## 5. Running with sharded models

<!-- TODO(#402): Streamline the way that models are sharded/exported/compiled for server. -->

For models that require sharding, like [Llama-3.1-405b](#supported-models), we
will use the [`sharktank.examples.sharding.shard_llm_dataset`](https://github.com/nod-ai/shark-ai/blob/main/sharktank/sharktank/examples/sharding/shard_llm_dataset.py)
script, which exports our model as sharded `irpa` files.

> [!NOTE]
> The `--tensor-parallelism-size` argument specifies the number of shards to
> create. For the Llama-3.1-405b model, we will use a `tensor-parallelism-size`
> of 8.
### Shard a `gguf` file

```bash
python -m sharktank.examples.sharding.shard_llm_dataset \
--gguf-file /path/to/model/llama3.1-405b.gguf \
--output-irpa /path/to/output/llama3.1-405b.irpa \
--tensor-parallelism-size 8
```

### Shard an `irpa` file

```bash
python -m sharktank.examples.sharding.shard_llm_dataset \
--irpa-file /path/to/model/llama3.1-405b.irpa \
--output-irpa /path/to/output/llama3.1-405b.irpa \
--tensor-parallelism-size 8
```

This will create `tensor_parallelism_size + 1` irpa files in our output dir
for each shard.

For example, our command above with `tensor-parallelism-size=8` will produce
the following files in our output directory:

```text
llama3.1-405b.irpa
llama3.1-405b.rank0.irpa
llama3.1-405b.rank1.irpa
llama3.1-405b.rank2.irpa
llama3.1-405b.rank3.irpa
llama3.1-405b.rank4.irpa
llama3.1-405b.rank5.irpa
llama3.1-405b.rank6.irpa
llama3.1-405b.rank7.irpa
```

### Exporting to MLIR

For exporting a sharded model to `mlir`, we will target the `unranked irpa` file
in our export command:

```bash
python -m sharktank.examples.export_paged_llm_v1 \
--irpa-file /path/to/output/llama3.1-405b.irpa \
--output-mlir /path/to/output/llama3.1-405b.mlir \
--output-config /path/to/output/llama3.1-405b.config.json \
--bs 4
```

### Compiling to VMFB

For compiling a sharded model to `vmfb`, we must ensure that the number of
devices we have specified are equal to our `tensor-parallelism-size`:

```bash
iree-compile /path/to/output/llama3.1-405b.mlir \
-o /path/to/output/llama3.1-405b.vmfb \
--iree-hal-target-device=hip[0] \
--iree-hal-target-device=hip[1] \
--iree-hal-target-device=hip[2] \
--iree-hal-target-device=hip[3] \
--iree-hal-target-device=hip[4] \
--iree-hal-target-device=hip[5] \
--iree-hal-target-device=hip[6] \
--iree-hal-target-device=hip[7] \
--iree-hip-target=gfx942 \
--iree-dispatch-creation-enable-aggressive-fusion=true \
--iree-global-opt-propagate-transposes=true \
--iree-opt-aggressively-propagate-transposes=true \
--iree-opt-data-tiling=false \
--iree-preprocessing-pass-pipeline='builtin.module(util.func(iree-preprocessing-generalize-linalg-matmul-experimental))' \
--iree-hal-indirect-command-buffers=true \
--iree-stream-resource-memory-model=discrete \
--iree-hal-memoization=true \
--iree-opt-strip-assertions
```

### Run the server

Now that we have compiled our sharded model, we can run the server with the
`--parameters` flag to specify the path to our sharded model files:

> [!NOTE]
> For running a sharded model, we must specify each irpa file in `--parameters`,
> and the number of devices in `--device_ids` should be equal to the
> `tensor-parallelism-size` of the model.
```bash
python -m shortfin_apps.llm.server \
--tokenizer_json /path/to/output/tokenizer.json \
--model_config /path/to/output/llama3.1-405b.config.json \
--vmfb /path/to/output/llama3.1-405b.vmfb \
--parameters \
/path/to/output/llama3.1-405b.irpa \
/path/to/output/llama3.1-405b.rank0.irpa \
/path/to/output/llama3.1-405b.rank1.irpa \
/path/to/output/llama3.1-405b.rank2.irpa \
/path/to/output/llama3.1-405b.rank3.irpa \
/path/to/output/llama3.1-405b.rank4.irpa \
/path/to/output/llama3.1-405b.rank5.irpa \
/path/to/output/llama3.1-405b.rank6.irpa \
/path/to/output/llama3.1-405b.rank7.irpa \
--device=hip \
--device_ids 0 1 2 3 4 5 6 7 |& tee shortfin_llm_server.log &
shortfin_process=$!
```

## 6. Server Options

To run the server with different options, you can use the
following command to see the available flags:
Expand All @@ -296,18 +431,18 @@ python -m shortfin_apps.llm.server --help

A full list of options can be found below:

| Argument | Description |
| ----------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `--host HOST` | Specify the host to bind the server. |
| `--port PORT` | Specify the port to bind the server. |
| `--root-path ROOT_PATH` | Root path to use for installing behind a path-based proxy. |
| `--timeout-keep-alive TIMEOUT_KEEP_ALIVE` | Keep-alive timeout duration. |
| `--tokenizer_json TOKENIZER_JSON` | Path to a `tokenizer.json` file. |
| `--tokenizer_config_json TOKENIZER_CONFIG_JSON` | Path to a `tokenizer_config.json` file. |
| `--model_config MODEL_CONFIG` | Path to the model config file. |
| `--vmfb VMFB` | Model [VMFB](https://iree.dev/developers/general/developer-tips/#inspecting-vmfb-files) to load. |
| `--parameters [FILE ...]` | Parameter archives to load (supports: `gguf`, `irpa`, `safetensors`). |
| `--device {local-task,hip,amdgpu}` | Device to serve on (e.g., `local-task`, `hip`). Same options as [iree-run-module --list_drivers](https://iree.dev/guides/deployment-configurations/gpu-rocm/#get-the-iree-runtime). |
| `--device_ids [DEVICE_IDS ...]` | Device IDs visible to the system builder. Defaults to None (full visibility). Can be an index or a device ID like `amdgpu:0:0@0`. |
| `--isolation {none,per_fiber,per_call}` | Concurrency control: How to isolate programs. |
| `--amdgpu_async_allocations` | Enable asynchronous allocations for AMD GPU device contexts. |
| Argument | Description |
| ----------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `--host HOST` | Specify the host to bind the server. |
| `--port PORT` | Specify the port to bind the server. |
| `--root-path ROOT_PATH` | Root path to use for installing behind a path-based proxy. |
| `--timeout-keep-alive TIMEOUT_KEEP_ALIVE` | Keep-alive timeout duration. |
| `--tokenizer_json TOKENIZER_JSON` | Path to a `tokenizer.json` file. |
| `--tokenizer_config_json TOKENIZER_CONFIG_JSON` | Path to a `tokenizer_config.json` file. |
| `--model_config MODEL_CONFIG` | Path to the model config file. |
| `--vmfb VMFB` | Model [VMFB](https://iree.dev/developers/general/developer-tips/#inspecting-vmfb-files) to load. |
| `--parameters [FILE ...]` | Parameter archives to load (supports: `gguf`, `irpa`, `safetensors`). |
| `--device {local-task,hip,amdgpu}` | Device to serve on (e.g., `local-task`, `hip`). Same options as [iree-run-module --list_drivers](https://iree.dev/guides/deployment-configurations/gpu-rocm/#get-the-iree-runtime). |
| `--device_ids [DEVICE_IDS ...]` | Device IDs visible to the system builder. Defaults to None (full visibility). Can be an index or a device ID like `amdgpu:0:0@0`. The number of `device_ids` should be equal to the tensor parallelism of the model. |
| `--isolation {none,per_fiber,per_call}` | Concurrency control: How to isolate programs. |
| `--amdgpu_async_allocations` | Enable asynchronous allocations for AMD GPU device contexts. |

0 comments on commit e5bf0fd

Please sign in to comment.