Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Llama support table + sharding docs #915

Merged
merged 4 commits into from
Feb 5, 2025
Merged
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
172 changes: 152 additions & 20 deletions docs/shortfin/llm/user/llama_serving.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,19 @@
# Llama end to end serving instructions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also further highlights that we need to streamline our export/compile process. We should allow user to specify just the huggingface repo when starting the server, while we take care of downloading safetensors, exporting, and compiling. We should do this while still allowing for specific local files to be specified: #402

See also #691. I'd really like for some of the ideas in those issues to be incorporated into our development procedures soon. Improving the "user" workflows should also improve our "developer" workflows, which are quite fragmented (scripts in external repositories, scripts on specific datacenter servers that only a few team members have access to, etc.).

For example, iree-org/iree#19911 could have reproduction steps like

pip install shark-ai==[some nightly version]

shark-compile \
  --hf-model=meta-llama/Llama-3.1-8B \
  --hf-cache-dir=/shark-dev/cache \
  --compile-target=gfx942 \
  --sharding-mode=tp8 \
  --output-artifacts-dir=~/artifacts/llama-3.1-8b \
  --output-dev-artifacts-dir=~/dev-artifacts/llama-3.1-8b

iree-benchmark-module \
  --flagfile=~/dev-artifacts/llama-3.1-8b/benchmark-flags.txt

(of course we would iterate on the specific flags, artifact formats, cache defaults, environment variable settings, etc., but we should start somewhere and add utilities as we go)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I really like this shark-compile idea. I was thinking of something like this, or moving it to the server command itself and adding a mutually exclusive list of args. Kinda like how sglang does it.

For example:

python -m shortfin_apps.llm.server \
   --model-path https://huggingface.co/meta-llama/Llama-3.1-405B-Instruct
   --device hip \
   --tp 8 \
   --compile-target gfx942 \
   --device_ids 0 1 2 3 4 5 6 7

We would default model artifacts to be saved to .cache/shark/Llama-3.1-405b-Instruct, and add an optional arg --cache-dir if they wanna save them somewhere else. Maybe we also include some flags for regenerating cached artifacts, like --ensure-export, --ensure-compile.

We would still wanna support the case of pre-compiled artifacts, so maybe we keep around our original set of args for that scenario:

python -m shortfin_apps.llm.server \
   --tokenizer_json /path/to/output/tokenizer.json \
   --model_config /path/to/output/llama3.1-405b.config.json \
   --vmfb /path/to/output/llama3.1-405b.vmfb \
   --parameters \
      /path/to/output/llama3.1-405b.irpa \
      /path/to/output/llama3.1-405b.rank0.irpa \
      /path/to/output/llama3.1-405b.rank1.irpa \
      /path/to/output/llama3.1-405b.rank2.irpa \
      /path/to/output/llama3.1-405b.rank3.irpa \
      /path/to/output/llama3.1-405b.rank4.irpa \
      /path/to/output/llama3.1-405b.rank5.irpa \
      /path/to/output/llama3.1-405b.rank6.irpa \
      /path/to/output/llama3.1-405b.rank7.irpa \
   --device=hip \
   --device_ids 0 1 2 3 4 5 6 7


## Supported Models

The following models are supported for serving:

<!-- TODO(https://github.com/iree-org/iree/issues/19832): Determine lower-bound of tp required for 405b -->
| Model Name | HuggingFace Model | Tensor Parallelism Range |
| ------------------------- | ----------------------------------------------------------------------------------------------- | ------------------------ |
| `Llama-3.1-8B` | [meta-llama/Llama-3.1-8B](https://huggingface.co/meta-llama/Llama-3.1-8B) | tp1-tp8 |
| `Llama-3.1-8B-Instruct` | [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) | tp1-tp8 |
| `Llama-3.1-70B` | [meta-llama/Llama-3.1-70B](https://huggingface.co/meta-llama/Llama-3.1-70B) | tp1-tp8 |
| `Llama-3.1-70B-Instruct` | [meta-llama/Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct) | tp1-tp8 |
| `Llama-3.1-405b` | [meta-llama/Llama-3.1-405B](https://huggingface.co/meta-llama/Llama-3.1-405B) | tp8 |
| `Llama-3.1-405b-Instruct` | [meta-llama/Llama-3.1-405B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-405B-Instruct) | tp8 |

## Introduction

This guide demonstrates how to serve the
Expand All @@ -22,6 +36,8 @@ Overview:
2. Download model files then compile the model for our accelerator(s) of choice
3. Start a server using the compiled model files
4. Send chat requests to the server and receive chat responses back
5. Running with sharded models
6. Server options

## 1. Setup

Expand Down Expand Up @@ -120,9 +136,7 @@ These variables configure the model export and compilation process:
export MLIR_PATH=$EXPORT_DIR/model.mlir
export OUTPUT_CONFIG_PATH=$EXPORT_DIR/config.json
export VMFB_PATH=$EXPORT_DIR/model.vmfb
export EXPORT_BATCH_SIZES=1,4
# NOTE: This is temporary, until multi-device is fixed
export ROCR_VISIBLE_DEVICES=1
export EXPORT_BATCH_SIZES=4
```

### Export to MLIR using sharktank
Expand Down Expand Up @@ -202,7 +216,8 @@ python -m shortfin_apps.llm.server \
--model_config=$OUTPUT_CONFIG_PATH \
--vmfb=$VMFB_PATH \
--parameters=$MODEL_PARAMS_PATH \
--device=hip > shortfin_llm_server.log 2>&1 &
--device=hip \
--device_ids 0 |& tee shortfin_llm_server.log &
shortfin_process=$!
```

Expand Down Expand Up @@ -283,7 +298,124 @@ If you want to find the process again:
ps -f | grep shortfin
```

## Server Options
## 5. Running with sharded models

<!-- TODO(#402): Streamline the way that models are sharded/exported/compiled for server. -->

For models that require sharding, like [Llama-3.1-405b](#supported-models), we
will use the [`sharktank.examples.sharding.shard_llm_dataset`](https://github.com/nod-ai/shark-ai/blob/main/sharktank/sharktank/examples/sharding/shard_llm_dataset.py)
script, which exports our model as sharded `irpa` files.

> [!NOTE]
> The `--tensor-parallelism-size` argument specifies the number of shards to
> create. For the Llama-3.1-405b model, we will use a `tensor-parallelism-size`
> of 8.

### Shard a `gguf` file

```bash
python -m sharktank.examples.sharding.shard_llm_dataset \
--gguf-file /path/to/model/llama3.1-405b.gguf \
--output-irpa /path/to/output/llama3.1-405b.irpa \
--tensor-parallelism-size 8
```

### Shard an `irpa` file

```bash
python -m sharktank.examples.sharding.shard_llm_dataset \
--irpa-file /path/to/model/llama3.1-405b.irpa \
--output-irpa /path/to/output/llama3.1-405b.irpa \
--tensor-parallelism-size 8
```

This will create `tensor_parallelism_size + 1` irpa files in our output dir
for each shard.

For example, our command above with `tensor-parallelism-size=8` will produce
the following files in our output directory:

```text
llama3.1-405b.irpa
llama3.1-405b.rank0.irpa
llama3.1-405b.rank1.irpa
llama3.1-405b.rank2.irpa
llama3.1-405b.rank3.irpa
llama3.1-405b.rank4.irpa
llama3.1-405b.rank5.irpa
llama3.1-405b.rank6.irpa
llama3.1-405b.rank7.irpa
```

### Exporting to MLIR

For exporting a sharded model to `mlir`, we will target the `unranked irpa` file
in our export command:

```bash
python -m sharktank.examples.export_paged_llm_v1 \
--irpa-file /path/to/output/llama3.1-405b.irpa \
--output-mlir /path/to/output/llama3.1-405b.mlir \
--output-config /path/to/output/llama3.1-405b.config.json \
--bs 4
```

### Compiling to VMFB

For compiling a sharded model to `vmfb`, we must ensure that the number of
devices we have specified are equal to our `tensor-parallelism-size`:

```bash
iree-compile /path/to/output/llama3.1-405b.mlir \
-o /path/to/output/llama3.1-405b.vmfb \
--iree-hal-target-device=hip[0] \
--iree-hal-target-device=hip[1] \
--iree-hal-target-device=hip[2] \
--iree-hal-target-device=hip[3] \
--iree-hal-target-device=hip[4] \
--iree-hal-target-device=hip[5] \
--iree-hal-target-device=hip[6] \
--iree-hal-target-device=hip[7] \
--iree-hip-target=gfx942 \
--iree-dispatch-creation-enable-aggressive-fusion=true \
--iree-global-opt-propagate-transposes=true \
--iree-opt-aggressively-propagate-transposes=true \
--iree-opt-data-tiling=false \
--iree-preprocessing-pass-pipeline='builtin.module(util.func(iree-preprocessing-generalize-linalg-matmul-experimental))' \
--iree-hal-indirect-command-buffers=true \
--iree-stream-resource-memory-model=discrete \
--iree-hal-memoization=true \
--iree-opt-strip-assertions
```

### Run the server

> [!NOTE]
> For running a sharded model, we must specify each irpa file in `--parameters`,
> and the number of devices in `--device_ids` should be equal to the
> `tensor-parallelism-size` of the model.

```bash
python -m shortfin_apps.llm.server \
--tokenizer_json /path/to/output/tokenizer.json \
--model_config /path/to/output/llama3.1-405b.config.json \
--vmfb /path/to/output/llama3.1-405b.vmfb \
--parameters \
/path/to/output/llama3.1-405b.irpa \
/path/to/output/llama3.1-405b.rank0.irpa \
/path/to/output/llama3.1-405b.rank1.irpa \
/path/to/output/llama3.1-405b.rank2.irpa \
/path/to/output/llama3.1-405b.rank3.irpa \
/path/to/output/llama3.1-405b.rank4.irpa \
/path/to/output/llama3.1-405b.rank5.irpa \
/path/to/output/llama3.1-405b.rank6.irpa \
/path/to/output/llama3.1-405b.rank7.irpa \
--device=hip \
--device_ids 0 1 2 3 4 5 6 7 |& tee shortfin_llm_server.log &
shortfin_process=$!
```

## 6. Server Options

To run the server with different options, you can use the
following command to see the available flags:
Expand All @@ -296,18 +428,18 @@ python -m shortfin_apps.llm.server --help

A full list of options can be found below:

| Argument | Description |
| ----------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `--host HOST` | Specify the host to bind the server. |
| `--port PORT` | Specify the port to bind the server. |
| `--root-path ROOT_PATH` | Root path to use for installing behind a path-based proxy. |
| `--timeout-keep-alive TIMEOUT_KEEP_ALIVE` | Keep-alive timeout duration. |
| `--tokenizer_json TOKENIZER_JSON` | Path to a `tokenizer.json` file. |
| `--tokenizer_config_json TOKENIZER_CONFIG_JSON` | Path to a `tokenizer_config.json` file. |
| `--model_config MODEL_CONFIG` | Path to the model config file. |
| `--vmfb VMFB` | Model [VMFB](https://iree.dev/developers/general/developer-tips/#inspecting-vmfb-files) to load. |
| `--parameters [FILE ...]` | Parameter archives to load (supports: `gguf`, `irpa`, `safetensors`). |
| `--device {local-task,hip,amdgpu}` | Device to serve on (e.g., `local-task`, `hip`). Same options as [iree-run-module --list_drivers](https://iree.dev/guides/deployment-configurations/gpu-rocm/#get-the-iree-runtime). |
| `--device_ids [DEVICE_IDS ...]` | Device IDs visible to the system builder. Defaults to None (full visibility). Can be an index or a device ID like `amdgpu:0:0@0`. |
| `--isolation {none,per_fiber,per_call}` | Concurrency control: How to isolate programs. |
| `--amdgpu_async_allocations` | Enable asynchronous allocations for AMD GPU device contexts. |
| Argument | Description |
| ----------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `--host HOST` | Specify the host to bind the server. |
| `--port PORT` | Specify the port to bind the server. |
| `--root-path ROOT_PATH` | Root path to use for installing behind a path-based proxy. |
| `--timeout-keep-alive TIMEOUT_KEEP_ALIVE` | Keep-alive timeout duration. |
| `--tokenizer_json TOKENIZER_JSON` | Path to a `tokenizer.json` file. |
| `--tokenizer_config_json TOKENIZER_CONFIG_JSON` | Path to a `tokenizer_config.json` file. |
| `--model_config MODEL_CONFIG` | Path to the model config file. |
| `--vmfb VMFB` | Model [VMFB](https://iree.dev/developers/general/developer-tips/#inspecting-vmfb-files) to load. |
| `--parameters [FILE ...]` | Parameter archives to load (supports: `gguf`, `irpa`, `safetensors`). |
| `--device {local-task,hip,amdgpu}` | Device to serve on (e.g., `local-task`, `hip`). Same options as [iree-run-module --list_drivers](https://iree.dev/guides/deployment-configurations/gpu-rocm/#get-the-iree-runtime). |
| `--device_ids [DEVICE_IDS ...]` | Device IDs visible to the system builder. Defaults to None (full visibility). Can be an index or a device ID like `amdgpu:0:0@0`. The number of `device_ids` should be equal to the tensor parallelism of the model. |
| `--isolation {none,per_fiber,per_call}` | Concurrency control: How to isolate programs. |
| `--amdgpu_async_allocations` | Enable asynchronous allocations for AMD GPU device contexts. |
Loading