Llama support table + sharding docs #915

stbaione · 2025-02-05T16:13:25Z

This PR adds a support table to our llama_serving.md guide, specifying our supported variants of llama3.1-8b, llama3.1-70b, llama3.1-405b.

I used vLLM Supported Models as a reference. I changed the structure a bit to make more sense for our server, but it's relatively similar to how they set their table up.

For sharding instructions, I added a section at the end of our doc. I debated between creating a separate md file for it, but I think it actually makes sense where it is. It gives our doc a flow where we give detailed descriptions of what's going on while the user gets setup with the lowest-barrier model. Following that, you get to the more advanced sharding section. The details aren't as specific and it's not an exact copy + paste flow like above. The assumption is that after reading above section, the user shouldn't have to be hand-held as much in this section.

Currently, I know we can shard 405b with an upper-bound of tp8. I need to run some tests to see what the supported lower-bound is.

This also further highlights that we need to streamline our export/compile process. We should allow user to specify just the huggingface repo when starting the server, while we take care of downloading safetensors, exporting, and compiling. We should do this while still allowing for specific local files to be specified: #402, #691

Add section in docs describing the process of running with a sharded Llama model

ScottTodd

Thanks! This looks good to merge with maybe a few edits. Agreed about the need for some tools that help put together the command sequences.

ScottTodd · 2025-02-05T16:48:58Z

docs/shortfin/llm/user/llama_serving.md

@@ -1,5 +1,19 @@
 # Llama end to end serving instructions


This also further highlights that we need to streamline our export/compile process. We should allow user to specify just the huggingface repo when starting the server, while we take care of downloading safetensors, exporting, and compiling. We should do this while still allowing for specific local files to be specified: #402

See also #691. I'd really like for some of the ideas in those issues to be incorporated into our development procedures soon. Improving the "user" workflows should also improve our "developer" workflows, which are quite fragmented (scripts in external repositories, scripts on specific datacenter servers that only a few team members have access to, etc.).

For example, iree-org/iree#19911 could have reproduction steps like

pip install shark-ai==[some nightly version] shark-compile \ --hf-model=meta-llama/Llama-3.1-8B \ --hf-cache-dir=/shark-dev/cache \ --compile-target=gfx942 \ --sharding-mode=tp8 \ --output-artifacts-dir=~/artifacts/llama-3.1-8b \ --output-dev-artifacts-dir=~/dev-artifacts/llama-3.1-8b iree-benchmark-module \ --flagfile=~/dev-artifacts/llama-3.1-8b/benchmark-flags.txt

(of course we would iterate on the specific flags, artifact formats, cache defaults, environment variable settings, etc., but we should start somewhere and add utilities as we go)

Yeah, I really like this shark-compile idea. I was thinking of something like this, or moving it to the server command itself and adding a mutually exclusive list of args. Kinda like how sglang does it.

For example:

python -m shortfin_apps.llm.server \ --model-path https://huggingface.co/meta-llama/Llama-3.1-405B-Instruct --device hip \ --tp 8 \ --compile-target gfx942 \ --device_ids 0 1 2 3 4 5 6 7

We would default model artifacts to be saved to .cache/shark/Llama-3.1-405b-Instruct, and add an optional arg --cache-dir if they wanna save them somewhere else. Maybe we also include some flags for regenerating cached artifacts, like --ensure-export, --ensure-compile.

We would still wanna support the case of pre-compiled artifacts, so maybe we keep around our original set of args for that scenario:

python -m shortfin_apps.llm.server \ --tokenizer_json /path/to/output/tokenizer.json \ --model_config /path/to/output/llama3.1-405b.config.json \ --vmfb /path/to/output/llama3.1-405b.vmfb \ --parameters \ /path/to/output/llama3.1-405b.irpa \ /path/to/output/llama3.1-405b.rank0.irpa \ /path/to/output/llama3.1-405b.rank1.irpa \ /path/to/output/llama3.1-405b.rank2.irpa \ /path/to/output/llama3.1-405b.rank3.irpa \ /path/to/output/llama3.1-405b.rank4.irpa \ /path/to/output/llama3.1-405b.rank5.irpa \ /path/to/output/llama3.1-405b.rank6.irpa \ /path/to/output/llama3.1-405b.rank7.irpa \ --device=hip \ --device_ids 0 1 2 3 4 5 6 7

docs/shortfin/llm/user/llama_serving.md

Add brief explanation + doc link on how sharding works, Add brief description of which technique we use in `sharktank`

Add support table for llama models,

e5bf0fd

Add section in docs describing the process of running with a sharded Llama model

stbaione requested review from ScottTodd, renxida and amd-chrissosa February 5, 2025 16:13

Get rid of unneeded text

d9ebe36

ScottTodd approved these changes Feb 5, 2025

View reviewed changes

stbaione added 2 commits February 5, 2025 11:15

Lint

cddab76

Remove unnecessary iree-compile flags,

a6d9034

Add brief explanation + doc link on how sharding works, Add brief description of which technique we use in `sharktank`

stbaione merged commit 58e89c1 into main Feb 5, 2025
30 of 33 checks passed

stbaione deleted the users/stbaione/sharded-llama-docs branch February 5, 2025 18:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Llama support table + sharding docs #915

Llama support table + sharding docs #915

stbaione commented Feb 5, 2025 •

edited

Loading

ScottTodd left a comment

ScottTodd Feb 5, 2025

stbaione Feb 5, 2025

Llama support table + sharding docs #915

Llama support table + sharding docs #915

Conversation

stbaione commented Feb 5, 2025 • edited Loading

ScottTodd left a comment

Choose a reason for hiding this comment

ScottTodd Feb 5, 2025

Choose a reason for hiding this comment

stbaione Feb 5, 2025

Choose a reason for hiding this comment

stbaione commented Feb 5, 2025 •

edited

Loading