-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Llama support table + sharding docs #915
Conversation
Add section in docs describing the process of running with a sharded Llama model
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! This looks good to merge with maybe a few edits. Agreed about the need for some tools that help put together the command sequences.
@@ -1,5 +1,19 @@ | |||
# Llama end to end serving instructions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also further highlights that we need to streamline our export/compile process. We should allow user to specify just the huggingface repo when starting the server, while we take care of downloading safetensors, exporting, and compiling. We should do this while still allowing for specific local files to be specified: #402
See also #691. I'd really like for some of the ideas in those issues to be incorporated into our development procedures soon. Improving the "user" workflows should also improve our "developer" workflows, which are quite fragmented (scripts in external repositories, scripts on specific datacenter servers that only a few team members have access to, etc.).
For example, iree-org/iree#19911 could have reproduction steps like
pip install shark-ai==[some nightly version]
shark-compile \
--hf-model=meta-llama/Llama-3.1-8B \
--hf-cache-dir=/shark-dev/cache \
--compile-target=gfx942 \
--sharding-mode=tp8 \
--output-artifacts-dir=~/artifacts/llama-3.1-8b \
--output-dev-artifacts-dir=~/dev-artifacts/llama-3.1-8b
iree-benchmark-module \
--flagfile=~/dev-artifacts/llama-3.1-8b/benchmark-flags.txt
(of course we would iterate on the specific flags, artifact formats, cache defaults, environment variable settings, etc., but we should start somewhere and add utilities as we go)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I really like this shark-compile
idea. I was thinking of something like this, or moving it to the server command itself and adding a mutually exclusive list of args. Kinda like how sglang does it.
For example:
python -m shortfin_apps.llm.server \
--model-path https://huggingface.co/meta-llama/Llama-3.1-405B-Instruct
--device hip \
--tp 8 \
--compile-target gfx942 \
--device_ids 0 1 2 3 4 5 6 7
We would default model artifacts to be saved to .cache/shark/Llama-3.1-405b-Instruct
, and add an optional arg --cache-dir
if they wanna save them somewhere else. Maybe we also include some flags for regenerating cached artifacts, like --ensure-export
, --ensure-compile
.
We would still wanna support the case of pre-compiled artifacts, so maybe we keep around our original set of args for that scenario:
python -m shortfin_apps.llm.server \
--tokenizer_json /path/to/output/tokenizer.json \
--model_config /path/to/output/llama3.1-405b.config.json \
--vmfb /path/to/output/llama3.1-405b.vmfb \
--parameters \
/path/to/output/llama3.1-405b.irpa \
/path/to/output/llama3.1-405b.rank0.irpa \
/path/to/output/llama3.1-405b.rank1.irpa \
/path/to/output/llama3.1-405b.rank2.irpa \
/path/to/output/llama3.1-405b.rank3.irpa \
/path/to/output/llama3.1-405b.rank4.irpa \
/path/to/output/llama3.1-405b.rank5.irpa \
/path/to/output/llama3.1-405b.rank6.irpa \
/path/to/output/llama3.1-405b.rank7.irpa \
--device=hip \
--device_ids 0 1 2 3 4 5 6 7
Add brief explanation + doc link on how sharding works, Add brief description of which technique we use in `sharktank`
This PR adds a support table to our
llama_serving.md
guide, specifying our supported variants ofllama3.1-8b, llama3.1-70b, llama3.1-405b
.I used vLLM Supported Models as a reference. I changed the structure a bit to make more sense for our server, but it's relatively similar to how they set their table up.
For sharding instructions, I added a section at the end of our doc. I debated between creating a separate
md
file for it, but I think it actually makes sense where it is. It gives our doc a flow where we give detailed descriptions of what's going on while the user gets setup with the lowest-barrier model. Following that, you get to the more advanced sharding section. The details aren't as specific and it's not an exact copy + paste flow like above. The assumption is that after reading above section, the user shouldn't have to be hand-held as much in this section.Currently, I know we can shard 405b with an upper-bound of
tp8
. I need to run some tests to see what the supported lower-bound is.This also further highlights that we need to streamline our export/compile process. We should allow user to specify just the huggingface repo when starting the server, while we take care of downloading safetensors, exporting, and compiling. We should do this while still allowing for specific local files to be specified: #402, #691