Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Llama support table + sharding docs #915

Merged
merged 4 commits into from
Feb 5, 2025

Conversation

stbaione
Copy link
Contributor

@stbaione stbaione commented Feb 5, 2025

This PR adds a support table to our llama_serving.md guide, specifying our supported variants of llama3.1-8b, llama3.1-70b, llama3.1-405b.

I used vLLM Supported Models as a reference. I changed the structure a bit to make more sense for our server, but it's relatively similar to how they set their table up.

For sharding instructions, I added a section at the end of our doc. I debated between creating a separate md file for it, but I think it actually makes sense where it is. It gives our doc a flow where we give detailed descriptions of what's going on while the user gets setup with the lowest-barrier model. Following that, you get to the more advanced sharding section. The details aren't as specific and it's not an exact copy + paste flow like above. The assumption is that after reading above section, the user shouldn't have to be hand-held as much in this section.

Currently, I know we can shard 405b with an upper-bound of tp8. I need to run some tests to see what the supported lower-bound is.

This also further highlights that we need to streamline our export/compile process. We should allow user to specify just the huggingface repo when starting the server, while we take care of downloading safetensors, exporting, and compiling. We should do this while still allowing for specific local files to be specified: #402, #691

Add section in docs describing the process of running with a sharded Llama model
Copy link
Member

@ScottTodd ScottTodd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! This looks good to merge with maybe a few edits. Agreed about the need for some tools that help put together the command sequences.

@@ -1,5 +1,19 @@
# Llama end to end serving instructions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also further highlights that we need to streamline our export/compile process. We should allow user to specify just the huggingface repo when starting the server, while we take care of downloading safetensors, exporting, and compiling. We should do this while still allowing for specific local files to be specified: #402

See also #691. I'd really like for some of the ideas in those issues to be incorporated into our development procedures soon. Improving the "user" workflows should also improve our "developer" workflows, which are quite fragmented (scripts in external repositories, scripts on specific datacenter servers that only a few team members have access to, etc.).

For example, iree-org/iree#19911 could have reproduction steps like

pip install shark-ai==[some nightly version]

shark-compile \
  --hf-model=meta-llama/Llama-3.1-8B \
  --hf-cache-dir=/shark-dev/cache \
  --compile-target=gfx942 \
  --sharding-mode=tp8 \
  --output-artifacts-dir=~/artifacts/llama-3.1-8b \
  --output-dev-artifacts-dir=~/dev-artifacts/llama-3.1-8b

iree-benchmark-module \
  --flagfile=~/dev-artifacts/llama-3.1-8b/benchmark-flags.txt

(of course we would iterate on the specific flags, artifact formats, cache defaults, environment variable settings, etc., but we should start somewhere and add utilities as we go)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I really like this shark-compile idea. I was thinking of something like this, or moving it to the server command itself and adding a mutually exclusive list of args. Kinda like how sglang does it.

For example:

python -m shortfin_apps.llm.server \
   --model-path https://huggingface.co/meta-llama/Llama-3.1-405B-Instruct
   --device hip \
   --tp 8 \
   --compile-target gfx942 \
   --device_ids 0 1 2 3 4 5 6 7

We would default model artifacts to be saved to .cache/shark/Llama-3.1-405b-Instruct, and add an optional arg --cache-dir if they wanna save them somewhere else. Maybe we also include some flags for regenerating cached artifacts, like --ensure-export, --ensure-compile.

We would still wanna support the case of pre-compiled artifacts, so maybe we keep around our original set of args for that scenario:

python -m shortfin_apps.llm.server \
   --tokenizer_json /path/to/output/tokenizer.json \
   --model_config /path/to/output/llama3.1-405b.config.json \
   --vmfb /path/to/output/llama3.1-405b.vmfb \
   --parameters \
      /path/to/output/llama3.1-405b.irpa \
      /path/to/output/llama3.1-405b.rank0.irpa \
      /path/to/output/llama3.1-405b.rank1.irpa \
      /path/to/output/llama3.1-405b.rank2.irpa \
      /path/to/output/llama3.1-405b.rank3.irpa \
      /path/to/output/llama3.1-405b.rank4.irpa \
      /path/to/output/llama3.1-405b.rank5.irpa \
      /path/to/output/llama3.1-405b.rank6.irpa \
      /path/to/output/llama3.1-405b.rank7.irpa \
   --device=hip \
   --device_ids 0 1 2 3 4 5 6 7

Add brief explanation + doc link on how sharding works,
Add brief description of which technique we use in `sharktank`
@stbaione stbaione merged commit 58e89c1 into main Feb 5, 2025
30 of 33 checks passed
@stbaione stbaione deleted the users/stbaione/sharded-llama-docs branch February 5, 2025 18:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants