Skip to content

Commit

Permalink
Update halo-models.md
Browse files Browse the repository at this point in the history
  • Loading branch information
kumardeepakamd authored Dec 13, 2024
1 parent b4cf133 commit b46e96c
Showing 1 changed file with 23 additions and 55 deletions.
78 changes: 23 additions & 55 deletions halo-models.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,20 +14,26 @@ TTFT: Time To First Token (time taken from processing of prompt to first token g

ITL: Average time between each new token generated in decode phase (second token onwards)

# llama3.1 inference user instructions
# User Instructions
-Read [cookbooks](https://github.com/nod-ai/shark-ai/tree/main/docs/shortfin/llm/user) for user-like inference run instructions.

Need to be able to have a flow where user can download gguf open weights and compile the model and launch shortfin server
| Steps | instructions | owner |
|-------------------------------------|--------------|--------|
| Compile Model: Given gguf, import weights, export MLIR and get vmfb | [cookbooks](https://github.com/nod-ai/shark-ai/tree/main/docs/shortfin/llm/user) | @scott
| Launch Shortfin: Launch shortfin server for {single node, cluster of nodes} | TBD | @xida/@sai
| Run Inference: Use sglang front-end to drive inference | TBD | @stephen
# Benchmarking
-Read [benchmarking](https://github.com/nod-ai/llm-dev/blob/main/llama_benchmarking.md) to get setup to get performance numbers.
## Latest Tracy Profiles
| Model | Tracy Profile |
|-------|---------------|
|llama3.1 8B Fp16 Unsharded nondecomposed (i.e. using flash attention2) | [Tracy Profile](https://sharkpublic.blob.core.windows.net/sharkpublic/halo-models/llm-dev/llama3_8b/8b_f16_nondecomposed_32.tracy)

# Schedule
(Model is assumed to be llama3.1 in the following table, e.g. "8B FP8" means "llama3.1 8B FP8 model")
|Item | Current Week (Dec 9-13) | Next Week (Dec 16-20) |
|------------------------------|-----------------------|--------------------------|
| Sharktank Modeling | - @Ian Finish Flux Vae decode (DONE 12/11) <br> - @Kyle finish flux model (DONE: 12/11) <br> - @Boian flux clip model export and compile for bf16 (DONE: 12/11) <br> - @Dan Finish and merge FP8 llama PR (ETA 12/12) |
| IREE codegeneration | - @kunvar decode flash attention (DONE 12/11) |
| Serving |- @ean flush out bf16 flux in shortfin for flux (ETA 12/12) <br> - @Xida fix flakiness in batch handling (Done: 12/12) <br> - @Stephen test and ensure sglang/shortfin batch runs work (ETA: 12/12) |
| Test Automation |- @Avi refresh benchmarking decode and prefill for 8B, 70B (ETA: 12/12) <br> -@Archana shortfin PPL debugging (ETA: 12/10) <br> -@Rob debug multi-device (ETA: 12/11)
| Performance Tuning | -@Avi tracy profile for decode (ETA:12/11)|

# Run Instructions
- [shortfin SDXL](https://github.com/nod-ai/SHARK-Platform/tree/main/shortfin/python/shortfin_apps/sd)
- [sglang-shortfin llama3.1](https://github.com/stbaione/SHARK-Platform/blob/sglang-user-doc/docs/shortfin/llm/user/shortfin_with_sglang_frontend_language.md)

# Nightly Test Reports
See latest [CI/Nightly Test Report](https://nod-ai.github.io/shark-ai/). Use [Nod.AI Lab](https://confluence.amd.com/pages/viewpage.action?spaceKey=ENGIT&title=Nod.AI+Lab) page to ssh into machine SharkMi300X to find logs and artifacts to triage the failures. File an issue (if not already filed/listed) and add to Issues table below.

Expand All @@ -36,18 +42,6 @@ See latest [CI/Nightly Test Report](https://nod-ai.github.io/shark-ai/). Use [No
|---|---|---|---|
|quark quantization | [QUARK-71](https://jira.xilinx.com/browse/QUARK-71) | Bowen Bow | FP8 matmul should be used in attention|
|iree codegen | [18864](https://github.com/iree-org/iree/issues/18864)| Ian Wood | OOM for 70B |
|iree Negative Memory | [19077](https://github.com/iree-org/iree/issues/19077) | unassigned | op uses -131072 bytes of shared memory

# Schedule
(Model is assumed to be llama3.1 in the following table, e.g. "8B FP8" means "llama3.1 8B FP8 model")
|Item | Last Week (Dec 2-6) | Current Week (Dec 9-13) |
|------------------------------|-----------------------|--------------------------|
| Sharktank Modeling | - @Boian CLIP encoder (ETA: 12/5) <br> - @Rob CI llama regression tests (ETA 12/3) <br> - @Ian Finish VAE decode (ETA: 12/5) | <br> - @Ian Finish Flux Vae decode (ETA 12/11) <br> - @Kyle finish flux model (ETA: 12/11) <br> - @Boian flux clip model export and compile for bf16 (ETA: 12/9) <br> - @Dan debuggi IR generated (Done: 12/9) <br> - @Dan Finish and merge FP8 llama PR (ETA 12/11)
| IREE codegeneration | |
| Serving | <br> - @eagarvey finish Flux pipeline for image generation (ETA: 12/3) <br> - @Stephen Debug CI flakiness (ETA: 12/2) <br> - @Xida landing PR's for attention changes (ETA: 12/3) | - @ean flush out bf16 flux in shortfin for flux (ETA 12/9) <br> - @Xida Debugging batching numeric issues (Done: 12/9) <br> - @Stephen enabling sharded llama in shortfin (ETA: 12/10) |
| Test Automation |- @Avi benchmarking dashboard (ETA: 12/3) <br> - @Archana shortfin regression tests (ETA: 12/3) | @ Avi - 405B TP8 fp16 verfication and Tracy Profiles (ETA 12/9) <br> - @Archana shortfin PPL debugging (ETA: 12/10) <br> - @Rob grok numerics (ETA 12/9) <br> - @Rob debug multi-device (ETA: 12/10)
| Performance Tuning | |


# Status-Numerics
Following naming convention should be used for weights and artifacts (on SharkMI300x and other similar machines)
Expand All @@ -70,11 +64,11 @@ Example: /data/llama-3.1/weights/405b/fp16/tp8/llama3.1_405b_fp16_tp8_parameters

Example: /data/llama-3.1/artifacts/405b/llama3.1_405b_fp16_nondecomposed_tp8_bs4.mlir

## llama3.1 decomposed
To generate artifacts, on SharkMI300x, follow sharktank [setup instructions](https://gist.github.com/stbaione/be38bfb214d990a4b765804223d6b948), then:
`python -m sharktank.examples.export_paged_llm_v1 --irpa-file=/data/llama-3.1/weights/8b/fp16/llama3.1_8b_fp16.irpa --output-mlir f16_dc.mlir --bs=1 --attention-kernel=decomposed`
## llama3.1 (non-decomposed, uses Flash Attention 2)
To generate artifacts for llama3.1 8B, on SharkMI300x, follow sharktank [setup instructions](https://gist.github.com/stbaione/be38bfb214d990a4b765804223d6b948), then:
`python -m sharktank.examples.export_paged_llm_v1 --irpa-file=/data/llama-3.1/weights/8b/fp16/llama3.1_8b_fp16.irpa --output-mlir f16_dc.mlir --bs=1`

## llama3.1 405B TP8 commands
To generate sharded llama3.1 405b artifacts, do following:
1. Shard irpa file:

`
Expand All @@ -95,25 +89,12 @@ Prefill + Decode Nondecomposed
python3 -m sharktank.examples.export_paged_llm_v1 --irpa-file=test.irpa --output-mlir=405b_f16_tp8_decomposed.mlir --output-config=405b_f16_tp8_decomposed.json --bs=4 --attention-kernel torch
`

3. Compile (FAIL [compile error](https://gist.github.com/aviator19941/73468660ecef16b03b37e9afa2e6d075), seems related to this [PR](https://github.com/iree-org/iree/pull/18663)):
3. Compile:

`
iree-compile 405b_f16_tp8_decomposed.mlir --iree-hip-target=gfx942 --iree-hal-target-backends=rocm -o=405b_f16_tp8_decomposed.vmfb --iree-hal-target-device=hip[0] --iree-hal-target-device=hip[1] --iree-hal-target-device=hip[2] --iree-hal-target-device=hip[3] --iree-hal-target-device=hip[4] --iree-hal-target-device=hip[5] --iree-hal-target-device=hip[6] --iree-hal-target-device=hip[7] --iree-dispatch-creation-enable-aggressive-fusion=true --iree-global-opt-propagate-transposes=true --iree-opt-aggressively-propagate-transposes=true --iree-opt-data-tiling=false --iree-preprocessing-pass-pipeline='builtin.module(util.func(iree-preprocessing-generalize-linalg-matmul-experimental))' --iree-hal-indirect-command-buffers=true --iree-stream-resource-memory-model=discrete --iree-hip-legacy-sync=false --iree-hal-memoization=true --iree-opt-strip-assertions
`

(MI300X GPU, SPX Mode)
|Item | Generate MLIR | Compile to vmfb | IREE invocation | IREE numeric [Perplexity](https://github.com/nod-ai/shark-ai/tree/main/sharktank/sharktank/evaluate) | Serving numeric |
|------------------------------------------|---------------|-----------------|-----------------|--------------|-----------------|
| llama3.1-8B-FP16-decomposed |PASS [TP1 mlir](https://sharkpublic.blob.core.windows.net/sharkpublic/halo-models/llm-dev/llama3_8b/8b_f16_decomposed_11_22.mlir) [gguf](https://sharkblobs.blob.core.windows.net/halo-models/llm-dev/llama3_8b/llama8b_f16.gguf) [irpa](https://sharkblobs.blob.core.windows.net/halo-models/llm-dev/llama3_8b/8b_f16.irpa) |PASS [vmfb](https://sharkblobs.blob.core.windows.net/halo-models/llm-dev/llama3_8b/8b_f16.vmfb) | PASS |14.99 | tbd
| llama3.1-8B-FP16-decomposed-TP8 | PASS ([MLIR](https://sharkblobs.blob.core.windows.net/halo-models/llm-dev/llama3_8b/llama3.1_8b_fp16_decomposed_tp8.mlir)) | PASS | PASS | FAIL (probably) | tbd
| llama3.1-70B-FP16-decomposed |PASS [TP1 mlir](https://sharkblobs.blob.core.windows.net/halo-models/llm-dev/llama3_70b/llama70b_f16.mlir) [gguf](https://sharkblobs.blob.core.windows.net/halo-models/llm-dev/llama3_70b/llama70b_f16.gguf) [irpa](https://sharkblobs.blob.core.windows.net/halo-models/llm-dev/llama3_70b/70b_f16.irpa) |PASS [vmfb](https://sharkblobs.blob.core.windows.net/halo-models/llm-dev/llama3_70b/70b_f16.vmfb) | FAIL [OOM](https://github.com/iree-org/iree/issues/18864) | tbd | tbd
| llama3.1-405B-FP16-decomposed |PASS [TP1 mlir](https://sharkpublic.blob.core.windows.net/sharkpublic/halo-models/llm-dev/llama3_405b/llama3.1_405b_fp16_TP1.mlir) [gguf](https://sharkblobs.blob.core.windows.net/halo-models/llm-dev/llama3_405b/llama405b_fp16.gguf) | tbd | tbd | tbd | tbd
| llama3.1-405B-FP16-decomposed-TP8 | PASS [MLIR](https://sharkpublic.blob.core.windows.net/sharkpublic/halo-models/llm-dev/llama3_405b/llama3.1_405b_f16_tp8.mlir) | PASS [vmfb](https://sharkpublic.blob.core.windows.net/sharkpublic/halo-models/llm-dev/llama3_405b/llama3.1_405b_f16_tp8.vmfb) | FAIL [Registers](https://github.com/iree-org/iree/issues/18923) | tbd | tbd
| llama3.1-8B-FP8-decomposed |PASS [TP1 mlir](https://sharkpublic.blob.core.windows.net/sharkpublic/dan/native_fp8_e4m3fnuz_llama3_8b.mlir) [irpa](https://sharkpublic.blob.core.windows.net/sharkpublic/dan/native_fp8_e4m3fnuz_llama3_8b.irpa) | Fails in iree, [patch](https://github.com/iree-org/iree/pull/18890) | tbd | tbd | tbd
| llama3.1-70B-FP8-decomposed |PASS [TP1 mlir](https://sharkpublic.blob.core.windows.net/sharkpublic/dan/native_fp8_e4m3fnuz_llama3_70b.mlir) [irpa](https://sharkpublic.blob.core.windows.net/sharkpublic/dan/native_fp8_e4m3fnuz_llama3_70b.irpa) |Fails in iree, [patch](https://github.com/iree-org/iree/pull/18890) | tbd | tbd | tbd
| llama3.1-405B-FP8-decomposed | tbd | tbd | tbd | tbd | tbd

## llama3.1 non-decomposed
### Artifacts
(MI300X GPU, SPX Mode)
|Item | Generate MLIR | Compile to vmfb | IREE invocation | IREE numeric | Serving numeric |
|------------------------------------------|---------------|-----------------|-----------------|--------------|-----------------|
Expand All @@ -127,19 +108,6 @@ iree-compile 405b_f16_tp8_decomposed.mlir --iree-hip-target=gfx942 --iree-hal-ta
| llama3.1-405B-FP8 |ETA: 11/5 | tbd | tbd | tbd | tbd
| llama-toy-size-FP32-TP2-CPU | PASS | PASS | tbd | tbd | tbd

## llama3.1 decodeposed
(only decode is decomposed)
(MI300X GPU, SPX Mode)
|Item | Generate MLIR | Compile to vmfb | IREE invocation | IREE numeric | Serving numeric |
|------------------------------------------|---------------|-----------------|-----------------|--------------|-----------------|
| llama3.1-8B-FP16 |PASS [mlir_tp1](https://sharkpublic.blob.core.windows.net/sharkpublic/halo-models/llm-dev/llama3_8b/llama8b_f16_tp1_decodeposed_bs4.mlir) | tbd | tbd | tbd | tbd
| llama3.1-70B-FP16 | tbd | tbd | tbd | tbd | tbd
| llama3.1-405B-FP16 |PASS [mlir_tp8](https://sharkpublic.blob.core.windows.net/sharkpublic/halo-models/llm-dev/llama3_405b/llama3_405b_f16_tp8_decodeposed.mlir) | tbd | tbd | tbd | tbd
| llama3.1-8B-FP8 | PASS [mlir](https://sharkpublic.blob.core.windows.net/sharkpublic/dan/f8_half_ndc.mlir) | Fail (attention, Dan currently looking into this) | tbd | tbd | tbd
| llama3.1-70B-FP8 | tbd | tbd | tbd | tbd | tbd
| llama3.1-405B-FP8 | tbd | tbd | tbd | tbd | tbd
| llama-toy-size-FP32-TP2-CPU | tbd | tbd | tbd | tbd | tbd

## Flux.1 dev
|Item | Generate MLIR | Compile to vmfb | IREE invocation | IREE numeric | Serving numeric |
|------------------|---------------|-----------------|-----------------|--------------|-----------------|
Expand Down

0 comments on commit b46e96c

Please sign in to comment.