-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tracking running llama models through IREE #22
Comments
For spriv-vulkan backend here's the minimal repro
error: |
Pulling in some of the comments from the chat.
For spirv-vulkan backend I have posted the minimal repro above. |
I think that assert can be safely dropped at the torch level in the same way as the broadcast asserts: when in strict mode from torch, the invariant being checked for dynamic legality must be true (torch enforces it). |
Thanks, I'm also able to compile for |
Compilation correctness
GGUF version 2 vs version 3
Running just decode, with zeroed arguments:Vulkan:
CPU (local-task): assert hit,
Next: figure out the runtime errors. Miscompile? Going over some runtime limits? local-sync and local-task have different errors. Look at the VM IR and see if anything stands out. |
I created a mock version of Compile with:
Run prefill with:
Run decode with:
I'm planning on loading that into Python and standing up an IREE version of https://github.com/nod-ai/sharktank/blob/main/sharktank/sharktank/examples/paged_llm_v1.py . Once the real model compiles, I'll substitute it. |
This upstream patch removes these assertions and implements a more direct lowering (no more switchy stuff): llvm/torch-mlir#3319 |
Latest attempt:
Compile for Vulkan: Run on Vulkan: Vulkan output:
Compile for CPU: Run on CPU: CPU crashes inside a dispatch ( Will trace execution and look at individual dispatches to go deeper. |
Currently debugging a runtime crash in decode still with @rsuderman . We're suspecting that the in-place scatter operations are writing out of bounds. The exported programs had a sequence of scatters back to back so Rob has a branch (https://github.com/rsuderman/sharktank/tree/rework_update) that makes the key value store updates use a single scatter (if I'm understanding correctly). The model fails to compile after those changes. I have a reduced test case of just a single |
A different reduced test (IR here, starting from the full llama model) was hitting an assert while compiling: https://gist.github.com/ScottTodd/366fe4b993c3d8e9776c40eddc4a6493 some debugging around the callstack also pointed at scatter ops: --- areOpsFusable ---
producer:
%48 = iree_linalg_ext.scatter dimension_map = [0, 1, 2, 3] unique_indices(false) ins(%expanded_27, %47 : tensor<1x1x1x1x32x100xf16>, tensor<1x4xi32>) outs(%expanded_21 : tensor<?x26x2x16x32x100xf16>) {
^bb0(%arg7: f16, %arg8: f16):
iree_linalg_ext.yield %arg7 : f16
} -> tensor<?x26x2x16x32x100xf16>
consumer:
%51 = iree_linalg_ext.scatter {__root_op__ = 17 : i64} dimension_map = [0, 1, 2, 3] unique_indices(false) ins(%expanded_35, %50 : tensor<1x1x1x1x32x100xf16>, tensor<1x4xi32>) outs(%48 : tensor<?x26x2x16x32x100xf16>) {
^bb0(%arg7: f16, %arg8: f16):
iree_linalg_ext.yield %arg7 : f16
} -> tensor<?x26x2x16x32x100xf16> I'm not sure if that is worth debugging further, may have been a buggy test case reduction. Going to follow up on the minimal |
Filed llvm/torch-mlir#3433 for the |
This occurs in the full model too. Can work around it by disabling all dispatch region fusions (add a |
Compiling with |
Tried to change dtypes in the model from i64 to i32 (https://github.com/nod-ai/sharktank/compare/main...ScottTodd:llama-i32?expand=1), ran into errors compiling after export like this: ~/iree-build/tools/iree-compile ~/scratch/open_llama_3b_v2_f16_i32more3_1block.mlir -o ~/scratch/open_llama_3b_v2_f16_i32more3_1block_asan.vmfb --iree-hal-target-backends=llvm-cpu --iree-llvmcpu-link-embedded=false --iree-llvmcpu-sanitize=address
/home/scotttodd/scratch/open_llama_3b_v2_f16_i32more3_1block.mlir:11259:11: error: 'arith.cmpi' op requires all operands to have the same type
%43 = torch.aten.index.Tensor %0, %42 : !torch.vtensor<[2048,50],complex<f32>>, !torch.list<optional<vtensor>> -> !torch.vtensor<[4,1,50],complex<f32>>
^
/home/scotttodd/scratch/open_llama_3b_v2_f16_i32more3_1block.mlir:11259:11: note: see current operation: %3741 = "arith.cmpi"(%arg275, %3740) <{predicate = 2 : i64}> : (i32, i64) -> i1 It sounds like iree-org/iree#17696 fixes decode crashes while still using i32 types. |
Confirmed that these patches help
All together, I see decode appearing to work (outputs appear sensible and aligned with prefill). Can continue to validate. |
Ideas for next steps / follow-up tasks:
|
Collecting some lessons learned and debugging tips from nod-ai/sharktank#22 into a single document.
Still seeing a crash in
I'll wrap all my repro steps (documented here: #69) into a script and run that script across my machines. Hopefully just a case of needing the cache (that |
Progress on #22 TODOs sprinkled throughout. Immediate next steps I'm considering: 1. Add a test / CI workflow that follows these steps (likely in Bash to start, then later in Python once more pieces are connected seamlessly) 2. Sanity check with other models from Hugging Face, other IREE backend targets, different batch sizes, etc. (could parameterize a test script on those options 🤔) 3. Extract some real inputs/outputs for use with `iree-run-module` then plug in to https://github.com/nod-ai/SHARK-TestSuite/tree/main/iree_tests to get presubmit coverage for `iree-compile` (guarding against compilation correctness regressions in LLVM integrates and other changes)
Progress on #22 Sample runs on my fork: * https://github.com/ScottTodd/sharktank/actions/runs/9670685134 * https://github.com/ScottTodd/sharktank/actions/runs/9715408887 I decided to run this on a nightly `schedule` and on `workflow_dispatch`. It takes around 10 minutes so it _could_ run on `pull_request` if we want too. As these components stabilize and we spend less time hacking on individual steps using the full toolkit (python -> manual `iree-compile` vs. using the in-process compiler API) we can switch the test from a bash script to a pytest file. Need to start somewhere :)
Progress on nod-ai/sharktank#22 This adds one test for a llama model running through https://github.com/nod-ai/sharktank. That project is still getting set up, so new docs for this particular workflow are coming in at nod-ai/sharktank#69 and tests in that repo are in nod-ai/sharktank#70. Specifically, this exercises: * [`sharktank/models/llama/llama.py`](https://github.com/nod-ai/sharktank/blob/main/sharktank/sharktank/models/llama/llama.py) * [`sharktank/examples/export_paged_llm_v1.py`](https://github.com/nod-ai/sharktank/blob/main/sharktank/sharktank/examples/export_paged_llm_v1.py) with batch sizes == [4] * The `open-llama-3b-v2-f16.gguf` file from https://huggingface.co/SlyEcho/open_llama_3b_v2_gguf * Compilation and crashless execution, _not_ numerical correctness (yet) Ideas for future work: * Test cases for the same model/parameters * Other batch sizes * `decode()` as well as `prefill()` * Real inputs with expected outputs (`decode()` crashes on some faked inputs still 🤔) * Other flag combinations and target configurations (starting simple though) * Test cases for other models/parameters * 8b / 70b parameter models * Mistral, Mixtral, Gemma, etc.
Progress on nod-ai/sharktank#22. See nod-ai/SHARK-TestSuite#272 for the specifics of what the new test is exercising. The "models" tests now include `pytorch/models/` and `sharktank/`, so all test names are qualified relative to `iree_tests/` in the test suite repo. (Totally inflating my commit stats here, sorry :P) ci-exactly: build_packages,regression_test
Goal
Run a llama model from https://github.com/nod-ai/sharktank/blob/main/sharktank/sharktank/models/llama/llama.py through IREE
Starting with
open_llama_3b_v2_f16_gguf
since we have that in docs. Could try another model or data type but should eventually all sorts of variants working.Approach
https://github.com/nod-ai/sharktank/tree/main/sharktank/sharktank/examples has a few files already:
Next steps from there could be
iree-compile
and run it usingiree-run-module
paged_llm_v1.py
that could eitherWorklog
Export -> try compile entire program ("prefill" and "decode")
llvm-cpu
with default flags:iree-compile open_llama_3b_v2_f16.mlir --iree-hal-target-backends=llvm-cpu -o /tmp/open_llama_3b_v2_f16_cpu.vmfb --iree-hal-executable-debug-level=3 --iree-hal-dump-executable-files-to=/tmp/open_llama_3b_v2_f16_cpu
. That got stuck compiling afterLLVMCPUVectorTransferLowering
: Serialize Executables crashing when compiling LLaMa on async-cpu iree-org/iree#17244 (comment)vulkan-spirv
with default flags:iree-compile open_llama_3b_v2_f16.mlir --iree-hal-target-backends=vulkan-spirv -o /tmp/open_llama_3b_v2_f16_vulkan.vmfb
. That hit two different spirv codegen issues: Vulkan compile errors for llama model from sharktank iree-org/iree#17304failed to legalize operation 'arith.extui'
withi1 -> i64
on CPU,'spirv.IAdd'
withi1
on Vulkan)Next: continue triaging compilation errors for prefill.
Export and run just "decode"
sharktank/sharktank/sharktank/examples/export_paged_llm_v1.py
Line 131 in 3be773c
iree-compile open_llama_3b_v2_f16_decode_only.mlir --iree-hal-target-backends=vulkan-spirv --iree-vulkan-target-triple=turing-unknown-unknown -o /tmp/open_llama_3b_v2_f16_vulkan_decode_only.vmfb --iree-hal-executable-debug-level=3
iree-run-module
I need the inputs and a parameter file--input=4xi64 --input=4xi64 --input=4xi64 --input=4xi64 --input=1x2662400xf32
(need to verify)huggingface-cli download --local-dir /tmp/open_llama_3b_v2_gguf SlyEcho/open_llama_3b_v2_gguf
(that folder then contains/tmp/open_llama_3b_v2_gguf/open-llama-3b-v2-f16.gguf
)iree-run-module --module=/tmp/open_llama_3b_v2_f16_vulkan_decode_only.vmfb --device=vulkan --input=4xi64 --input=4xi64 --input=4xi64 --input=4xi64 --input=1x2662400xf32 --parameters=model=/tmp/open_llama_3b_v2_gguf/open-llama-3b-v2-f16.gguf
produces this error:iree\runtime\src\iree\io\formats\gguf\gguf_parser.c:678: UNIMPLEMENTED; GGUF format version 2 is unsupported; expected version 3
Next: try upgrading GGUF version 2 to 3? Load from safetensors? Convert to IRPA?
The text was updated successfully, but these errors were encountered: