forked from iree-org/iree
-
Notifications
You must be signed in to change notification settings - Fork 11
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[Docs][Tuner] Add initial tuning spec docs (iree-org#19462)
General intro, list the main flags, show an example. Issue: iree-org#19214
- Loading branch information
Showing
3 changed files
with
138 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,136 @@ | ||
--- | ||
icon: octicons/meter-16 | ||
--- | ||
|
||
# Tuning | ||
|
||
This page documents support for IREE dispatch tuning. The compiler supports | ||
both default and user-provided tuning specs (specifications) that override | ||
compiler heuristics that guide dispatch code generation. In our experience, | ||
tuning specs can provide meaningful speedup of model execution. For example, we | ||
achieved a ~10% improvement on the Stable Diffusion XL (SDXL) model with the | ||
MI300X GPU. | ||
|
||
## Tuning specs | ||
|
||
The default specs are shipped with the IREE compiler and are target-specific. | ||
We aim to provide default tuning specs that cover the most in-demand hardware | ||
and dispatches from most popular ML models, although we do not guarantee | ||
completeness. | ||
|
||
User-provided tuning specs are a mechanism that allows for users to get the | ||
best performance on custom models and hardware targets without having to modify | ||
the compiler source code or needlessly special-case compiler heuristics. | ||
|
||
Currently, the dispatch tuner that generates tuning specs is still experimental | ||
and hosted | ||
[in an external repo](https://github.com/nod-ai/shark-ai/tree/main/tuner). This | ||
document describes how to work with tuning specs generated by the SHARK Tuner | ||
or produced manually, but it does not go into detail on how to generate these | ||
specs. | ||
|
||
## Flags | ||
|
||
The use of tuning specs in `iree-compile` is controlled with the following | ||
flags: | ||
|
||
* `--iree-codegen-enable-default-tuning-specs` -- enables or disables the | ||
default tuning specs shipped with the compiler. | ||
* `--iree-codegen-tuning-spec-path` -- loads a user-specified tuning spec. | ||
* `--iree-codegen-dump-tuning-specs-to` -- dumps final tuning specs to a | ||
directory or standard output. | ||
|
||
Note that both default and user-provided specs can be enabled at the same time. | ||
The compiler will link them together and invoke the user-provided spec before | ||
attempting the default one. | ||
|
||
## Anatomy of a tuning spec | ||
|
||
### Example | ||
|
||
```mlir | ||
module @my_spec attributes { transform.with_named_sequence } { | ||
transform.named_sequence @apply_op_config(%op: !transform.any_op {transform.readonly}, | ||
%config: !transform.any_param {transform.readonly}) { | ||
transform.annotate %op "compilation_info" = %config : !transform.any_op, !transform.any_param | ||
transform.yield | ||
} | ||
transform.named_sequence | ||
@match_mmt_f16_f16_f32(%root: !transform.any_op {transform.readonly}) -> !transform.any_op { | ||
transform.match.operation_name %root ["linalg.generic"] : !transform.any_op | ||
%ins, %outs = transform.iree.match.cast_compatible_dag_from_root %root { | ||
^bb0(%lhs: tensor<?x?xf16>, %rhs: tensor<?x?xf16>, %out: tensor<?x?xf32>): | ||
%7 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d2)>, | ||
affine_map<(d0, d1, d2) -> (d1, d2)>, | ||
affine_map<(d0, d1, d2) -> (d0, d1)>], | ||
iterator_types = ["parallel", "parallel", "reduction"]} | ||
ins(%lhs, %rhs : tensor<?x?xf16>, tensor<?x?xf16>) outs(%out : tensor<?x?xf32>) { | ||
^bb0(%in: f16, %in_0: f16, %acc: f32): | ||
%8 = arith.extf %in : f16 to f32 | ||
%9 = arith.extf %in_0 : f16 to f32 | ||
%10 = arith.mulf %8, %9 : f32 | ||
%11 = arith.addf %acc, %10 : f32 | ||
linalg.yield %11 : f32 | ||
} -> tensor<?x?xf32> | ||
} : (!transform.any_op) -> (!transform.any_value, !transform.any_value) | ||
transform.yield %root : !transform.any_op | ||
} | ||
transform.named_sequence | ||
@match_mmt_2048x1280x5120_f16_f16_f32(%matmul: !transform.any_op {transform.readonly}) | ||
-> (!transform.any_op, !transform.any_param) { | ||
%mmt = transform.include @match_mmt_f16_f16_f32 failures(propagate) (%matmul) | ||
: (!transform.any_op) -> !transform.any_op | ||
%lhs = transform.get_operand %matmul[0] : (!transform.any_op) -> !transform.any_value | ||
%rhs = transform.get_operand %matmul[1] : (!transform.any_op) -> !transform.any_value | ||
transform.iree.match.cast_compatible_type %lhs = tensor<2048x5120xf16> : !transform.any_value | ||
transform.iree.match.cast_compatible_type %rhs = tensor<1280x5120xf16> : !transform.any_value | ||
%config = transform.param.constant #iree_codegen.compilation_info< | ||
lowering_config = #iree_gpu.lowering_config<{promote_operands = [0, 1], | ||
mma_kind = #iree_gpu.mma_layout<MFMA_F32_16x16x16_F16>, | ||
subgroup_m_count = 2, subgroup_n_count = 2, | ||
reduction = [0, 0, 64], | ||
workgroup = [64, 128, 0]}>, | ||
translation_info = #iree_codegen.translation_info<pipeline = LLVMGPUVectorDistribute | ||
workgroup_size = [256, 1, 1] subgroup_size = 64, | ||
{gpu_pipeline_options = #iree_gpu.pipeline_options<prefetch_shared_memory = true>}> | ||
> -> !transform.any_param | ||
transform.yield %matmul, %config : !transform.any_op, !transform.any_param | ||
} | ||
transform.named_sequence | ||
@__kernel_config(%variant_op: !transform.any_op {transform.consumed}) -> !transform.any_op | ||
attributes { iree_codegen.tuning_spec_entrypoint } { | ||
%res = transform.foreach_match in %variant_op | ||
@match_mmt_2048x1280x5120_f16_f16_f32 -> @apply_op_config | ||
: (!transform.any_op) -> !transform.any_op | ||
transform.yield %res : !transform.any_op | ||
} | ||
} | ||
``` | ||
|
||
### Explanation | ||
|
||
Tuning specs are | ||
[transform dialect](https://mlir.llvm.org/docs/Dialects/Transform/) libraries | ||
that conform to the following format: | ||
|
||
* All tuning spec entry points (named sequence ops) are marked with the | ||
`iree_codegen.tuning_spec_entrypoint` attribute. They have a single argument | ||
of type `!transform.any_op` and return a single value of type | ||
`!transform.any_op`. | ||
* All entry points in the final tuning specs must either read | ||
(`transform.readonly`) or consume (`transform.consumed`) the argument. | ||
|
||
The tuning spec above attempts to match `linalg.generic` ops that correspond to the | ||
matmul operation with the RHS operand transposed (a.k.a. mmt) of shape | ||
`2048x1280x5120` and `f16` operand element types and `f32` result element type. | ||
|
||
If the match succeeds, the tuning spec applies the `compilation_info` attribute | ||
that will drive the code generation. This attribute is considered a compiler | ||
implementation detail; in general, each codegen pipeline has its own | ||
requirements as to what is considered a valid compilation info and how to | ||
interpret it. | ||
|
||
Tuning specs get executed by the 'Materialize User Configs` pass. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters