Skip to content

Conversation

@Lukinon
Copy link

@Lukinon Lukinon commented Oct 28, 2025

This PR introduces initial support for running tilelang on the Hygon DCU backend (tested on the BW200 platform). The key changes include:

  • Integration of Hygon DCU as a supported backend in tilelang.
  • Implementation of Matrix Core instruction support, enabling tilelang to generate and execute DCU-specific matrix operations.
  • Successful execution of existing examples on the DCU, demonstrating functional correctness.
  • Addition of dedicated DCU test cases under the testing/ directory to validate backend functionality and ensure future compatibility.

Summary by CodeRabbit

  • New Features

    • Added DCU-target optimizations and a new high-performance GEMM path (MMAC/MFMA) for GPU matrix multiply, including FP8/bfloat16/int8 support and swizzle/layout improvements
    • New utilities for index conversion and GPU reductions/prefix-sum primitives
    • New example and tests demonstrating the accelerated GEMM path and correctness/benchmarking
  • Chores

    • Adjusted HIP/ROCm build and compilation defaults and flags for the DCU path

@github-actions
Copy link

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 28, 2025

Walkthrough

Adds DCU/MMAC GEMM support and DCU-targeted HIP device templates, new layout/fragments, MMAC intrinsic emission and codegen handling, a HIP vertical-slash index kernel, Python examples/tests, and several build/tooling adjustments for ROCm/DCU paths.

Changes

Cohort / File(s) Summary
DCU GEMM example & tests
examples/gemm/example_gemm_intrinsics_dcu.py, testing/python/dcu/test_tilelang_gemm_mmac_intrinsic.py
New example and tests implementing a TileLang MMAC GEMM path with swizzle layouts, MatrixCore intrin emitter usage, JIT kernel generation, profiling, benchmarking and correctness checks.
Vertical-slash index kernel
examples/minference/ops/vertical_slash_index.hip
New HIP kernel, device helpers, host launcher, and PyBind/Torch binding that compute block/column counts and indices from batched vertical/slash index inputs.
Layout factories
src/layout/gemm_layouts.cc, src/layout/layout.h
Added makeGemmFragmentCDCU fragment generator and its declaration; minor layout-path comment changes.
GEMM operator layout inference
src/op/gemm.cc
Selects CDCU fragment when TargetIsDCU(target) in GEMM InferLayout; added include.
Target utilities
src/target/utils.h, src/target/utils.cc
Added TargetIsDCU(Target) declaration and implementation (ROCm/gfx936 check). (Implementation references mcpu — review needed.)
HIP codegen & intrinsics
src/target/codegen_hip.cc, src/target/intrin_rule_hip.cc
Codegen: added handling for tl::tvm_mmac() intrinsic path (12-arg mmac emission) alongside mfma; injects HIP_ENABLE_WARP_SYNC_BUILTINS and switches includes to dcu_hip templates. Intrin rules: registered tir.hip.__shfl.
DCU HIP template headers
src/tl_templates/dcu_hip/common.h, .../core.hpp, .../copy.h, .../debug.h, .../gemm.h, .../ldsm.h, .../reduce.h, .../threadblock_swizzle.h, .../hip_fp8.h
New comprehensive dcu_hip header suite: type/vector aliases, FP8 wrappers, atomic helpers, async/cp primitives, device debug prints, MFMA/MMAC-capable gemm templates, reduce and rasterization helpers.
MMAC macro generator
tilelang/intrinsics/mmac_macro_generator.py
New MatrixCoreIntrinEmitter and MatrixCorePreshuffleIntrinEmitter for emitting MMAC/ldmatrix/stmatrix/mmac intrinsics with preshuffle support.
HIP toolchain / ROCm defaults
tilelang/contrib/hipcc.py, tilelang/contrib/rocm.py, tilelang/engine/lower.py
Compilation flags adjusted (use -O1, add -Wno-invalid-constexpr), default ROCm path changed to /opt/dtk, and HIP compile flags updated in lowering.
TileLang contrib: HIP compilation
tilelang/contrib/hipcc.py
Lowered optimization level and added constexpr warning suppression flag.

Sequence Diagram(s)

sequenceDiagram
    participant Py as Python (tl_matmul)
    participant Lower as TileLang Lowering
    participant Codegen as HIP Codegen
    participant Templates as dcu_hip templates
    participant Device as GPU kernel

    Py->>Lower: Request JIT GEMM (MatrixCoreIntrinEmitter)
    Lower->>Codegen: Lower to emit intrinsics
    Codegen->>Codegen: if TargetIsDCU -> select CDCU fragment & emit tl::tvm_mmac() path
    Codegen->>Templates: Include dcu_hip templates (ldmatrix/mmac/stmatrix)
    Codegen-->>Device: Emit HIP kernel with __builtin_amdgcn_mmac_* calls
    Device->>Device: ldmatrix A/B -> mmac (accumulate) -> stmatrix -> store C
    Device-->>Py: Return kernel, profiler, results
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

  • Files requiring extra attention:
    • src/target/codegen_hip.cc: correctness of the new tl::tvm_mmac() emission, dtype mapping, and duplicated insertion sites.
    • src/target/utils.cc: TargetIsDCU implementation references mcpu without visible declaration.
    • src/tl_templates/dcu_hip/gemm.h and related templates: verify indexing, swizzle, mfma/mmac usage and type specializations (int8, fp16, bf16, FP8).
    • tilelang/intrinsics/mmac_macro_generator.py: ldmatrix/stmatrix index maps and preshuffle edge cases.
    • examples/minference/ops/vertical_slash_index.hip: grid/launch correctness and PyBind interface.

Possibly related PRs

Suggested labels

enhancement

Poem

🐰 Hop, I tuned the DCU trail,

MMAC sparks where fragments sail,
Swizzled tiles and kernels hum,
Threads align — the accumulations come,
Code compiled, the little rabbit drums ✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 10.28% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly and clearly describes the main objective of the PR: adding support for the Hygon DCU backend to tilelang, which is the primary focus of all the changes.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 17

🧹 Nitpick comments (16)
tilelang/contrib/rocm.py (1)

230-230: Note: Inconsistent default paths across the file.

While this change targets DCU support, other functions in this file still default to /opt/rocm (see lines 57, 140-143, 280). For a complete DCU integration, you may want to apply similar path detection logic to those functions as well, or alternatively use find_rocm_path() (which checks environment variables) as the source of truth for the installation path.

examples/minference/ops/vertical_slash_index.hip (3)

1-14: Remove unused include and tighten device helper annotations.

ATen/dtk_macros.h appears unused and may break some builds. Also consider adding restrict/inline hints for device helpers.

Apply:

-#include <ATen/dtk_macros.h>
+#include <ATen/ATen.h>

Optional tweaks:

-__device__ void save_blocks(
+__device__ __forceinline__ void save_blocks(

And annotate kernel params as restrict for better codegen:

-    const int* seqlens,
+    const int* __restrict__ seqlens,

(Apply restrict to other pointer params similarly.)


21-35: Consider documenting input ordering and monotonicity assumptions.

Kernel assumes slash/vertical indexes are per-(batch, head) slices and monotonically increasing; otherwise logic breaks. Add comments or TORCH_CHECKs to assert sortedness in debug paths.

Would you like a lightweight CPU validator to assert monotonicity/sizes before launch?


95-113: Optional: add launch_bounds to guide occupancy.

Since N_THREADS is fixed at 64, annotate kernel for better occupancy/compile-time checks.

Apply:

-__global__ void convert_vertical_slash_indexes_kernel(
+__launch_bounds__(64)
+__global__ void convert_vertical_slash_indexes_kernel(
tilelang/intrinsics/mmac_macro_generator.py (2)

19-31: Consider annotating class-level constants with ClassVar.

The dtype_abbrv dictionary is a class-level constant that should be annotated with typing.ClassVar to clarify intent and satisfy static analysis.

Apply this diff:

+from typing import ClassVar
+
 class MatrixCoreIntrinEmitter(object):
     """
     To eliminate Python syntax within TIR Macro.
     """
 
     M_DIM = 16
     N_DIM = 16
     WARP_SIZE = 64
-    dtype_abbrv = {
+    dtype_abbrv: ClassVar[dict[str, str]] = {
         "float16": "fp16",
         "bfloat16": "bf16",
         "float32": "fp32",
         "int8": "int8",
         "int32": "int32",
         "float8_e4m3": "e4m3",
         "float8_e5m2": "e5m2",
         "float8_e4m3fnuz": "e4m3fnuz",
     }

582-582: Consider using consistent notation for thread transformations.

Lines 582 and 592 use bitwise operations ((tx & 15) >> 2) + ((tx & 3) << 2) + ((tx >> 4) << 4), while lines 293 and 304 use arithmetic operations (tx & 15) // 4 + (tx & 3) * 4 + (tx // 16) * 16 for the same transformation. Both are functionally equivalent but mixing styles reduces readability.

Also applies to: 592-592

src/layout/gemm_layouts.cc (1)

750-750: Clarify the purpose of the commented alternative.

The commented-out call to makeHalfBankSwizzleLayout suggests an alternative swizzling strategy. Consider either:

  • Removing it if it's no longer relevant
  • Adding a brief comment explaining why this alternative exists and under what conditions it might be preferred
src/op/gemm.cc (1)

7-7: Remove unused include.

The <fstream> header appears to be unused in this file. Consider removing it to keep the codebase clean.

-#include <fstream>
src/tl_templates/dcu_hip/threadblock_swizzle.h (1)

1-46: Consider extracting the duplicated ceil_div lambda.

Both rasterization2DRow and rasterization2DColumn define identical ceil_div lambdas. While this duplication is minor and the functions are templates (so there's no runtime cost), extracting it to a shared helper would improve maintainability.

namespace tl {

namespace detail {
constexpr auto ceil_div = [](int a, int b) { return (a + b - 1) / b; };
}

template <int panel_width> TL_DEVICE dim3 rasterization2DRow() {
  const unsigned int block_idx = blockIdx.x + blockIdx.y * gridDim.x;
  const unsigned int grid_size = gridDim.x * gridDim.y;
  const unsigned int panel_size = panel_width * gridDim.x;
  const unsigned int panel_offset = block_idx % panel_size;
  const unsigned int panel_idx = block_idx / panel_size;
  const unsigned int total_panel = detail::ceil_div(grid_size, panel_size);
  // ... rest of implementation
}

template <int panel_width> TL_DEVICE dim3 rasterization2DColumn() {
  const unsigned int block_idx = blockIdx.x + blockIdx.y * gridDim.x;
  const unsigned int grid_size = gridDim.x * gridDim.y;
  const unsigned int panel_size = panel_width * gridDim.y;
  const unsigned int panel_offset = block_idx % panel_size;
  const unsigned int panel_idx = block_idx / panel_size;
  const unsigned int total_panel = detail::ceil_div(grid_size, panel_size);
  // ... rest of implementation
}

} // namespace tl
testing/python/dcu/test_tilelang_gemm_mmac_intrinsic.py (1)

66-66: Remove unused variable.

The variable cache_write_shared is assigned but never used. Consider removing it to keep the code clean.

     shared_scope = "shared"
-    cache_write_shared = False
examples/gemm/example_gemm_intrinsics_dcu.py (1)

17-25: Swizzle enablement is overly strict

The 512-bit row check only enables swizzle for 64B rows (e.g., fp16, K=32). For int8 (row_bytes=32B) swizzle remains off, though the layout util supports 32B/64B/128B. Consider enabling when row_bytes % 32 == 0 and passing swizzle_bytes accordingly.

src/tl_templates/dcu_hip/reduce.h (1)

44-53: AllReduce requires non-null red_buf for threads ≥ warpSize

When offset >= warpSize, you unconditionally write to red_buf. Document and assert this precondition, or switch to an internal shared-memory buffer.

Would you like me to wire a templated shared-memory scratch allocation path for this?

src/tl_templates/dcu_hip/copy.h (2)

58-62: Clarify fence selection in cp_async_wait.

The function defaults to async_gld_fence(N) with a commented alternative async_gld_sld_fence(N). Consider documenting when each fence type should be used, or provide template parameters to select the appropriate fence based on memory access patterns.


64-73: Unused template parameter pre_nop.

The template parameter pre_nop is declared but never used in the function body. Consider removing it if not needed, or add a TODO comment if it's reserved for future functionality.

src/tl_templates/dcu_hip/gemm.h (2)

32-51: Address const-correctness in bfloat16 MFMA.

Similar to the int8 case, lines 39-40 use const_cast to remove constness. Additionally, the manual loop (lines 43-46) to copy data could be simplified if alignment is guaranteed.

Consider:

  1. Remove const_cast and use const pointers throughout
  2. If alignment is guaranteed, use direct casting instead of manual copying:
-    short *b_short = reinterpret_cast<short *>(const_cast<bfloat16_t *>(b));
-    short *a_short = reinterpret_cast<short *>(const_cast<bfloat16_t *>(a));
+    const short *b_short = reinterpret_cast<const short *>(b);
+    const short *a_short = reinterpret_cast<const short *>(a);
     
-    // Copy the data
-    for (int i = 0; i < 4; ++i) {
-      b_vec[i] = b_short[i];
-      a_vec[i] = a_short[i];
-    }
+    b_vec = *reinterpret_cast<const bfloat16x4_vec *>(b_short);
+    a_vec = *reinterpret_cast<const bfloat16x4_vec *>(a_short);

72-72: Resolve commented-out static_assert.

Line 72 has a commented-out static_assert for clear_accum. Either uncomment it if the feature is unsupported, or remove it if support has been added.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 60567ba and 9a64085.

📒 Files selected for processing (23)
  • examples/gemm/example_gemm_intrinsics_dcu.py (1 hunks)
  • examples/minference/ops/vertical_slash_index.hip (1 hunks)
  • src/layout/gemm_layouts.cc (2 hunks)
  • src/layout/layout.h (1 hunks)
  • src/op/gemm.cc (2 hunks)
  • src/target/codegen_hip.cc (3 hunks)
  • src/target/intrin_rule_hip.cc (2 hunks)
  • src/target/utils.cc (2 hunks)
  • src/target/utils.h (1 hunks)
  • src/tl_templates/dcu_hip/common.h (1 hunks)
  • src/tl_templates/dcu_hip/copy.h (1 hunks)
  • src/tl_templates/dcu_hip/core.hpp (1 hunks)
  • src/tl_templates/dcu_hip/debug.h (1 hunks)
  • src/tl_templates/dcu_hip/gemm.h (1 hunks)
  • src/tl_templates/dcu_hip/hip_fp8.h (1 hunks)
  • src/tl_templates/dcu_hip/ldsm.h (1 hunks)
  • src/tl_templates/dcu_hip/reduce.h (1 hunks)
  • src/tl_templates/dcu_hip/threadblock_swizzle.h (1 hunks)
  • testing/python/dcu/test_tilelang_gemm_mmac_intrinsic.py (1 hunks)
  • tilelang/contrib/hipcc.py (1 hunks)
  • tilelang/contrib/rocm.py (1 hunks)
  • tilelang/engine/lower.py (1 hunks)
  • tilelang/intrinsics/mmac_macro_generator.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (13)
src/op/gemm.cc (2)
src/target/utils.cc (2)
  • TargetIsDCU (82-90)
  • TargetIsDCU (82-82)
src/layout/gemm_layouts.cc (4)
  • makeGemmFragmentCDCU (159-174)
  • makeGemmFragmentCDCU (159-161)
  • makeGemmFragmentCCDNA (176-191)
  • makeGemmFragmentCCDNA (176-178)
src/layout/layout.h (1)
src/layout/gemm_layouts.cc (2)
  • makeGemmFragmentCDCU (159-174)
  • makeGemmFragmentCDCU (159-161)
src/tl_templates/dcu_hip/reduce.h (2)
src/tl_templates/dcu_hip/core.hpp (20)
  • T (31-34)
  • T (37-40)
  • T (43-46)
  • T (69-72)
  • T (75-78)
  • T (81-84)
  • max (31-31)
  • max (37-37)
  • max (43-43)
  • max (49-49)
  • max (55-58)
  • max (55-55)
  • max (62-62)
  • min (69-69)
  • min (75-75)
  • min (81-81)
  • min (87-87)
  • min (93-96)
  • min (93-93)
  • min (100-100)
src/tl_templates/dcu_hip/common.h (1)
  • __half (113-113)
examples/gemm/example_gemm_intrinsics_dcu.py (10)
tilelang/intrinsics/mma_layout.py (1)
  • get_swizzle_layout (166-201)
tilelang/intrinsics/mmac_macro_generator.py (7)
  • MatrixCoreIntrinEmitter (14-396)
  • ldmatrix_a (227-266)
  • ldmatrix_a (453-525)
  • ldmatrix_b (268-312)
  • ldmatrix_b (527-602)
  • mmac (314-346)
  • stmatrix (348-396)
tilelang/transform/simplify.py (1)
  • simplify_prim_func (53-59)
tilelang/env.py (1)
  • disable_cache (271-272)
tilelang/language/allocate.py (2)
  • alloc_shared (24-39)
  • alloc_local (42-53)
tilelang/language/annotations.py (2)
  • annotate_layout (25-36)
  • use_swizzle (17-22)
tilelang/language/fill.py (1)
  • clear (24-48)
tilelang/language/pipeline.py (1)
  • Pipelined (9-46)
tilelang/language/parallel.py (1)
  • Parallel (9-29)
tilelang/profiler/__init__.py (1)
  • assert_allclose (77-146)
src/tl_templates/dcu_hip/threadblock_swizzle.h (3)
tilelang/carver/roller/rasterization.py (1)
  • panel_width (16-18)
tilelang/carver/roller/hint.py (1)
  • stride (45-46)
src/tl_templates/cuda/threadblock_swizzle.h (1)
  • rasterization2DColumn (25-41)
src/tl_templates/dcu_hip/common.h (2)
src/tl_templates/dcu_hip/gemm.h (4)
  • half (23-29)
  • bfloat16_t (32-51)
  • void (155-234)
  • void (236-296)
src/tl_templates/dcu_hip/reduce.h (3)
  • _Float16 (32-32)
  • T (92-164)
  • __half (29-29)
src/target/utils.h (1)
src/target/utils.cc (2)
  • TargetIsDCU (82-90)
  • TargetIsDCU (82-82)
src/tl_templates/dcu_hip/copy.h (1)
src/tl_templates/cuda/copy.h (1)
  • cp_async_wait (20-26)
src/tl_templates/dcu_hip/hip_fp8.h (1)
src/tl_templates/cuda/cuda_fp8.h (3)
  • fp8_e4_2_t (9-12)
  • make_fp8_e4_4_t (88-97)
  • make_fp8_e4_8_t (100-109)
testing/python/dcu/test_tilelang_gemm_mmac_intrinsic.py (12)
tilelang/intrinsics/mma_layout.py (1)
  • get_swizzle_layout (166-201)
tilelang/intrinsics/mmac_macro_generator.py (7)
  • MatrixCoreIntrinEmitter (14-396)
  • ldmatrix_a (227-266)
  • ldmatrix_a (453-525)
  • ldmatrix_b (268-312)
  • ldmatrix_b (527-602)
  • mmac (314-346)
  • stmatrix (348-396)
tilelang/transform/simplify.py (1)
  • simplify_prim_func (53-59)
tilelang/testing/__init__.py (1)
  • set_random_seed (30-35)
tilelang/env.py (1)
  • disable_cache (271-272)
tilelang/language/allocate.py (2)
  • alloc_shared (24-39)
  • alloc_local (42-53)
tilelang/language/annotations.py (2)
  • annotate_layout (25-36)
  • use_swizzle (17-22)
tilelang/language/fill.py (1)
  • clear (24-48)
tilelang/language/pipeline.py (1)
  • Pipelined (9-46)
tilelang/language/copy.py (1)
  • copy (11-87)
tilelang/language/parallel.py (1)
  • Parallel (9-29)
tilelang/jit/__init__.py (1)
  • compile (30-79)
src/tl_templates/dcu_hip/debug.h (1)
src/tl_templates/dcu_hip/common.h (1)
  • half_t (116-116)
src/tl_templates/dcu_hip/gemm.h (1)
tilelang/intrinsics/mfma_layout.py (1)
  • make_mfma_swizzle_layout (130-152)
tilelang/intrinsics/mmac_macro_generator.py (5)
tilelang/intrinsics/utils.py (1)
  • mfma_store_index_map (85-86)
tilelang/tileop/gemm/gemm_base.py (2)
  • k_pack (111-112)
  • chunk (63-64)
tilelang/language/kernel.py (2)
  • threads (215-219)
  • KernelLaunchFrame (95-226)
tilelang/intrinsics/mfma_layout.py (16)
  • shared_16x4_to_local_64x1_layout_A (6-8)
  • shared_4x16_to_local_64x1_layout_B (17-19)
  • shared_16x16_to_local_64x4_layout_A (46-49)
  • shared_16x16_to_local_64x4_layout_B (58-61)
  • shared_16x32_to_local_64x8_layout_A (88-91)
  • shared_16x32_to_local_64x8_layout_B (100-103)
  • shared_16x64_to_local_64x16_layout_A (112-115)
  • shared_16x64_to_local_64x16_layout_B (124-127)
  • thread_id_shared_access_64x1_to_16x4_layout_A (11-14)
  • thread_id_shared_access_64x1_to_4x16_layout_B (22-25)
  • thread_id_shared_access_64x4_to_16x16_layout_A (40-43)
  • thread_id_shared_access_64x4_to_16x16_layout_B (52-55)
  • thread_id_shared_access_64x8_to_16x32_layout_A (82-85)
  • thread_id_shared_access_64x8_to_16x32_layout_B (94-97)
  • thread_id_shared_access_64x16_to_16x64_layout_A (106-109)
  • thread_id_shared_access_64x16_to_16x64_layout_B (118-121)
tilelang/language/ast/ir.py (2)
  • index_map (1673-1679)
  • meta_var (1731-1750)
🪛 Ruff (0.14.1)
testing/python/dcu/test_tilelang_gemm_mmac_intrinsic.py

66-66: Local variable cache_write_shared is assigned to but never used

Remove assignment to unused variable cache_write_shared

(F841)

tilelang/intrinsics/mmac_macro_generator.py

22-31: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)


96-96: Avoid specifying long messages outside the exception class

(TRY003)


194-194: Avoid specifying long messages outside the exception class

(TRY003)


253-253: Ambiguous variable name: l

(E741)


261-261: Ambiguous variable name: l

(E741)


294-294: Ambiguous variable name: l

(E741)


305-305: Ambiguous variable name: l

(E741)


481-481: Ambiguous variable name: l

(E741)


490-490: Ambiguous variable name: l

(E741)


509-509: Ambiguous variable name: l

(E741)


519-519: Ambiguous variable name: l

(E741)


555-555: Ambiguous variable name: l

(E741)


564-564: Ambiguous variable name: l

(E741)


583-583: Ambiguous variable name: l

(E741)


593-593: Ambiguous variable name: l

(E741)

🔇 Additional comments (29)
examples/minference/ops/vertical_slash_index.hip (1)

121-158: The review comment is incorrect—Python binding exists.

The pybind11 binding for convert_vertical_slash_indexes is already present in examples/minference/ops/kernels.cpp (lines 13-16), explicitly exposing the function to Python via m.def("convert_vertical_slash_indexes", ...). No action is needed.

Likely an incorrect or invalid review comment.

src/layout/layout.h (1)

153-155: LGTM! DCU fragment factory follows established patterns.

The new makeGemmFragmentCDCU declaration is consistent with existing GEMM fragment factories (CCDNA, Hopper, etc.) and has a corresponding implementation.

src/target/utils.h (1)

25-25: LGTM! Target predicate follows established conventions.

The TargetIsDCU declaration is consistent with existing target detection APIs.

src/target/intrin_rule_hip.cc (1)

243-251: LGTM! Intrinsic registration follows HIP patterns.

The tir.hip.__shfl registration correctly implements a non-sync shuffle variant with appropriate arguments (var, lane, width).

tilelang/intrinsics/mmac_macro_generator.py (3)

227-266: LGTM! Matrix load logic correctly handles transposed and non-transposed layouts.

The ldmatrix_a implementation properly:

  • Extracts thread bindings and applies reverse index mapping
  • Handles both transposed and non-transposed cases with appropriate coordinate calculations
  • Uses vectorized loads for efficiency

314-346: LGTM! MMAC intrinsic invocation correctly handles vectorized types.

The method properly:

  • Constructs vectorized data type strings when local_size > 1
  • Computes correct buffer offsets for A, B, and C matrices
  • Invokes the tvm_mmac intrinsic with appropriate layout and type parameters

293-293: Request manual verification of thread ID transformation logic for DCU MMAC B matrix.

Lines 293 and 304 in the base MatrixCoreIntrinEmitter.ldmatrix_b() method apply thread ID transformation: (tx & 15) // 4 + (tx & 3) * 4 + (tx // 16) * 16. This transformation is also present (in equivalent bit-shift form) at lines 582 and 592 in the derived class's preshuffle implementation, but notably absent from the preshuffle variant's main path (lines 554, 563).

The transformation appears intentional but lacks documentation or tests explaining:

  • Why DCU MMAC B-matrix requires thread reordering while A-matrix doesn't (ldmatrix_a uses plain tx)
  • Why the preshuffle optimization eliminates this transformation requirement
  • Whether this formula is architecturally correct for gfx936

Verify this transformation is correct for your target DCU architecture and document the design rationale.

tilelang/engine/lower.py (1)

109-109: Verify the optimization level change is intentional.

The addition of -O1 represents a significant reduction from the typical -O3 level. While this may be necessary for DCU compatibility or to work around compiler issues, it could negatively impact performance.

Please confirm:

  1. Is this optimization level required for DCU correctness?
  2. Have performance implications been evaluated?
  3. Can a higher optimization level be used once DCU support matures?
src/tl_templates/dcu_hip/ldsm.h (1)

1-3: LGTM! Minimal DCU HIP header structure is appropriate.

The header correctly uses #pragma once and includes the common DCU HIP definitions.

tilelang/contrib/hipcc.py (1)

64-65: Verify compiler flag changes for DCU.

Two significant changes:

  1. Optimization level: Changed to -O1 (matching the change in tilelang/engine/lower.py). This reduces optimization but may be required for DCU compatibility.
  2. Warning suppression: Added -Wno-invalid-constexpr. This suggests the generated HIP code contains constexpr usage that doesn't meet HIP compiler requirements.

Please confirm:

  • Are these flags specifically required for DCU/gfx936 targets?
  • What constexpr issues does the warning suppression address?
  • Is there a plan to fix the underlying constexpr issues rather than suppressing warnings?
src/layout/gemm_layouts.cc (1)

159-174: LGTM! DCU-specific fragment layout follows the established pattern.

The new makeGemmFragmentCDCU function mirrors the structure of makeGemmFragmentCCDNA with the key distinction being the final Repeat call parameters (true, true) vs (true, false). This differentiation aligns with DCU-specific layout requirements while maintaining consistency with the existing codebase architecture.

src/op/gemm.cc (1)

831-840: LGTM! DCU fragment selection follows the established pattern.

The conditional selection between makeGemmFragmentCDCU and makeGemmFragmentCCDNA based on TargetIsDCU(T.target) correctly routes DCU targets to their specialized fragment path while preserving existing CDNA behavior.

src/target/codegen_hip.cc (2)

140-140: LGTM! Macro enables required warp synchronization features.

Defining HIP_ENABLE_WARP_SYNC_BUILTINS before including hip_runtime.h is necessary for accessing warp-level synchronization primitives on HIP.


150-155: LGTM! Include path updates align with DCU-specific template organization.

The switch from tl_templates/hip/... to tl_templates/dcu_hip/... correctly routes to the DCU-specific implementations introduced in this PR.

testing/python/dcu/test_tilelang_gemm_mmac_intrinsic.py (2)

15-27: LGTM! Swizzle layout helper correctly implements bank conflict avoidance.

The make_swizzle_layout function appropriately checks for 512-bit row alignment before applying swizzling, ensuring optimal shared memory access patterns for DCU.


30-179: LGTM! Matrix multiplication kernel demonstrates proper DCU intrinsic usage.

The tl_matmul function correctly orchestrates:

  • Shared memory allocation with appropriate swizzling
  • Pipelined outer loop for K-dimension blocking
  • MMAC intrinsic-based computation via MatrixCoreIntrinEmitter
  • Proper store-back from local fragments to global memory

The implementation serves as a solid reference for DCU-based GEMM operations.

examples/gemm/example_gemm_intrinsics_dcu.py (1)

123-124: Confirm panel_size=10 is supported by the swizzle pattern

Threadblock swizzle device function is templated by panel size; many implementations assume powers-of-two. Please confirm 10 is valid for your target, or change to 8/16.

src/tl_templates/dcu_hip/core.hpp (1)

28-66: LGTM for min/max helpers and host/device guards

The overload set covers scalar and variadic forms with proper device specializations for float/double.

Also applies to: 68-105

src/tl_templates/dcu_hip/copy.h (4)

1-14: LGTM: Type aliases and includes are well-structured.

The type aliases are clear and the use of ck_tile::int32x4_t from the common header provides consistency across the DCU backend.


16-31: LGTM: Buffer resource construction follows DCU patterns.

The buffer_resource struct and make_wave_buffer_resource function correctly construct a buffer descriptor and normalize lanes using __builtin_amdgcn_readfirstlane, which is the appropriate pattern for ensuring uniform values across a wave.


33-39: LGTM: M0 register manipulation is correct.

These helpers correctly use inline assembly to manipulate the M0 register, which is standard practice for controlling LDS (Local Data Share) operations on AMDGPU.


82-86: Pointer arithmetic analysis verified—clarify or simplify N=4 access pattern.

Your analysis is mathematically correct. The pointer operations cause all threads to access the same location:

  • Base: global_base_ptr - threadIdx.x * 4 (in bytes)
  • Offset: threadIdx.x * 4 (equals N when N=4)
  • Effective: (global_base_ptr - threadIdx.x * 4) + threadIdx.x * 4 = global_base_ptr

This same pattern appears in both cp_async_gs and cp_async_gs_conditional functions in src/tl_templates/dcu_hip/copy.h (lines 84-85, 102-103) and mirrors the same code in src/tl_templates/hip/copy.h.

The unusual pointer subtraction followed by offset reinstatement should either be:

  1. Documented to clarify the wave-level intent, or
  2. Simplified to make_wave_buffer_resource(global_base_ptr), threadIdx.x * N if per-thread access is intended.

Given this pattern's consistency across multiple locations, verify whether this is intentional wave-level behavior or an unintended cancellation that needs correction.

src/tl_templates/dcu_hip/debug.h (2)

1-99: LGTM: Debug print specializations are comprehensive and type-safe.

The debug_print_var specializations cover a good range of types with appropriate format specifiers. The casting of narrow types (char, short) to int/unsigned int for printf is correct practice for variadic functions.


101-191: LGTM: Buffer value printing is well-implemented.

The debug_print_buffer_value specializations correctly handle buffer context along with value printing. The half_t specialization appropriately converts to float for printf compatibility.

src/tl_templates/dcu_hip/gemm.h (5)

23-29: LGTM: Half-precision MFMA correctly uses vector types.

The specialization appropriately casts to float16x4 pointers for the MFMA intrinsic. This assumes proper alignment, which should be guaranteed by the calling context.


53-64: LGTM: FP8 specialization correctly handles const pointers.

The FP8 MFMA specialization properly uses reinterpret_cast<const int64_t *> without const_cast violations. Good example of proper const-correctness.


117-139: LGTM: Swizzle layout matches reference implementation.

The make_mfma_swizzle_layout function correctly implements the bank conflict avoidance pattern consistent with the Python reference in tilelang/intrinsics/mfma_layout.py.


155-234: LGTM: GEMM body implements standard MFMA pattern.

The shared-to-register GEMM computation correctly implements the fetch-and-compute pattern for MFMA instructions. The lane ID swizzling (line 168) and transpose logic are appropriate for the MFMA 16x16 instruction.


236-323: LGTM: Register-shared GEMM and wrappers are correctly implemented.

The body_rs function appropriately handles the case where matrix A is already in registers, and the free function wrappers provide a clean API for both shared-shared and register-shared GEMM variants.

Comment on lines +50 to +52
if out_dtype == "int32":
micro_size_k = 32

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

micro_size_k should depend on in_dtype, not out_dtype

Int8 paths require K-fragment=32 regardless of the final out dtype. Gate on in_dtype to avoid incorrect tiling when out_dtype is int32 but inputs aren’t int8.

Apply this diff:

-    if out_dtype == "int32":
-        micro_size_k = 32
+    if in_dtype == "int8":
+        micro_size_k = 32
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if out_dtype == "int32":
micro_size_k = 32
if in_dtype == "int8":
micro_size_k = 32
🤖 Prompt for AI Agents
In examples/gemm/example_gemm_intrinsics_dcu.py around lines 50 to 52, the code
sets micro_size_k based on out_dtype but it should be gated on in_dtype: change
the condition to check if in_dtype == "int8" (or equivalent int8 type sentinel
used elsewhere) and set micro_size_k = 32 for int8 input paths; leave other
dtype branches unchanged so int8 inputs always use K-fragment=32 even if
out_dtype is int32.

Comment on lines +65 to +68
block_M = block_row_warps * warp_row_tiles
block_N = block_col_warps * warp_col_tiles
block_K = chunk

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Tile edge handling missing; kernel reads/writes out of bounds for non-multiple sizes

Grid uses ceildiv, but loads/stores assume perfect multiples of block_M/N/K. Either add guards in load/store loops or assert divisibility.

Minimal safety precondition (quick fix):

     block_K = chunk
@@
-    A_shape = (M, K)
+    # Precondition for this example kernel (simplifies control flow)
+    assert M % block_M == 0 and N % block_N == 0 and K % block_K == 0, \
+        "Example kernel currently requires M,N,K to be multiples of block sizes"
+    A_shape = (M, K)

Recommended follow-up (optional): add bounds checks in the A/B loads and C stores so arbitrary M,N,K work. I can draft that if you want.

Also applies to: 108-159


I need to examine the actual code to verify the concerns raised in the review comment.


🏁 Script executed:

#!/bin/bash
# Retrieve the file and examine the kernel implementation
if [ -f "examples/gemm/example_gemm_intrinsics_dcu.py" ]; then
    echo "=== File Content ===" 
    cat examples/gemm/example_gemm_intrinsics_dcu.py
else
    echo "File not found at examples/gemm/example_gemm_intrinsics_dcu.py"
    find . -name "example_gemm_intrinsics_dcu.py" -type f 2>/dev/null
fi

Length of output: 5708


Add bounds-checking guards or assertion for non-multiple tensor dimensions

The kernel grid uses ceildiv (line 99) but load/store loops assume perfect multiples. With M, N, or K not divisible by block_M, block_N, or block_K, the loops read/write out of bounds. The current test (M=N=K=16384) masks this because all are divisible by their block sizes.

Either add bounds checks in the A/B loads (lines 117–121) and C store (lines 138–145), or add an early assertion:

     block_K = chunk
@@
+    assert M % block_M == 0 and N % block_N == 0 and K % block_K == 0, \
+        "Example kernel requires M, N, K to be multiples of block sizes (M:%d, N:%d, K:%d; block_M:%d, block_N:%d, block_K:%d)" % (M, N, K, block_M, block_N, block_K)
     A_shape = (M, K)
🤖 Prompt for AI Agents
In examples/gemm/example_gemm_intrinsics_dcu.py around lines 65–68, the computed
block sizes (block_M, block_N, block_K) are used later by loads/stores that
assume tensor dimensions are exact multiples; add either runtime assertions
early (assert M % block_M == 0 and N % block_N == 0 and K % block_K == 0) or,
preferably, add bounds-check guards around A/B loads (lines ~117–121) and the C
store (lines ~138–145): before reading A or B elements check the computed global
row/col indices against M/N/K and substitute zero (or a safe value) for
out-of-bounds loads; before writing C check indices and skip stores outside M/N,
ensuring no out-of-bounds memory access.

Comment on lines +40 to +54
int seqlen = seqlens[batch_idx];
int block_idx_m = group_idx * blockDim.x + threadIdx.x;
int start_m = block_idx_m * BLOCK_SIZE_M;
if (start_m >= seqlen) {
return;
}
int end_m = start_m + BLOCK_SIZE_M;
vertical_indexes += (batch_idx * N_HEADS + head_idx) * NNZ_V;
slash_indexes += (batch_idx * N_HEADS + head_idx) * NNZ_S;
int row_offset = (batch_idx * N_HEADS + head_idx) * N_ROWS + block_idx_m;
block_count += row_offset;
block_offset += row_offset * NNZ_S;
column_count += row_offset;
column_index += row_offset * NNZ_V;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Guard against row index overflow.

Threads with block_idx_m ≥ N_ROWS can compute a valid start_m < seqlen (if seqlen > context), causing OOB on row_offset. Add an explicit guard.

Apply:

   int seqlen = seqlens[batch_idx];
   int block_idx_m = group_idx * blockDim.x + threadIdx.x;
+  if (block_idx_m >= N_ROWS) {
+      return;
+  }
   int start_m = block_idx_m * BLOCK_SIZE_M;
   if (start_m >= seqlen) {
       return;
   }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
int seqlen = seqlens[batch_idx];
int block_idx_m = group_idx * blockDim.x + threadIdx.x;
int start_m = block_idx_m * BLOCK_SIZE_M;
if (start_m >= seqlen) {
return;
}
int end_m = start_m + BLOCK_SIZE_M;
vertical_indexes += (batch_idx * N_HEADS + head_idx) * NNZ_V;
slash_indexes += (batch_idx * N_HEADS + head_idx) * NNZ_S;
int row_offset = (batch_idx * N_HEADS + head_idx) * N_ROWS + block_idx_m;
block_count += row_offset;
block_offset += row_offset * NNZ_S;
column_count += row_offset;
column_index += row_offset * NNZ_V;
int seqlen = seqlens[batch_idx];
int block_idx_m = group_idx * blockDim.x + threadIdx.x;
if (block_idx_m >= N_ROWS) {
return;
}
int start_m = block_idx_m * BLOCK_SIZE_M;
if (start_m >= seqlen) {
return;
}
int end_m = start_m + BLOCK_SIZE_M;
vertical_indexes += (batch_idx * N_HEADS + head_idx) * NNZ_V;
slash_indexes += (batch_idx * N_HEADS + head_idx) * NNZ_S;
int row_offset = (batch_idx * N_HEADS + head_idx) * N_ROWS + block_idx_m;
block_count += row_offset;
block_offset += row_offset * NNZ_S;
column_count += row_offset;
column_index += row_offset * NNZ_V;
🤖 Prompt for AI Agents
In examples/minference/ops/vertical_slash_index.hip around lines 40 to 54,
threads where block_idx_m >= N_ROWS can still have start_m < seqlen and will
compute row_offset and use it causing out-of-bounds accesses; add an explicit
guard right after computing block_idx_m (before computing row_offset and any
row-dependent offsets) that returns when block_idx_m >= N_ROWS so subsequent
uses of row_offset, block_count/offset, column_count/index are safe.

Comment on lines +55 to +62
int tmp_col_cnt = 0, tmp_blk_cnt = 0;
int s = 0, v = 0;
int v_idx = vertical_indexes[v++];
int s_idx = slash_indexes[s++];
while (s_idx >= end_m) {
s_idx = slash_indexes[s++];
}
s_idx = max(end_m - s_idx, BLOCK_SIZE_M);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Fix OOB reads when NNZ_S/NNZ_V are zero and bound the pre-scan.

Accessing vertical_indexes[v++] and slash_indexes[s++] without checking NNZ_* risks OOB. The pre-loop while also lacks a bound on s.

Apply:

-    int tmp_col_cnt = 0, tmp_blk_cnt = 0;
-    int s = 0, v = 0;
-    int v_idx = vertical_indexes[v++];
-    int s_idx = slash_indexes[s++];
-    while (s_idx >= end_m) {
-        s_idx = slash_indexes[s++];
-    }
-    s_idx = max(end_m - s_idx, BLOCK_SIZE_M);
+    int tmp_col_cnt = 0, tmp_blk_cnt = 0;
+    int s = 0, v = 0;
+    // Safe init of v_idx
+    int v_idx = (NNZ_V > 0) ? vertical_indexes[v++] : (end_m + BLOCK_SIZE_M);
+    // Handle NNZ_S == 0 early
+    if (NNZ_S == 0) {
+        block_count[0] = 0;
+        column_count[0] = 0;
+        return;
+    }
+    int s_idx = slash_indexes[s++];
+    while (s < NNZ_S && s_idx >= end_m) {
+        s_idx = slash_indexes[s++];
+    }
+    if (s_idx >= end_m) {
+        // No slash indices relevant for this row
+        block_count[0] = 0;
+        column_count[0] = 0;
+        return;
+    }
+    s_idx = max(end_m - s_idx, BLOCK_SIZE_M);
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
int tmp_col_cnt = 0, tmp_blk_cnt = 0;
int s = 0, v = 0;
int v_idx = vertical_indexes[v++];
int s_idx = slash_indexes[s++];
while (s_idx >= end_m) {
s_idx = slash_indexes[s++];
}
s_idx = max(end_m - s_idx, BLOCK_SIZE_M);
int tmp_col_cnt = 0, tmp_blk_cnt = 0;
int s = 0, v = 0;
// Safe init of v_idx
int v_idx = (NNZ_V > 0) ? vertical_indexes[v++] : (end_m + BLOCK_SIZE_M);
// Handle NNZ_S == 0 early
if (NNZ_S == 0) {
block_count[0] = 0;
column_count[0] = 0;
return;
}
int s_idx = slash_indexes[s++];
while (s < NNZ_S && s_idx >= end_m) {
s_idx = slash_indexes[s++];
}
if (s_idx >= end_m) {
// No slash indices relevant for this row
block_count[0] = 0;
column_count[0] = 0;
return;
}
s_idx = max(end_m - s_idx, BLOCK_SIZE_M);

Comment on lines +12 to +20
template <> struct MfmaTraits<int8_t> {
template <typename AccType>
static TL_DEVICE void mfma_op(const int8_t *b, const int8_t *a, AccType *c) {
int64_t *b_packed = reinterpret_cast<int64_t *>(const_cast<int8_t *>(b));
int64_t *a_packed = reinterpret_cast<int64_t *>(const_cast<int8_t *>(a));

*c = __builtin_amdgcn_mmac_i32_16x16x32i8(*b_packed, *a_packed, *c);
}
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Remove const_cast from const pointers.

Lines 15-16 use const_cast to remove constness from the input pointers before casting to int64_t*. While the MFMA intrinsic likely doesn't modify the data, casting away const violates const-correctness and can lead to undefined behavior if the underlying data is actually const.

The intrinsic should accept const pointers. If it doesn't, consider using reinterpret_cast<const int64_t*> directly without const_cast, or verify the intrinsic signature:

-    int64_t *b_packed = reinterpret_cast<int64_t *>(const_cast<int8_t *>(b));
-    int64_t *a_packed = reinterpret_cast<int64_t *>(const_cast<int8_t *>(a));
+    const int64_t *b_packed = reinterpret_cast<const int64_t *>(b);
+    const int64_t *a_packed = reinterpret_cast<const int64_t *>(a);

Then update line 18 if needed based on the intrinsic's signature.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
template <> struct MfmaTraits<int8_t> {
template <typename AccType>
static TL_DEVICE void mfma_op(const int8_t *b, const int8_t *a, AccType *c) {
int64_t *b_packed = reinterpret_cast<int64_t *>(const_cast<int8_t *>(b));
int64_t *a_packed = reinterpret_cast<int64_t *>(const_cast<int8_t *>(a));
*c = __builtin_amdgcn_mmac_i32_16x16x32i8(*b_packed, *a_packed, *c);
}
};
template <> struct MfmaTraits<int8_t> {
template <typename AccType>
static TL_DEVICE void mfma_op(const int8_t *b, const int8_t *a, AccType *c) {
const int64_t *b_packed = reinterpret_cast<const int64_t *>(b);
const int64_t *a_packed = reinterpret_cast<const int64_t *>(a);
*c = __builtin_amdgcn_mmac_i32_16x16x32i8(*b_packed, *a_packed, *c);
}
};
🤖 Prompt for AI Agents
In src/tl_templates/dcu_hip/gemm.h around lines 12 to 20, remove the const_cast
usage that strips constness from the input pointers: change the local packed
pointer types to const int64_t* and use reinterpret_cast<const int64_t*>(b) and
reinterpret_cast<const int64_t*>(a) instead of const_cast; then pass *b_packed
and *a_packed to the MFMA intrinsic as before (or adjust the call if the
intrinsic requires non-pointer/value types) so that const-correctness is
preserved.

Comment on lines +46 to +55
__device__ fp8_e4_4_t make_fp8_e4_4_t(fp8_e4_t x, fp8_e4_t y, fp8_e4_t z,
fp8_e4_t w) {
// reinterpret the 4 fp8_e4_t values to signed char value and shift
signed char x_char = *reinterpret_cast<signed char *>(&x);
signed char y_char = *reinterpret_cast<signed char *>(&y);
signed char z_char = *reinterpret_cast<signed char *>(&z);
signed char w_char = *reinterpret_cast<signed char *>(&w);
int res = (w_char << 24) | (z_char << 16) | (y_char << 8) | x_char;
return *reinterpret_cast<fp8_e4_4_t *>(&res);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Avoid UB from type-punning via reinterpret_cast; construct via fields

The current helpers reinterpret int buffers to fp8_e4_4_t/fp8_e4_8_t, which risks aliasing/alignment UB. Build via named members instead.

Apply this diff:

-__device__ fp8_e4_4_t make_fp8_e4_4_t(fp8_e4_t x, fp8_e4_t y, fp8_e4_t z,
-                                      fp8_e4_t w) {
-  // reinterpret the 4 fp8_e4_t values to signed char value and shift
-  signed char x_char = *reinterpret_cast<signed char *>(&x);
-  signed char y_char = *reinterpret_cast<signed char *>(&y);
-  signed char z_char = *reinterpret_cast<signed char *>(&z);
-  signed char w_char = *reinterpret_cast<signed char *>(&w);
-  int res = (w_char << 24) | (z_char << 16) | (y_char << 8) | x_char;
-  return *reinterpret_cast<fp8_e4_4_t *>(&res);
-}
+__device__ fp8_e4_4_t make_fp8_e4_4_t(fp8_e4_t x, fp8_e4_t y, fp8_e4_t z,
+                                      fp8_e4_t w) {
+  fp8_e4_4_t out{};
+  out.x = x; out.y = y; out.z = z; out.w = w;
+  return out;
+}
@@
-__device__ fp8_e4_8_t make_fp8_e4_8_t(fp8_e4_t x, fp8_e4_t y, fp8_e4_t z,
-                                      fp8_e4_t w, fp8_e4_t v, fp8_e4_t u,
-                                      fp8_e4_t t, fp8_e4_t s) {
-  signed char x_char = *reinterpret_cast<signed char *>(&x);
-  signed char y_char = *reinterpret_cast<signed char *>(&y);
-  signed char z_char = *reinterpret_cast<signed char *>(&z);
-  signed char w_char = *reinterpret_cast<signed char *>(&w);
-  signed char v_char = *reinterpret_cast<signed char *>(&v);
-  signed char u_char = *reinterpret_cast<signed char *>(&u);
-  signed char t_char = *reinterpret_cast<signed char *>(&t);
-  signed char s_char = *reinterpret_cast<signed char *>(&s);
-  int a = (w_char << 24) | (z_char << 16) | (y_char << 8) | x_char;
-  int b = (s_char << 24) | (t_char << 16) | (u_char << 8) | v_char;
-  fp8_e4_8_t res;
-  res.x = *reinterpret_cast<fp8_e4_4_t *>(&a);
-  res.y = *reinterpret_cast<fp8_e4_4_t *>(&b);
-  return res;
-}
+__device__ fp8_e4_8_t make_fp8_e4_8_t(fp8_e4_t x, fp8_e4_t y, fp8_e4_t z,
+                                      fp8_e4_t w, fp8_e4_t v, fp8_e4_t u,
+                                      fp8_e4_t t, fp8_e4_t s) {
+  fp8_e4_8_t out{};
+  out.x = make_fp8_e4_4_t(x, y, z, w);
+  out.y = make_fp8_e4_4_t(v, u, t, s);
+  return out;
+}

Also applies to: 57-74

🤖 Prompt for AI Agents
In src/tl_templates/dcu_hip/hip_fp8.h around lines 46-55 (and similarly 57-74),
the function uses reinterpret_cast to type-pun fp8 bytes into signed char and to
reinterpret an int as fp8_e4_4_t, which can cause aliasing/alignment UB; replace
the reinterpret_casts by directly constructing and returning the fp8_e4_4_t
using its named byte/member fields (assign x,y,z,w to the struct members in the
correct order) and similarly for the 8-wide variant, avoiding any pointer casts
or reinterpreting integers as the struct — ensure you preserve byte ordering and
use explicit member assignment to build the result.

Comment on lines +89 to +99
static_assert(threads == 1024 or threads == 512 or threads == 256 or
threads == 128 or threads == 64 or threads == 32);
template <typename T, int SEG = 32>
static TL_DEVICE T run(const T *__restrict__ src, T *__restrict__ dst, int H,
int W) {

constexpr int TILE_H = threads / SEG;
constexpr uint64_t MASK = 0xffffffffffffffffULL;
const int num_blocks = (H + TILE_H - 1) / TILE_H;
const int tid = threadIdx.x;
const int lane = tid % 64;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Fix compile/runtime issues in CumSum2D

  • Return type should be void (it writes into dst).
  • Don’t return; inside the row-bounds check; use continue;.
  • Use 32-bit mask type expected by HIP intrinsics.
  • Pass integer lane indices to __shfl_sync, not T-typed zeros.

Apply this diff:

-  template <typename T, int SEG = 32>
-  static TL_DEVICE T run(const T *__restrict__ src, T *__restrict__ dst, int H,
-                         int W) {
+  template <typename T, int SEG = 32>
+  static TL_DEVICE void run(const T *__restrict__ src, T *__restrict__ dst, int H,
+                            int W) {
@@
-    constexpr uint64_t MASK = 0xffffffffffffffffULL;
+    const unsigned MASK = 0xFFFFFFFFu;
@@
-      if (gRow >= H)
-        return;
+      if (gRow >= H)
+        continue;
@@
-          T segSum = (T)__shfl_sync(MASK, val, (T)0);
+          T segSum = (T)__shfl_sync(MASK, val, 0);
@@
-          carry = (T)__shfl_sync(MASK, carry, (T)0);
+          carry = (T)__shfl_sync(MASK, carry, 0);

Also applies to: 102-107, 131-135, 157-161

🤖 Prompt for AI Agents
In src/tl_templates/dcu_hip/reduce.h around lines 89-99 (and similarly adjust
102-107, 131-135, 157-161), fix the CumSum2D routine to: change the function
signature return type from T to void since it writes into dst; replace any early
"return;" inside the row-bounds check with "continue;" to let threads proceed to
next iteration; use a 32-bit mask type (uint32_t) for HIP shuffle/warp
intrinsics instead of uint64_t; and when calling __shfl_sync pass integer lane
indices (e.g., int lane or int offset) rather than T-typed zeros so the shuffle
arguments are integer types. Ensure these changes are applied consistently in
the other referenced line ranges.

B.T.to(torch.float32)).to(getattr(torch, out_dtype))
elif a_transposed and not b_transposed:
# Get Reference Result
ref_c = torch.matmul(A.Tto(torch.float32),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Fix syntax error in transpose operation.

Line 224 contains a typo: A.Tto(torch.float32) should be A.T.to(torch.float32). This will cause a runtime AttributeError.

-        ref_c = torch.matmul(A.Tto(torch.float32),
+        ref_c = torch.matmul(A.T.to(torch.float32),
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
ref_c = torch.matmul(A.Tto(torch.float32),
ref_c = torch.matmul(A.T.to(torch.float32),
🤖 Prompt for AI Agents
In testing/python/dcu/test_tilelang_gemm_mmac_intrinsic.py around line 224,
there's a typo in the transpose/convert chain: replace the incorrect attribute
call `A.Tto(torch.float32)` with `A.T.to(torch.float32)` so the tensor is first
transposed (A.T) and then converted to float32 (A.T.to(...)); update that call
accordingly to avoid the AttributeError.


@tvm.ffi.register_func("tvm_callback_rocm_get_arch", override=True)
def get_rocm_arch(rocm_path="/opt/rocm"):
def get_rocm_arch(rocm_path="/opt/dtk"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Breaking change: default path breaks standard ROCm installations.

Changing the default from /opt/rocm to /opt/dtk will cause existing ROCm users to fall back to the default gfx900 architecture instead of detecting their actual GPU. This could lead to suboptimal performance or compatibility issues.

Consider trying multiple paths in order to support both standard ROCm and Hygon DCU installations:

-def get_rocm_arch(rocm_path="/opt/dtk"):
+def get_rocm_arch(rocm_path=None):
     """Utility function to get the AMD GPU architecture
 
     Parameters
     ----------
     rocm_path : str
         The path to rocm installation directory
 
     Returns
     -------
     gpu_arch : str
         The AMD GPU architecture
     """
     gpu_arch = "gfx900"
+    # Try multiple common installation paths if not specified
+    if rocm_path is None:
+        for candidate in ["/opt/dtk", "/opt/rocm"]:
+            if os.path.exists(candidate):
+                rocm_path = candidate
+                break
+        else:
+            print("ROCm/DTK not detected, using default gfx900")
+            return gpu_arch
+    
     # check if rocm is installed
     if not os.path.exists(rocm_path):
-        print("ROCm not detected, using default gfx900")
+        print(f"ROCm/DTK not found at {rocm_path}, using default gfx900")
         return gpu_arch

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In tilelang/contrib/rocm.py around line 230, the function get_rocm_arch
currently defaults to "/opt/dtk", which breaks standard ROCm installs; change
the implementation to try multiple install roots in order (e.g. ['/opt/rocm',
'/opt/dtk']) rather than a single hardcoded default: accept an optional path or
list, iterate over the candidate roots, attempt detection against each root
until one succeeds, fall back to existing gfx900 behavior only if none match,
and update the function signature/docstring to reflect that it checks multiple
standard locations.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (9)
src/target/utils.cc (1)

81-89: Undefined variable mcpu causes compilation error.

Line 86 references mcpu without declaring it. Following the pattern from TargetIsCDNA above (lines 70-79), you must extract the attribute first.

Apply this diff:

 bool TargetIsDCU(Target target) {
   if (!TargetIsRocm(target))
     return false;
   if (target->attrs.count("mcpu")) {
+    std::string mcpu = Downcast<String>(target->attrs.at("mcpu"));
     // if mcpu start with "gfx936", it is DCU
     return mcpu.find("gfx936") == 0;
   }
   return false;
 }
examples/gemm/example_gemm_intrinsics_dcu.py (3)

51-53: Incorrect dtype check for micro_size_k setting.

The condition checks out_dtype but should check in_dtype. Int8 inputs require K-fragment=32 regardless of the accumulator/output dtype.

Apply this diff:

-    if out_dtype == "int32":
+    if in_dtype == "int8":
         micro_size_k = 32

109-145: Missing bounds checks for non-multiple tensor dimensions.

The kernel grid uses ceildiv (line 109), but the A/B load loops (lines 131-136) and C store loop (lines 153-159) assume M, N, K are exact multiples of block_M, block_N, block_K. Edge blocks will read/write out of bounds.

Add an assertion to document the constraint:

     block_K = chunk
+    assert M % block_M == 0 and N % block_N == 0 and K % block_K == 0, \
+        f"Example requires M, N, K to be multiples of block sizes (M={M}, N={N}, K={K}; block_M={block_M}, block_N={block_N}, block_K={block_K})"
 
     A_shape = (M, K)

Or add bounds checks in the loops:

for i, k in T.Parallel(block_M, block_K):
    if by * block_M + i < M and ko * block_K + k < K:
        A_shared[i, k] = A[by * block_M + i, ko * block_K + k]

154-159: C_shared indexing is transposed.

C_shared is declared with shape (block_M // micro_size_x, block_N // micro_size_y, micro_size_x, micro_size_y) (lines 74-79), representing [M_tiles, N_tiles, tile_M, tile_N]. However, the access swaps the first two indices.

Apply this diff:

                 C[by * block_M + i, bx * block_N + j] = C_shared[
-                    j // micro_size_y,
-                    i // micro_size_x,  
+                    i // micro_size_x,
+                    j // micro_size_y,
                     i % micro_size_x,
                     j % micro_size_y,
                 ]
src/tl_templates/dcu_hip/reduce.h (1)

94-108: Fix compile/runtime issues in CumSum2D (previously flagged).

The issues identified in previous reviews are still present:

  • Return type should be void (function writes into dst)
  • Don't return; inside the row-bounds check at line 108; use continue;
  • Use 32-bit mask type (line 99): const unsigned MASK = 0xFFFFFFFFu; instead of uint64_t
  • Pass integer lane indices to __shfl_sync (lines 134, 137, 160, 163), not T-typed zeros

Apply the diff suggested in the previous review to fix these issues.

src/tl_templates/dcu_hip/hip_fp8.h (1)

46-74: Avoid UB from type-punning via reinterpret_cast (previously flagged).

The issues identified in previous reviews are still present. The current helpers reinterpret int buffers to fp8 structures via reinterpret_cast, which risks aliasing/alignment undefined behavior. Build via named members instead as suggested in the previous review.

Apply the diff suggested in the previous review to construct the structures using their named fields directly.

src/tl_templates/dcu_hip/copy.h (1)

99-107: Fix buffer overflow in conditional zero-fill (previously flagged).

The issue identified in previous reviews is still present. When N=4, the else branch (line 105) writes 16 bytes via uint4, but only 4 bytes should be written based on the template parameter.

Apply the diff suggested in the previous review to write the correct number of bytes based on N.

src/tl_templates/dcu_hip/gemm.h (2)

12-20: Remove const_cast from const pointers (previously flagged).

The issue identified in previous reviews is still present. Lines 15-16 use const_cast to remove constness from input pointers, which violates const-correctness and can lead to undefined behavior.

Apply the diff suggested in the previous review to use const int64_t* pointers without const_cast.


32-51: Similar const_cast issue in bfloat16 specialization.

Lines 39-40 also use const_cast to remove constness, similar to the int8 specialization. This should be fixed to preserve const-correctness.

Apply this diff:

-    short *b_short = reinterpret_cast<short *>(const_cast<bfloat16_t *>(b));
-    short *a_short = reinterpret_cast<short *>(const_cast<bfloat16_t *>(a));
+    const short *b_short = reinterpret_cast<const short *>(b);
+    const short *a_short = reinterpret_cast<const short *>(a);
🧹 Nitpick comments (4)
tilelang/intrinsics/mmac_macro_generator.py (2)

252-263: Ambiguous variable name l reduces readability.

The variable l (lowercase L) at lines 252, 260, and throughout the file is easily confused with the digit 1. Consider more descriptive names like row_start, row_offset, or left_idx.

Example refactor for line 252:

-                        l, r = (rk * chunk + ki * (k_pack * micro_size_k),
+                        row_offset, col_offset = (rk * chunk + ki * (k_pack * micro_size_k),
                                 warp_m * warp_row_tiles + i * micro_size_x)
-                        A_local_buf[i * k_pack * local_size_a + local_id] = A_shared_buf[l + row,
-                                                                                         r + col]
+                        A_local_buf[i * k_pack * local_size_a + local_id] = A_shared_buf[row_offset + row,
+                                                                                         col_offset + col]

This pattern applies to lines 260, 295, 308, 484, 493, 512, 522, 558, 567, 588, 600.


293-307: Extract repeated bit-manipulation expression to improve clarity.

The expression (tx & 15) // 4 + (tx & 3) * 4 + (tx // 16) * 16 appears at lines 293, 306 and similar patterns at 585, 597. This appears to perform lane ID swizzling for DCU warps. Extracting it to a named helper would clarify intent and reduce duplication.

Add a helper method:

def _swizzle_lane_id_for_b(self, tx):
    """Swizzle lane ID for B matrix access pattern on DCU."""
    return (tx & 15) // 4 + (tx & 3) * 4 + (tx // 16) * 16

Then replace usages:

                     for local_id in T.vectorized(k_pack * local_size_b):
                         row, col = T.meta_var(
-                            reverse_index_map((tx & 15) // 4 + (tx & 3) * 4 + (tx // 16) * 16,
+                            reverse_index_map(self._swizzle_lane_id_for_b(tx),
                                               local_id))
src/op/gemm.cc (1)

8-8: Remove unused <fstream> include.

The <fstream> header was added but doesn't appear to be used anywhere in this file.

Apply this diff:

 #include "gemm.h"
 #include "builtin.h"
-#include <fstream>
 #include <tvm/tir/builtin.h>
src/layout/gemm_layouts.cc (1)

750-752: Remove or document commented-out swizzle code.

The commented-out call to makeHalfBankSwizzleLayout creates ambiguity. Either remove it if obsolete, or add a comment explaining why makeFullBankSwizzleLayout is preferred.

If obsolete, apply this diff:

   else if (mat_continuous % (vector_size * 8) == 0)
-    // return makeHalfBankSwizzleLayout(mat_stride, mat_continuous,
-    // element_size);
     return makeFullBankSwizzleLayout(mat_stride, mat_continuous, element_size);

Or document the choice:

   else if (mat_continuous % (vector_size * 8) == 0)
+    // Note: Full bank swizzle provides better performance than half-bank for this path
-    // return makeHalfBankSwizzleLayout(mat_stride, mat_continuous,
-    // element_size);
     return makeFullBankSwizzleLayout(mat_stride, mat_continuous, element_size);
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9a64085 and 8f4628e.

📒 Files selected for processing (16)
  • examples/gemm/example_gemm_intrinsics_dcu.py (1 hunks)
  • src/layout/gemm_layouts.cc (2 hunks)
  • src/layout/layout.h (1 hunks)
  • src/op/gemm.cc (2 hunks)
  • src/target/intrin_rule_hip.cc (2 hunks)
  • src/target/utils.cc (1 hunks)
  • src/tl_templates/dcu_hip/common.h (1 hunks)
  • src/tl_templates/dcu_hip/copy.h (1 hunks)
  • src/tl_templates/dcu_hip/core.hpp (1 hunks)
  • src/tl_templates/dcu_hip/debug.h (1 hunks)
  • src/tl_templates/dcu_hip/gemm.h (1 hunks)
  • src/tl_templates/dcu_hip/hip_fp8.h (1 hunks)
  • src/tl_templates/dcu_hip/reduce.h (1 hunks)
  • src/tl_templates/dcu_hip/threadblock_swizzle.h (1 hunks)
  • testing/python/dcu/test_tilelang_gemm_mmac_intrinsic.py (1 hunks)
  • tilelang/intrinsics/mmac_macro_generator.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • testing/python/dcu/test_tilelang_gemm_mmac_intrinsic.py
  • src/tl_templates/dcu_hip/common.h
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-09-15T10:51:06.985Z
Learnt from: botbw
Repo: tile-ai/tilelang PR: 691
File: src/tl_templates/cuda/gemm_sp_sm80.h:81-85
Timestamp: 2025-09-15T10:51:06.985Z
Learning: In CUTLASS tensor operation layouts, crosswise constants should be computed using sizeof(T) (bytes), not cutlass::sizeof_bits<T>::value (bits). This is the established pattern in the official CUTLASS codebase, as seen in default_mma_core_sparse_sm80.h.

Applied to files:

  • examples/gemm/example_gemm_intrinsics_dcu.py
  • src/tl_templates/dcu_hip/hip_fp8.h
  • src/tl_templates/dcu_hip/gemm.h
📚 Learning: 2025-09-15T10:51:06.985Z
Learnt from: botbw
Repo: tile-ai/tilelang PR: 691
File: src/tl_templates/cuda/gemm_sp_sm80.h:81-85
Timestamp: 2025-09-15T10:51:06.985Z
Learning: In CUTLASS tensor operation layouts, crosswise constants should be computed using sizeof(T) (bytes), not cutlass::sizeof_bits<T>::value (bits). However, the layout template parameter should use sizeof_bits<T>::value (bits). This is the established pattern in the official CUTLASS codebase, as seen in default_mma_core_sparse_sm80.h where Crosswise uses sizeof(ElementA) but the layout template uses sizeof_bits<ElementA>::value.

Applied to files:

  • examples/gemm/example_gemm_intrinsics_dcu.py
  • src/tl_templates/dcu_hip/hip_fp8.h
  • src/layout/gemm_layouts.cc
  • src/tl_templates/dcu_hip/gemm.h
🧬 Code graph analysis (11)
src/op/gemm.cc (2)
src/target/utils.cc (2)
  • TargetIsDCU (81-89)
  • TargetIsDCU (81-81)
src/layout/gemm_layouts.cc (4)
  • makeGemmFragmentCDCU (159-174)
  • makeGemmFragmentCDCU (159-161)
  • makeGemmFragmentCCDNA (176-191)
  • makeGemmFragmentCCDNA (176-178)
src/tl_templates/dcu_hip/threadblock_swizzle.h (1)
src/tl_templates/cuda/threadblock_swizzle.h (1)
  • rasterization2DColumn (25-41)
examples/gemm/example_gemm_intrinsics_dcu.py (9)
tilelang/intrinsics/mma_layout.py (1)
  • get_swizzle_layout (166-201)
tilelang/intrinsics/mmac_macro_generator.py (7)
  • MatrixCoreIntrinEmitter (13-399)
  • ldmatrix_a (226-265)
  • ldmatrix_a (456-528)
  • ldmatrix_b (267-315)
  • ldmatrix_b (530-609)
  • mmac (317-349)
  • stmatrix (351-399)
tilelang/env.py (1)
  • disable_cache (271-272)
testing/python/dcu/test_tilelang_gemm_mmac_intrinsic.py (2)
  • make_swizzle_layout (16-28)
  • transform_func (24-26)
tilelang/language/allocate.py (1)
  • alloc_shared (24-39)
tilelang/language/annotations.py (2)
  • annotate_layout (25-36)
  • use_swizzle (17-22)
tilelang/language/fill.py (1)
  • clear (24-48)
tilelang/language/pipeline.py (1)
  • Pipelined (9-46)
tilelang/language/parallel.py (1)
  • Parallel (9-29)
src/layout/layout.h (1)
src/layout/gemm_layouts.cc (2)
  • makeGemmFragmentCDCU (159-174)
  • makeGemmFragmentCDCU (159-161)
tilelang/intrinsics/mmac_macro_generator.py (6)
tilelang/intrinsics/utils.py (1)
  • mfma_store_index_map (85-86)
tilelang/tileop/gemm/gemm_base.py (2)
  • k_pack (111-112)
  • chunk (63-64)
tilelang/language/kernel.py (2)
  • threads (215-219)
  • KernelLaunchFrame (95-226)
tilelang/intrinsics/mfma_layout.py (16)
  • shared_16x4_to_local_64x1_layout_A (6-8)
  • shared_4x16_to_local_64x1_layout_B (17-19)
  • shared_16x16_to_local_64x4_layout_A (46-49)
  • shared_16x16_to_local_64x4_layout_B (58-61)
  • shared_16x32_to_local_64x8_layout_A (88-91)
  • shared_16x32_to_local_64x8_layout_B (100-103)
  • shared_16x64_to_local_64x16_layout_A (112-115)
  • shared_16x64_to_local_64x16_layout_B (124-127)
  • thread_id_shared_access_64x1_to_16x4_layout_A (11-14)
  • thread_id_shared_access_64x1_to_4x16_layout_B (22-25)
  • thread_id_shared_access_64x4_to_16x16_layout_A (40-43)
  • thread_id_shared_access_64x4_to_16x16_layout_B (52-55)
  • thread_id_shared_access_64x8_to_16x32_layout_A (82-85)
  • thread_id_shared_access_64x8_to_16x32_layout_B (94-97)
  • thread_id_shared_access_64x16_to_16x64_layout_A (106-109)
  • thread_id_shared_access_64x16_to_16x64_layout_B (118-121)
tilelang/language/ast/ir.py (2)
  • index_map (1673-1679)
  • meta_var (1731-1750)
tilelang/utils/language.py (1)
  • is_global (12-22)
src/tl_templates/dcu_hip/core.hpp (1)
src/tl_templates/dcu_hip/reduce.h (1)
  • T (95-167)
src/tl_templates/dcu_hip/debug.h (2)
src/tl_templates/cuda/gemm_sp_sm80.h (2)
  • signed (128-130)
  • unsigned (124-126)
src/tl_templates/dcu_hip/common.h (1)
  • half_t (112-112)
src/tl_templates/dcu_hip/reduce.h (2)
src/tl_templates/dcu_hip/core.hpp (20)
  • T (30-30)
  • T (32-34)
  • T (36-38)
  • T (54-54)
  • T (56-58)
  • T (60-62)
  • max (30-30)
  • max (32-32)
  • max (36-36)
  • max (40-40)
  • max (44-46)
  • max (44-44)
  • max (49-49)
  • min (54-54)
  • min (56-56)
  • min (60-60)
  • min (64-64)
  • min (68-70)
  • min (68-68)
  • min (73-73)
src/tl_templates/dcu_hip/common.h (1)
  • __half (110-110)
src/tl_templates/dcu_hip/hip_fp8.h (1)
src/tl_templates/cuda/cuda_fp8.h (3)
  • fp8_e4_2_t (9-12)
  • make_fp8_e4_4_t (88-97)
  • make_fp8_e4_8_t (100-109)
src/tl_templates/dcu_hip/gemm.h (3)
tilelang/intrinsics/mfma_layout.py (1)
  • make_mfma_swizzle_layout (130-152)
examples/gemm/example_gemm_intrinsics_dcu.py (1)
  • make_swizzle_layout (14-26)
testing/python/dcu/test_tilelang_gemm_mmac_intrinsic.py (1)
  • make_swizzle_layout (16-28)
src/tl_templates/dcu_hip/copy.h (1)
src/tl_templates/cuda/copy.h (1)
  • cp_async_wait (20-26)
🪛 Ruff (0.14.4)
tilelang/intrinsics/mmac_macro_generator.py

21-30: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)


95-95: Avoid specifying long messages outside the exception class

(TRY003)


193-193: Avoid specifying long messages outside the exception class

(TRY003)


252-252: Ambiguous variable name: l

(E741)


260-260: Ambiguous variable name: l

(E741)


295-295: Ambiguous variable name: l

(E741)


308-308: Ambiguous variable name: l

(E741)


484-484: Ambiguous variable name: l

(E741)


493-493: Ambiguous variable name: l

(E741)


512-512: Ambiguous variable name: l

(E741)


522-522: Ambiguous variable name: l

(E741)


558-558: Ambiguous variable name: l

(E741)


567-567: Ambiguous variable name: l

(E741)


588-588: Ambiguous variable name: l

(E741)


600-600: Ambiguous variable name: l

(E741)

🔇 Additional comments (16)
src/layout/layout.h (1)

153-155: LGTM: Fragment constructor declaration follows established pattern.

The makeGemmFragmentCDCU declaration is consistent with the existing CDNA and Hopper fragment constructors, with appropriate parameters for DCU-specific layout generation.

src/op/gemm.cc (1)

831-839: DCU-specific fragment selection logic is correct once TargetIsDCU is fixed.

The conditional correctly selects makeGemmFragmentCDCU for DCU targets and falls back to makeGemmFragmentCCDNA for other CDNA targets. However, this depends on fixing the compilation error in TargetIsDCU (src/target/utils.cc line 86).

Verify that the CDCU fragment produces the expected layout after fixing the TargetIsDCU bug.

src/layout/gemm_layouts.cc (1)

159-174: LGTM: CDCU fragment layout correctly differentiates from CDNA.

The implementation follows the CDNA pattern but uses Repeat({warp_m / 16, warp_n / 16}, false, false) instead of false, true at the warp level. This intentional difference reflects DCU's distinct thread-to-fragment mapping compared to standard CDNA architectures.

src/tl_templates/dcu_hip/threadblock_swizzle.h (2)

7-24: LGTM: Row-major rasterization helper correctly implements panel-based swizzling.

The rasterization2DRow template correctly computes 2D block indices with panel-based swizzling for improved L2 cache locality. The logic matches the established CUDA pattern.


26-43: LGTM: Column-major rasterization helper is symmetric to row variant.

The rasterization2DColumn template correctly implements column-major traversal by swapping gridDim.xgridDim.y and adjusting the index calculations accordingly.

src/target/intrin_rule_hip.cc (1)

243-251: Need to expand the search to better understand usage patterns.

LGTM: HIP shuffle intrinsic registration syntax is correct.

The registration correctly defines tir.hip.__shfl with 3 inputs (var, lane, width), mapping to the __shfl device function. AMD GPUs guarantee all warp lanes are executed in lockstep, so the non-sync variant is valid for DCU. However, this intrinsic is registered but not currently invoked anywhere in the codebase—it appears to be either forward-looking or used only in generated/dynamic code paths.

src/tl_templates/dcu_hip/core.hpp (1)

1-77: LGTM - HIP/CUDA compatibility layer well structured.

The header provides a clean abstraction layer for HIP/CUDA with appropriate host/device qualifiers and platform-specific buffer resource configuration. The max/min utilities with specializations for built-in operations are well organized.

src/tl_templates/dcu_hip/reduce.h (1)

7-90: LGTM - Reduction operations well structured.

The SumOp, MaxOp, MinOp functors and the AllReduce template with half-precision support are well implemented with proper synchronization and warp-level operations.

src/tl_templates/dcu_hip/hip_fp8.h (1)

1-44: LGTM - FP8 wrapper types well structured.

The FP8 type aliases and wrapper struct fp8_e4_4_t with its union-based layout, constructors, and conversion operators provide a clean interface for HIP FP8 operations. The aligned wrapper structs for fp8_e4_8_t and fp8_e4_16_t are appropriately defined.

src/tl_templates/dcu_hip/copy.h (2)

1-62: LGTM - HIP resource utilities well implemented.

The buffer resource structure, wave buffer creation, M0 register manipulation, and memory fence operations provide a solid foundation for HIP device-side memory operations. The cp_async_wait implementation with appropriate fence operations is correct.


75-87: LGTM - Async copy implementation for different sizes.

The cp_async_gs template handles different copy sizes (16, 8, 4 bytes) appropriately, using direct loads for larger sizes and buffer operations for the 4-byte case.

src/tl_templates/dcu_hip/gemm.h (4)

67-97: LGTM - Template parameters and constants well organized.

The GemmTensorOp class template parameters and compile-time constants for tile sizes, warp configuration, and padding are clearly defined and properly structured for MFMA-based GEMM operations.


117-153: LGTM - Swizzle layout implementation.

The make_mfma_swizzle_layout function correctly implements bank conflict avoidance through XOR-based swizzling, matching the pattern from the Python reference implementation in the codebase.


155-233: LGTM - GEMM body implementation.

The main GEMM computation in body() properly orchestrates register tiling, shared memory loads with swizzling, and MFMA instruction dispatch through the MfmaTraits mechanism. The transpose handling and index mapping are correct.


235-296: LGTM - Register-shared GEMM variant.

The body_rs() variant for register-shared computation correctly adapts the algorithm for when the A operand is already in registers, avoiding redundant shared memory loads for A.

src/tl_templates/dcu_hip/debug.h (1)

1-191: LGTM - Debug utilities well implemented.

The device-side debug printing templates with specializations for various types (signed/unsigned char, int, float, double, bool, short, half) are consistently implemented with appropriate format specifiers and type conversions. The buffer-value variant adds useful context for debugging array/buffer contents.

}

template <> CK_TILE_DEVICE float max(float x, float y) {
return __builtin_fmaxf(x, y); // can resultin v_max3_f32
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix typo in comment.

The comment contains a typo: "resultin" should be "result in".

Apply this diff:

-  return __builtin_fmaxf(x, y); // can resultin v_max3_f32
+  return __builtin_fmaxf(x, y); // can result in v_max3_f32
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
return __builtin_fmaxf(x, y); // can resultin v_max3_f32
return __builtin_fmaxf(x, y); // can result in v_max3_f32
🤖 Prompt for AI Agents
In src/tl_templates/dcu_hip/core.hpp around line 41, the inline comment has a
typo: "resultin" should be "result in"; update the comment text to read "can
result in v_max3_f32" so the spacing is correct.

Comment on lines +44 to +46
template <> CK_TILE_DEVICE double max(double x, double y) {
return __builtin_fmax(x, y); // maybe still v_max3_f32
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix misleading comment.

The comment mentions "v_max3_f32" for a max function on doubles, which seems incorrect or misleading since the instruction name suggests float32 operations.

Apply this diff to clarify:

-template <> CK_TILE_DEVICE double max(double x, double y) {
-  return __builtin_fmax(x, y); // maybe still v_max3_f32
-}
+template <> CK_TILE_DEVICE double max(double x, double y) {
+  return __builtin_fmax(x, y); // v_max_f64 or equivalent
+}
🤖 Prompt for AI Agents
In src/tl_templates/dcu_hip/core.hpp around lines 44 to 46, the inline comment
"maybe still v_max3_f32" is misleading for a double-precision max
implementation; replace or remove it and document the correct intrinsic (e.g.,
use __builtin_fmax for double or reference the appropriate v_max_f64 name if
applicable) so the comment matches the double type and implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant