Skip to content

Commit

Permalink
[SWP] attempt to remove a workaround for a triton llvm codegen bug (#…
Browse files Browse the repository at this point in the history
…4774)

Triton LLVM codegen has a bug where local_loads from #shared to #mma
layout can lead to invalid code if the loaded shape is smaller than the
mma tile. Remove the workaround.

See triton-lang/triton#3561.

Verified that with test case: https://pastebin.com/xxP3cFmy (test.mlir),
running
triton-opt test.mlir -tritongpu-pipeline=num-stages=3
--convert-scf-to-cf --allocate-shared-memory
--convert-triton-gpu-to-llvm
has no issue.

Unit test case added in triton-lang/triton#4798
also shows no issue.
  • Loading branch information
manman-ren authored Sep 27, 2024
1 parent 2ef33c6 commit e7ec3fe
Show file tree
Hide file tree
Showing 2 changed files with 2 additions and 32 deletions.
31 changes: 0 additions & 31 deletions lib/Dialect/TritonGPU/Transforms/Pipeliner/MatmulLoopPipeline.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -441,37 +441,6 @@ assignMemoryLayouts(llvm::SmallVector<std::tuple<Operation *, int, Operation *>>
} else if (auto dot = dyn_cast<tt::DotOp>(use)) {
loadInfo.sharedEncoding =
getSharedEncIfAllUsersAreDotEnc(op->getResult(0)).value_or(nullptr);

// HACK: Triton LLVM codegen has a bug where local_loads from #shared to
// #mma layout can lead to invalid code if the loaded shape is smaller
// than the mma tile (e.g. loading a 128x1 tensor for an MMAv2 dot with
// tile {16,8} is bad because 1 < 8). To work around this, don't
// pipeline such loads.
//
// The codegen bug is caught by an assertion, so if you think you've
// fixed it, feel free to delete this code and see if the assert still
// fails. :)
if (!loadInfo.sharedEncoding) {
if (auto dotEnc = dyn_cast<ttg::NvidiaMmaEncodingAttr>(
dot.getResult().getType().getEncoding())) {
auto loadTy = cast<RankedTensorType>(op->getResultTypes()[0]);
auto mmaInstrShape = dotEnc.getInstrShape();
if (loadTy.getRank() < mmaInstrShape.size())
continue;
bool ok = true;
for (int i = 0; i < mmaInstrShape.size(); i++) {
if (loadTy.getShape()[loadTy.getRank() - mmaInstrShape.size() +
i] < mmaInstrShape[i]) {
ok = false;
break;
}
}
// If this load might trigger the bug, don't do the fallback logic
// below, which might allow the load to be pipelined.
if (!ok)
continue;
}
}
}
} else if (auto loadOp = dyn_cast<tt::LoadOp>(use)) {
// The use of this loadOp is another loadOp. If the use is not in the
Expand Down
3 changes: 2 additions & 1 deletion test/TritonGPU/loop-pipeline.mlir
Original file line number Diff line number Diff line change
Expand Up @@ -1460,7 +1460,8 @@ module attributes {"triton_gpu.num-ctas" = 1 : i32, "triton_gpu.num-warps" = 2 :
// -----

// COMMON-LABEL: @dont_pipeline_128x1
// COMMON-NOT: local_load{{.*}}128x1
// AMD-NOT: local_load{{.*}}128x1
// CHECK: local_load{{.*}}128x1
#blocked = #triton_gpu.blocked<{sizePerThread = [1, 1], threadsPerWarp = [32, 1], warpsPerCTA = [4, 1], order = [0, 1]}>
#mma = #triton_gpu.nvidia_mma<{versionMajor = 2, versionMinor = 0, warpsPerCTA = [4, 1], instrShape = [16, 8]}>
module attributes {"triton_gpu.num-ctas" = 1 : i32, "triton_gpu.num-warps" = 4 : i32} {
Expand Down

0 comments on commit e7ec3fe

Please sign in to comment.