[TileAndFuse] Add thread groups for convolution ops #695

newling · 2024-08-22T17:02:10Z

This works for now because the sizes of dimensions DimZ and LinearDim0 are 1 with our tiling strategy, and so DimY and DimX map to the rows and columns of the AIE array. Follow-up work: pass to collapse scf.forall's with more than 2 induction variables to just 2. So instead of

(i,j,k) in (2,3,5)

for example, could be

(i,l) in (2,15) and then j=l/5 k=l%5.

newling · 2024-08-22T17:51:48Z

IR for AIR lowering is below. The compilation fails in air-dependency-canonicalize with

error: 'air.launch' op found non-herd op with core id

@erwei-xilinx is this something that can be fixed (assuming the IR below looks valid) ? Let me know if you'd like the full IR trace.

FWIW this allocation of blocks and threads to convolution works ok with the objectFifo pipeline (which FWIW is currently running out of BDs in a late mlir-aie pass).

// -----// IR Dump Before AMDAIEBridgeToAIR (iree-amdaie-bridge-to-air) //----- //
module {
  func.func @conv_2d_nhwc_hwcf_dispatch_0_conv_2d_nhwc_hwcf_2x12x12x64x3x3x32_i32() attributes {translation_info = #iree_codegen.translation_info<Custom>} {
    %c8 = arith.constant 8 : index
    %c1 = arith.constant 1 : index
    %c32 = arith.constant 32 : index
    %c3 = arith.constant 3 : index
    %c0_i32 = arith.constant 0 : i32
    %c0 = arith.constant 0 : index
    %0 = hal.interface.binding.subspan layout(<push_constants = 0, sets = [<0, bindings = [<0, storage_buffer, ReadOnly>, <1, storage_buffer, ReadOnly>, <2, storage_buffer>], flags = Indirect>]>) set(0) binding(0) alignment(64) offset(%c0) flags(ReadOnly) : memref<2x14x14x32xi32>
    memref.assume_alignment %0, 64 : memref<2x14x14x32xi32>
    %1 = hal.interface.binding.subspan layout(<push_constants = 0, sets = [<0, bindings = [<0, storage_buffer, ReadOnly>, <1, storage_buffer, ReadOnly>, <2, storage_buffer>], flags = Indirect>]>) set(0) binding(1) alignment(64) offset(%c0) flags(ReadOnly) : memref<3x3x32x64xi32>
    memref.assume_alignment %1, 64 : memref<3x3x32x64xi32>
    %2 = hal.interface.binding.subspan layout(<push_constants = 0, sets = [<0, bindings = [<0, storage_buffer, ReadOnly>, <1, storage_buffer, ReadOnly>, <2, storage_buffer>], flags = Indirect>]>) set(0) binding(2) alignment(64) offset(%c0) : memref<2x12x12x64xi32>
    memref.assume_alignment %2, 64 : memref<2x12x12x64xi32>
    scf.forall (%arg0, %arg1, %arg2) = (0, 0, 0) to (12, 12, 64) step (4, 4, 4) {
      %subview = memref.subview %0[0, %arg0, %arg1, 0] [2, 6, 6, 32] [1, 1, 1, 1] : memref<2x14x14x32xi32> to memref<2x6x6x32xi32, strided<[6272, 448, 32, 1], offset: ?>>
      %subview_0 = memref.subview %1[0, 0, 0, %arg2] [3, 3, 32, 4] [1, 1, 1, 1] : memref<3x3x32x64xi32> to memref<3x3x32x4xi32, strided<[6144, 2048, 64, 1], offset: ?>>
      %subview_1 = memref.subview %2[0, %arg0, %arg1, %arg2] [2, 4, 4, 4] [1, 1, 1, 1] : memref<2x12x12x64xi32> to memref<2x4x4x4xi32, strided<[9216, 768, 64, 1], offset: ?>>
      %alloc = memref.alloc() : memref<2x6x6x32xi32, 1 : i32>
      linalg.copy ins(%subview : memref<2x6x6x32xi32, strided<[6272, 448, 32, 1], offset: ?>>) outs(%alloc : memref<2x6x6x32xi32, 1 : i32>)
      %alloc_2 = memref.alloc() : memref<3x3x32x4xi32, 1 : i32>
      linalg.copy ins(%subview_0 : memref<3x3x32x4xi32, strided<[6144, 2048, 64, 1], offset: ?>>) outs(%alloc_2 : memref<3x3x32x4xi32, 1 : i32>)
      %alloc_3 = memref.alloc() : memref<2x4x4x4xi32, 1 : i32>
      scf.forall (%arg3, %arg4, %arg5, %arg6) = (0, 0, 0, 0) to (2, 4, 4, 4) step (1, 1, 4, 4) {
        %subview_4 = memref.subview %alloc_3[%arg3, %arg4, %arg5, %arg6] [1, 1, 4, 4] [1, 1, 1, 1] : memref<2x4x4x4xi32, 1 : i32> to memref<1x1x4x4xi32, strided<[64, 16, 4, 1], offset: ?>, 1 : i32>
        %alloc_5 = memref.alloc() : memref<1x1x4x4xi32, 2 : i32>
        linalg.fill ins(%c0_i32 : i32) outs(%alloc_5 : memref<1x1x4x4xi32, 2 : i32>)
        scf.for %arg7 = %c0 to %c3 step %c1 {
          scf.for %arg8 = %c0 to %c3 step %c1 {
            scf.for %arg9 = %c0 to %c32 step %c8 {
              %3 = affine.apply affine_map<()[s0, s1] -> (s0 + s1)>()[%arg4, %arg7]
              %4 = affine.apply affine_map<()[s0, s1] -> (s0 + s1)>()[%arg5, %arg8]
              %subview_6 = memref.subview %alloc[%arg3, %3, %4, %arg9] [1, 1, 4, 8] [1, 1, 1, 1] : memref<2x6x6x32xi32, 1 : i32> to memref<1x1x4x8xi32, strided<[1152, 192, 32, 1], offset: ?>, 1 : i32>
              %subview_7 = memref.subview %alloc_2[%arg7, %arg8, %arg9, %arg6] [1, 1, 8, 4] [1, 1, 1, 1] : memref<3x3x32x4xi32, 1 : i32> to memref<1x1x8x4xi32, strided<[384, 128, 4, 1], offset: ?>, 1 : i32>
              %alloc_8 = memref.alloc() : memref<1x1x4x8xi32, 2 : i32>
              linalg.copy ins(%subview_6 : memref<1x1x4x8xi32, strided<[1152, 192, 32, 1], offset: ?>, 1 : i32>) outs(%alloc_8 : memref<1x1x4x8xi32, 2 : i32>)
              %alloc_9 = memref.alloc() : memref<1x1x8x4xi32, 2 : i32>
              linalg.copy ins(%subview_7 : memref<1x1x8x4xi32, strided<[384, 128, 4, 1], offset: ?>, 1 : i32>) outs(%alloc_9 : memref<1x1x8x4xi32, 2 : i32>)
              %subview_10 = memref.subview %alloc_8[0, 0, 0, 0] [1, 1, 4, 8] [1, 1, 1, 1] : memref<1x1x4x8xi32, 2 : i32> to memref<1x4x8xi32, strided<[32, 8, 1]>, 2 : i32>
              %subview_11 = memref.subview %alloc_9[0, 0, 0, 0] [1, 1, 8, 4] [1, 1, 1, 1] : memref<1x1x8x4xi32, 2 : i32> to memref<1x8x4xi32, strided<[32, 4, 1]>, 2 : i32>
              %subview_12 = memref.subview %alloc_5[0, 0, 0, 0] [1, 1, 4, 4] [1, 1, 1, 1] : memref<1x1x4x4xi32, 2 : i32> to memref<1x4x4xi32, strided<[16, 4, 1]>, 2 : i32>
              linalg.conv_1d_nwc_wcf {dilations = dense<1> : vector<1xi64>, strides = dense<1> : vector<1xi64>} ins(%subview_10, %subview_11 : memref<1x4x8xi32, strided<[32, 8, 1]>, 2 : i32>, memref<1x8x4xi32, strided<[32, 4, 1]>, 2 : i32>) outs(%subview_12 : memref<1x4x4xi32, strided<[16, 4, 1]>, 2 : i32>)
              memref.dealloc %alloc_8 : memref<1x1x4x8xi32, 2 : i32>
              memref.dealloc %alloc_9 : memref<1x1x8x4xi32, 2 : i32>
            }
          }
        }
        linalg.copy ins(%alloc_5 : memref<1x1x4x4xi32, 2 : i32>) outs(%subview_4 : memref<1x1x4x4xi32, strided<[64, 16, 4, 1], offset: ?>, 1 : i32>)
        memref.dealloc %alloc_5 : memref<1x1x4x4xi32, 2 : i32>
      } {mapping = [#gpu.thread<y>, #gpu.thread<x>, #gpu.thread<z>, #gpu.thread<linear_dim_0>]}
      linalg.copy ins(%alloc_3 : memref<2x4x4x4xi32, 1 : i32>) outs(%subview_1 : memref<2x4x4x4xi32, strided<[9216, 768, 64, 1], offset: ?>>)
      memref.dealloc %alloc : memref<2x6x6x32xi32, 1 : i32>
      memref.dealloc %alloc_2 : memref<3x3x32x4xi32, 1 : i32>
      memref.dealloc %alloc_3 : memref<2x4x4x4xi32, 1 : i32>
    } {mapping = [#gpu.block<y>, #gpu.block<x>, #gpu.block<z>]}
    return
  }
}

newling · 2024-08-22T18:24:04Z

IR for AIR lowering is below. The compilation fails in air-dependency-canonicalize with

error: 'air.launch' op found non-herd op with core id

@erwei-xilinx is this something that can be fixed (assuming the IR below looks valid) ? Let me know if you'd like the full IR trace.

FWIW this allocation of blocks and threads to convolution works ok with the objectFifo pipeline (which FWIW is currently running out of BDs in a late mlir-aie pass).

// -----// IR Dump Before AMDAIEBridgeToAIR (iree-amdaie-bridge-to-air) //----- //
module {
  func.func @conv_2d_nhwc_hwcf_dispatch_0_conv_2d_nhwc_hwcf_2x12x12x64x3x3x32_i32() attributes {translation_info = #iree_codegen.translation_info<Custom>} {
    %c8 = arith.constant 8 : index
    %c1 = arith.constant 1 : index
    %c32 = arith.constant 32 : index
    %c3 = arith.constant 3 : index
    %c0_i32 = arith.constant 0 : i32
    %c0 = arith.constant 0 : index
    %0 = hal.interface.binding.subspan layout(<push_constants = 0, sets = [<0, bindings = [<0, storage_buffer, ReadOnly>, <1, storage_buffer, ReadOnly>, <2, storage_buffer>], flags = Indirect>]>) set(0) binding(0) alignment(64) offset(%c0) flags(ReadOnly) : memref<2x14x14x32xi32>
    memref.assume_alignment %0, 64 : memref<2x14x14x32xi32>
    %1 = hal.interface.binding.subspan layout(<push_constants = 0, sets = [<0, bindings = [<0, storage_buffer, ReadOnly>, <1, storage_buffer, ReadOnly>, <2, storage_buffer>], flags = Indirect>]>) set(0) binding(1) alignment(64) offset(%c0) flags(ReadOnly) : memref<3x3x32x64xi32>
    memref.assume_alignment %1, 64 : memref<3x3x32x64xi32>
    %2 = hal.interface.binding.subspan layout(<push_constants = 0, sets = [<0, bindings = [<0, storage_buffer, ReadOnly>, <1, storage_buffer, ReadOnly>, <2, storage_buffer>], flags = Indirect>]>) set(0) binding(2) alignment(64) offset(%c0) : memref<2x12x12x64xi32>
    memref.assume_alignment %2, 64 : memref<2x12x12x64xi32>
    scf.forall (%arg0, %arg1, %arg2) = (0, 0, 0) to (12, 12, 64) step (4, 4, 4) {
      %subview = memref.subview %0[0, %arg0, %arg1, 0] [2, 6, 6, 32] [1, 1, 1, 1] : memref<2x14x14x32xi32> to memref<2x6x6x32xi32, strided<[6272, 448, 32, 1], offset: ?>>
      %subview_0 = memref.subview %1[0, 0, 0, %arg2] [3, 3, 32, 4] [1, 1, 1, 1] : memref<3x3x32x64xi32> to memref<3x3x32x4xi32, strided<[6144, 2048, 64, 1], offset: ?>>
      %subview_1 = memref.subview %2[0, %arg0, %arg1, %arg2] [2, 4, 4, 4] [1, 1, 1, 1] : memref<2x12x12x64xi32> to memref<2x4x4x4xi32, strided<[9216, 768, 64, 1], offset: ?>>
      %alloc = memref.alloc() : memref<2x6x6x32xi32, 1 : i32>
      linalg.copy ins(%subview : memref<2x6x6x32xi32, strided<[6272, 448, 32, 1], offset: ?>>) outs(%alloc : memref<2x6x6x32xi32, 1 : i32>)
      %alloc_2 = memref.alloc() : memref<3x3x32x4xi32, 1 : i32>
      linalg.copy ins(%subview_0 : memref<3x3x32x4xi32, strided<[6144, 2048, 64, 1], offset: ?>>) outs(%alloc_2 : memref<3x3x32x4xi32, 1 : i32>)
      %alloc_3 = memref.alloc() : memref<2x4x4x4xi32, 1 : i32>
      scf.forall (%arg3, %arg4, %arg5, %arg6) = (0, 0, 0, 0) to (2, 4, 4, 4) step (1, 1, 4, 4) {
        %subview_4 = memref.subview %alloc_3[%arg3, %arg4, %arg5, %arg6] [1, 1, 4, 4] [1, 1, 1, 1] : memref<2x4x4x4xi32, 1 : i32> to memref<1x1x4x4xi32, strided<[64, 16, 4, 1], offset: ?>, 1 : i32>
        %alloc_5 = memref.alloc() : memref<1x1x4x4xi32, 2 : i32>
        linalg.fill ins(%c0_i32 : i32) outs(%alloc_5 : memref<1x1x4x4xi32, 2 : i32>)
        scf.for %arg7 = %c0 to %c3 step %c1 {
          scf.for %arg8 = %c0 to %c3 step %c1 {
            scf.for %arg9 = %c0 to %c32 step %c8 {
              %3 = affine.apply affine_map<()[s0, s1] -> (s0 + s1)>()[%arg4, %arg7]
              %4 = affine.apply affine_map<()[s0, s1] -> (s0 + s1)>()[%arg5, %arg8]
              %subview_6 = memref.subview %alloc[%arg3, %3, %4, %arg9] [1, 1, 4, 8] [1, 1, 1, 1] : memref<2x6x6x32xi32, 1 : i32> to memref<1x1x4x8xi32, strided<[1152, 192, 32, 1], offset: ?>, 1 : i32>
              %subview_7 = memref.subview %alloc_2[%arg7, %arg8, %arg9, %arg6] [1, 1, 8, 4] [1, 1, 1, 1] : memref<3x3x32x4xi32, 1 : i32> to memref<1x1x8x4xi32, strided<[384, 128, 4, 1], offset: ?>, 1 : i32>
              %alloc_8 = memref.alloc() : memref<1x1x4x8xi32, 2 : i32>
              linalg.copy ins(%subview_6 : memref<1x1x4x8xi32, strided<[1152, 192, 32, 1], offset: ?>, 1 : i32>) outs(%alloc_8 : memref<1x1x4x8xi32, 2 : i32>)
              %alloc_9 = memref.alloc() : memref<1x1x8x4xi32, 2 : i32>
              linalg.copy ins(%subview_7 : memref<1x1x8x4xi32, strided<[384, 128, 4, 1], offset: ?>, 1 : i32>) outs(%alloc_9 : memref<1x1x8x4xi32, 2 : i32>)
              %subview_10 = memref.subview %alloc_8[0, 0, 0, 0] [1, 1, 4, 8] [1, 1, 1, 1] : memref<1x1x4x8xi32, 2 : i32> to memref<1x4x8xi32, strided<[32, 8, 1]>, 2 : i32>
              %subview_11 = memref.subview %alloc_9[0, 0, 0, 0] [1, 1, 8, 4] [1, 1, 1, 1] : memref<1x1x8x4xi32, 2 : i32> to memref<1x8x4xi32, strided<[32, 4, 1]>, 2 : i32>
              %subview_12 = memref.subview %alloc_5[0, 0, 0, 0] [1, 1, 4, 4] [1, 1, 1, 1] : memref<1x1x4x4xi32, 2 : i32> to memref<1x4x4xi32, strided<[16, 4, 1]>, 2 : i32>
              linalg.conv_1d_nwc_wcf {dilations = dense<1> : vector<1xi64>, strides = dense<1> : vector<1xi64>} ins(%subview_10, %subview_11 : memref<1x4x8xi32, strided<[32, 8, 1]>, 2 : i32>, memref<1x8x4xi32, strided<[32, 4, 1]>, 2 : i32>) outs(%subview_12 : memref<1x4x4xi32, strided<[16, 4, 1]>, 2 : i32>)
              memref.dealloc %alloc_8 : memref<1x1x4x8xi32, 2 : i32>
              memref.dealloc %alloc_9 : memref<1x1x8x4xi32, 2 : i32>
            }
          }
        }
        linalg.copy ins(%alloc_5 : memref<1x1x4x4xi32, 2 : i32>) outs(%subview_4 : memref<1x1x4x4xi32, strided<[64, 16, 4, 1], offset: ?>, 1 : i32>)
        memref.dealloc %alloc_5 : memref<1x1x4x4xi32, 2 : i32>
      } {mapping = [#gpu.thread<y>, #gpu.thread<x>, #gpu.thread<z>, #gpu.thread<linear_dim_0>]}
      linalg.copy ins(%alloc_3 : memref<2x4x4x4xi32, 1 : i32>) outs(%subview_1 : memref<2x4x4x4xi32, strided<[9216, 768, 64, 1], offset: ?>>)
      memref.dealloc %alloc : memref<2x6x6x32xi32, 1 : i32>
      memref.dealloc %alloc_2 : memref<3x3x32x4xi32, 1 : i32>
      memref.dealloc %alloc_3 : memref<2x4x4x4xi32, 1 : i32>
    } {mapping = [#gpu.block<y>, #gpu.block<x>, #gpu.block<z>]}
    return
  }
}

Suggestion from @erwei-xilinx to run canonicalization to reduce 4-d scf.parallel to 2-d scf.parallel works perfectly. We wonder why canonicalization doesn't do the same for scf.forall

yzhang93

This is a good starting point. I suggest to put this as a separate utility function.

yzhang93 · 2024-08-27T16:53:01Z

compiler/plugins/target/AMD-AIE/iree-amd-aie/Transforms/AMDAIETileAndFuse.cpp

+    uint32_t nbIndVars = std::count_if(tileSizesVal.begin(), tileSizesVal.end(),
+                                       [](int64_t t) { return t != 0; });
+
+    // See mlir::gpu::MappingId enum: there are currently 13 values.


Maybe good to include the reference https://github.com/llvm/llvm-project/blob/e8063702cfbbf39f0b92283d0588dee264b5eb2b/mlir/include/mlir/Dialect/GPU/IR/GPUDeviceMappingAttr.td#L37.

yzhang93 · 2024-08-27T17:04:51Z

compiler/plugins/target/AMD-AIE/iree-amd-aie/Transforms/AMDAIETileAndFuse.cpp

+      if (i == 0)
+        mapping.push_back(getMappingAttributeForDimension(1));
+      else if (i == 1)
+        mapping.push_back(getMappingAttributeForDimension(0));
+      else
+        mapping.push_back(getMappingAttributeForDimension(i));


This might be problematic if we have the batch dimension. Typically, we should put DimZ for the batch dimension.

yzhang93 · 2024-09-03T17:21:18Z

...iler/plugins/target/AMD-AIE/iree-amd-aie/Transforms/test/tile_and_fuse_using_scf_forall.mlir

@@ -32,7 +32,25 @@ func.func @conv_2d_nhwc_hwcf(%arg0: tensor<2x14x14x32xbf16>, %arg1: tensor<3x3x3
 // TILE-LEVEL-0-SAME:   {
 //      TILE-LEVEL-0:       linalg.fill
 //      TILE-LEVEL-0:       linalg.conv_2d_nhwc_hwcf
-//      TILE-LEVEL-0:   }
+//      TILE-LEVEL-0:   }  {mapping = [#gpu.block<y>, #gpu.block<x>, #gpu.block<z>]}


In my opinion, the order of mapping attributes for block and thread should be corresponding to each other. Although we are not using block attributes at the moment, it's good to keep the attributes in the same order.

Yeah, so the logic is actually identical for threads and blocks. What's happening in this example is that the tiling sizes at thread and block levels are different:

For blocks: [0, 4, 4, 4, 0, 0, 0]
For threads: [1, 1, 4, 4, 0, 0, 0]

The logic works implemented in this PR works as follows:

First, assign attributes to dimensions with tile size greater than 1. For threads, that is dimensions 2 and 3.
Second, assign attributes to dimensions with tile size equal to 1. For threads, that is dimensions 0 and 1.

The attributes assigned in the order y then x then z then linear_dim_0, linear_dim_1 etc.

For [1, 1, 4, 4, 0, 0, 0], after step 1 the assigned dimensions are
[none, none, y, x, none, none, none]
and then after step 2 the assigned dimensions are
[z, linear_dim_0, y, x, none, none, none].

And that is why at the thread level we end up with [z, linear_dim_0, y, x].

Thanks for the explanation. Yeah, I can follow the steps to set up mapping attributes. It's just not typical that the attributes are different for the same dim in block and thread mapping. I don't have a good solution to solve this other than hardcoding the dimensions. On the other hand, since the block attributes are not used anyway, maybe we could remove these block attributes?

I just tried removing block dimension tiling, but for small matmul examples where the block tile sizes are all 1, there's a problem: scf.forall without tiling attributes get canonicalized away when the iterations space is size 1. i.e. the block-level scf.forall gets removed completely. Ideally we'd be able to work without this outer scf.forall, but currently the passes aren't set up to handle this, I guess. So removing block dimensions for matmuls isn't something we can immediately do.

I could remove them for convolution, but we might have the same issue when we have small convolutions.

So yeah, not sure. Maybe, for now, it's ok to keep the block dimensions as they are for convolution?

I see... We should find a way to solve the scf.forall canonicalization problem. Could you add a TODO comment for the block mapping attributes?

Ok, I'll add a comment. Thanks for accepting :)

yzhang93 · 2024-09-03T17:23:20Z

...iler/plugins/target/AMD-AIE/iree-amd-aie/Transforms/test/tile_and_fuse_using_scf_forall.mlir

+// TILE-LEVEL-1-SAME:   {
+//      TILE-LEVEL-1:       linalg.fill
+//      TILE-LEVEL-1:       linalg.conv_2d_nhwc_hwcf
+//      TILE-LEVEL-1:   }  {mapping = [#gpu.thread<z>, #gpu.thread<linear_dim_0>, #gpu.thread<y>, #gpu.thread<x>]}


This looks good to me.

newling requested review from MaheshRavishankar, nirvedhmeshram, yzhang93, Abhishek-Varma and jtuyls as code owners August 22, 2024 17:02

newling force-pushed the thread_groups_for_conv branch from 9ba7444 to 46bb421 Compare August 26, 2024 15:16

newling removed request for jtuyls and nirvedhmeshram August 26, 2024 15:26

yzhang93 requested changes Aug 27, 2024

View reviewed changes

newling force-pushed the thread_groups_for_conv branch 3 times, most recently from 57ae4ac to 38c7885 Compare August 29, 2024 15:52

newling requested a review from yzhang93 August 29, 2024 15:53

yzhang93 reviewed Sep 3, 2024

View reviewed changes

yzhang93 approved these changes Sep 4, 2024

View reviewed changes

newling added 2 commits September 4, 2024 11:58

squash again

b2915c6

rebase, and add comment about canonicalization

282b2ff

newling force-pushed the thread_groups_for_conv branch from 38c7885 to 282b2ff Compare September 4, 2024 19:05

newling added 2 commits September 4, 2024 14:15

Fix error introduced with last comment commit

dae6fc7

Merge branch 'main' into thread_groups_for_conv

a57efaa

newling enabled auto-merge (squash) September 4, 2024 21:41

newling merged commit 994f6e3 into nod-ai:main Sep 4, 2024
6 checks passed

newling deleted the thread_groups_for_conv branch December 12, 2024 23:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TileAndFuse] Add thread groups for convolution ops #695

[TileAndFuse] Add thread groups for convolution ops #695

newling commented Aug 22, 2024 •

edited

Loading

newling commented Aug 22, 2024 •

edited

Loading

newling commented Aug 22, 2024

yzhang93 left a comment

yzhang93 Aug 27, 2024

yzhang93 Aug 27, 2024

yzhang93 Sep 3, 2024

newling Sep 3, 2024

yzhang93 Sep 3, 2024 •

edited

Loading

newling Sep 4, 2024

yzhang93 Sep 4, 2024

newling Sep 4, 2024

yzhang93 Sep 3, 2024

[TileAndFuse] Add thread groups for convolution ops #695

[TileAndFuse] Add thread groups for convolution ops #695

Conversation

newling commented Aug 22, 2024 • edited Loading

newling commented Aug 22, 2024 • edited Loading

newling commented Aug 22, 2024

yzhang93 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yzhang93 Sep 3, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

newling commented Aug 22, 2024 •

edited

Loading

newling commented Aug 22, 2024 •

edited

Loading

yzhang93 Sep 3, 2024 •

edited

Loading