[quidditch_snitch] Reintroduce `tensor.microkernel` op #106

zero9178 · 2024-08-14T16:07:17Z

1b20d75 previously removed the tensor.microkernel operation as it at the time seemed not worth the extra code.

Since then, we noted that microkernels execute in a more asynchronous manner due to Snitch's asynchronous FPU requiring the use of an explicit microkernel_fence operation. Optimizing the placement of these is easier done in tensor land, making the operation more worth it. Additionally, more experience in bufferization lead to simplifying its implementation by restricting microkernel_yield to only tensor operations.

A tensor counterpart of microkernel_fence called sync_tensor has also been added which makes a result tensor of a tensor.microkernel operation available. It bufferizes to microkernel_fence and its placement could be further optimized in the future. The conservative placement of microkernel_fence operations was also removed from speicalize-dma-code leading to less barriers and microkernel_fences.

The op as it is currently used and specified returns the index of compute cores (in our current simulations 0 to exclusive 8) and has unspecified behavior on the DMA core. This will be used in a later PR to properly lower it when doing DMA code specialization.

Barriers may be inserted in a loop resulting from the lowering of a `scf.forall`. The lowering of `compute_core_index` unfortunately returns `num_compute_cores + 1` which is outside the specified range of the operation and leads to such loops and their barriers being skipped by the DMA core. As the pass already assumes non-divergent control flow, we can fix this by specializing `compute_core_index` to any integer in the defined range of the op.

1b20d75 previously removed the `tensor.microkernel` operation as it at the time seemed not worth the extra code. Since then, we noted that microkernels execute in a more asynchronous manner due to Snitch's asynchronous FPU requiring the use of an explicit `microkernel_fence` operation. Optimizing the placement of these is easier done in tensor land, making the operation more worth it. Additionally, more experience in bufferization lead to simplifying its implementation by restricting `microkernel_yield` to only tensor operations. A tensor counterpart of `microkernel_fence` called `sync_tensor` has also been added which makes a result tensor of a `tensor.microkernel` operation available. It bufferizes to `microkernel_fence` and its placement could be further optimized in the future. The conservative placement of `microkernel_fence` operations was also removed from `speicalize-dma-code` leading to less barriers and `microkernel_fence`s.

# Conflicts: # codegen/tests/Dialect/Snitch/Transforms/specialize-dma-code.mlir

zero9178 added 4 commits August 15, 2024 11:30

remove fence from runtime

50406dc

zero9178 force-pushed the tensor-microkernel-op branch from 0b4285b to 50406dc Compare August 15, 2024 10:39

zero9178 changed the base branch from main to dma-compute-core-index-lowering August 15, 2024 10:39

Base automatically changed from dma-compute-core-index-lowering to main August 15, 2024 10:51

Merge branch 'main' into tensor-microkernel-op

606d8bd

# Conflicts: # codegen/tests/Dialect/Snitch/Transforms/specialize-dma-code.mlir

zero9178 merged commit 5803d44 into main Aug 15, 2024
1 check passed

zero9178 deleted the tensor-microkernel-op branch August 15, 2024 12:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[quidditch_snitch] Reintroduce `tensor.microkernel` op #106

[quidditch_snitch] Reintroduce `tensor.microkernel` op #106

zero9178 commented Aug 14, 2024

[quidditch_snitch] Reintroduce tensor.microkernel op #106

[quidditch_snitch] Reintroduce tensor.microkernel op #106

Conversation

zero9178 commented Aug 14, 2024

[quidditch_snitch] Reintroduce `tensor.microkernel` op #106

[quidditch_snitch] Reintroduce `tensor.microkernel` op #106