Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[quidditch_snitch] Reintroduce tensor.microkernel op #106

Merged
merged 5 commits into from
Aug 15, 2024

Conversation

zero9178
Copy link
Member

1b20d75 previously removed the tensor.microkernel operation as it at the time seemed not worth the extra code.

Since then, we noted that microkernels execute in a more asynchronous manner due to Snitch's asynchronous FPU requiring the use of an explicit microkernel_fence operation. Optimizing the placement of these is easier done in tensor land, making the operation more worth it. Additionally, more experience in bufferization lead to simplifying its implementation by restricting microkernel_yield to only tensor operations.

A tensor counterpart of microkernel_fence called sync_tensor has also been added which makes a result tensor of a tensor.microkernel operation available. It bufferizes to microkernel_fence and its placement could be further optimized in the future. The conservative placement of microkernel_fence operations was also removed from speicalize-dma-code leading to less barriers and microkernel_fences.

The op as it is currently used and specified returns the index of compute cores (in our current simulations 0 to exclusive 8) and has unspecified behavior on the DMA core. This will be used in a later PR to properly lower it when doing DMA code specialization.
Barriers may be inserted in a loop resulting from the lowering of a `scf.forall`. The lowering of `compute_core_index` unfortunately returns `num_compute_cores + 1` which is outside the specified range of the operation and leads to such loops and their barriers being skipped by the DMA core.
As the pass already assumes non-divergent control flow, we can fix this by specializing `compute_core_index` to any integer in the defined range of the op.
1b20d75 previously removed the `tensor.microkernel` operation as it at the time seemed not worth the extra code.

Since then, we noted that microkernels execute in a more asynchronous manner due to Snitch's asynchronous FPU requiring the use of an explicit `microkernel_fence` operation. Optimizing the placement of these is easier done in tensor land, making the operation more worth it. Additionally, more experience in bufferization lead to simplifying its implementation by restricting `microkernel_yield` to only tensor operations.

A tensor counterpart of `microkernel_fence` called `sync_tensor` has also been added which makes a result tensor of a `tensor.microkernel` operation available. It bufferizes to `microkernel_fence` and its placement could be further optimized in the future.
The conservative placement of `microkernel_fence` operations was also removed from `speicalize-dma-code` leading to less barriers and `microkernel_fence`s.
@zero9178 zero9178 force-pushed the tensor-microkernel-op branch from 0b4285b to 50406dc Compare August 15, 2024 10:39
@zero9178 zero9178 changed the base branch from main to dma-compute-core-index-lowering August 15, 2024 10:39
Base automatically changed from dma-compute-core-index-lowering to main August 15, 2024 10:51
# Conflicts:
#	codegen/tests/Dialect/Snitch/Transforms/specialize-dma-code.mlir
@zero9178 zero9178 merged commit 5803d44 into main Aug 15, 2024
1 check passed
@zero9178 zero9178 deleted the tensor-microkernel-op branch August 15, 2024 12:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant