Add a pass to fold DMA waits #962

Yu-Zhewen · 2024-12-05T13:54:52Z

Each DMA channel has a task queue with the depth of 4. DMA wait is only required for every 4 pushes, reducing unnecessary synchronization.

Example: https://gist.github.com/Yu-Zhewen/5f569b56c7b1f1a8715a7c4c3bf9e609

Results compared to 7c4b985:

Test (MxKxN)	Instruction Size Before (Words)	Instruction Size After (Words)
512x4096x512	1228	1132
512x512x4096	820	772
4096x512x512	4628	4244

This optimization is orthogonal to DMA chaining #931.

compiler/plugins/target/AMD-AIE/iree-amd-aie/Transforms/AMDAIESimplifyDmaWaits.cpp

newling · 2024-12-05T23:43:26Z

compiler/plugins/target/AMD-AIE/iree-amd-aie/Transforms/test/fold_dma_waits.mlir

+  }
+}
+
+// -----


Please add comments about what is being tested in each case. I can see there are fewer dma_wait operations after the pass, but it's not clear to me which ones are being removed. Also a bit surprised that there are no CHECK-NOT or CHECK-NEXT statements.

newling · 2024-12-05T23:51:47Z

runtime/src/iree-amd-aie/aie_runtime/iree_aie_runtime.h

@@ -378,6 +378,8 @@ struct AMDAIEDeviceModel {
  DenseMap<uint32_t, SmallVector<uint32_t>> getChannelToValidBdIds(
      AMDAIETileType tileType) const;

+  uint8_t getDmaMaxQueueSize(uint8_t col, uint8_t row);


Really unfortunate that this can't be const. I see that getTileType uses a const_cast to workaround the lack of const-correctness in aie_rt, seems like first done here

6c4f905#diff-17008229092a63d5df9831105108aa99c799a226441b0dd9d8327708e57fc2aeR218

I tracked this down because I saw you passing AMDAIE::AMDAIEDeviceModel deviceModel by value, which isn't great as this isn't a pointer type so there is overhead. But if getDmaMaxQueueSize isn't const, deviceModel can't be const ref where you use it...

newling · 2024-12-05T23:55:52Z

compiler/plugins/target/AMD-AIE/iree-amd-aie/Transforms/AMDAIEFoldDmaWaits.cpp

+                  << "expected to operate on an `amdaie.flow`";
+              return WalkResult::interrupt();
+            }
+            if (maybeFlowOp->getIsPacketFlow()) return WalkResult::advance();


This advance -- so you skip to the next waitOp, effectively making toErase = false for this waitOp? I think the logic would be easier to follow if you made standalone functions. Maybe a function like

LogicalResult canFoldWaitOp(WaitOp waitOp, AMDAIE::AMDAIEDeviceModel deviceModel, ...) {

}

runtime/src/iree-amd-aie/aie_runtime/iree_aie_runtime.h

…nto zhewen_remove_wait

jtuyls · 2024-12-06T14:06:25Z

compiler/plugins/target/AMD-AIE/iree-amd-aie/Transforms/test/fold_dma_waits.mlir

+// CHECK:         %[[TOKEN_0:.+]] = amdaie.npu.half_dma_cpy_nd async %[[CONNECTION]](%[[OBJECT_FIFO_1]] [] [] [] bd_id = %[[BD_ID_0]] channel = %[[CHANNEL_0]]) : !amdaie.logicalobjectfifo<memref<2048xi32>>
+// CHECK:         amdaie.npu.dma_wait(%[[TOKEN_0]] : !amdaie.async_token)


Why do we have a wait on the first dma_cpy_nd?

I think we should expect:

dma_cpy_nd dma_cpy_nd dma_cpy_nd %0 = dma_cpy_nd dma_wait(%0)

Instead of:

%0 = dma_cpy_nd dma_wait(%0) dma_cpy_nd dma_cpy_nd %1 = dma_cpy_nd dma_wait(%0)

I am traversing controlcode in reverse order, and the example actually is:

%0 = dma_cpy_nd dma_wait(%0) dma_cpy_nd dma_cpy_nd dma_cpy_nd %1 = dma_cpy_nd dma_wait(%1)

there are four dma_cpy_nd ops between.

Right, I missed the fourth op between the waits, but we still shouldn't have a wait on the first op. This does matter if the number of dma_cpy_nd ops is smaller or equal to 4

Sorry, not only for nb_dma_cpy_nd_ops <= 4, but also in the example above, you're using two waits in what could be implemented with one.

but if we don't have a wait on the first op, then we will have to do this:

dma_cpy_nd dma_cpy_nd dma_cpy_nd %0 = dma_cpy_nd dma_wait(%0) %1 = dma_cpy_nd dma_wait(%1)

we still have two waits, since anyway we need one wait at the end of controlcode? That's why I choose traverse in reverse order.

Otherwise, I need to check if the current wait is the last one for each connection

You can traverse in forward order and keep track of every 4th and last DMA on a connection, then in a second pass, you can only keep those waits. I don't see how that's more complex?

The reason I think this is important is that the output IR should be as one would intuitively expect. A lot of the time I have to debug issues by just reading and understanding the IR and non-intuitive output is no fun when you're doing that.

I tried something, but then I realized that traversing in forward order complicates the management of BD IDs. When encountering a duplicate BD ID, we need to keep the wait for the last DMA whichused that BD ID. Also, all the erasion decisions between those two DMAs need to be updated accordingly.

Ok, could you add documentation to the function on why you iterate in reverse order and an example to show expected output for one of these more quirky cases above?

Thanks, added now

newling

Thanks for the comments in the lit tests, they helped me

newling · 2024-12-06T21:07:13Z

compiler/plugins/target/AMD-AIE/iree-amd-aie/Transforms/AMDAIEFoldDmaWaits.cpp

+            AMDAIE::ConnectionOp connectionOp = maybeConnectionOp.value();
+            // Retrieve the flow op.
+            std::optional<AMDAIE::FlowOp> maybeFlowOp =
+                maybeConnectionOp->getFlowOp();


Suggested change

maybeConnectionOp->getFlowOp();

connectionOp.getFlowOp();

newling · 2024-12-06T21:09:09Z

compiler/plugins/target/AMD-AIE/iree-amd-aie/Transforms/AMDAIEFoldDmaWaits.cpp

+      [&](AMDAIE::NpuDmaWaitOp waitOp) {
+        bool toErase = true;
+        for (Value token : waitOp.getAsyncTokens()) {
+          if (auto npuHalfDmaCpyNdOp =


I still kinda think a function at the level

FailureOr canFoldBasedOnNpuHalfDmaCpyNdOp(...)

would make for slightly easier to read (less indented) code.

Yes, made it a function now

newling · 2024-12-06T21:10:01Z

compiler/plugins/target/AMD-AIE/iree-amd-aie/Transforms/AMDAIEFoldDmaWaits.cpp

+            uint32_t row = getConstantIndexOrAssert(tileOp.getRow());
+            uint32_t maxQueueSize = deviceModel.getDmaMaxQueueSize(col, row);
+            // Keep wait op if, either reaches the maximum queue size, or there
+            // is a duplicate BD ID in the same tile.


Suggested change

// is a duplicate BD ID in the same tile.

// is a duplicate BD ID in the same tile, or packet flow, or the queue is empty

?

newling · 2024-12-06T21:11:01Z

compiler/plugins/target/AMD-AIE/iree-amd-aie/Transforms/test/fold_dma_waits.mlir

+// -----
+
+// Expect no DMA waits to be folded, since the same BD ID is used.
+// CHECK-LABEL: @fold_dma_waits_same_bd_id


Could just be
CHECK-COUNT-2: dma_wait
CHECK-NOT: dma_wait

I'm unblocking this, letting Jorn accept / reject as I don't have enough context to know if this is good to land

jtuyls

LGTM

) This is an enhancement for #962. In the previous PR, DMA waits on the same `connection` (and the same tile) could be folded, exploiting the fact that each DMA channel has a queue size of 4. In this PR, DMA waits across multiple `columns` can also be folded, provided their corresponding `row`, `channel`, and `direction` are the same. This optimization leverages the ability to specify `colNum` in `TCTSync`, where the range `[col, col + colNum)` can be addressed. The numbers in the following table show the instruction size in words. | Test (MxKxN) | No Folding | Only Fold by Connection | Only Fold by Column | Fold Both | |---------------|------------|--------------------|----------------|-----------| | 512x4096x512 | 1228 | 1132 | 1120 | 1096 | | 512x512x4096 | 820 | 772 | 748 | 736 | | 4096x512x512 | 4628 | 4244 | 4220 | 4124 |

init commit

ce810db

Yu-Zhewen requested review from makslevental, MaheshRavishankar, nirvedhmeshram, yzhang93, Abhishek-Varma and jtuyls as code owners December 5, 2024 13:54

jtuyls reviewed Dec 5, 2024

View reviewed changes

bugfix and resolve comments

d59184e

Yu-Zhewen changed the title ~~Add a pass to simplify DMA waits~~ Add a pass to fold DMA waits Dec 5, 2024

Merge branch 'main' into zhewen_remove_wait

00b8894

newling previously requested changes Dec 5, 2024

View reviewed changes

jtuyls reviewed Dec 6, 2024

View reviewed changes

runtime/src/iree-amd-aie/aie_runtime/iree_aie_runtime.h Outdated Show resolved Hide resolved

Yu-Zhewen added 4 commits December 6, 2024 10:25

resolve comments

5f247c5

Merge branch 'zhewen_remove_wait' of github.com:nod-ai/iree-amd-aie i…

3160142

…nto zhewen_remove_wait

Merge branch 'main' into zhewen_remove_wait

ba2cc44

fix merge

f820d08

jtuyls reviewed Dec 6, 2024

View reviewed changes

newling reviewed Dec 6, 2024

View reviewed changes

Yu-Zhewen added 2 commits December 9, 2024 20:24

resolve comments

6fedb1d

Merge branch 'main' into zhewen_remove_wait

f9920b4

jtuyls approved these changes Dec 9, 2024

View reviewed changes

Merge branch 'main' into zhewen_remove_wait

e056f3c

Yu-Zhewen enabled auto-merge (squash) December 9, 2024 21:23

Merge branch 'main' into zhewen_remove_wait

1d46658

Yu-Zhewen merged commit 2243dd8 into main Dec 9, 2024
8 checks passed

Yu-Zhewen deleted the zhewen_remove_wait branch December 9, 2024 22:53

Yu-Zhewen mentioned this pull request Dec 12, 2024

[AMDAIEFoldDmaWaits] Fold DMA wait operations across multi columns #986

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a pass to fold DMA waits #962

Add a pass to fold DMA waits #962

Yu-Zhewen commented Dec 5, 2024 •

edited

Loading

newling Dec 5, 2024

newling Dec 5, 2024

newling Dec 5, 2024

jtuyls Dec 6, 2024

Yu-Zhewen Dec 6, 2024 •

edited

Loading

jtuyls Dec 6, 2024

jtuyls Dec 6, 2024

Yu-Zhewen Dec 6, 2024 •

edited

Loading

jtuyls Dec 6, 2024 •

edited

Loading

jtuyls Dec 6, 2024 •

edited

Loading

Yu-Zhewen Dec 6, 2024

jtuyls Dec 9, 2024

Yu-Zhewen Dec 9, 2024

newling left a comment

newling Dec 6, 2024

newling Dec 6, 2024

Yu-Zhewen Dec 9, 2024

newling Dec 6, 2024

newling Dec 6, 2024

jtuyls left a comment

		// CHECK: %[[TOKEN_0:.+]] = amdaie.npu.half_dma_cpy_nd async %[[CONNECTION]](%[[OBJECT_FIFO_1]] [] [] [] bd_id = %[[BD_ID_0]] channel = %[[CHANNEL_0]]) : !amdaie.logicalobjectfifo<memref<2048xi32>>
		// CHECK: amdaie.npu.dma_wait(%[[TOKEN_0]] : !amdaie.async_token)

	// is a duplicate BD ID in the same tile.
	// is a duplicate BD ID in the same tile, or packet flow, or the queue is empty

Add a pass to fold DMA waits #962

Add a pass to fold DMA waits #962

Conversation

Yu-Zhewen commented Dec 5, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Yu-Zhewen Dec 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Yu-Zhewen Dec 6, 2024 • edited Loading

Choose a reason for hiding this comment

jtuyls Dec 6, 2024 • edited Loading

Choose a reason for hiding this comment

jtuyls Dec 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

newling left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jtuyls left a comment

Choose a reason for hiding this comment

Yu-Zhewen commented Dec 5, 2024 •

edited

Loading

Yu-Zhewen Dec 6, 2024 •

edited

Loading

Yu-Zhewen Dec 6, 2024 •

edited

Loading

jtuyls Dec 6, 2024 •

edited

Loading

jtuyls Dec 6, 2024 •

edited

Loading