Add DMA loop subsumption transformation #512

jtuyls · 2024-07-08T14:43:42Z

Addresses #495.

DMA loop iteration subsumption tries to move scf.for loops inside the DMA operations by updating the DMA access patterns and hoisting them out of the loop. See the issue above for more details.

NOTE: this PR doesn't add the new DMA loop subsumption pass to the objectFifo lowering pipeline as making this work in E2E needs some changes to how BD IDs are assigned to amdaie.npu.dma_memcpy_nd operations. Earlier, all operations would use id == 0, which is only valid if the control code is executed synchronously 'operation by operation' (every DMA operation was directly followed by a wait). As this is not the case anymore, some new logic is needed to assign ids to the amdaie.npu.dma_memcpy_nd operations. This will be addressed in a follow-up PR.

jtuyls · 2024-07-08T14:50:48Z

compiler/plugins/target/AMD-AIE/iree-amd-aie/IR/AMDAIETargetModel.h

+//
+//===----------------------------------------------------------------------===//
+//
+// In absence of a complete hardware model interface, this file contains some


@makslevental Any thoughts on building a complete hardware/device model and how to stage it? I am aware of #462 and the logic in this PR could be ported to use that instead of this file.

Yes indeed it should - that PR makes aie-rt the canonical device model (modulo whatever small bugs they currently have). These constants can/should go into that device model but we should be careful lest we become like mlir-aie's model which diverges from aie-rt for exactly these kinds of things (eg NbDims can be recovered, probably painfully, from ~~XAie_DmaDesc~~ XAie_DmaTensor).

So we should land #462 soon (it's basically ready for review).

newling

Thanks for the clearly commented code!

Comments/questions here, and nitty suggestions in this PR: jtuyls#13

I am not "requesting changes" because there isn't anything which is a blocker for me.

newling · 2024-07-08T22:45:02Z

compiler/plugins/target/AMD-AIE/iree-amd-aie/Transforms/AMDAIEDmaLoopSubsumption.cpp

+
+  /// Main rewrite function for a doubly strided operation with a `scf.for`
+  /// parent operation. Only handle a loop induction variable with an
+  /// `affine.apply` user for now.


If the induction variable is a direct operand, it is not used? So you have a map like

#map = affine_map<(d0) -> (d0)>

which is folded away, instead of

#map = affine_map<(d0) -> (2 * d0)>

Then this pass doesn't apply? Do you think identity maps like this are unlikely, or is this just future work?

I think we won't ever see this, as these identity maps get folded and then this just becomes the case where the offset is the induction variable itself instead of an affine.apply on an induction variable. But in any case, I added a test case now as I added support for induction variables as well now: https://github.com/nod-ai/iree-amd-aie/pull/512/files#diff-c50b6a09faa065aeac408d16634baa440e56664a3545daadfe435c7b8bc1eee4R404

newling · 2024-07-08T23:11:47Z

compiler/plugins/target/AMD-AIE/iree-amd-aie/Transforms/test/dma_loop_subsumption.mlir

+
+// -----
+
+// CHECK-LABEL: @npu_dma_cpy_nd_with_for_dependency_on_target


Let me check that I understand the logic here.

There's a circular_dma_cpy_nd operation from L2 to L2 (maybe this should be L3 to L2?)

There's a control code (lx6) instruction (aka the dma_cpy_nd op) in a loop, where each iteration of the loop starts a copy of a chunk of shape [1, 8, 16] .

This loop of copies is transformed by the new pass into a single copy, of a tensor of size [6, 1, 8, 16].

Is the idea that back pressure will ensure that the data gets copied in manageable size?

There's a circular_dma_cpy_nd operation from L2 to L2 (maybe this should be L3 to L2?)

This pass doesn't make any assumption on the source/target of DMA operations, so L2 -> L2 should work as well. L2 -> L2 can be implemented on HW btw.

There's a control code (lx6) instruction (aka the dma_cpy_nd op) in a loop, where each iteration of the loop starts a copy of a chunk of shape [1, 8, 16] .
This loop of copies is transformed by the new pass into a single copy, of a tensor of size [6, 1, 8, 16].
Is the idea that back pressure will ensure that the data gets copied in manageable size?

The core optimization is to avoid having to program the shim DMAs over and over as the lx6 is not very fast and also to keep the shim DMA busy for a very long time with a single instruction.

There might be back-pressure on the L2 side depending on how fast the consumer can write into L2 (while synchronized with other DMAs), but it might also be the case that the data fabric connection to DDR can't keep up with the shim DMA's requests. Regardless, the shim DMA might work a very long time on this single buffer descriptor instruction.

newling · 2024-07-08T23:38:52Z

compiler/plugins/target/AMD-AIE/iree-amd-aie/Transforms/AMDAIEDmaLoopSubsumption.cpp

+                                  size_t maxNbDims) -> LogicalResult {
+      size_t counter = 0;
+      for (Value offset : dynamicOffsets)
+        if (allApplyValues.contains(offset)) counter++;


Perhaps add a test where there is a dynamic dimension which is not from an induction variable (just to cover all paths through the code).

Added: https://github.com/nod-ai/iree-amd-aie/pull/512/files#diff-c50b6a09faa065aeac408d16634baa440e56664a3545daadfe435c7b8bc1eee4R52

newling · 2024-07-08T23:44:07Z

compiler/plugins/target/AMD-AIE/iree-amd-aie/Transforms/AMDAIEDmaLoopSubsumption.cpp

+      if (applyValues.contains(offsetValue)) {
+        auto applyOp =
+            dyn_cast<affine::AffineApplyOp>(offsetValue.getDefiningOp());
+        if (!applyOp) return failure();


Can this ever happen, I'd have thought based on applyValues.contains(offsetValue) above that it cannot.

I was thinking about the case where the offset would be an induction variable instead of an affine.apply operating on an induction variable, but then I didn't handle that case as we don't see it right now and as the PR was becoming rather large already, I wanted to handle that in a follow-up. But I added it to this PR itself now.

newling · 2024-07-08T23:52:07Z

compiler/plugins/target/AMD-AIE/iree-amd-aie/Transforms/AMDAIEDmaLoopSubsumption.cpp

+        if (!applyOp) return failure();
+        AffineMap affineMap = applyOp.getAffineMap();
+        RetrieveStrideSize retriever;
+        retriever.visit(affineMap.getResult(0));


I suspect this is where affine maps which aren't nice like "#map = affine_map<(d0) -> (d0 * 16)>" get filtered out, but I don't understand how. Am I correct? if so please add a comment

Yeah, the affine expressions are filtered here via the affine expression visitor. I added some comments here.

newling · 2024-07-09T00:16:01Z

compiler/plugins/target/AMD-AIE/iree-amd-aie/IR/AMDAIETargetModel.h

+/// (through semaphore locks), the inter-iteration access pattern is typically
+/// used as an additional intra-iteration access pattern, resulting in 4 DMA
+/// dimensions which can be used to address global memory data.
+static const int64_t kAMDAIEShimDmaNbDims = 4;


The decision of when to consider a BD's inter-iteration dimensions as true dimensions seems a bit fluid, based on your description. Would it be less risky to keep these constants local to the scope they're used in, rather than leak them as global constants? I think having these global constants might be confusing, especially as the .td description of npu.dma_cpy_nd says it has 'unlimited number of dimensions'.

Separated these constants into pure hardware parameters (which will be read from the device model once read, see discussion with Max above). And moved the intricacies of how they're used to the local scope: https://github.com/nod-ai/iree-amd-aie/pull/512/files#diff-06f23460d4354d096b514e0c3a4c3ea0abcd54e766adc9c515ea30973badc9a3R356

newling · 2024-07-09T00:20:38Z

tests/samples/matmul_peeled_objectfifo.mlir

@@ -1,4 +1,4 @@
-// RUN: iree-opt --pass-pipeline="builtin.module(fold-memref-alias-ops,iree-amdaie-pack-to-dma,air-copy-to-dma,iree-amdaie-air-dma-to-amdaie-dma,iree-amdaie-insert-cores,cse,iree-amdaie-localize-logicalobjectfifo,iree-amdaie-distribute-cores-and-objectfifos,cse,canonicalize,iree-amdaie-dma-to-circular-dma,func.func(iree-amdaie-create-aie-workgroup),cse,iree-amdaie-canonicalize-doubly-strided-op,iree-amdaie-access-to-acquire-release,cse,canonicalize,iree-amdaie-controlcode-loop-unroll,cse,canonicalize,iree-amdaie-create-logical-objectfifo-link,iree-amdaie-canonicalize-doubly-strided-op,iree-amdaie-lower-to-aie,canonicalize)" --split-input-file %s | FileCheck %s
+// RUN: iree-opt --pass-pipeline="builtin.module(fold-memref-alias-ops,iree-amdaie-pack-to-dma,air-copy-to-dma,iree-amdaie-air-dma-to-amdaie-dma,iree-amdaie-insert-cores,cse,iree-amdaie-localize-logicalobjectfifo,iree-amdaie-distribute-cores-and-objectfifos,cse,canonicalize,iree-amdaie-dma-to-circular-dma,func.func(iree-amdaie-create-aie-workgroup),cse,iree-amdaie-canonicalize-doubly-strided-op,iree-amdaie-access-to-acquire-release,cse,canonicalize,iree-amdaie-dma-loop-subsumption,cse,canonicalize,iree-amdaie-controlcode-loop-unroll,cse,canonicalize,iree-amdaie-create-logical-objectfifo-link,iree-amdaie-canonicalize-doubly-strided-op,iree-amdaie-lower-to-aie,canonicalize)" --split-input-file %s | FileCheck %s


Can/should the new pass be added to the C++ pipeline?

I want to do that in a follow-up:

NOTE: this PR doesn't add the new DMA loop subsumption pass to the objectFifo lowering pipeline as making this work in E2E needs some changes to how BD IDs are assigned to amdaie.npu.dma_memcpy_nd operations. Earlier, all operations would use id == 0, which is only valid if the control code is executed synchronously 'operation by operation' (every DMA operation was directly followed by a wait). As this is not the case anymore, some new logic is needed to assign ids to the amdaie.npu.dma_memcpy_nd operations. This will be addressed in a follow-up PR.

I manually verified correctness in E2E though.

jtuyls · 2024-07-09T12:48:16Z

Comments/questions here, and nitty suggestions in this PR: jtuyls#13

Merged this, but made some adjustments to it.

Addressing a comment from your PR here:

Also, I'm wondering if a pattern approach is best here, rather than a simple pass which walks the IR once. We've had discussions about this previously in the team and the general consensus was that simple IR walks should be used unless you really need the fixed point convergence of patterns. I find it easier (not impossible) to emit errors and signal pass failures when not using patterns.

There can be nested loop cases where after a rewrite on the strided operation happens, you want to consider it again within its new scope.

newling

Thanks for the feedback

jtuyls requested review from MaheshRavishankar, nirvedhmeshram, yzhang93 and Abhishek-Varma as code owners July 8, 2024 14:43

jtuyls commented Jul 8, 2024

View reviewed changes

newling reviewed Jul 9, 2024

View reviewed changes

jtuyls force-pushed the dma-loop-subsumption branch 2 times, most recently from 8c75ba4 to 74eeefa Compare July 9, 2024 12:44

newling approved these changes Jul 9, 2024

View reviewed changes

jtuyls force-pushed the dma-loop-subsumption branch from 74eeefa to e52c573 Compare July 9, 2024 17:23

Add DMA loop subsumption transformation

70a3b42

jtuyls force-pushed the dma-loop-subsumption branch from e52c573 to 70a3b42 Compare July 9, 2024 17:25

jtuyls merged commit 0ee892f into nod-ai:main Jul 9, 2024
2 checks passed

jtuyls deleted the dma-loop-subsumption branch July 9, 2024 17:55

jtuyls mentioned this pull request Jul 9, 2024

Add DMA loop iteration subsumption #495

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add DMA loop subsumption transformation #512

Add DMA loop subsumption transformation #512

jtuyls commented Jul 8, 2024

jtuyls Jul 8, 2024

makslevental Jul 8, 2024 •

edited

Loading

makslevental Jul 8, 2024

newling left a comment

newling Jul 8, 2024

jtuyls Jul 9, 2024 •

edited

Loading

newling Jul 8, 2024

jtuyls Jul 9, 2024

newling Jul 8, 2024

jtuyls Jul 9, 2024

newling Jul 8, 2024

jtuyls Jul 9, 2024

newling Jul 8, 2024

jtuyls Jul 9, 2024

newling Jul 9, 2024

jtuyls Jul 9, 2024

newling Jul 9, 2024

jtuyls Jul 9, 2024

jtuyls commented Jul 9, 2024

newling left a comment


		// -----

		// CHECK-LABEL: @npu_dma_cpy_nd_with_for_dependency_on_target

		@@ -1,4 +1,4 @@
		// RUN: iree-opt --pass-pipeline="builtin.module(fold-memref-alias-ops,iree-amdaie-pack-to-dma,air-copy-to-dma,iree-amdaie-air-dma-to-amdaie-dma,iree-amdaie-insert-cores,cse,iree-amdaie-localize-logicalobjectfifo,iree-amdaie-distribute-cores-and-objectfifos,cse,canonicalize,iree-amdaie-dma-to-circular-dma,func.func(iree-amdaie-create-aie-workgroup),cse,iree-amdaie-canonicalize-doubly-strided-op,iree-amdaie-access-to-acquire-release,cse,canonicalize,iree-amdaie-controlcode-loop-unroll,cse,canonicalize,iree-amdaie-create-logical-objectfifo-link,iree-amdaie-canonicalize-doubly-strided-op,iree-amdaie-lower-to-aie,canonicalize)" --split-input-file %s \| FileCheck %s
		// RUN: iree-opt --pass-pipeline="builtin.module(fold-memref-alias-ops,iree-amdaie-pack-to-dma,air-copy-to-dma,iree-amdaie-air-dma-to-amdaie-dma,iree-amdaie-insert-cores,cse,iree-amdaie-localize-logicalobjectfifo,iree-amdaie-distribute-cores-and-objectfifos,cse,canonicalize,iree-amdaie-dma-to-circular-dma,func.func(iree-amdaie-create-aie-workgroup),cse,iree-amdaie-canonicalize-doubly-strided-op,iree-amdaie-access-to-acquire-release,cse,canonicalize,iree-amdaie-dma-loop-subsumption,cse,canonicalize,iree-amdaie-controlcode-loop-unroll,cse,canonicalize,iree-amdaie-create-logical-objectfifo-link,iree-amdaie-canonicalize-doubly-strided-op,iree-amdaie-lower-to-aie,canonicalize)" --split-input-file %s \| FileCheck %s

Add DMA loop subsumption transformation #512

Add DMA loop subsumption transformation #512

Conversation

jtuyls commented Jul 8, 2024

Choose a reason for hiding this comment

makslevental Jul 8, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

newling left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jtuyls Jul 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jtuyls commented Jul 9, 2024

newling left a comment

Choose a reason for hiding this comment

makslevental Jul 8, 2024 •

edited

Loading

jtuyls Jul 9, 2024 •

edited

Loading