Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add DMA loop subsumption transformation #512

Merged
merged 1 commit into from
Jul 9, 2024

Conversation

jtuyls
Copy link
Collaborator

@jtuyls jtuyls commented Jul 8, 2024

Addresses #495.

DMA loop iteration subsumption tries to move scf.for loops inside the DMA operations by updating the DMA access patterns and hoisting them out of the loop. See the issue above for more details.

NOTE: this PR doesn't add the new DMA loop subsumption pass to the objectFifo lowering pipeline as making this work in E2E needs some changes to how BD IDs are assigned to amdaie.npu.dma_memcpy_nd operations. Earlier, all operations would use id == 0, which is only valid if the control code is executed synchronously 'operation by operation' (every DMA operation was directly followed by a wait). As this is not the case anymore, some new logic is needed to assign ids to the amdaie.npu.dma_memcpy_nd operations. This will be addressed in a follow-up PR.

//
//===----------------------------------------------------------------------===//
//
// In absence of a complete hardware model interface, this file contains some
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@makslevental Any thoughts on building a complete hardware/device model and how to stage it? I am aware of #462 and the logic in this PR could be ported to use that instead of this file.

Copy link
Collaborator

@makslevental makslevental Jul 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes indeed it should - that PR makes aie-rt the canonical device model (modulo whatever small bugs they currently have). These constants can/should go into that device model but we should be careful lest we become like mlir-aie's model which diverges from aie-rt for exactly these kinds of things (eg NbDims can be recovered, probably painfully, from XAie_DmaDesc XAie_DmaTensor).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we should land #462 soon (it's basically ready for review).

Copy link
Contributor

@newling newling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clearly commented code!

Comments/questions here, and nitty suggestions in this PR: jtuyls#13

I am not "requesting changes" because there isn't anything which is a blocker for me.


/// Main rewrite function for a doubly strided operation with a `scf.for`
/// parent operation. Only handle a loop induction variable with an
/// `affine.apply` user for now.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the induction variable is a direct operand, it is not used? So you have a map like

#map = affine_map<(d0) -> (d0)>

which is folded away, instead of

#map = affine_map<(d0) -> (2 * d0)>

Then this pass doesn't apply? Do you think identity maps like this are unlikely, or is this just future work?

Copy link
Collaborator Author

@jtuyls jtuyls Jul 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we won't ever see this, as these identity maps get folded and then this just becomes the case where the offset is the induction variable itself instead of an affine.apply on an induction variable. But in any case, I added a test case now as I added support for induction variables as well now: https://github.com/nod-ai/iree-amd-aie/pull/512/files#diff-c50b6a09faa065aeac408d16634baa440e56664a3545daadfe435c7b8bc1eee4R404


// -----

// CHECK-LABEL: @npu_dma_cpy_nd_with_for_dependency_on_target
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me check that I understand the logic here.

  • There's a circular_dma_cpy_nd operation from L2 to L2 (maybe this should be L3 to L2?)
  • There's a control code (lx6) instruction (aka the dma_cpy_nd op) in a loop, where each iteration of the loop starts a copy of a chunk of shape [1, 8, 16] .
  • This loop of copies is transformed by the new pass into a single copy, of a tensor of size [6, 1, 8, 16].

Is the idea that back pressure will ensure that the data gets copied in manageable size?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a circular_dma_cpy_nd operation from L2 to L2 (maybe this should be L3 to L2?)

This pass doesn't make any assumption on the source/target of DMA operations, so L2 -> L2 should work as well. L2 -> L2 can be implemented on HW btw.

There's a control code (lx6) instruction (aka the dma_cpy_nd op) in a loop, where each iteration of the loop starts a copy of a chunk of shape [1, 8, 16] .
This loop of copies is transformed by the new pass into a single copy, of a tensor of size [6, 1, 8, 16].
Is the idea that back pressure will ensure that the data gets copied in manageable size?

The core optimization is to avoid having to program the shim DMAs over and over as the lx6 is not very fast and also to keep the shim DMA busy for a very long time with a single instruction.

There might be back-pressure on the L2 side depending on how fast the consumer can write into L2 (while synchronized with other DMAs), but it might also be the case that the data fabric connection to DDR can't keep up with the shim DMA's requests. Regardless, the shim DMA might work a very long time on this single buffer descriptor instruction.

size_t maxNbDims) -> LogicalResult {
size_t counter = 0;
for (Value offset : dynamicOffsets)
if (allApplyValues.contains(offset)) counter++;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps add a test where there is a dynamic dimension which is not from an induction variable (just to cover all paths through the code).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (applyValues.contains(offsetValue)) {
auto applyOp =
dyn_cast<affine::AffineApplyOp>(offsetValue.getDefiningOp());
if (!applyOp) return failure();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this ever happen, I'd have thought based on applyValues.contains(offsetValue) above that it cannot.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about the case where the offset would be an induction variable instead of an affine.apply operating on an induction variable, but then I didn't handle that case as we don't see it right now and as the PR was becoming rather large already, I wanted to handle that in a follow-up. But I added it to this PR itself now.

if (!applyOp) return failure();
AffineMap affineMap = applyOp.getAffineMap();
RetrieveStrideSize retriever;
retriever.visit(affineMap.getResult(0));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect this is where affine maps which aren't nice like "#map = affine_map<(d0) -> (d0 * 16)>" get filtered out, but I don't understand how. Am I correct? if so please add a comment

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the affine expressions are filtered here via the affine expression visitor. I added some comments here.

/// (through semaphore locks), the inter-iteration access pattern is typically
/// used as an additional intra-iteration access pattern, resulting in 4 DMA
/// dimensions which can be used to address global memory data.
static const int64_t kAMDAIEShimDmaNbDims = 4;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The decision of when to consider a BD's inter-iteration dimensions as true dimensions seems a bit fluid, based on your description. Would it be less risky to keep these constants local to the scope they're used in, rather than leak them as global constants? I think having these global constants might be confusing, especially as the .td description of npu.dma_cpy_nd says it has 'unlimited number of dimensions'.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separated these constants into pure hardware parameters (which will be read from the device model once read, see discussion with Max above). And moved the intricacies of how they're used to the local scope: https://github.com/nod-ai/iree-amd-aie/pull/512/files#diff-06f23460d4354d096b514e0c3a4c3ea0abcd54e766adc9c515ea30973badc9a3R356

@@ -1,4 +1,4 @@
// RUN: iree-opt --pass-pipeline="builtin.module(fold-memref-alias-ops,iree-amdaie-pack-to-dma,air-copy-to-dma,iree-amdaie-air-dma-to-amdaie-dma,iree-amdaie-insert-cores,cse,iree-amdaie-localize-logicalobjectfifo,iree-amdaie-distribute-cores-and-objectfifos,cse,canonicalize,iree-amdaie-dma-to-circular-dma,func.func(iree-amdaie-create-aie-workgroup),cse,iree-amdaie-canonicalize-doubly-strided-op,iree-amdaie-access-to-acquire-release,cse,canonicalize,iree-amdaie-controlcode-loop-unroll,cse,canonicalize,iree-amdaie-create-logical-objectfifo-link,iree-amdaie-canonicalize-doubly-strided-op,iree-amdaie-lower-to-aie,canonicalize)" --split-input-file %s | FileCheck %s
// RUN: iree-opt --pass-pipeline="builtin.module(fold-memref-alias-ops,iree-amdaie-pack-to-dma,air-copy-to-dma,iree-amdaie-air-dma-to-amdaie-dma,iree-amdaie-insert-cores,cse,iree-amdaie-localize-logicalobjectfifo,iree-amdaie-distribute-cores-and-objectfifos,cse,canonicalize,iree-amdaie-dma-to-circular-dma,func.func(iree-amdaie-create-aie-workgroup),cse,iree-amdaie-canonicalize-doubly-strided-op,iree-amdaie-access-to-acquire-release,cse,canonicalize,iree-amdaie-dma-loop-subsumption,cse,canonicalize,iree-amdaie-controlcode-loop-unroll,cse,canonicalize,iree-amdaie-create-logical-objectfifo-link,iree-amdaie-canonicalize-doubly-strided-op,iree-amdaie-lower-to-aie,canonicalize)" --split-input-file %s | FileCheck %s
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can/should the new pass be added to the C++ pipeline?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to do that in a follow-up:

NOTE: this PR doesn't add the new DMA loop subsumption pass to the objectFifo lowering pipeline as making this work in E2E needs some changes to how BD IDs are assigned to amdaie.npu.dma_memcpy_nd operations. Earlier, all operations would use id == 0, which is only valid if the control code is executed synchronously 'operation by operation' (every DMA operation was directly followed by a wait). As this is not the case anymore, some new logic is needed to assign ids to the amdaie.npu.dma_memcpy_nd operations. This will be addressed in a follow-up PR.

I manually verified correctness in E2E though.

@jtuyls jtuyls force-pushed the dma-loop-subsumption branch 2 times, most recently from 8c75ba4 to 74eeefa Compare July 9, 2024 12:44
@jtuyls
Copy link
Collaborator Author

jtuyls commented Jul 9, 2024

Comments/questions here, and nitty suggestions in this PR: jtuyls#13

Merged this, but made some adjustments to it.

Addressing a comment from your PR here:

Also, I'm wondering if a pattern approach is best here, rather than a simple pass which walks the IR once. We've had discussions about this previously in the team and the general consensus was that simple IR walks should be used unless you really need the fixed point convergence of patterns. I find it easier (not impossible) to emit errors and signal pass failures when not using patterns.

There can be nested loop cases where after a rewrite on the strided operation happens, you want to consider it again within its new scope.

Copy link
Contributor

@newling newling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback

@jtuyls jtuyls force-pushed the dma-loop-subsumption branch from 74eeefa to e52c573 Compare July 9, 2024 17:23
@jtuyls jtuyls force-pushed the dma-loop-subsumption branch from e52c573 to 70a3b42 Compare July 9, 2024 17:25
@jtuyls jtuyls merged commit 0ee892f into nod-ai:main Jul 9, 2024
2 checks passed
@jtuyls jtuyls deleted the dma-loop-subsumption branch July 9, 2024 17:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants