-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add DMA loop subsumption transformation #512
Conversation
// | ||
//===----------------------------------------------------------------------===// | ||
// | ||
// In absence of a complete hardware model interface, this file contains some |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@makslevental Any thoughts on building a complete hardware/device model and how to stage it? I am aware of #462 and the logic in this PR could be ported to use that instead of this file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes indeed it should - that PR makes aie-rt
the canonical device model (modulo whatever small bugs they currently have). These constants can/should go into that device model but we should be careful lest we become like mlir-aie
's model which diverges from aie-rt
for exactly these kinds of things (eg NbDims
can be recovered, probably painfully, from XAie_DmaDesc XAie_DmaTensor).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we should land #462 soon (it's basically ready for review).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the clearly commented code!
Comments/questions here, and nitty suggestions in this PR: jtuyls#13
I am not "requesting changes" because there isn't anything which is a blocker for me.
|
||
/// Main rewrite function for a doubly strided operation with a `scf.for` | ||
/// parent operation. Only handle a loop induction variable with an | ||
/// `affine.apply` user for now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the induction variable is a direct operand, it is not used? So you have a map like
#map = affine_map<(d0) -> (d0)>
which is folded away, instead of
#map = affine_map<(d0) -> (2 * d0)>
Then this pass doesn't apply? Do you think identity maps like this are unlikely, or is this just future work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we won't ever see this, as these identity maps get folded and then this just becomes the case where the offset is the induction variable itself instead of an affine.apply
on an induction variable. But in any case, I added a test case now as I added support for induction variables as well now: https://github.com/nod-ai/iree-amd-aie/pull/512/files#diff-c50b6a09faa065aeac408d16634baa440e56664a3545daadfe435c7b8bc1eee4R404
|
||
// ----- | ||
|
||
// CHECK-LABEL: @npu_dma_cpy_nd_with_for_dependency_on_target |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me check that I understand the logic here.
- There's a
circular_dma_cpy_nd
operation from L2 to L2 (maybe this should be L3 to L2?) - There's a control code (lx6) instruction (aka the
dma_cpy_nd
op) in a loop, where each iteration of the loop starts a copy of a chunk of shape [1, 8, 16] . - This loop of copies is transformed by the new pass into a single copy, of a tensor of size [6, 1, 8, 16].
Is the idea that back pressure will ensure that the data gets copied in manageable size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a circular_dma_cpy_nd operation from L2 to L2 (maybe this should be L3 to L2?)
This pass doesn't make any assumption on the source/target of DMA operations, so L2 -> L2 should work as well. L2 -> L2 can be implemented on HW btw.
There's a control code (lx6) instruction (aka the dma_cpy_nd op) in a loop, where each iteration of the loop starts a copy of a chunk of shape [1, 8, 16] .
This loop of copies is transformed by the new pass into a single copy, of a tensor of size [6, 1, 8, 16].
Is the idea that back pressure will ensure that the data gets copied in manageable size?
The core optimization is to avoid having to program the shim DMAs over and over as the lx6 is not very fast and also to keep the shim DMA busy for a very long time with a single instruction.
There might be back-pressure on the L2 side depending on how fast the consumer can write into L2 (while synchronized with other DMAs), but it might also be the case that the data fabric connection to DDR can't keep up with the shim DMA's requests. Regardless, the shim DMA might work a very long time on this single buffer descriptor instruction.
size_t maxNbDims) -> LogicalResult { | ||
size_t counter = 0; | ||
for (Value offset : dynamicOffsets) | ||
if (allApplyValues.contains(offset)) counter++; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps add a test where there is a dynamic dimension which is not from an induction variable (just to cover all paths through the code).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (applyValues.contains(offsetValue)) { | ||
auto applyOp = | ||
dyn_cast<affine::AffineApplyOp>(offsetValue.getDefiningOp()); | ||
if (!applyOp) return failure(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this ever happen, I'd have thought based on applyValues.contains(offsetValue)
above that it cannot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about the case where the offset would be an induction variable instead of an affine.apply
operating on an induction variable, but then I didn't handle that case as we don't see it right now and as the PR was becoming rather large already, I wanted to handle that in a follow-up. But I added it to this PR itself now.
if (!applyOp) return failure(); | ||
AffineMap affineMap = applyOp.getAffineMap(); | ||
RetrieveStrideSize retriever; | ||
retriever.visit(affineMap.getResult(0)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect this is where affine maps which aren't nice like "#map = affine_map<(d0) -> (d0 * 16)>" get filtered out, but I don't understand how. Am I correct? if so please add a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the affine expressions are filtered here via the affine expression visitor. I added some comments here.
/// (through semaphore locks), the inter-iteration access pattern is typically | ||
/// used as an additional intra-iteration access pattern, resulting in 4 DMA | ||
/// dimensions which can be used to address global memory data. | ||
static const int64_t kAMDAIEShimDmaNbDims = 4; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The decision of when to consider a BD's inter-iteration dimensions as true dimensions seems a bit fluid, based on your description. Would it be less risky to keep these constants local to the scope they're used in, rather than leak them as global constants? I think having these global constants might be confusing, especially as the .td description of npu.dma_cpy_nd says it has 'unlimited number of dimensions'.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Separated these constants into pure hardware parameters (which will be read from the device model once read, see discussion with Max above). And moved the intricacies of how they're used to the local scope: https://github.com/nod-ai/iree-amd-aie/pull/512/files#diff-06f23460d4354d096b514e0c3a4c3ea0abcd54e766adc9c515ea30973badc9a3R356
@@ -1,4 +1,4 @@ | |||
// RUN: iree-opt --pass-pipeline="builtin.module(fold-memref-alias-ops,iree-amdaie-pack-to-dma,air-copy-to-dma,iree-amdaie-air-dma-to-amdaie-dma,iree-amdaie-insert-cores,cse,iree-amdaie-localize-logicalobjectfifo,iree-amdaie-distribute-cores-and-objectfifos,cse,canonicalize,iree-amdaie-dma-to-circular-dma,func.func(iree-amdaie-create-aie-workgroup),cse,iree-amdaie-canonicalize-doubly-strided-op,iree-amdaie-access-to-acquire-release,cse,canonicalize,iree-amdaie-controlcode-loop-unroll,cse,canonicalize,iree-amdaie-create-logical-objectfifo-link,iree-amdaie-canonicalize-doubly-strided-op,iree-amdaie-lower-to-aie,canonicalize)" --split-input-file %s | FileCheck %s | |||
// RUN: iree-opt --pass-pipeline="builtin.module(fold-memref-alias-ops,iree-amdaie-pack-to-dma,air-copy-to-dma,iree-amdaie-air-dma-to-amdaie-dma,iree-amdaie-insert-cores,cse,iree-amdaie-localize-logicalobjectfifo,iree-amdaie-distribute-cores-and-objectfifos,cse,canonicalize,iree-amdaie-dma-to-circular-dma,func.func(iree-amdaie-create-aie-workgroup),cse,iree-amdaie-canonicalize-doubly-strided-op,iree-amdaie-access-to-acquire-release,cse,canonicalize,iree-amdaie-dma-loop-subsumption,cse,canonicalize,iree-amdaie-controlcode-loop-unroll,cse,canonicalize,iree-amdaie-create-logical-objectfifo-link,iree-amdaie-canonicalize-doubly-strided-op,iree-amdaie-lower-to-aie,canonicalize)" --split-input-file %s | FileCheck %s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can/should the new pass be added to the C++ pipeline?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to do that in a follow-up:
NOTE: this PR doesn't add the new DMA loop subsumption pass to the objectFifo lowering pipeline as making this work in E2E needs some changes to how BD IDs are assigned to amdaie.npu.dma_memcpy_nd operations. Earlier, all operations would use id == 0, which is only valid if the control code is executed synchronously 'operation by operation' (every DMA operation was directly followed by a wait). As this is not the case anymore, some new logic is needed to assign ids to the amdaie.npu.dma_memcpy_nd operations. This will be addressed in a follow-up PR.
I manually verified correctness in E2E though.
8c75ba4
to
74eeefa
Compare
Merged this, but made some adjustments to it. Addressing a comment from your PR here:
There can be nested loop cases where after a rewrite on the strided operation happens, you want to consider it again within its new scope. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback
74eeefa
to
e52c573
Compare
e52c573
to
70a3b42
Compare
Addresses #495.
DMA loop iteration subsumption tries to move scf.for loops inside the DMA operations by updating the DMA access patterns and hoisting them out of the loop. See the issue above for more details.
NOTE: this PR doesn't add the new DMA loop subsumption pass to the objectFifo lowering pipeline as making this work in E2E needs some changes to how BD IDs are assigned to
amdaie.npu.dma_memcpy_nd
operations. Earlier, all operations would useid == 0
, which is only valid if the control code is executed synchronously 'operation by operation' (every DMA operation was directly followed by a wait). As this is not the case anymore, some new logic is needed to assign ids to theamdaie.npu.dma_memcpy_nd
operations. This will be addressed in a follow-up PR.