Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pass to transfer the strided access pattern from L3 to L2 #792
Pass to transfer the strided access pattern from L3 to L2 #792
Changes from 2 commits
8e29625
326b930
248fdc0
ca6cf57
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now, you're looking for a single dimension that can be combined with the innermost dimension? Ideally, this would work for multiple dimensions as well. for example:
should become:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree this case should be considered. However, I'd leave out such change in this revision until the logic of new sizes/strides (see the other comments below) is confirmed correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the purpose of this copy? I would have expected
copyIncludeDims
, i.e. the dimensions that were excluded on the L3 side, should be included on the L2 side?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The intention is to copy the dimensions that are not supposed to be combined from the original L3 addressing, and "exclude" those dimensions that need to be combined.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain how to calculate the new strides? I don't understand how it's just
initial *= values[i];
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Take the following dma ops for example:
The logic is to create L2 side strides from the innermost dimension, and then reverse the vector to have the final order. The new L2 side strides always start with [1], and should have the same number of dimensions as the original L3 side source addressing. The next dimensions are calculated by the logic
initial *= l3OrigSizes[i]
.The
initial
means the innermost continuous elements which isl3OrigSizes[-1]* l3OrigStrides.back[-1]
(the implementation omit l3OrigStrides[-1] because l3OrigStrides[-1] == 1). The combined elements are now continuous on L3 side, but should have a strided addressing on L2 side, the stride should beinitial * l3OrigSizes[-2]
. So after this iteration, the strides are [1, 32 * 32].Same logic for the next iteration, the strides become [1, 32 * 32, 32 * 32 * 2] = [1, 1024, 2048]. After reversion, it's [2048, 1024, 1]. At last insert the stride for the position of the combined dimension (e.g., index 1 in this example), which is
l3OrigStride[dimForCombine]
, and get final strides [2048, 32, 1024, 1].Let me know if this is the correct logic to get the L2 side strides, or if there's a better way to calculate this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think the core idea is good: i.e. that the strides should be created as if the original L3 was contiguous and then rearranged based on which dimension(s) are combined with the innermost one.
However, I do think this needs extensive tests to ensure correctness as this will otherwise lead to hard-debug numerical errors in the future. So, it would be good to create a standalone utility function that takes in a set of static offsets/sizes/strides and produces the new static L3 and L2 offsets/sizes/strides, so that it can be tested standalone (ctest, not lit) on a lot of different cases, see for example: https://github.com/nod-ai/iree-amd-aie/blob/main/compiler/plugins/target/AMD-AIE/iree-amd-aie/Transforms/test/AMDAIEDmaUtilsTest.cpp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if different NPU DMA users have different strides/sizes/offsets?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if we would have such cases. I looked through the current IR, and only find the case that the connection op has multiple NpuDmaCpyNdOp users (just with different offsets) and one NpuCircularDmaCpyNdOp user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do see it with peeled matmul and regardless, we should check for it and return/abort.