Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DmaLoopSubsumption] Fix for strided op in loop without induction var dependency #570

Merged
merged 1 commit into from
Jul 18, 2024

Conversation

jtuyls
Copy link
Collaborator

@jtuyls jtuyls commented Jul 18, 2024

Fixes an issue exposed by a 128x32x64 matmul: #556.

This issue would come up in partial loop dependencies in case of scf.forall:

scf.forall (%arg2, %arg3) in (2, 6) {
  %1 = amdaie.npu.dma_cpy_nd %0([0, %arg3] [8, 16] [16, 1], [] [] [])
  amdaie.npu.dma_wait(%2, S2MM)
}

In this case, the outer iteration upon which the NPU DMA operation didn't depend ( %arg2) would not be considered, resulting in incorrect output IR:

%1 = amdaie.npu.dma_cpy_nd %0([0, 0, 0] [6, 8, 16] [1, 16, 1], [] [] [])
amdaie.npu.dma_wait(%2, S2MM)

However, the outer iteration should be considered as well, making sure the above DMA operation is executed twice, resulting in the below output IR:

%1 = amdaie.npu.dma_cpy_nd %0([0, 0, 0, 0] [2, 6, 8, 16] [0, 1, 16, 1], [] [] [])
amdaie.npu.dma_wait(%2, S2MM)

This PR fixes the issue by adding general support for subsuming loop iterations into strided operations without any loop induction variable dependency.

Copy link
Contributor

@nirvedhmeshram nirvedhmeshram left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice fixes, LGTM!

Copy link
Contributor

@Abhishek-Varma Abhishek-Varma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@jtuyls
Copy link
Collaborator Author

jtuyls commented Jul 18, 2024

I noticed only the outermost dimension can be zero in HW, so I will create a follow-up to not perform the loop subsumption in that case. As that is a specific HW limitation, I think it's not needed to alter the general behaviour of this transformation for that, so will use a flag to enable/disable that limitation.

@jtuyls jtuyls merged commit a990f77 into nod-ai:main Jul 18, 2024
2 checks passed
@jtuyls jtuyls deleted the dma-loop-subsumption-wo-loop-dep branch July 18, 2024 16:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants